[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl


> hi !
> This mailing list becomes difficult to follow... but here i am
> anyway. I don't know if i can answer other posts.
Sorry ;-)

> Cedric BAIL wrote:
> > > i just read the chapter about RC5 in Bruce Schneier's book :
> > > it looks very interesting and i wonder why the code size
> > > can't be reduced ...
> > It's easy, we have two solutions :
> >  - or we reduce the code, we only use 31 registers and we are around 600
> > instructions before doing something in memory.
> it''s NOT about not using memory or not using registers.
> In an usual case, you would have 4KB of dedicated instruction and data caches.
> maybe more, maybe less. If you put some data (the added constants generated
> by the key) in L1, you free some registers that can be used for pointing
> memory "streams", so there is a _continuous_ flow of data through the CPU,
> _NOT_ "bursts" every 600 cycles or so. usually, the memory system won't
> keep up and you'll slow down all the system. I think it's possible to find
> a compromise with maybe 500 or 800 instructions, leaving enough space in L1
> for some other useful code.
So ok, I wasn't clear. I code my rc5 functions so that we do something like
20000 cycles (computes 40 keys) before doing any operation in DCache and the
core must stay during the main loop in ICache. So the presure on memory
will be very small and I imagin that will increase speed. I will save the
registers at start and restore them at the end. In each case I will lost
a lot of time, but the data will style stay in DCache so it's not a problem.

At least, I think that you must look to the _4.cpp rc5 ansi core. You will
more easily understand what I mean.

> >  - or we use all the register, compute 2 more keys before doing something
> > in memory and having better performance I thing.
> today, performance is also (like in the 70s) constrained by the memory.
> in today's systems, you can have a bandwidth of less than one byte per
> instruction (provided the instruction is already in Icache and we hit L2 or
> the local SDRAM).
> You HAVE to interleave memory accesses (by small chunks) and computations.
> Otherwise, your nicely optimised code will runn during 600 cycles and
> stall almost completely another 600 cycles.
Don't forget, my main loop will take 1200 instructions... And a loop is for
looping ;-)

> > > > I didn't find a specially designed core for IA64.
> > > it's just a matter of time...
> > Or because nobody have this type of hardware...
> or because when you can afford one, you can afford a dedicated RC5 HW
> ;-)
The objectif of the distributed.net project is not to say : buy a new core,
but use your cpu to do...
> i'm not speaking about recoding the *client*, only the decoding algo.
It's what I speak about...

> if you start from the inner loop, there should be no problem.
> You then propagate all the constraints to the client : platform detection,
> tuning of the keys... and since you have the sources, you can make
> your own f-cpu patch.
The problem is in the test. You must do a test for each chunk...
> there is a simple way to do your thing in C :
Grrr, I know howto use pointer on function ;-)

<snip stupid code ;-) (you now what a enum and a switch his)>
> Is it too difficult to do ?

> > I am sure too, but I wan't to see what is the difference with the 64
> > bits version.
> version of what ? RC5 or F-CPU ?
A the RC5 version never change, it came from the dnetc projet ;-)

> > So, ok if we want to reduce the code needed, we will need to put data in
> > memory and manipulate some stupid table...
> if you mean "simple" table, it's ok.
Yes, nut it's not ok. I didn't wan't to access to memory when I can stay
in register.

> If you have 16 rounds and a 64-bit block, you need 16*2*32=1KB of
> cache. for 128-bit blocks, you need 2KB of data cache.
> Because the rounds use sequential, linear access, there is no cache
> penalty (there should be some auto-prefetching of the next cache line).
I think that you didn't look to the file I gave to you...
> Furthermore, if you decode several blocks at once (for example, a 64-bit
> core with 64-bit blocks), you can do :

>    loadi.64 8, [rp], rd;
>    sdupi.32 0, rd, r1;
>    sdupi.32 1, rd, r2; // or something like that
> --> you execute only one load and you get 4 32-bit values
>   in 2 registers with 3 instructions.
euh, sory but what are you doing here ?

> > So I prefer to have a big code (but smaller than 8KB), than to do stupid
> > operation in memory and not use all the
> > register and forgot that I am not on an x86 CPU and that I have 63
> > registers !
> i think you are mistaken : the goal is not to use ALL the registers.
> There are other things that come into the game, such as the time it
> takes to load and store the whole damn register set. Some computations are
> less memory intensive and might be happy to spread in the whole register
> set.
If I save and restore this register only after 20000 cycles or more, and
if I didn't use the L1 cache during all this cycle, where is the problem ?

> RC5 is not "intensive" but this becomes a bottleneck if you don't take
> care of the steadiness of the memory streams. It takes quite a while
> to dump/flush 512 bytes. Don't forget that the core
> often runs 10x faster than the memory system.
It's why I want to stay in register.
> If we ever find some time to meet, we'll read the documents and draft
> some code together, i'll show you some tricks.
And me too ;-)

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/