[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl

hi again,

Cedric BAIL wrote:
> Hi,
> > hi !
> >
> > This mailing list becomes difficult to follow... but here i am
> > anyway. I don't know if i can answer other posts.
> Sorry ;-)

but finally i prefer to answer your posts, they are more instructive than
others ;-)

> > Cedric BAIL wrote:
> > > > i just read the chapter about RC5 in Bruce Schneier's book :
> > > > it looks very interesting and i wonder why the code size
> > > > can't be reduced ...
> > > It's easy, we have two solutions :
> > >  - or we reduce the code, we only use 31 registers and we are around 600
> > > instructions before doing something in memory.
> > it''s NOT about not using memory or not using registers.
> > In an usual case, you would have 4KB of dedicated instruction and data caches.
> > maybe more, maybe less. If you put some data (the added constants generated
> > by the key) in L1, you free some registers that can be used for pointing
> > memory "streams", so there is a _continuous_ flow of data through the CPU,
> > _NOT_ "bursts" every 600 cycles or so. usually, the memory system won't
> > keep up and you'll slow down all the system. I think it's possible to find
> > a compromise with maybe 500 or 800 instructions, leaving enough space in L1
> > for some other useful code.
> So ok, I wasn't clear. I code my rc5 functions so that we do something like
> 20000 cycles (computes 40 keys) before doing any operation in DCache and the
> core must stay during the main loop in ICache. So the presure on memory
> will be very small and I imagin that will increase speed. I will save the
> registers at start and restore them at the end. In each case I will lost
> a lot of time, but the data will style stay in DCache so it's not a problem.
> At least, I think that you must look to the _4.cpp rc5 ansi core. You will
> more easily understand what I mean.

i see things like 

      ROUND3EVEN (S1_02, S2_02, S3_02, S4_02);
      ROUND3ODD  (S1_03, S2_03, S3_03, S4_03);
      ROUND3EVEN (S1_04, S2_04, S3_04, S4_04);
      ROUND3ODD  (S1_05, S2_05, S3_05, S4_05);
      ROUND3EVEN (S1_06, S2_06, S3_06, S4_06);
      ROUND3ODD  (S1_07, S2_07, S3_07, S4_07);
      ROUND3EVEN (S1_08, S2_08, S3_08, S4_08);
      ROUND3ODD  (S1_09, S2_09, S3_09, S4_09);
      ROUND3EVEN (S1_10, S2_10, S3_10, S4_10);
      ROUND3ODD  (S1_11, S2_11, S3_11, S4_11);
      ROUND3EVEN (S1_12, S2_12, S3_12, S4_12);
      ROUND3ODD  (S1_13, S2_13, S3_13, S4_13);
      ROUND3EVEN (S1_14, S2_14, S3_14, S4_14);
      ROUND3ODD  (S1_15, S2_15, S3_15, S4_15);
      ROUND3EVEN (S1_16, S2_16, S3_16, S4_16);
      ROUND3ODD  (S1_17, S2_17, S3_17, S4_17);
      ROUND3EVEN (S1_18, S2_18, S3_18, S4_18);
      ROUND3ODD  (S1_19, S2_19, S3_19, S4_19);
      ROUND3EVEN (S1_20, S2_20, S3_20, S4_20);
      ROUND3ODD  (S1_21, S2_21, S3_21, S4_21);
      ROUND3EVEN (S1_22, S2_22, S3_22, S4_22);
      ROUND3ODD  (S1_23, S2_23, S3_23, S4_23);
and this reinforces my belief that it is possible
to use the memory (it fits in L1) for storing the coefs.
This means that we can loop several with a loop core containing
4 rounds (either parallel or sequential) or less.
I see no particular reason why we would be forced to use ALL
the registers for a single key.

The macros for the rounds seem to be rather well designed
and suitable for FC0, the loop size is not too small or too
large, the instructions seem correctly interleaved and the average
distance of 3 instructions between computation and use is "good".

To further my remark about using the L1 cache, here is some code :

#define  ROUND2ODD(S1N, S2N, S3N, S4N)	\
*    tmp1 = S1N;			\   <-- instead of using a register, we can use a load instead.
    A1 += Llo1;			\
*    tmp2 = S2N;			\
    A2 += Llo2;			\
*    tmp3 = S3N;			\
    A3 += Llo3;			\
*    tmp4 = S4N;			\
    A4 += Llo4;			\
    A1 += tmp1;			\
    A2 += tmp2;			\
..... rest of the code

look, there are 32 instructions (i presume that it could be possible to re-optimise
the code to remove some "temp"s !) and only 4 loads. it is a perfectly sustainable

However, i miss the signification of this code :

>       if (rc5unitwork->cypher.lo == eA1 &&
>   	    rc5unitwork->cypher.hi == ROTL(eB1 ^ eA1, eA1) +
>   	      ROTL3(S1_25 + A1 + ROTL(Llo1 + A1 + Lhi1, A1 + Lhi1))) return kiter;

> > >  - or we use all the register, compute 2 more keys before doing something
> > > in memory and having better performance I thing.
> > today, performance is also (like in the 70s) constrained by the memory.
> > in today's systems, you can have a bandwidth of less than one byte per
> > instruction (provided the instruction is already in Icache and we hit L2 or
> > the local SDRAM).
> > You HAVE to interleave memory accesses (by small chunks) and computations.
> > Otherwise, your nicely optimised code will runn during 600 cycles and
> > stall almost completely another 600 cycles.
> Don't forget, my main loop will take 1200 instructions... And a loop is for
> looping ;-)

if you say so...

> > > > > I didn't find a specially designed core for IA64.
> > > > it's just a matter of time...
> > > Or because nobody have this type of hardware...
> > or because when you can afford one, you can afford a dedicated RC5 HW
> > ;-)
> The objectif of the distributed.net project is not to say : buy a new core,
> but use your cpu to do...

that's cool if we want to heat up the room while we're away ;-P

> > i'm not speaking about recoding the *client*, only the decoding algo.
> It's what I speak about...
> > if you start from the inner loop, there should be no problem.
> > You then propagate all the constraints to the client : platform detection,
> > tuning of the keys... and since you have the sources, you can make
> > your own f-cpu patch.
> The problem is in the test. You must do a test for each chunk...

where is this test ?

> > there is a simple way to do your thing in C :
> Grrr, I know howto use pointer on function ;-)
> <snip stupid code ;-) (you now what a enum and a switch his)>
> > Is it too difficult to do ?

ok you reassure me. but then i don't see the point of your previous remark.

> > > I am sure too, but I wan't to see what is the difference with the 64
> > > bits version.
> > version of what ? RC5 or F-CPU ?
> A the RC5 version never change, it came from the dnetc projet ;-)
who knows ...

> > > So, ok if we want to reduce the code needed, we will need to put data in
> > > memory and manipulate some stupid table...
> > if you mean "simple" table, it's ok.
> Yes, nut it's not ok. I didn't wan't to access to memory when I can stay
> in register.

programming is an exercise on balancing register vs memory usage, speed vs 
memory footprint, unrolling vs factoring... in fewer words, as someone
on usenet write in his sig, it's an exercise on caching.

If your program footprint explodes, i don't see the point of
"not accessing memory at all costs". The larger the program/code,
the more bugs. <insert personal experience here.>
If your bloated code does always do the _same_ thing all over
the cache, then there's something really dumb happening.
Frankly, you can reduce the code size by 4 or 8, which is reasonable and safe.

What to do with the extra registers ? more computations, of course !
Imagine a 256-bit, 4-issue F-CPU where there is only one unit per
instruction slot (1 slot for boolean ops, 2 slots for additions, 1 slot
for register move or memory store/load, or something like that).
The exta registers will allow to compute even more stuff and still
execute with no pipeline stall or empty issue slot. You can simply
copy-paste the code, interleave it and rename the registers so you
are sure that every execution unit is working all the time.

> > If you have 16 rounds and a 64-bit block, you need 16*2*32=1KB of
> > cache. for 128-bit blocks, you need 2KB of data cache.
> > Because the rounds use sequential, linear access, there is no cache
> > penalty (there should be some auto-prefetching of the next cache line).
> I think that you didn't look to the file I gave to you...

not quite, but it shows that there are 12 rounds and the blocks are 64-bits,
so the "working set" uses 32-bit integers.

> > Furthermore, if you decode several blocks at once (for example, a 64-bit
> > core with 64-bit blocks), you can do :
> >    loadi.64 8, [rp], rd;
> >    sdupi.32 0, rd, r1;
> >    sdupi.32 1, rd, r2; // or something like that
> > --> you execute only one load and you get 4 32-bit values
> >   in 2 registers with 3 instructions.
> euh, sory but what are you doing here ?

if we use F-CPU for enciphering or deciphering a message (in a real-world
communication application, for example), i presume that the keys for each
round don't change across the blocks --> we can use each round's value
several times, once for each block, and we can store several blocks
in the register set with the SIMD operations.

> > > So I prefer to have a big code (but smaller than 8KB), than to do stupid
> > > operation in memory and not use all the
> > > register and forgot that I am not on an x86 CPU and that I have 63
> > > registers !
> > i think you are mistaken : the goal is not to use ALL the registers.
> > There are other things that come into the game, such as the time it
> > takes to load and store the whole damn register set. Some computations are
> > less memory intensive and might be happy to spread in the whole register
> > set.
> If I save and restore this register only after 20000 cycles or more, and
> if I didn't use the L1 cache during all this cycle, where is the problem ?
i need some more explanations, i'm not a distributed.net specialist.
and i don't have much time to crawl inside their code (which would
not be more difficult to read with some more comments, say 3x more).

> > RC5 is not "intensive" but this becomes a bottleneck if you don't take
> > care of the steadiness of the memory streams. It takes quite a while
> > to dump/flush 512 bytes. Don't forget that the core
> > often runs 10x faster than the memory system.
> It's why I want to stay in register.

except that here,
 - the required bandwidth is well below the maltong point
 - the footprint is well below the thrashing point
 - with the extra registers, you can perform more computations,
   such as testing stuff in parallel or compute larger blocks per round.

> > If we ever find some time to meet, we'll read the documents and draft
> > some code together, i'll show you some tricks.
> And me too ;-)

i *like* that mutual misunderstanding that ends up learning stuffs :-)))

> A+
>   Cedric
PS: i got your post twice, c'est encore un coup  la VM  Rik ?
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/