[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



Hi,
> hi again,

> but finally i prefer to answer your posts, they are more instructive
> than others ;-)
Thanks ;-)

I must add something before repplying to your post. The objectif of the
dnetc project is to break the rc5 keys. So you increment the keys and you
verify if it decrypt the bloc they give to you.

> i see things like 
> 
>       ROUND3EVEN (S1_02, S2_02, S3_02, S4_02);
>       ROUND3ODD  (S1_03, S2_03, S3_03, S4_03);
>       ROUND3EVEN (S1_04, S2_04, S3_04, S4_04);
>       ROUND3ODD  (S1_05, S2_05, S3_05, S4_05);
>       ROUND3EVEN (S1_06, S2_06, S3_06, S4_06);
>       ROUND3ODD  (S1_07, S2_07, S3_07, S4_07);
>       ROUND3EVEN (S1_08, S2_08, S3_08, S4_08);
>       ROUND3ODD  (S1_09, S2_09, S3_09, S4_09);
>       ROUND3EVEN (S1_10, S2_10, S3_10, S4_10);
>       ROUND3ODD  (S1_11, S2_11, S3_11, S4_11);
>       ROUND3EVEN (S1_12, S2_12, S3_12, S4_12);
>       ROUND3ODD  (S1_13, S2_13, S3_13, S4_13);
>       ROUND3EVEN (S1_14, S2_14, S3_14, S4_14);
>       ROUND3ODD  (S1_15, S2_15, S3_15, S4_15);
>       ROUND3EVEN (S1_16, S2_16, S3_16, S4_16);
>       ROUND3ODD  (S1_17, S2_17, S3_17, S4_17);
>       ROUND3EVEN (S1_18, S2_18, S3_18, S4_18);
>       ROUND3ODD  (S1_19, S2_19, S3_19, S4_19);
>       ROUND3EVEN (S1_20, S2_20, S3_20, S4_20);
>       ROUND3ODD  (S1_21, S2_21, S3_21, S4_21);
>       ROUND3EVEN (S1_22, S2_22, S3_22, S4_22);
>       ROUND3ODD  (S1_23, S2_23, S3_23, S4_23);
   
> and this reinforces my belief that it is possible
> to use the memory (it fits in L1) for storing the coefs.
I know, it's possible to use memory, but you will not reduce the use
of the ICache and you will add a lot of access to the DCache,
and I don't see why you absolutely want to use the DCache.
(If you want some example that use simple table for their S1..4 look
in the other file I have posted).

> This means that we can loop several with a loop core containing
> 4 rounds (either parallel or sequential) or less.
Well, where did this 4th round came from ?

> I see no particular reason why we would be forced to use ALL
> the registers for a single key.
Not for a single key, but for 4 keys.
 
> The macros for the rounds seem to be rather well designed
> and suitable for FC0, the loop size is not too small or too
> large, the instructions seem correctly interleaved and the average
> distance of 3 instructions between computation and use is "good".
It's why I select this algo.
 
> To further my remark about using the L1 cache, here is some code :
 
> #define  ROUND2ODD(S1N, S2N, S3N, S4N)	\
> *    tmp1 = S1N;			\   <-- instead of using a register, we can use a
> load instead.
>     A1 += Llo1;			\
> *    tmp2 = S2N;			\
>     A2 += Llo2;			\
> *    tmp3 = S3N;			\
>     A3 += Llo3;			\
> *    tmp4 = S4N;			\
>     A4 += Llo4;			\
>     A1 += tmp1;			\
>     A2 += tmp2;			\
> ..... rest of the code

The question is why did you absolutly want to use the DCache ?
The register are quicker than the DCache, so why ?

> look, there are 32 instructions (i presume that it could be possible to
> re-optimise the code to remove some "temp"s !) and only 4 loads. it is a
> perfectly sustainable bandwidth.

> However, i miss the signification of this code :
> 
> >       if (rc5unitwork->cypher.lo == eA1 &&
> >   	    rc5unitwork->cypher.hi == ROTL(eB1 ^ eA1, eA1) +
> >   	      ROTL3(S1_25 + A1 + ROTL(Llo1 + A1 + Lhi1, A1 + Lhi1))) return
> >								 kiter;
It compare if the decrypted block equal the block they give to you.
It's for checking if the key his the one your are looking for !

> > The problem is in the test. You must do a test for each chunk...
> where is this test ?
The code you include in your post ;-) (The one you didn't understand)
 
> ok you reassure me. but then i don't see the point of your previous
> remark.
Perhaps, because I don't understand where you want to use this code.

> programming is an exercise on balancing register vs memory usage, speed vs 
> memory footprint, unrolling vs factoring... in fewer words, as someone
> on usenet write in his sig, it's an exercise on caching.
Yes but if you look to the optimised core for K7, you will see that
they use table, but not unrolling loop. I don't know why, but I am
not a K7 "gourou".
 
> If your program footprint explodes, i don't see the point of
> "not accessing memory at all costs". The larger the program/code,
> the more bugs. <insert personal experience here.>
> If your bloated code does always do the _same_ thing all over
> the cache, then there's something really dumb happening.
> Frankly, you can reduce the code size by 4 or 8, which is reasonable and
> safe.
You can't reduce the code much more than 600 instructions. If you
use small loop to compute the different round, you will use too much
bandwich only to reduce the use of instruction. I think that it's preferable
to use all the ICache and nothing for the DCache.

> What to do with the extra registers ? more computations, of course !
> Imagine a 256-bit, 4-issue F-CPU where there is only one unit per
> instruction slot (1 slot for boolean ops, 2 slots for additions, 1 slot
> for register move or memory store/load, or something like that).
> The exta registers will allow to compute even more stuff and still
> execute with no pipeline stall or empty issue slot. You can simply
> copy-paste the code, interleave it and rename the registers so you
> are sure that every execution unit is working all the time.
... Well I actually compute 4 keys in the same pass. If we have a
256 bits F-CPU, we only need to rewrite the test and we will compute
16 keys per pass. No register renaming, nothing complicated to do...

> > > If you have 16 rounds and a 64-bit block, you need 16*2*32=1KB of
> > > cache. for 128-bit blocks, you need 2KB of data cache.
> > > Because the rounds use sequential, linear access, there is no cache
> > > penalty (there should be some auto-prefetching of the next cache line).
> > I think that you didn't look to the file I gave to you...
> not quite, but it shows that there are 12 rounds and the blocks are
> 64-bits, so the "working set" uses 32-bit integers.
euh, I see only 3 rounds. For the rest it's ok.

> > > Furthermore, if you decode several blocks at once (for example, a 64-bit
> > > core with 64-bit blocks), you can do :
> > 
> > >    loadi.64 8, [rp], rd;
> > >    sdupi.32 0, rd, r1;
> > >    sdupi.32 1, rd, r2; // or something like that
> > > --> you execute only one load and you get 4 32-bit values
> > >   in 2 registers with 3 instructions.
> > euh, sory but what are you doing here ?
> if we use F-CPU for enciphering or deciphering a message (in a real-world
> communication application, for example), i presume that the keys for each
> round don't change across the blocks --> we can use each round's value
> several times, once for each block, and we can store several blocks
> in the register set with the SIMD operations.
oups, you want to do brute force attack in a communication application...
I don't understand.

> > > > So I prefer to have a big code (but smaller than 8KB), than to do stupid
> > > > operation in memory and not use all the
> > > > register and forgot that I am not on an x86 CPU and that I have 63
> > > > registers !
> > > i think you are mistaken : the goal is not to use ALL the registers.
> > > There are other things that come into the game, such as the time it
> > > takes to load and store the whole damn register set. Some computations are
> > > less memory intensive and might be happy to spread in the whole register
> > > set.
> > If I save and restore this register only after 20000 cycles or more, and
> > if I didn't use the L1 cache during all this cycle, where is the problem ?
> i need some more explanations, i'm not a distributed.net specialist.
> and i don't have much time to crawl inside their code (which would
> not be more difficult to read with some more comments, say 3x more).
ok, if you want all the source code for this project go to their web site.
But they have a lot of different project and RC5 his the one of the oldest.

> > > RC5 is not "intensive" but this becomes a bottleneck if you don't take
> > > care of the steadiness of the memory streams. It takes quite a while
> > > to dump/flush 512 bytes. Don't forget that the core
> > > often runs 10x faster than the memory system.
> > It's why I want to stay in register.
> except that here,
>  - the required bandwidth is well below the maltong point
What is a maltong point ?

>  - the footprint is well below the thrashing point
Same question, but with thrashing.

>  - with the extra registers, you can perform more computations,
It's what I do, with 31 registers I computes 2 keys, with 62... 4 keys.

>    such as testing stuff in parallel or compute larger blocks per round.
 
A+
  Cedric

> PS: i got your post twice, c'est encore un coup à la VM à Rik ?
No : "c'est un coup a netscape"

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/