[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



> > I must add something before repplying to your post. The objectif of the
> > dnetc project is to break the rc5 keys. So you increment the keys and you
> > verify if it decrypt the bloc they give to you.
> fine, however i don't see where the verification is performed.
>
> From my point of view, the dnet project is a specific part of the
> application domain of F-CPU. Can i point that crypto is used in
> communication (like in SSH) and storage (encrypted filesystems) ?
> This means that if we find something interesting when examining dnet,
> it can also enhance the performance of a normal computer that works
> in a secured environment, such as an internet server. It is maybe
> worth noting it and may explain why i don't focus exclusively on
> RC5 cracking (though i know it's a good entry point and a "cool way"
> to make others think about F-CPU).
Hum, the objectif is not to implement a RC5 cracking client, but to see what 
are the problem with our special F-CPU stuff (variable SIMD).


> no because it will reside in a smaller loop, with maybe
> do {
>        ROUND3EVEN (S1[i], S2[i], S3[i], S4[i]);
>        ROUND3ODD  (S1[i+1], S2[i+1], S3[i+1], S4[i+1]);
>        i+=2;
> } while (i<12);
I think that to solve this discussion the best way is to code a virtual 
machin for FC-0, with cycle count, so that we can see what is the best
implementation.

> in asm, we will need 4 pointer registers (renamed S1 to S4)
> and they use post-incremented addressing. The time to execute the
> block is enough to pre-load the LSU for the next "round" (loop cycle).

> > > This means that we can loop several with a loop core containing
> > > 4 rounds (either parallel or sequential) or less.
> > Well, where did this 4th round came from ?
> look at the code i copied at the top of the mail :
> i count 12 "ROUND3EVEN" and "ROUND3ODD" : these are "rounds"
> as the name says. Look at a crypto book, it's the Feistel
> structure denomination). and you can remark that 24/2=12.

> Now if we put 2*"ROUND3EVEN" and 2*"ROUND3ODD" in the loop body,
> we can loop 6* and spare some number of registers that can be
> assigned to other duties, such as searching other keys.
Perhaps, perhaps not, because you will need more memory bandwith to use
this register to compute other keys. The second problem is that you will
have a bigger loop to attack all the register. I think that your solution need
to much band with for data and instruction...

> > > I see no particular reason why we would be forced to use ALL
> > > the registers for a single key.
> > Not for a single key, but for 4 keys.
> ok, ok, but what if there was a mean to search 8 keys or maybe even 12 or
> 16 ? i don't think it's impossible or even impracticable, and i give a
> proof later in this post.
The problem is that you use more memory bandwith, with latency and you
will compute less key than me in the same time. The objectif is not to
test 16 or more key in the same pass but to test them as quick as possible.


> in this case, the difference does not appear because
>  - the data set fits in Dcache
>  - the access pattern is deterministic and linear (that is : ideal)
>
> In the ideal case, the distance between LSU operations on a give register
> must be spaced (at least half a dozen cycles) but the result of a load is
> usable _immediately_ (at first glance) after the instruction.
> This means that even without auto-prefetch, there is no latency if the
> distance between loads is respected. If you use all LSU lines (8 currently,
> this number corresponds to the average L1 latency) you can sustain the
> maximum throughput.
Ok, don't forget we need to save and restore all the register. I have a 
suggestion about them. On x86, if you use register to put your param when you 
do a function call you really optimise your application.

I propose to use the first 8 registers to put the return address (R1), and the
first params (R2 to R8). For the returned value we put the result in R1.
If we have variable number of parameter (like print) we detect it and put
them on stack (R63).

So with this spec we can start to work on the main calcul of the function 
without the need to first save them and in many case we will have enough 
register to work (I think that the majority of the function only use 8 
registers).

If the used language is C, we know that the parameter will be lost, but for 
language like Pascal (or one who use byRef call), that is great, you can code
a function like swap for example in very few instruction :

move R2, R4
move R3, R2
move R4, R3

That as quick as a macro without is border effect.

(I thing that I have perhaps forgot something about this problem, because why 
didn't people use this type of call ?)

> You can also think about the quantity of memory that is available :
>  - a L1 would contain 4 or 8 KB of memory
>  - the register set can contain 512 bytes (maximum)
> A RC5 core computation seems to require around 6 registers, so it seems
> possible to compute at least 8 simultaneous scalar operations (6*8=48,
> still leaving some registers for other duties) and thus 16 "keys" with
> a SIMD 64-bit F-CPU. This means however that the loop will contain
> only one Feistel round (both even and odd), but for all 8(16) keys in
> parallel.

> Clearly, this means that if we implement a FC3 with 4 or 6 instructions
> decoded per cycle, RC5 will still fly.

> > > ok you reassure me. but then i don't see the point of your previous
> > > remark.
> > Perhaps, because I don't understand where you want to use this code.
> RC5 is RC5, you use it to encode and decode streams (files, network
> packets...). DNET uses both and performs extra stuff, so it first needs a
> good understanding of RC5, and then a good overview of their cracking
> strategy.
> So i start with the simplest things : RC5 stream coding.
False, this is not the same problem. In one case you have a big amount of 
random data to crypt (your stream). And in the other case you 64 bits
crypted data and the 64 bits uncrypted data, and the only thing that is 
changing is the key (Only incremented).

> > > programming is an exercise on balancing register vs memory usage, speed
> > > vs memory footprint, unrolling vs factoring... in fewer words, as
> > > someone on usenet write in his sig, it's an exercise on caching.
> > Yes but if you look to the optimised core for K7, you will see that
> > they use table, but not unrolling loop. I don't know why, but I am
> > not a K7 "gourou".
> i have not seen their code (there's no mention about AMD or K7 in your ANSI
> codes) but from my , ... hmmmm...., "knowledge" of the architecture, it is
> due to... register pressure.
Sorry, but the optimised version for different core are present on their web 
site. I think you don't understand me, they use unrolled loop, and I don't 
find any core that use a "rolled" loop, so why ?

> hey, you can sustain 256 bits per cycle of bandwidth from and to L1 !
> the bandwidth is here to be used :-) If you have a "core" that fits in 1000
> instructions, it's ok (fits in 4KB of Icache). But the LSU is here to
> relieve the register set from temporarily unused data and it's dumb to not
> use it because as long as it fits in L1, you feel no latency.
Ok, I now understand your point of view. I think that only a good virtual 
machin can say who as the best solution.

> from what is written above, it's possible to do 16 keys with a 64-bit CPU.
> Mega-quote : ---->   "Use the cache, Luke" ;-P <-----
I understood, master ;-)  But I think that you too much cache.

> a "round" in the Feistel structure is the thing that is defined in the
> macro (where the computation is performed). The main code instanciates the
> macro 12x, this means that there are 12 "rounds" (plaintext attack on RC5
> is the only known solution when 16 rounds are used, according to Schneier).

> What you seem to call "round" is the higher level of the DNET code,
> where the S variables are generated, then text is encrypted and finally
> decrypted. maybe i missed a few thousands stuffs, but "ROUND3xxx" means
> that it's the 3rd kind of round, not the 3rd round.
Ok, I speak about the DNET round, and you the Schneier round.

> i'm just contemplating using FC0 in a secured environment where lots of
> data must be encrypted. because we don't have to recompute S all the time,
> the stream can be really fast (a 1st generation FC0 could at least not
> blush when compared to faster CPUs).
Ok, but that's not our current problem.

> > ok, if you want all the source code for this project go to their web
> > site. But they have a lot of different project and RC5 his the one of the
> > oldest.
> ??? is there a recommended URL for the source ?
www.distributed.net ;-)

> ooops i should not type with my mittens on ;-)
> "melting" point, or "it's not a bottleneck" because L1 bandwidth
> is not a problem when data is accessed with regular patterns.
Ok, I understand.

> it's the same as when your computer "swaps" to the hard disk :
> it is because it requires more memory than physically available,
> so it uses some slower, larger temporary storage.
> The same goes with L1, L2 etc...
Ok, I understand this too.

> mmmmm from what i read, with 32 registers it's possible to compute 4 keys :
> what did i miss ? maybe the compiler automatically maps the S variables
> to memory, don't you think ?

> let's count :
> >  u32 kiter = 0;
> >   u32 keycount = tslice;
>
> 2
>
> >   u32 S1_00,S1_01,S1_02,S1_03,S1_04,S1_05,S1_06,S1_07,S1_08,S1_09,
> >       S1_10,S1_11,S1_12,S1_13,S1_14,S1_15,S1_16,S1_17,S1_18,S1_19,
> >       S1_20,S1_21,S1_22,S1_23,S1_24,S1_25;
>
> 26
>
> >   u32 S2_00,S2_01,S2_02,S2_03,S2_04,S2_05,S2_06,S2_07,S2_08,S2_09,
> >       S2_10,S2_11,S2_12,S2_13,S2_14,S2_15,S2_16,S2_17,S2_18,S2_19,
> >       S2_20,S2_21,S2_22,S2_23,S2_24,S2_25;
>
> 26
>
> >   u32 S3_00,S3_01,S3_02,S3_03,S3_04,S3_05,S3_06,S3_07,S3_08,S3_09,
> >       S3_10,S3_11,S3_12,S3_13,S3_14,S3_15,S3_16,S3_17,S3_18,S3_19,
> >       S3_20,S3_21,S3_22,S3_23,S3_24,S3_25;
>
> 26
>
> >   u32 S4_00,S4_01,S4_02,S4_03,S4_04,S4_05,S4_06,S4_07,S4_08,S4_09,
> >       S4_10,S4_11,S4_12,S4_13,S4_14,S4_15,S4_16,S4_17,S4_18,S4_19,
> >       S4_20,S4_21,S4_22,S4_23,S4_24,S4_25;
>
> 26
>
> this part is more interesting :
> >   u32 A1, Llo1, Lhi1;
> >   u32 A2, Llo2, Lhi2;
> >   u32 A3, Llo3, Lhi3;
> >   u32 A4, Llo4, Lhi4;
> >   u32 tmp1, tmp2, tmp3, tmp4;
>
> 12+4=16
you forgot eA1, eA2, eA3, eA4, eB1, eB2, eB3 and eB4

> the total number is 16 + 26*4 + 2 = 122
total number = 130...

> So if i count correctly, you can fit this in the F-CPU register set
> simply because a 64-bit SIMD register can contain 2*u32.

> Then you can remove some of the code that loads the S, but you have
> to re-schedule the macros because temp1 to temp4 may be useless.
> at least "tmp1 = S1N;" etc. can be removed (register name inference).
Of course, I reschedule the macro to complain with F-CPU coding rules.

> Later, i see : "u32 cS0, Q;"
> and "u32 eA1, eB1, eA2, eB2, eA3, eB3, eA4, eB4;"
> so it adds 5 more registers : argh.
Ah, you see them (cS0 and Q are constant so that not realy a problem).
So you understand my current problem.

A+
   Cedric

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/