[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl

hello !

Cedric BAIL wrote:
> Hi,
> > hi again,
> I must add something before repplying to your post. The objectif of the
> dnetc project is to break the rc5 keys. So you increment the keys and you
> verify if it decrypt the bloc they give to you.

fine, however i don't see where the verification is performed.

From my point of view, the dnet project is a specific part of the
application domain of F-CPU. Can i point that crypto is used in
communication (like in SSH) and storage (encrypted filesystems) ?
This means that if we find something interesting when examining dnet,
it can also enhance the performance of a normal computer that works
in a secured environment, such as an internet server. It is maybe
worth noting it and may explain why i don't focus exclusively on
RC5 cracking (though i know it's a good entry point and a "cool way"
to make others think about F-CPU).

> > i see things like
> >
> >       ROUND3EVEN (S1_02, S2_02, S3_02, S4_02);
> >       ROUND3ODD  (S1_03, S2_03, S3_03, S4_03);
> >       ROUND3EVEN (S1_04, S2_04, S3_04, S4_04);
> >       ROUND3ODD  (S1_05, S2_05, S3_05, S4_05);
> >       ROUND3EVEN (S1_06, S2_06, S3_06, S4_06);
> >       ROUND3ODD  (S1_07, S2_07, S3_07, S4_07);
> >       ROUND3EVEN (S1_08, S2_08, S3_08, S4_08);
> >       ROUND3ODD  (S1_09, S2_09, S3_09, S4_09);
> >       ROUND3EVEN (S1_10, S2_10, S3_10, S4_10);
> >       ROUND3ODD  (S1_11, S2_11, S3_11, S4_11);
> >       ROUND3EVEN (S1_12, S2_12, S3_12, S4_12);
> >       ROUND3ODD  (S1_13, S2_13, S3_13, S4_13);
> >       ROUND3EVEN (S1_14, S2_14, S3_14, S4_14);
> >       ROUND3ODD  (S1_15, S2_15, S3_15, S4_15);
> >       ROUND3EVEN (S1_16, S2_16, S3_16, S4_16);
> >       ROUND3ODD  (S1_17, S2_17, S3_17, S4_17);
> >       ROUND3EVEN (S1_18, S2_18, S3_18, S4_18);
> >       ROUND3ODD  (S1_19, S2_19, S3_19, S4_19);
> >       ROUND3EVEN (S1_20, S2_20, S3_20, S4_20);
> >       ROUND3ODD  (S1_21, S2_21, S3_21, S4_21);
> >       ROUND3EVEN (S1_22, S2_22, S3_22, S4_22);
> >       ROUND3ODD  (S1_23, S2_23, S3_23, S4_23);
> > and this reinforces my belief that it is possible
> > to use the memory (it fits in L1) for storing the coefs.
> I know, it's possible to use memory, but you will not reduce the use
> of the ICache and you will add a lot of access to the DCache,

no because it will reside in a smaller loop, with maybe
do {
       ROUND3EVEN (S1[i], S2[i], S3[i], S4[i]);
       ROUND3ODD  (S1[i+1], S2[i+1], S3[i+1], S4[i+1]);
} while (i<12);

in asm, we will need 4 pointer registers (renamed S1 to S4)
and they use post-incremented addressing. The time to execute the
block is enough to pre-load the LSU for the next "round" (loop cycle).

> and I don't see why you absolutely want to use the DCache.
This frees a lot of registers that can be used for other computing
purposes. In the case of a stream encryption, the "S" are stored in L1
and the encrypted data are stored in the remaining registers.
This provides enough work so that it can be rescheduled easily,
and even execute on superscalar versions.

The case of dnet is looking different but i can't involve myself
in this now. The case of the SIMD shift has been proved and there
does not seem to be any difficulty in implementing it with F-CPU,
we simply disagree on a "small" implementation detail :-)

> (If you want some example that use simple table for their S1..4 look
> in the other file I have posted).
"the other file" ? which ? there is a lot of them :-/

> > This means that we can loop several with a loop core containing
> > 4 rounds (either parallel or sequential) or less.
> Well, where did this 4th round came from ?

look at the code i copied at the top of the mail :
i count 12 "ROUND3EVEN" and "ROUND3ODD" : these are "rounds"
as the name says. Look at a crypto book, it's the Feistel
structure denomination). and you can remark that 24/2=12.

Now if we put 2*"ROUND3EVEN" and 2*"ROUND3ODD" in the loop body,
we can loop 6* and spare some number of registers that can be
assigned to other duties, such as searching other keys.

> > I see no particular reason why we would be forced to use ALL
> > the registers for a single key.
> Not for a single key, but for 4 keys.
ok, ok, but what if there was a mean to search 8 keys or maybe even 12 or 16 ?
i don't think it's impossible or even impracticable, and i give a proof
later in this post.

> > The macros for the rounds seem to be rather well designed
> > and suitable for FC0, the loop size is not too small or too
> > large, the instructions seem correctly interleaved and the average
> > distance of 3 instructions between computation and use is "good".
> It's why I select this algo.
you have better tastes than i thought ;-P
however, i don't like the code which tries to pre-optimize the code
for 2-address operations.

> > To further my remark about using the L1 cache, here is some code :
> > #define  ROUND2ODD(S1N, S2N, S3N, S4N)        \
> > *    tmp1 = S1N;                      \   <-- instead of using a register, we can use a
> > load instead.
> >     A1 += Llo1;                       \
> > *    tmp2 = S2N;                      \
> >     A2 += Llo2;                       \
> > *    tmp3 = S3N;                      \
> >     A3 += Llo3;                       \
> > *    tmp4 = S4N;                      \
> >     A4 += Llo4;                       \
> >     A1 += tmp1;                       \
> >     A2 += tmp2;                       \
> > ..... rest of the code
> The question is why did you absolutly want to use the DCache ?
> The register are quicker than the DCache, so why ?
in this case, the difference does not appear because
 - the data set fits in Dcache
 - the access pattern is deterministic and linear (that is : ideal)

In the ideal case, the distance between LSU operations on a give register
must be spaced (at least half a dozen cycles) but the result of a load is
usable _immediately_ (at first glance) after the instruction.
This means that even without auto-prefetch, there is no latency if the distance
between loads is respected. If you use all LSU lines (8 currently, this number
corresponds to the average L1 latency) you can sustain the maximum throughput.

You can also think about the quantity of memory that is available :
 - a L1 would contain 4 or 8 KB of memory
 - the register set can contain 512 bytes (maximum)
A RC5 core computation seems to require around 6 registers, so it seems
possible to compute at least 8 simultaneous scalar operations (6*8=48,
still leaving some registers for other duties) and thus 16 "keys" with
a SIMD 64-bit F-CPU. This means however that the loop will contain
only one Feistel round (both even and odd), but for all 8(16) keys in

Clearly, this means that if we implement a FC3 with 4 or 6 instructions
decoded per cycle, RC5 will still fly.

> > look, there are 32 instructions (i presume that it could be possible to
> > re-optimise the code to remove some "temp"s !) and only 4 loads. it is a
> > perfectly sustainable bandwidth.
> > However, i miss the signification of this code :
> >
> > >       if (rc5unitwork->cypher.lo == eA1 &&
> > >         rc5unitwork->cypher.hi == ROTL(eB1 ^ eA1, eA1) +
> > >           ROTL3(S1_25 + A1 + ROTL(Llo1 + A1 + Lhi1, A1 + Lhi1))) return
> > >                                                              kiter;
> It compare if the decrypted block equal the block they give to you.
> It's for checking if the key his the one your are looking for !
well, nothing worth fussing about, then ;-)

> > > The problem is in the test. You must do a test for each chunk...
> > where is this test ?
> The code you include in your post ;-) (The one you didn't understand)
may one forgive me. i thought they knew how to write comments. 

> > ok you reassure me. but then i don't see the point of your previous
> > remark.
> Perhaps, because I don't understand where you want to use this code.
RC5 is RC5, you use it to encode and decode streams (files, network packets...).
DNET uses both and performs extra stuff, so it first needs a good understanding
of RC5, and then a good overview of their cracking strategy.
So i start with the simplest things : RC5 stream coding.

> > programming is an exercise on balancing register vs memory usage, speed vs
> > memory footprint, unrolling vs factoring... in fewer words, as someone
> > on usenet write in his sig, it's an exercise on caching.
> Yes but if you look to the optimised core for K7, you will see that
> they use table, but not unrolling loop. I don't know why, but I am
> not a K7 "gourou".
i have not seen their code (there's no mention about AMD or K7 in your ANSI codes)
but from my , ... hmmmm...., "knowledge" of the architecture, it is due to...
register pressure.

However, there is no loss in using memory in this case (as i explained :
linear access pattern and dataset size) so the strategy for F-CPU would be
to use a similar code in a loop, but the 8* unrolling will be just
cut-paste-rename and the computations will be done in parallel.

Given the 8 "general purpose registers" of x86 and the 32 regs of other RISC
platforms (see the _4-rg source), you see that it is possible to compute
at least 8 keys in scalar mode and much more in SIMD mode.

> > If your program footprint explodes, i don't see the point of
> > "not accessing memory at all costs". The larger the program/code,
> > the more bugs. <insert personal experience here.>
> > If your bloated code does always do the _same_ thing all over
> > the cache, then there's something really dumb happening.
> > Frankly, you can reduce the code size by 4 or 8, which is reasonable and
> > safe.
> You can't reduce the code much more than 600 instructions. If you
> use small loop to compute the different round, you will use too much
> bandwich only to reduce the use of instruction. I think that it's preferable
> to use all the ICache and nothing for the DCache.

hey, you can sustain 256 bits per cycle of bandwidth from and to L1 !
the bandwidth is here to be used :-) If you have a "core" that fits in 1000
instructions, it's ok (fits in 4KB of Icache). But the LSU is here to relieve
the register set from temporarily unused data and it's dumb to not use it
because as long as it fits in L1, you feel no latency.

> > What to do with the extra registers ? more computations, of course !
> > Imagine a 256-bit, 4-issue F-CPU where there is only one unit per
> > instruction slot (1 slot for boolean ops, 2 slots for additions, 1 slot
> > for register move or memory store/load, or something like that).
> > The exta registers will allow to compute even more stuff and still
> > execute with no pipeline stall or empty issue slot. You can simply
> > copy-paste the code, interleave it and rename the registers so you
> > are sure that every execution unit is working all the time.
> ... Well I actually compute 4 keys in the same pass. If we have a
> 256 bits F-CPU, we only need to rewrite the test and we will compute
> 16 keys per pass. No register renaming, nothing complicated to do...

from what is written above, it's possible to do 16 keys with a 64-bit CPU.

Mega-quote : ---->   "Use the cache, Luke" ;-P <-----

> > > > If you have 16 rounds and a 64-bit block, you need 16*2*32=1KB of
> > > > cache. for 128-bit blocks, you need 2KB of data cache.
> > > > Because the rounds use sequential, linear access, there is no cache
> > > > penalty (there should be some auto-prefetching of the next cache line).
> > > I think that you didn't look to the file I gave to you...
> > not quite, but it shows that there are 12 rounds and the blocks are
> > 64-bits, so the "working set" uses 32-bit integers.
> euh, I see only 3 rounds. For the rest it's ok.

a "round" in the Feistel structure is the thing that is defined in the macro
(where the computation is performed). The main code instanciates the macro
12x, this means that there are 12 "rounds" (plaintext attack on RC5 is the only
known solution when 16 rounds are used, according to Schneier).

What you seem to call "round" is the higher level of the DNET code,
where the S variables are generated, then text is encrypted and finally
decrypted. maybe i missed a few thousands stuffs, but "ROUND3xxx" means
that it's the 3rd kind of round, not the 3rd round.

> > > > Furthermore, if you decode several blocks at once (for example, a 64-bit
> > > > core with 64-bit blocks), you can do :
> > >
> > > >    loadi.64 8, [rp], rd;
> > > >    sdupi.32 0, rd, r1;
> > > >    sdupi.32 1, rd, r2; // or something like that
> > > > --> you execute only one load and you get 4 32-bit values
> > > >   in 2 registers with 3 instructions.
> > > euh, sory but what are you doing here ?
> > if we use F-CPU for enciphering or deciphering a message (in a real-world
> > communication application, for example), i presume that the keys for each
> > round don't change across the blocks --> we can use each round's value
> > several times, once for each block, and we can store several blocks
> > in the register set with the SIMD operations.
> oups, you want to do brute force attack in a communication application...
> I don't understand.

i'm just contemplating using FC0 in a secured environment where lots of data
must be encrypted. because we don't have to recompute S all the time, the
stream can be really fast (a 1st generation FC0 could at least not blush
when compared to faster CPUs).

> > > > > So I prefer to have a big code (but smaller than 8KB), than to do stupid
> > > > > operation in memory and not use all the
> > > > > register and forgot that I am not on an x86 CPU and that I have 63
> > > > > registers !
> > > > i think you are mistaken : the goal is not to use ALL the registers.
> > > > There are other things that come into the game, such as the time it
> > > > takes to load and store the whole damn register set. Some computations are
> > > > less memory intensive and might be happy to spread in the whole register
> > > > set.
> > > If I save and restore this register only after 20000 cycles or more, and
> > > if I didn't use the L1 cache during all this cycle, where is the problem ?
> > i need some more explanations, i'm not a distributed.net specialist.
> > and i don't have much time to crawl inside their code (which would
> > not be more difficult to read with some more comments, say 3x more).
> ok, if you want all the source code for this project go to their web site.
> But they have a lot of different project and RC5 his the one of the oldest.

??? is there a recommended URL for the source ?

> > > > RC5 is not "intensive" but this becomes a bottleneck if you don't take
> > > > care of the steadiness of the memory streams. It takes quite a while
> > > > to dump/flush 512 bytes. Don't forget that the core
> > > > often runs 10x faster than the memory system.
> > > It's why I want to stay in register.
> > except that here,
> >  - the required bandwidth is well below the maltong point
> What is a maltong point ?

ooops i should not type with my mittens on ;-)
"melting" point, or "it's not a bottleneck" because L1 bandwidth
is not a problem when data is accessed with regular patterns.

> >  - the footprint is well below the thrashing point
> Same question, but with thrashing.

it's the same as when your computer "swaps" to the hard disk :
it is because it requires more memory than physically available,
so it uses some slower, larger temporary storage.
The same goes with L1, L2 etc...

> >  - with the extra registers, you can perform more computations,
> It's what I do, with 31 registers I computes 2 keys, with 62... 4 keys.
mmmmm from what i read, with 32 registers it's possible to compute 4 keys :
what did i miss ? maybe the compiler automatically maps the S variables
to memory, don't you think ?

let's count :
>  u32 kiter = 0;
>   u32 keycount = tslice;
>   u32 S1_00,S1_01,S1_02,S1_03,S1_04,S1_05,S1_06,S1_07,S1_08,S1_09,
>       S1_10,S1_11,S1_12,S1_13,S1_14,S1_15,S1_16,S1_17,S1_18,S1_19,
>       S1_20,S1_21,S1_22,S1_23,S1_24,S1_25;
>   u32 S2_00,S2_01,S2_02,S2_03,S2_04,S2_05,S2_06,S2_07,S2_08,S2_09,
>       S2_10,S2_11,S2_12,S2_13,S2_14,S2_15,S2_16,S2_17,S2_18,S2_19,
>       S2_20,S2_21,S2_22,S2_23,S2_24,S2_25;
>   u32 S3_00,S3_01,S3_02,S3_03,S3_04,S3_05,S3_06,S3_07,S3_08,S3_09,
>       S3_10,S3_11,S3_12,S3_13,S3_14,S3_15,S3_16,S3_17,S3_18,S3_19,
>       S3_20,S3_21,S3_22,S3_23,S3_24,S3_25;
>   u32 S4_00,S4_01,S4_02,S4_03,S4_04,S4_05,S4_06,S4_07,S4_08,S4_09,
>       S4_10,S4_11,S4_12,S4_13,S4_14,S4_15,S4_16,S4_17,S4_18,S4_19,
>       S4_20,S4_21,S4_22,S4_23,S4_24,S4_25;

this part is more interesting :
>   u32 A1, Llo1, Lhi1;
>   u32 A2, Llo2, Lhi2;
>   u32 A3, Llo3, Lhi3;
>   u32 A4, Llo4, Lhi4;
>   u32 tmp1, tmp2, tmp3, tmp4;

the total number is 16 + 26*4 + 2 = 122

So if i count correctly, you can fit this in the F-CPU register set
simply because a 64-bit SIMD register can contain 2*u32.

Then you can remove some of the code that loads the S, but you have
to re-schedule the macros because temp1 to temp4 may be useless.
at least "tmp1 = S1N;" etc. can be removed (register name inference).

Later, i see : "u32 cS0, Q;"
and "u32 eA1, eB1, eA2, eB2, eA3, eB3, eA4, eB4;"
so it adds 5 more registers : argh.

> A+
>   Cedric
> > PS: i got your post twice, c'est encore un coup  la VM  Rik ?
> No : "c'est un coup a netscape"

"finally, when netscape freezes, it's rock stable." :-P

np : Die Toten Hosen, "Opium fr's Volk".
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/