[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



hi !

This mailing list becomes difficult to follow... but here i am
anyway. I don't know if i can answer other posts.

Cedric BAIL wrote:
> > i just read the chapter about RC5 in Bruce Schneier's book :
> > it looks very interesting and i wonder why the code size
> > can't be reduced ...
> It's easy, we have two solutions :
>  - or we reduce the code, we only use 31 registers and we are around 600
> instructions before doing something in memory.
it''s NOT about not using memory or not using registers.
In an usual case, you would have 4KB of dedicated instruction and data caches.
maybe more, maybe less. If you put some data (the added constants generated
by the key) in L1, you free some registers that can be used for pointing
memory "streams", so there is a _continuous_ flow of data through the CPU,
_NOT_ "bursts" every 600 cycles or so. usually, the memory system won't
keep up and you'll slow down all the system. I think it's possible to find
a compromise with maybe 500 or 800 instructions, leaving enough space in L1
for some other useful code.

>  - or we use all the register, compute 2 more keys before doing something
> in memory and having better performance I thing.
today, performance is also (like in the 70s) constrained by the memory.
in today's systems, you can have a bandwidth of less than one byte per instruction
(provided the instruction is already in Icache and we hit L2 or the local SDRAM).
You HAVE to interleave memory accesses (by small chunks) and computations.
Otherwise, your nicely optimised code will runn during 600 cycles and stall
almost completely another 600 cycles.

> > > > i am almost sure that a packing operation would avoid the
> > > > last 2 instructions.
> > > What did you mean by a packing operation ?
> > look at page 129 (?) of your PDF F-CPU manual :
> Ok, I look at this page, and it's exactly what I want, so I will
> use it for my final code.

do you trust me, now ? :-P

> > but please forget this, because it's optional and SRB must work before.
> > We don't have a LSU yet, so it's really difficult to speak about this.
> Ok, we will speak about it later.
so ...
<snip>
 
> > i thought it was as clear in the manual :-)
> Yes, but not so clear. So I really ask me how to know the number of chunk,
> or the real size of the register.
you can also look at some examples in the manual (instruction set part).

> > > I didn't find a specially designed core for IA64.
> > it's just a matter of time...
> Or because nobody have this type of hardware...
or because when you can afford one, you can afford a dedicated RC5 HW ;-)

> > what is the block size you use ? 64-bits ? and how many rounds ?
> > If you store the coeffs in registers, then no wonder you need so many
> > registers. However, using postincremented loads, you can sustain your
> > throughput. What Bruce Schneier describes is pretty simple so your
> > implementation is probably too hairy... or optimized for a CPU
> > which is not at all adapted to this task (and there is not only one).
> > but i'm sure that even a SHARC DSP can do the job wihout heating.
> 
> > using the size flags wisely with the SIMD flag on, you SHOULD be able to
> > do a core-width-independent code. i'm sure you can but this probably
> > requires you to start from scratch, not from ansi or distributed.net
> > sources.
> hum, the objectif is not to recode a client from scratch, but to have the
> same base client and have only the specific code that perform the calcul
> in F-CPU asm and the core selection system perhaps.

i'm not speaking about recoding the *client*, only the decoding algo.

> But you have at least one problem if you want to do a generic rc5 code for
> F-CPU you need to know how many keys you compute in a pass and how big
> the register are. It's really hard to do a generic algorithm (look at nicolas
> post).

if you start from the inner loop, there should be no problem.
You then propagate all the constraints to the client : platform detection,
tuning of the keys... and since you have the sources, you can make your
own f-cpu patch.

> > >    I think, that it's preferable to detect on wich CPU we are at start and
> > > then select the good core. (like what the current dnetc client do on x86).
> > F-CPU can do even better.
> So be clear : We will lost time to detect the CPU at each time the function
> is called (around 100 000 call). So it's not the good way to solved the
> problem, I will look for a good answer but at the start not during the call.

heh, i thought you were wiser ;-)

there is a simple way to do your thing in C :

void compute_1() {
  ****
}

void compute_2() {
  ****
}

void compute_3() {
  ****
}

void compute_ptr() = &compute_1;

int main () {

 // detect platform version
  if (CPU_TYPE == TYPE_2)
    compute_ptr = &compute_2;    
  else {
    if (CPU_TYPE == TYPE_3)
      compute_ptr = &compute_3;
    // else : default code
  }

  for (i=0; i<N; i++)
    compute_ptr();
	/* this translates to :
             - compute_ptr is mapped to a register (r1 for example)
             - just call with jmpa r1,r2 (where r2 is the return address pointer)
             - the functions return with jmpa r2,r0;
         */
}

Is it too difficult to do ?

> > i'm SURE RC5 can work in SIMD mode on F-CPU, including FC0. with 128-bit
> > or 256-bit cores, it could even be able to process 2 or 4 blocks at once.
> > a 64-bit core can code/decode a 128-bit block. i have no source code but
> > i'm pretty confident with this gut feeling.
> I am sure too, but I wan't to see what is the difference with the 64 bits
> version.

version of what ? RC5 or F-CPU ?

> > > > However, i wonder if there is a way to "factor" some code from the
> > > > core and reduce the code size. there _should_ be a way to minimize this
> > > > code.
> > > I think that it is not possible, because I use all the registers and each
> > > line are different.
> > héhé :-P
> > from what my book says, you're trying to grok an already optimised code.
> > if we restart from the definition, you'll see it's almost straight-forward.
> > good night Petit Scarabée,
> So, ok if we want to reduce the code needed, we will need to put data in
> memory and manipulate some stupid table...
if you mean "simple" table, it's ok.
If you have 16 rounds and a 64-bit block, you need 16*2*32=1KB of cache.
for 128-bit blocks, you need 2KB of data cache.
Because the rounds use sequential, linear access, there is no cache penalty
(there should be some auto-prefetching of the next cache line).

Furthermore, if you decode several blocks at once (for example, a 64-bit core
with 64-bit blocks), you can do :

   loadi.64 8, [rp], rd;
   sdupi.32 0, rd, r1;
   sdupi.32 1, rd, r2; // or something like that
--> you execute only one load and you get 4 32-bit values
  in 2 registers with 3 instructions.


> So I prefer to have a big code (but
> smaller than 8KB), than to do stupid operation in memory and not use all the
> register and forgot that I am not on an x86 CPU and that I have 63 registers !

i think you are mistaken : the goal is not to use ALL the registers.
There are other things that come into the game, such as the time it takes
to load and store the whole damn register set. Some computations are less
memory intensive and might be happy to spread in the whole register set.
RC5 is not "intensive" but this becomes a bottleneck if you don't take
care of the steadiness of the memory streams. It takes quite a while to
dump/flush 512 bytes. Don't forget that the core
often runs 10x faster than the memory system.

If we ever find some time to meet, we'll read the documents and draft some
code together, i'll show you some tricks.

read you soon,

> A+
>   Cedric
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/