[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



> i just read the chapter about RC5 in Bruce Schneier's book :
> it looks very interesting and i wonder why the code size
> can't be reduced ...
It's easy, we have two solutions :
 - or we reduce the code, we only use 31 registers and we are around 600 
instructions before doing something in memory.
 - or we use all the register, compute 2 more keys before doing something
in memory and having better performance I thing.
 
> > > i am almost sure that a packing operation would avoid the
> > > last 2 instructions.
> > What did you mean by a packing operation ?
> look at page 129 (?) of your PDF F-CPU manual :
Ok, I look at this page, and it's exactly what I want, so I will
use it for my final code.

> mmmm and what about using them for the function prolog and epilog ?
> two registers would contain the range of used registers and they
> would be saved on the stack or restored. This would make the compiler
> happier because there would be no determined register allocation.
> only the stack pointer and the low and high bounds are "fixed"
> (then the instruction would need automatic post-increment)...
> 
> but please forget this, because it's optional and SRB must work before.
> We don't have a LSU yet, so it's really difficult to speak about this.
Ok, we will speak about it later.

> > I think that it will do :
> > for i := imm6(1) to imm6(1) + imm6(2) do
> >   store 8, ri, [r3]
> > done
> > or some think like that.
> no, because
>  - please don't add an adder in the Critical DataPath
The add operation is needed, and we must check if he can do is save before
starting the storem/loadm, I see the problem, and the difference with the
srb mechanism.

>  - your store is likely to create a trap, but storem/loadm uses the SRB
>    which is specified to not trap.
Yes, but that mean that your storem/loadm only work on physical address, right?

>  - the SRB will "snoop" the Xbar, in case a register is to be saved AND used


> i thought it was as clear in the manual :-)
Yes, but not so clear. So I really ask me how to know the number of chunk,
or the real size of the register.

> > I didn't find a specially designed core for IA64.
> it's just a matter of time...
Or because nobody have this type of hardware...
 
> what is the block size you use ? 64-bits ? and how many rounds ?
> If you store the coeffs in registers, then no wonder you need so many
> registers. However, using postincremented loads, you can sustain your
> throughput. What Bruce Schneier describes is pretty simple so your
> implementation is probably too hairy... or optimized for a CPU
> which is not at all adapted to this task (and there is not only one).
> but i'm sure that even a SHARC DSP can do the job wihout heating.

> using the size flags wisely with the SIMD flag on, you SHOULD be able to
> do a core-width-independent code. i'm sure you can but this probably
> requires you to start from scratch, not from ansi or distributed.net
> sources.
hum, the objectif is not to recode a client from scratch, but to have the
same base client and have only the specific code that perform the calcul
in F-CPU asm and the core selection system perhaps.

But you have at least one problem if you want to do a generic rc5 code for
F-CPU you need to know how many keys you compute in a pass and how big
the register are. It's really hard to do a generic algorithm (look at nicolas
post).

> >    I think, that it's preferable to detect on wich CPU we are at start and
> > then select the good core. (like what the current dnetc client do on x86).
> F-CPU can do even better.
So be clear : We will lost time to detect the CPU at each time the function
is called (around 100 000 call). So it's not the good way to solved the
problem, I will look for a good answer but at the start not during the call.
 
> i'm SURE RC5 can work in SIMD mode on F-CPU, including FC0. with 128-bit
> or 256-bit cores, it could even be able to process 2 or 4 blocks at once.
> a 64-bit core can code/decode a 128-bit block. i have no source code but
> i'm pretty confident with this gut feeling.
I am sure too, but I wan't to see what is the difference with the 64 bits
version.
 
> > > > PS2: Actually the main loop need 1200 instructions with a real srotl
> > > > instruction, and without it need 2300 instructions...
> > > As you can see, an instruction can influence other things : 1K2
> > > instructions requires almost 5 Kbytes of code and a 8KB instruction cache
> > > is enough.
> > > But 2300 instructions require 9200 bytes and there would be some cache
> > > thrashing with only 8KB of cache...

> > > However, i wonder if there is a way to "factor" some code from the
> > > core and reduce the code size. there _should_ be a way to minimize this
> > > code.
> > I think that it is not possible, because I use all the registers and each
> > line are different.

> héhé :-P
> from what my book says, you're trying to grok an already optimised code.
> if we restart from the definition, you'll see it's almost straight-forward. 
> good night Petit Scarabée,
So, ok if we want to reduce the code needed, we will need to put data in
memory and manipulate some stupid table... So I prefer to have a big code (but
smaller than 8KB), than to do stupid operation in memory and not use all the
register and forgot that I am not on an x86 CPU and that I have 63 registers !
 

A+
  Cedric
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/