[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl

hi !

i just read the chapter about RC5 in Bruce Schneier's book :
it looks very interesting and i wonder why the code size
can't be reduced ...

Cedric BAIL wrote:
> hi,
> > what does that mean ?
> That mean that currently only the low part of the rotation parameter
> define all the rotation of the different chunk.

ok i see here the problem is. i think it can be solved, at least with
my own SHL unit structure.

> > thanks for the code ! however, i don't know if i have enough time
> > to read it tonight (i have to setup my homebrewed LFS).
> ;-)
and i don't speak about the efforts and time required to read the list ;-P
i still haven't rebooted...

btw, i'm angry because KUDZU has thrashed my X configuration on the laptop.
i won't' be able to use ANY graphic application before i have rebuilt LFS and X.

btw, who has ever used DirectFB ? it's a small, simple and fast
alternative to X11.

> > > %macro simdrotl 3
> > >         shiftri 32, %1, %3
> > >         rotl.q  %2, %1, %1
> > >         shiftri 32, %2, %2
> > >         rotl.q  %2, %3, %3
> > >         shiftri 32, %3, %3
> > >         or      %3, %1, %1
> > > %endmacro
> >
> > i am puzzled.
> >
> > first i am surprised that you have to use this kind of code,
> > i would have expected that a SIMD rotation already existed.
> > you shouldn't have to write this kind of macro.
> So you understand the problem.

yup and i thought that Michael had implemented a more general version.

> > second, i would have written this differently (remember
> > that there is at least 1 cycle of latency for the shifts)
> >          shiftri 32, %1, %3
> >          shiftri 32, %2, %2  // swapping this one saves some cycles
> >          rotl.q  %2, %1, %1
> >          rotl.q  %2, %3, %3
> >          shiftri 32, %3, %3  //
> >          or      %3, %1, %1  // would a packing operation work here ?
> Ok, not so much different.
yup but from the performance point of view, there is a difference ;-)

> > i am almost sure that a packing operation would avoid the
> > last 2 instructions.
> What did you mean by a packing operation ?

look at page 129 (?) of your PDF F-CPU manual :


2.3.4 mix
mixl r3, r2, r1 mixh r3, r2, r1
Mix two halves of r3 and r2 and puts the result into r1.

[here is a little drawing that shows how it works]

Depending on the h ag, the lower or higher part of r3 and r2 are interleaved. The size of the
source chunks is determined by the size ags. This instruction is useful to interleave words in a butter y"
fashion or reverse a little matrix. Or simply it can be used to create an extended form of the result of an
addition with carry.


> > But the first problem is still the most annoying : i had expected that
> > Michael supported "real" SIMD operation. This comforts me in thinking
> > that i have to write my on shift unit.
> > The problem is not to prove that there is (or not) an algorithm that uses
> > a specific opcode variant, it is more : we have to design an "orthogonal"
> > instruction set which allows all (or as many as possible) combinations
> > of parameters. This eases the design of the core, the compiler and the
> > applications.
> I real agry with you

but this will solved one day, don't worry.

> > >         The second point, is about the storem and loadm operation,
> > > for this algorithm that saturate all the register bank, we need before the
> > > start of the loop to save all the registers. The problem is that the storem
> > > and loadm need actually a register that contain the number of registers to
> > > save... That stupid, to save all the register we need to do :
> > >         storei          8, R1, R63
> > >         loadconsx.0     62, R1
> > >         storem          R1, R2, R63
> > > The solution is easy : storem 63, R1, R3 ...
> > do you mean that 2 forms of the storem instruction are needed ?
> I think, that only the immediate form is needed.

mmmm and what about using them for the function prolog and epilog ?
two registers would contain the range of used registers and they
would be saved on the stack or restored. This would make the compiler
happier because there would be no determined register allocation.
only the stack pointer and the low and high bounds are "fixed"
(then the instruction would need automatic post-increment)...

but please forget this, because it's optional and SRB must work before.
We don't have a LSU yet, so it's really difficult to speak about this.

> > if an immediate and a register form are enough, why not, though
> > the scheduling is quite different... This is why the operands
> > must be _all_ immediate or _all_ registers_ (otherwise it becomes
> > really complex).
> >
> > don't forget however that the pointer must be in the middle position,
> > so you would write either
> >   storem/loadm r1,[r2],r3
> > or
> >   storem/loadm imm8,[r2],imm6
> "Oups", I have done a error, I mean this :
> storem/loadm imm6(1), imm6(2), [r3]

yep, but pointers are hardwired to the middle position.

> I think that it will do :
> for i := imm6(1) to imm6(1) + imm6(2) do
>   store 8, ri, [r3]
> done
> or some think like that.

no, because
 - please don't add an adder in the Critical DataPath
 - your store is likely to create a trap, but storem/loadm uses the SRB
   which is specified to not trap.
 - the SRB will "snoop" the Xbar, in case a register is to be saved AND used

> > but this forces to add an imm6 field where there is nothing yet.
> > that's ugly.
> A I don't understand where the imm8 came from ? We only have 63 registers,
> right ?

sure but the field in this location is an imm8 (with sign extension).
the value can be truncated, however.

> > >         And now, I have a question about a not really clear feature,the
> > > size register. I didn't really understand what they say. Did they say how
> > > many chunk divide the register ? Did they say how big the chunk are (but in
> > > that case how many are they ?) ? Or some thing else.
> > There are two things to consider : the size of the register and the
> > size of the sub-parts.
> > * When the SIMD flag is set, the register is implicitely considered as
> > being the widest. The size attributes specify the size of the chunks.
> > More specificly : the whole register is written back.
> > * When there is no SIMD flag, the register size is defined by the size flags.
> > Only the LSB of the register is written, depending on the attributes.
> Ok, it's clearer now.

i thought it was as clear in the manual :-)

> > >         A good news for the end, 63 registers is enough (I only need one
> > > more ;-), I think that I will find it). And our RC5 algorithm only use
> > > register and never do any operation in memory... that really great, no
> > > other core can compute in the same time 4 keys directly in registers.
> > are you sure ? I'd consider IA64 as a tough competitor, here.
> I didn't find a specially designed core for IA64.
it's just a matter of time...

> (I must say that during
> a scope we need 64 registers, and not only a part of them).
what is the block size you use ? 64-bits ? and how many rounds ?
If you store the coeffs in registers, then no wonder you need so many
registers. However, using postincremented loads, you can sustain your
throughput. What Bruce Schneier describes is pretty simple so your
implementation is probably too hairy... or optimized for a CPU
which is not at all adapted to this task (and there is not only one).
but i'm sure that even a SHARC DSP can do the job wihout heating.

> > > PS1: I hope that I will finish the rc5_f-cpu core for next week. I think
> > > that coding real algorithm on F-CPU is a good start to see what we does
> > > wrong or good. I think that I must design an other version for a F-CPU
> > > that will have 128 bits or 256 bits registers.
> > can you make a "generic" version that scales easily with the core's size ?
> > by using the size flags, you should be able to find a way to do that.
> I think that is not the good way to have performance, because every n keys
> we will call a core detect function to know on wich platform we are and then
> adapt our code to this platform.

using the size flags wisely with the SIMD flag on, you SHOULD be able to
do a core-width-independent code. i'm sure you can but this probably
requires you to start from scratch, not from ansi or distributed.net sources.

>    I think, that it's preferable to detect on wich CPU we are at start and
> then select the good core. (like what the current dnetc client do on x86).
F-CPU can do even better.

> But why not, I didn't think to the question, but it is perhaps possible to
> do that easily (I only ask me about the test function, but I will see).

i'm SURE RC5 can work in SIMD mode on F-CPU, including FC0. with 128-bit
or 256-bit cores, it could even be able to process 2 or 4 blocks at once.
a 64-bit core can code/decode a 128-bit block. i have no source code but
i'm pretty confident with this gut feeling.

> > > PS2: Actually the main loop need 1200 instructions with a real srotl
> > > instruction, and without it need 2300 instructions...
> > As you can see, an instruction can influence other things : 1K2
> > instructions requires almost 5 Kbytes of code and a 8KB instruction cache
> > is enough.
> > But 2300 instructions require 9200 bytes and there would be some cache
> > thrashing with only 8KB of cache...
> > However, i wonder if there is a way to "factor" some code from the
> > core and reduce the code size. there _should_ be a way to minimize this
> > code.
> I think that it is not possible, because I use all the registers and each
> line are different.

héhé :-P

from what my book says, you're trying to grok an already optimised code.
if we restart from the definition, you'll see it's almost straight-forward.

good night Petit Scarabée,

> A+
>   Cedric
when can we meet again and work on the manual ?
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/