[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



hi !

cedric wrote:
> Hi,
> 
>         Last week, Michael Riepe explain me how the rot and shift simd
> instruction currently work in the FC0. He answers that it wasn't really a simd
> shift and rotation operation.

what does that mean ?

>         So I ask some friend to find algorithm that need rotation and that can be
> easily port on a simd processor. The answer whas the rc5 algorithm from the
> distributed.net project (If you don't know this project look at
> www.distributed.net) . I started to port it on the FC-0 (The file that is
> attached to this post contain the C ansi version of this algorithm). I
> selected the rc5ansi_4-rg.cpp version who work on 4 keys in a single pass
> and approximatively need 62 registers.

thanks for the code ! however, i don't know if i have enough time
to read it tonight (i have to setup my homebrewed LFS).

>         My code is currently not finished, but I have some little think to say.
> First I did'nt see the need of the currently srotl.* instruction. If you
> prefer, in this code, we need a srotl instruction, but I must use this
> version :
> %macro simdrotl 3
>         shiftri 32, %1, %3
>         rotl.q  %2, %1, %1
>         shiftri 32, %2, %2
>         rotl.q  %2, %3, %3
>         shiftri 32, %3, %3
>         or      %3, %1, %1
> %endmacro

i am puzzled.

first i am surprised that you have to use this kind of code,
i would have expected that a SIMD rotation already existed.
you shouldn't have to write this kind of macro.

second, i would have written this differently (remember
that there is at least 1 cycle of latency for the shifts)

         shiftri 32, %1, %3
         shiftri 32, %2, %2  // swapping this one saves some cycles
         rotl.q  %2, %1, %1
         rotl.q  %2, %3, %3
         shiftri 32, %3, %3  //
         or      %3, %1, %1  // would a packing operation work here ?

i am almost sure that a packing operation would avoid the
last 2 instructions.

But the first problem is still the most annoying : i had expected that
Michael supported "real" SIMD operation. This comforts me in thinking that
i have to write my on shift unit.

> The other rotation operation need only immediate. And as you can see,
> for the srotl, I will need more register if I want to compute 16 bits data, or
> if the size of the register became bigger. (And I currently didn't find an
> algorithm that use srotl as we design it).
The problem is not to prove that there is (or not) an algorithm that uses
a specific opcode variant, it is more : we have to design an "orthogonal"
instruction set which allows all (or as many as possible) combinations
of parameters. This eases the design of the core, the compiler and the applications.

>         An other thing about the srotl, it currently double the asm code.
> It mean that, if we have a real srotl operation, we will have the same or
> better performance than the K7 core (who is actually the best for this
> algorithm). So my question is what is the cost of having a real srotl
> instruction ?

From my point of view, it would not be really expensive. I wonder
what Michael thinks about this but after all, the SHL unit is "just
a bunch of multiplexers"...

>         The second point, is about the storem and loadm operation,
> for this algorithm that saturate all the register bank, we need before the
> start of the loop to save all the registers. The problem is that the storem
> and loadm need actually a register that contain the number of registers to
> save... That stupid, to save all the register we need to do :
>         storei          8, R1, R63
>         loadconsx.0     62, R1
>         storem          R1, R2, R63
> The solution is easy : storem 63, R1, R3 ...

do you mean that 2 forms of the storem instruction are needed ?
if an immediate and a register form are enough, why not, though
the scheduling is quite different... This is why the operands
must be _all_ immediate or _all_ registers_ (otherwise it becomes
really complex).

don't forget however that the pointer must be in the middle position,
so you would write either
  storem/loadm r1,[r2],r3
or
  storem/loadm imm8,[r2],imm6

but this forces to add an imm6 field where there is nothing yet.
that's ugly.

>         And now, I have a question about a not really clear feature, the
> size register. I didn't really understand what they say. Did they say how
> many chunk divide the register ? Did they say how big the chunk are (but in
> that case how many are they ?) ? Or some thing else.

There are two things to consider : the size of the register and the size
of the sub-parts.
* When the SIMD flag is set, the register is implicitely considered as
being the widest. The size attributes specify the size of the chunks.
More specificly : the whole register is written back.
* When there is no SIMD flag, the register size is defined by the size flags.
Only the LSB of the register is written, depending on the attributes.

>         Why this question, only because the rc5 algorithm is really the
> typical algorithm that can use big simd register (If you have 128bits
> registers, you can compute 8 keys in the same pass...). But if we work in 128
> bits the srotl operation will be abolutly needed !

obviously !

>         A good news for the end, 63 registers is enough (I only need one
> more ;-), I think that I will find it). And our RC5 algorithm only use
> register and never do any operation in memory... that really great, no other
> core can compute in the same time 4 keys directly in registers.

are you sure ? I'd consider IA64 as a tough competitor, here.

> A+
>   Cedric
> 
> PS1: I hope that I will finish the rc5_f-cpu core for next week. I think that
> coding real algorithm on F-CPU is a good start to see what we does wrong
> or good. I think that I must design an other version for a F-CPU that will
> have 128 bits or 256 bits registers.

can you make a "generic" version that scales easily with the core's size ?
by using the size flags, you should be able to find a way to do that.

> PS2: Actually the main loop need 1200 instructions with a real srotl
> instruction, and without it need 2300 instructions...

As you can see, an instruction can influence other things : 1K2 instructions
requires almost 5 Kbytes of code and a 8KB instruction cache is enough.
But 2300 instructions require 9200 bytes and there would be some cache thrashing
with only 8KB of cache...

However, i wonder if there is a way to "factor" some code from the core
and reduce the code size. there _should_ be a way to minimize this code.

WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PS : looks like i'll have to read Bruce Schneier's book for remembering
how RC5 works.
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/