[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[f-cpu] RC5, F-CPU and srotl
Hi,
Last week, Michael Riepe explain me how the rot and shift simd
instruction currently work in the FC0. He answers that it wasn't really a simd
shift and rotation operation.
So I ask some friend to find algorithm that need rotation and that can be
easily port on a simd processor. The answer whas the rc5 algorithm from the
distributed.net project (If you don't know this project look at
www.distributed.net) . I started to port it on the FC-0 (The file that is
attached to this post contain the C ansi version of this algorithm). I
selected the rc5ansi_4-rg.cpp version who work on 4 keys in a single pass
and approximatively need 62 registers.
My code is currently not finished, but I have some little think to say.
First I did'nt see the need of the currently srotl.* instruction. If you
prefer, in this code, we need a srotl instruction, but I must use this
version :
%macro simdrotl 3
shiftri 32, %1, %3
rotl.q %2, %1, %1
shiftri 32, %2, %2
rotl.q %2, %3, %3
shiftri 32, %3, %3
or %3, %1, %1
%endmacro
The other rotation operation need only immediate. And as you can see,
for the srotl, I will need more register if I want to compute 16 bits data, or
if the size of the register became bigger. (And I currently didn't find an
algorithm that use srotl as we design it).
An other thing about the srotl, it currently double the asm code.
It mean that, if we have a real srotl operation, we will have the same or
better performance than the K7 core (who is actually the best for this
algorithm). So my question is what is the cost of having a real srotl
instruction ?
The second point, is about the storem and loadm operation,
for this algorithm that saturate all the register bank, we need before the
start of the loop to save all the registers. The problem is that the storem
and loadm need actually a register that contain the number of registers to
save... That stupid, to save all the register we need to do :
storei 8, R1, R63
loadconsx.0 62, R1
storem R1, R2, R63
The solution is easy : storem 63, R1, R3 ...
And now, I have a question about a not really clear feature, the
size register. I didn't really understand what they say. Did they say how
many chunk divide the register ? Did they say how big the chunk are (but in
that case how many are they ?) ? Or some thing else.
Why this question, only because the rc5 algorithm is really the
typical algorithm that can use big simd register (If you have 128bits
registers, you can compute 8 keys in the same pass...). But if we work in 128
bits the srotl operation will be abolutly needed !
A good news for the end, 63 registers is enough (I only need one
more ;-), I think that I will find it). And our RC5 algorithm only use
register and never do any operation in memory... that really great, no other
core can compute in the same time 4 keys directly in registers.
A+
Cedric
PS1: I hope that I will finish the rc5_f-cpu core for next week. I think that
coding real algorithm on F-CPU is a good start to see what we does wrong
or good. I think that I must design an other version for a F-CPU that will
have 128 bits or 256 bits registers.
PS2: Actually the main loop need 1200 instructions with a real srotl
instruction, and without it need 2300 instructions...
rc5-ansi.tar.gz