[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] RC5, F-CPU and srotl


	Last week, Michael Riepe explain me how the rot and shift simd
instruction currently work in the FC0. He answers that it wasn't really a simd
shift and rotation operation.

	So I ask some friend to find algorithm that need rotation and that can be
easily port on a simd processor. The answer whas the rc5 algorithm from the
distributed.net project (If you don't know this project look at 
www.distributed.net) . I started to port it on the FC-0 (The file that is  
attached to this post contain the C ansi version of this algorithm). I 
selected the rc5ansi_4-rg.cpp version who work on 4 keys in a single pass 
and approximatively need 62 registers.

	My code is currently not finished, but I have some little think to say.
First I did'nt see the need of the currently srotl.* instruction. If you 
prefer, in this code, we need a srotl instruction, but I must use this 
version :
%macro simdrotl 3
        shiftri 32, %1, %3
        rotl.q  %2, %1, %1
        shiftri 32, %2, %2
        rotl.q  %2, %3, %3
        shiftri 32, %3, %3
        or      %3, %1, %1

The other rotation operation need only immediate. And as you can see,
for the srotl, I will need more register if I want to compute 16 bits data, or
if the size of the register became bigger. (And I currently didn't find an 
algorithm that use srotl as we design it).
	An other thing about the srotl, it currently double the asm code.
It mean that, if we have a real srotl operation, we will have the same or 
better performance than the K7 core (who is actually the best for this 
algorithm). So my question is what is the cost of having a real srotl
instruction ?

	The second point, is about the storem and loadm operation,
for this algorithm that saturate all the register bank, we need before the
start of the loop to save all the registers. The problem is that the storem
and loadm need actually a register that contain the number of registers to 
save... That stupid, to save all the register we need to do :
	storei		8, R1, R63
	loadconsx.0	62, R1
	storem		R1, R2, R63
The solution is easy : storem 63, R1, R3 ...

	And now, I have a question about a not really clear feature, the
size register. I didn't really understand what they say. Did they say how 
many chunk divide the register ? Did they say how big the chunk are (but in 
that case how many are they ?) ? Or some thing else.
	Why this question, only because the rc5 algorithm is really the
typical algorithm that can use big simd register (If you have 128bits 
registers, you can compute 8 keys in the same pass...). But if we work in 128 
bits the srotl operation will be abolutly needed !

	A good news for the end, 63 registers is enough (I only need one
more ;-), I think that I will find it). And our RC5 algorithm only use 
register and never do any operation in memory... that really great, no other 
core can compute in the same time 4 keys directly in registers.


PS1: I hope that I will finish the rc5_f-cpu core for next week. I think that 
coding real algorithm on F-CPU is a good start to see what we does wrong
or good. I think that I must design an other version for a F-CPU that will 
have 128 bits or 256 bits registers.

PS2: Actually the main loop need 1200 instructions with a real srotl 
instruction, and without it need 2300 instructions...