[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl



On Sun, Apr 07, 2002 at 11:57:35PM +0200, cedric wrote:
> Hi,
> 
> 	Last week, Michael Riepe explain me how the rot and shift simd
> instruction currently work in the FC0. He answers that it wasn't really a simd
> shift and rotation operation.

Depends on your definitions of `really' and `simd' ;)

> 	So I ask some friend to find algorithm that need rotation and that can be
> easily port on a simd processor. The answer whas the rc5 algorithm from the
> distributed.net project (If you don't know this project look at 
> www.distributed.net) . I started to port it on the FC-0 (The file that is  
> attached to this post contain the C ansi version of this algorithm). I 
> selected the rc5ansi_4-rg.cpp version who work on 4 keys in a single pass 
> and approximatively need 62 registers.
> 
> 	My code is currently not finished, but I have some little think to say.
> First I did'nt see the need of the currently srotl.* instruction. If you 
> prefer, in this code, we need a srotl instruction, but I must use this 
> version :
> %macro simdrotl 3
>         shiftri 32, %1, %3
>         rotl.q  %2, %1, %1
>         shiftri 32, %2, %2
>         rotl.q  %2, %3, %3
>         shiftri 32, %3, %3
>         or      %3, %1, %1
> %endmacro

I guess that should read `shiftli 32, %3, %3', right?

> The other rotation operation need only immediate. And as you can see,
> for the srotl, I will need more register if I want to compute 16 bits data, or
> if the size of the register became bigger. (And I currently didn't find an 
> algorithm that use srotl as we design it).

Here's a 16-bit version, using slightly less than 3 instructions per
slice, and only 2 registers:

	rotl.d  %2, %1, %1
	shiftri 16, %2, %2
	rotri   16, %1, %1
	rotl.d  %2, %1, %1
	shiftri 16, %2, %2
	rotri   16, %1, %1
	rotl.d  %2, %1, %1
	shiftri 16, %2, %2
	rotri   16, %1, %1
	rotl.d  %2, %1, %1
	rotr    16, %1, %1

> 	An other thing about the srotl, it currently double the asm code.
> It mean that, if we have a real srotl operation, we will have the same or 
> better performance than the K7 core (who is actually the best for this 
> algorithm). So my question is what is the cost of having a real srotl
> instruction ?

A more complex shifter unit, with more delay.

> 	The second point, is about the storem and loadm operation,
> for this algorithm that saturate all the register bank, we need before the
> start of the loop to save all the registers. The problem is that the storem
> and loadm need actually a register that contain the number of registers to 
> save... That stupid, to save all the register we need to do :
> 	storei		8, R1, R63
> 	loadconsx.0	62, R1
> 	storem		R1, R2, R63
> The solution is easy : storem 63, R1, R3 ...

A while ago (precisely: Aug 15, 2001), I wrote this:

	- The loadm/storem has a surprising operand order
	  (start,src/dest,count), and it's not clear whether the
	  register *numbers* or the register *contents* serve as the
	  start/count values.  I suggest the former, and I would also
	  change the operands to (firstreg, lastreg, memaddr) which is
	  much easier to grok for humans.

Two days later, Yann replied:

	 - whether it is the contents or the value of the address does
	   not change much except that the value is know 2 cycles
	   before or after. i'd prefer to use the register number than
	   its value, though, if possible.  though using the register
	   contents might also help.

That is, the first and third operand can be considered immediate operands
in disguise.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/