[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] RC5, F-CPU and srotl

To: f-cpu@seul.org
Subject: Re: [f-cpu] RC5, F-CPU and srotl
From: Cedric BAIL <cedric.bail@free.fr>
Date: Mon, 08 Apr 2002 01:21:13 +0200 (MEST)
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Sun, 07 Apr 2002 19:21:15 -0400
In-Reply-To: <3CB0B928.865D54EF@f-cpu.org>
References: <20020407165956.6B13CAB@postfix2-1.free.fr> <3CB0B928.865D54EF@f-cpu.org>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
User-Agent: IMP/PHP IMAP webmail program 2.2.42

hi,

> what does that mean ?
That mean that currently only the low part of the rotation parameter
define all the rotation of the different chunk.
 
> thanks for the code ! however, i don't know if i have enough time
> to read it tonight (i have to setup my homebrewed LFS).
;-)
 
> > %macro simdrotl 3
> >         shiftri 32, %1, %3
> >         rotl.q  %2, %1, %1
> >         shiftri 32, %2, %2
> >         rotl.q  %2, %3, %3
> >         shiftri 32, %3, %3
> >         or      %3, %1, %1
> > %endmacro
> 
> i am puzzled.
> 
> first i am surprised that you have to use this kind of code,
> i would have expected that a SIMD rotation already existed.
> you shouldn't have to write this kind of macro.
So you understand the problem.

> second, i would have written this differently (remember
> that there is at least 1 cycle of latency for the shifts)
 
>          shiftri 32, %1, %3
>          shiftri 32, %2, %2  // swapping this one saves some cycles
>          rotl.q  %2, %1, %1
>          rotl.q  %2, %3, %3
>          shiftri 32, %3, %3  //
>          or      %3, %1, %1  // would a packing operation work here ?
Ok, not so much different.
 
> i am almost sure that a packing operation would avoid the
> last 2 instructions.
What did you mean by a packing operation ?
 
> But the first problem is still the most annoying : i had expected that
> Michael supported "real" SIMD operation. This comforts me in thinking
> that i have to write my on shift unit.
 
> The problem is not to prove that there is (or not) an algorithm that uses
> a specific opcode variant, it is more : we have to design an "orthogonal"
> instruction set which allows all (or as many as possible) combinations
> of parameters. This eases the design of the core, the compiler and the
> applications.
I real agry with you
 
> From my point of view, it would not be really expensive. I wonder
> what Michael thinks about this but after all, the SHL unit is "just
> a bunch of multiplexers"...
 
> >         The second point, is about the storem and loadm operation,
> > for this algorithm that saturate all the register bank, we need before the
> > start of the loop to save all the registers. The problem is that the storem
> > and loadm need actually a register that contain the number of registers to
> > save... That stupid, to save all the register we need to do :
> >         storei          8, R1, R63
> >         loadconsx.0     62, R1
> >         storem          R1, R2, R63
> > The solution is easy : storem 63, R1, R3 ...

> do you mean that 2 forms of the storem instruction are needed ?
I think, that only the immediate form is needed. 

> if an immediate and a register form are enough, why not, though
> the scheduling is quite different... This is why the operands
> must be _all_ immediate or _all_ registers_ (otherwise it becomes
> really complex).
> 
> don't forget however that the pointer must be in the middle position,
> so you would write either
>   storem/loadm r1,[r2],r3
> or
>   storem/loadm imm8,[r2],imm6
"Oups", I have done a error, I mean this :
storem/loadm imm6(1), imm6(2), [r3]

I think that it will do :
for i := imm6(1) to imm6(1) + imm6(2) do
  store 8, ri, [r3]
done

or some think like that.

> but this forces to add an imm6 field where there is nothing yet.
> that's ugly.
A I don't understand where the imm8 came from ? We only have 63 registers,
right ?

> >         And now, I have a question about a not really clear feature,the
> > size register. I didn't really understand what they say. Did they say how
> > many chunk divide the register ? Did they say how big the chunk are (but in
> > that case how many are they ?) ? Or some thing else.
> There are two things to consider : the size of the register and the
> size of the sub-parts.
> * When the SIMD flag is set, the register is implicitely considered as
> being the widest. The size attributes specify the size of the chunks.
> More specificly : the whole register is written back.
> * When there is no SIMD flag, the register size is defined by the size flags.
> Only the LSB of the register is written, depending on the attributes.
Ok, it's clearer now.
 
> >         A good news for the end, 63 registers is enough (I only need one
> > more ;-), I think that I will find it). And our RC5 algorithm only use
> > register and never do any operation in memory... that really great, no
> > other core can compute in the same time 4 keys directly in registers.
> are you sure ? I'd consider IA64 as a tough competitor, here.
I didn't find a specially designed core for IA64. (I must say that during
a scope we need 64 registers, and not only a part of them).

> > PS1: I hope that I will finish the rc5_f-cpu core for next week. I think 
> > that coding real algorithm on F-CPU is a good start to see what we does
> > wrong or good. I think that I must design an other version for a F-CPU
> > that will have 128 bits or 256 bits registers.
> can you make a "generic" version that scales easily with the core's size ?
> by using the size flags, you should be able to find a way to do that.
I think that is not the good way to have performance, because every n keys
we will call a core detect function to know on wich platform we are and then
adapt our code to this platform.
   I think, that it's preferable to detect on wich CPU we are at start and
then select the good core. (like what the current dnetc client do on x86).
But why not, I didn't think to the question, but it is perhaps possible to
do that easily (I only ask me about the test function, but I will see).

> > PS2: Actually the main loop need 1200 instructions with a real srotl
> > instruction, and without it need 2300 instructions...
> As you can see, an instruction can influence other things : 1K2
> instructions requires almost 5 Kbytes of code and a 8KB instruction cache
> is enough.
> But 2300 instructions require 9200 bytes and there would be some cache
> thrashing with only 8KB of cache...

> However, i wonder if there is a way to "factor" some code from the
> core and reduce the code size. there _should_ be a way to minimize this
> code.
I think that it is not possible, because I use all the registers and each
line are different.

A+
  Cedric


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] RC5, F-CPU and srotl
  - From: Yann Guidon <whygee@f-cpu.org>

References:
- [f-cpu] RC5, F-CPU and srotl
  - From: cedric <cedric.bail@free.fr>
- Re: [f-cpu] RC5, F-CPU and srotl
  - From: Yann Guidon <whygee@f-cpu.org>

Prev by Date: Re: [f-cpu] Supported Instructions
Next by Date: Re: [f-cpu] Supported Instructions
Prev by thread: Re: [f-cpu] RC5, F-CPU and srotl
Next by thread: Re: [f-cpu] RC5, F-CPU and srotl
Index(es):
- Date
- Thread