[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w



hi !

nicO wrote:
> Yann Guidon a écrit :
> > nicO wrote:
> > > So i propose to change all reference in the manual concerning register
> > > access as rn and rn+1 by rn with n even and rn+1. So why ?
> > currently, i do rn and rn xor 1. this is handy and friendly
> It's not enought because you always need 5 port.
Either you have the 2 write ports, or the core (deespite the higher
frequency) runs software slower because of the contention of the
write bus. you have the choice...

> > > Usualy silicon foundry give memory generator for dual ported memory and
> > > nothing else (it the same for FPGA). So (as for leon), we could use 2
> > > area of such memory to produice the 3 needed port for the memory (so we
> > > duplicat data).
> > to be more precise, i have read that LEON uses 2 banks of 1R1W,
> > the written data is sent to both the W ports and the R ports
> > provide data in parallel.
> Yep, ... to make a 2R1W register bank... as for us !
with the difference that we need 3 banks and this still does not
solve the problem if you don't have 2 write ports.
unless you find a better trick. i have the beginning of an idea,
which is explained later in my post.

> > > So each EU could receive 2 128 bits data and write ONE 128 data. Write
> > > enable could be used to minimise power consumption when only one byte is
> > > written.
> > >
> > > Comments ?
> >
> > there is a big problem if you can't write 2 different data
> > (at different addresses). The 2R2W instructions are only
> > one side of the problem and you have overlooked a detail :
> > instructions such as "load post-incremented" (a 2R2W instruction)
> > performs an addition and a load in parallel. The result of
> > the load will be ready faster (3 cycle starting from the decoding
> > stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
> > without even counting the TLB and the cache lookup). Do you see
> > where it goes ?
> 
> Humm, an add slower than a load : look so odd !

it is because "load" is the instruction, and not what it "really" does.
"load" and "store" transfer data to and from the LSU. this is a
_deterministic_ instruction because it goes throught the following units :
 - LSU
 - LSU alignment/shift
 - Xbar
 - R7 (Register set)
(in the reverse order for "store").

i told it : forget about the other cores ! in FC0, the load and store
instructions do NOT go to the memory ! it's the LSU's job and the
decode stage only asks if a given pointer has the associated data ready,
so if a loaded data is not ready, the instruction is stalled at the Xbar
stage.

On the other hand, for a load/store to work, it requires that the pointer
goes through the validation and fetch process, which is completely
hidden from the user, except concerning the time it takes.
When updating the pointer, the other side of the "fork" goes through
the following :
 - Dec  (reads the pointer's value)
 - Xbar
 - ASU1 (compute the new pointer value)
 - ASU2
 - TLB1  +  Xbar
 - TLB2  +  write R7
 - cache and LSU lookup (may require more cycles if there's a cache miss)
 - allocate and update the LSU line
 
the load/store instruction is both deterministic and exempt from
exceptions inside the "execution core". When it arrives to the LSU,
this unit takes the misses into account and manages the 8 buffers.
It communicates with the decoder with flags that indicate if data
is ready or not, or if a register has been validated as a pointer.
This way, we can fire the TLB misses only at decode/xbar stage.

> Such instruction will be split at the decoder stage, one goes thought
> the alu, the other thought the load pipe. Load will be much (much)
> slower than the add, so the data will never be ready in the same time,
> so we could add a little delay, times to times.
read above : there is no "load pipe", at least in the sense that you know
for other CPUs. of course, accessing the memory is slower than doing
an addition, but it is in the OTHER branch of the pipe.
Otherwise, if your "load pipe" contains the TLB lookup, you
introduce the possibility of an exception firing in the pipeline,
while our design goal is to avoid it completely.

> > If you issue several loads in a row (say : 6), the last
> > ones will be stalled even if all the data is ready,
> > because your bus will not sustain the 2 writes per cycle
> > at different addresses. It is not a solution to delay the
> > fastest data because the situation would become even worse
> > (you know that memory access speed is critical for computer
> > performance).
> ??? The data will come even more slowly compare to the cor speed, so ?
i will make you a little drawing so you can understand.
in the last example, i assume that all the requested data is already
present in the LSU's buffer (a kind of L0 cache and reorder buffer
which contains 8*32 bytes).

> In an other point, i propose to add definitly the 8 register load
> instruction.
???? this means that you transfer 8*64=512 bits at once, while our
maximal internal bus width is 256 bits currently !

> This is a kind of preload for loop. To try to manage prefetch in x86, i
> can say that the timing is very complicated to have : you prefetch too
> early, the data could be scratch and it goes slower. If you make it too
> late, you add some instruction to be performed, so it goes slower...
there is already a prefetch instruction in the f-cpu manual IIRC.

> This instruction could use the SRB hardware. It will load in shunk of
> register ((R0)R1-R7,R8-R15,R16-R23,R24-R31,...).
why do you want to mix prefetch and register load ? i really don't
understand your intention.
remark that on another side, the LSU is also in charge of recognizing
"memory streams", it will issue contiguous memory fetches if it
sees that several contiguous fetches have triggered cache misses
in the past.

> > However, if you can't "pay" multiported SRAM (whatever
> > the configuration, say 3*(2W1R)), you have the possibility to
> > "downgrade" your implementation : you will not support
> > any 2W operation and you will reduce the scheduler to 1 write
> > per cycle. the other operations will have to be emulated
> > and the performance will suffer a lot... but it is still
> > possible. You will still be able to run "out of order completion"
> > code but with limited features and a low tolerance for
> > bus contention.
> 
> ???

in simpler words : with only 1 write port (even if it's 128-bit wide)
you won't be able to execute the following instruction chunck :

add r1,r2,r3
and r4,r5,r6

because the add operates in 2 cycles and and is 1 cycle,
it will create a contention ("hazard" if you prefer) on the write bus.
the scheduler will have to delay the AND.
Having 2 write ports not only helps with 2R2W operations, but also
with badly scheduled instructions that finish at the same time,
and the above example is a case that is statistically happening
more often than 2R2W instructions.

> > As a side note, i'm realistic enough to see that we won't
> > be able to "implement" the FC0 before a long time (how many years ?)
> > so i am not concerned "directly" with this issue. This is interesting
> > to see how we can modify the architecture with respect to this
> > economic problem but you know that this industry evolves very quickly.
> > Too quickly for us.
> Maybe, but will always have 20% time penalty, at least !

as long as we are conscious about it, it's ok.

> > my conclusion is : as long as you can sustain 2 writes per cycles,
> > it's ok for me. Even if you have to use 4 register banks which are
> > addressed with their 2LSB (i use the analogy with a 4-way set associative
> > cache, so in this case there are 4 register banks of 16 registers each).
> > 128-bit registers is not a "decent solution" IMHO : it's really too large
> > and not useful in 80% of the cases where 2W/cycle is necessary.
> ?? Not usefull and you need 2 write ports only for one instruction: a
> post increment-load !
no, as shown in the previous simple example.
of course, you can decide to have only one write port but the compiler
will be more complex (it will have to manage all the write hazards).

> Do you remember the rules : if add 10% more silicium it must increase
> the speed by 10%, and i don't think that for the post incremental load.
i'm sorry but :
 - the load and store instructions are _critical_ instructions in RISC world.
 - IIRC the original remark was 10% more critical datapath,
    and not 10% more silicium. we can put datapaths in parallel.
 - the "10%" rule is valid for "classical" designs, while in the F-CPU case
   we are in another "extreme" situation :
   * silicon surface is not "expensive" (the package and the test is as
     expensive so 10% more silicon is not "mecanically" 10% higher price)
   * we consider that we are already at maximum speed and operations
     can't be "compressed"

> > i hope we will be able to discuss this matter in depth.
> > i am currently designing the scheduler so it's the right time.
> I should write mine before you finish yours !! to compare.
the comparison will certainly be interesting, sure :-)

is it possible to meet you ?
i have to draw some pictures that will be useful for you.

> > merry christmas everybody, btw !
> The same !
>  nicO
> > > nicO
> > WHYGEE
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > *************************************************************
> > To unsubscribe, send an e-mail to majordomo@seul.org with
> > unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

-- 
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/