[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w



hi,

nicO wrote:
> 
> We have soon speak about that. I beleive that Whygee was ok about it.
> But a recent post seems to say the opposite.

i think that it is not yet completely clear.

> So i propose to change all reference in the manual concerning register
> access as rn and rn+1 by rn with n even and rn+1. So why ?

currently, i do rn and rn xor 1. this is handy and friendly
when one unrolls his loop once (there is only one bit of difference
between the 2 iterations of the code and no need to re-schedule
all the instructions).

> Because it became possible to have only 3 port to access the register
> bank instead of 5. Most of 3r2w instructions only need an access to a
> Rn+1 register, this add 2 complete new access port to the memory and a
> incrementer unit.

AFAIK there is no 3R2W instruction "as such". However this is
true that the register set must sustain this rate, at least when
it supports the 3R1W or the 2R2W operations.

> If we align access, we can manage 128 bits data path. so in fact, each
> register are packed by 2. So all instruction which need more than a
> register will use the close one. If we want access to a single register
> we simply use a muxes.
> 
> There is 2 main point to use less access port : -the speed of the
> regiter bank and -the avability of such memory.
> 
> The more port you put on a memory, slower it run, that's physical.
i won't contradict you on that point ;-) that's why there is
one full clock cycle for the register reads.

> Usualy silicon foundry give memory generator for dual ported memory and
> nothing else (it the same for FPGA). So (as for leon), we could use 2
> area of such memory to produice the 3 needed port for the memory (so we
> duplicat data).
to be more precise, i have read that LEON uses 2 banks of 1R1W,
the written data is sent to both the W ports and the R ports
provide data in parallel.

however, some arithmetics shows that S(2R1W) > 2*S(1R1W).

> In fact, asking the foundry to give specific item cost a
> lot of money, so at least for the Fc0, i should consider that fact.
> 
> I jnmy compagny they will use leon. And will never have money to ask to
> have multiported memory (don't even think of using array of flipflop if
> you want speed !!).
> 
> So each EU could receive 2 128 bits data and write ONE 128 data. Write
> enable could be used to minimise power consumption when only one byte is
> written.
> 
> Comments ?

there is a big problem if you can't write 2 different data
(at different addresses). The 2R2W instructions are only
one side of the problem and you have overlooked a detail :
instructions such as "load post-incremented" (a 2R2W instruction)
performs an addition and a load in parallel. The result of
the load will be ready faster (3 cycle starting from the decoding
stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
without even counting the TLB and the cache lookup). Do you see
where it goes ?

If you issue several loads in a row (say : 6), the last
ones will be stalled even if all the data is ready,
because your bus will not sustain the 2 writes per cycle
at different addresses. It is not a solution to delay the
fastest data because the situation would become even worse
(you know that memory access speed is critical for computer
performance).

However, if you can't "pay" multiported SRAM (whatever
the configuration, say 3*(2W1R)), you have the possibility to
"downgrade" your implementation : you will not support
any 2W operation and you will reduce the scheduler to 1 write
per cycle. the other operations will have to be emulated
and the performance will suffer a lot... but it is still
possible. You will still be able to run "out of order completion"
code but with limited features and a low tolerance for
bus contention.


As a side note, i'm realistic enough to see that we won't
be able to "implement" the FC0 before a long time (how many years ?)
so i am not concerned "directly" with this issue. This is interesting
to see how we can modify the architecture with respect to this
economic problem but you know that this industry evolves very quickly.
Too quickly for us.


my conclusion is : as long as you can sustain 2 writes per cycles,
it's ok for me. Even if you have to use 4 register banks which are
addressed with their 2LSB (i use the analogy with a 4-way set associative
cache, so in this case there are 4 register banks of 16 registers each).
128-bit registers is not a "decent solution" IMHO : it's really too large
and not useful in 80% of the cases where 2W/cycle is necessary.

i hope we will be able to discuss this matter in depth.
i am currently designing the scheduler so it's the right time.

merry christmas everybody, btw !

> nicO
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/