[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w

To: f-cpu@seul.org
Subject: Re: [f-cpu] 3r2w -> 2r1w
From: nicO <nicolas.boulay@ifrance.com>
Date: Sun, 23 Dec 2001 13:19:49 -0500
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Sun, 23 Dec 2001 07:18:50 -0500
References: <3C253D09.67D2F7FB@ifrance.com> <3C25567E.6BD0BE9C@f-cpu.org>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

Yann Guidon a écrit :
> 
> hi,
> 
> nicO wrote:
> >
> > We have soon speak about that. I beleive that Whygee was ok about it.
> > But a recent post seems to say the opposite.
> 
> i think that it is not yet completely clear.
> 
> > So i propose to change all reference in the manual concerning register
> > access as rn and rn+1 by rn with n even and rn+1. So why ?
> 
> currently, i do rn and rn xor 1. this is handy and friendly

It's not enought because you always need 5 port.

> when one unrolls his loop once (there is only one bit of difference
> between the 2 iterations of the code and no need to re-schedule
> all the instructions).
> 
> > Because it became possible to have only 3 port to access the register
> > bank instead of 5. Most of 3r2w instructions only need an access to a
> > Rn+1 register, this add 2 complete new access port to the memory and a
> > incrementer unit.
> 
> AFAIK there is no 3R2W instruction "as such". However this is
> true that the register set must sustain this rate, at least when
> it supports the 3R1W or the 2R2W operations.
> 
> > If we align access, we can manage 128 bits data path. so in fact, each
> > register are packed by 2. So all instruction which need more than a
> > register will use the close one. If we want access to a single register
> > we simply use a muxes.
> >
> > There is 2 main point to use less access port : -the speed of the
> > regiter bank and -the avability of such memory.
> >
> > The more port you put on a memory, slower it run, that's physical.
> i won't contradict you on that point ;-) that's why there is
> one full clock cycle for the register reads.
> 
> > Usualy silicon foundry give memory generator for dual ported memory and
> > nothing else (it the same for FPGA). So (as for leon), we could use 2
> > area of such memory to produice the 3 needed port for the memory (so we
> > duplicat data).
> to be more precise, i have read that LEON uses 2 banks of 1R1W,
> the written data is sent to both the W ports and the R ports
> provide data in parallel.
>

Yep, ... to make a 2R1W register bank... as for us !
 
> however, some arithmetics shows that S(2R1W) > 2*S(1R1W).
> 
> > In fact, asking the foundry to give specific item cost a
> > lot of money, so at least for the Fc0, i should consider that fact.
> >
> > I jnmy compagny they will use leon. And will never have money to ask to
> > have multiported memory (don't even think of using array of flipflop if
> > you want speed !!).
> >
> > So each EU could receive 2 128 bits data and write ONE 128 data. Write
> > enable could be used to minimise power consumption when only one byte is
> > written.
> >
> > Comments ?
> 
> there is a big problem if you can't write 2 different data
> (at different addresses). The 2R2W instructions are only
> one side of the problem and you have overlooked a detail :
> instructions such as "load post-incremented" (a 2R2W instruction)
> performs an addition and a load in parallel. The result of
> the load will be ready faster (3 cycle starting from the decoding
> stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
> without even counting the TLB and the cache lookup). Do you see
> where it goes ?
> 

Humm, an add slower than a load : look so odd !

Such instruction will be split at the decoder stage, one goes thought
the alu, the other thought the load pipe. Load will be much (much)
slower than the add, so the data will never be ready in the same time,
so we could add a little delay, times to times.

> If you issue several loads in a row (say : 6), the last
> ones will be stalled even if all the data is ready,
> because your bus will not sustain the 2 writes per cycle
> at different addresses. It is not a solution to delay the
> fastest data because the situation would become even worse
> (you know that memory access speed is critical for computer
> performance).
>

??? The data will come even more slowly compare to the cor speed, so ?
In an other point, i propose to add definitly the 8 register load
instruction.
This is a kind of preload for loop. To try to manage prefetch in x86, i
can say that the timing is very complicated to have : you prefetch too
early, the data could be scratch and it goes slower. If you make it too
late, you add some instruction to be performed, so it goes slower...

This instruction could use the SRB hardware. It will load in shunk of
register ((R0)R1-R7,R8-R15,R16-R23,R24-R31,...).
 
> However, if you can't "pay" multiported SRAM (whatever
> the configuration, say 3*(2W1R)), you have the possibility to
> "downgrade" your implementation : you will not support
> any 2W operation and you will reduce the scheduler to 1 write
> per cycle. the other operations will have to be emulated
> and the performance will suffer a lot... but it is still
> possible. You will still be able to run "out of order completion"
> code but with limited features and a low tolerance for
> bus contention.
>

???
 
> As a side note, i'm realistic enough to see that we won't
> be able to "implement" the FC0 before a long time (how many years ?)
> so i am not concerned "directly" with this issue. This is interesting
> to see how we can modify the architecture with respect to this
> economic problem but you know that this industry evolves very quickly.
> Too quickly for us.
> 

Maybe, but will always have 20% time penalty, at least !

> my conclusion is : as long as you can sustain 2 writes per cycles,
> it's ok for me. Even if you have to use 4 register banks which are
> addressed with their 2LSB (i use the analogy with a 4-way set associative
> cache, so in this case there are 4 register banks of 16 registers each).
> 128-bit registers is not a "decent solution" IMHO : it's really too large
> and not useful in 80% of the cases where 2W/cycle is necessary.
> 

?? Not usefull and you need 2 write ports only for one instruction: a
post increment-load !
Do you remember the rules : if add 10% more silicium it must increase
the speed by 10%, and i don't think that for the post incremental load.

> i hope we will be able to discuss this matter in depth.
> i am currently designing the scheduler so it's the right time.
> 

I should write mine before you finish yours !! to compare. 

> merry christmas everybody, btw !
>

The same !
 nicO
> > nicO
> WHYGEE
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] 3r2w -> 2r1w
  - From: Yann Guidon <whygee@f-cpu.org>

References:
- [f-cpu] 3r2w -> 2r1w
  - From: nicO <nicolas.boulay@ifrance.com>
- Re: [f-cpu] 3r2w -> 2r1w
  - From: Yann Guidon <whygee@f-cpu.org>

Prev by Date: Re: [f-cpu] Manual update
Next by Date: Re: [f-cpu] 3r2w -> 2r1w
Prev by thread: Re: [f-cpu] 3r2w -> 2r1w
Next by thread: Re: [f-cpu] 3r2w -> 2r1w
Index(es):
- Date
- Thread