[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w



Yann Guidon a écrit :
> 
> hi !
> 
> nicO wrote:
> > Yann Guidon a écrit :
> > > nicO wrote:
> > > > So i propose to change all reference in the manual concerning register
> > > > access as rn and rn+1 by rn with n even and rn+1. So why ?
> > > currently, i do rn and rn xor 1. this is handy and friendly
> > It's not enought because you always need 5 port.
> Either you have the 2 write ports, or the core (deespite the higher
> frequency) runs software slower because of the contention of the
> write bus. you have the choice...
> 

I don't think that you will balance the penalty time of 5th ported
memory.

> > > > Usualy silicon foundry give memory generator for dual ported memory and
> > > > nothing else (it the same for FPGA). So (as for leon), we could use 2
> > > > area of such memory to produice the 3 needed port for the memory (so we
> > > > duplicat data).
> > > to be more precise, i have read that LEON uses 2 banks of 1R1W,
> > > the written data is sent to both the W ports and the R ports
> > > provide data in parallel.
> > Yep, ... to make a 2R1W register bank... as for us !
> with the difference that we need 3 banks and this still does not
> solve the problem if you don't have 2 write ports.
> unless you find a better trick. i have the beginning of an idea,
> which is explained later in my post.
> 
> > > > So each EU could receive 2 128 bits data and write ONE 128 data. Write
> > > > enable could be used to minimise power consumption when only one byte is
> > > > written.
> > > >
> > > > Comments ?
> > >
> > > there is a big problem if you can't write 2 different data
> > > (at different addresses). The 2R2W instructions are only
> > > one side of the problem and you have overlooked a detail :
> > > instructions such as "load post-incremented" (a 2R2W instruction)
> > > performs an addition and a load in parallel. The result of
> > > the load will be ready faster (3 cycle starting from the decoding
> > > stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
> > > without even counting the TLB and the cache lookup). Do you see
> > > where it goes ?
> >
> > Humm, an add slower than a load : look so odd !
> 
> it is because "load" is the instruction, and not what it "really" does.
> "load" and "store" transfer data to and from the LSU. this is a
> _deterministic_ instruction because it goes throught the following units :
>  - LSU
>  - LSU alignment/shift
>  - Xbar
>  - R7 (Register set)
> (in the reverse order for "store").
> 
> i told it : forget about the other cores ! in FC0, the load and store
> instructions do NOT go to the memory ! it's the LSU's job and the
> decode stage only asks if a given pointer has the associated data ready,
> so if a loaded data is not ready, the instruction is stalled at the Xbar
> stage.
> 

And what if the data aren't in "L0 cache", you always must create the
hardware to handel that case.

> On the other hand, for a load/store to work, it requires that the pointer
> goes through the validation and fetch process, which is completely
> hidden from the user, except concerning the time it takes.
> When updating the pointer, the other side of the "fork" goes through
> the following :
>  - Dec  (reads the pointer's value)
>  - Xbar
>  - ASU1 (compute the new pointer value)
>  - ASU2
>  - TLB1  +  Xbar
>  - TLB2  +  write R7
>  - cache and LSU lookup (may require more cycles if there's a cache miss)
>  - allocate and update the LSU line
> 
> the load/store instruction is both deterministic and exempt from
> exceptions inside the "execution core". When it arrives to the LSU,
> this unit takes the misses into account and manages the 8 buffers.
> It communicates with the decoder with flags that indicate if data
> is ready or not, or if a register has been validated as a pointer.
> This way, we can fire the TLB misses only at decode/xbar stage.

In case of miss, you have to handel it ! 

> 
> > Such instruction will be split at the decoder stage, one goes thought
> > the alu, the other thought the load pipe. Load will be much (much)
> > slower than the add, so the data will never be ready in the same time,
> > so we could add a little delay, times to times.
> read above : there is no "load pipe", at least in the sense that you know
> for other CPUs. of course, accessing the memory is slower than doing
> an addition, but it is in the OTHER branch of the pipe.
> Otherwise, if your "load pipe" contains the TLB lookup, you
> introduce the possibility of an exception firing in the pipeline,
> while our design goal is to avoid it completely.
> 
> > > If you issue several loads in a row (say : 6), the last
> > > ones will be stalled even if all the data is ready,
> > > because your bus will not sustain the 2 writes per cycle
> > > at different addresses. It is not a solution to delay the
> > > fastest data because the situation would become even worse
> > > (you know that memory access speed is critical for computer
> > > performance).
> > ??? The data will come even more slowly compare to the cor speed, so ?
> i will make you a little drawing so you can understand.
> in the last example, i assume that all the requested data is already
> present in the LSU's buffer (a kind of L0 cache and reorder buffer
> which contains 8*32 bytes).
> 

I know (but you must remember that this cache will be much more slow
than register read).

> > In an other point, i propose to add definitly the 8 register load
> > instruction.
> ???? this means that you transfer 8*64=512 bits at once, while our
> maximal internal bus width is 256 bits currently !
>

Yep ! You speak about something similar using SRB memory mecanism. So in
that case, it will take several cycle, it will be great to manage burst
access to the main memory.
 
> > This is a kind of preload for loop. To try to manage prefetch in x86, i
> > can say that the timing is very complicated to have : you prefetch too
> > early, the data could be scratch and it goes slower. If you make it too
> > late, you add some instruction to be performed, so it goes slower...
> there is already a prefetch instruction in the f-cpu manual IIRC.
> 

I know. And i know it's very difficult to use. That's why i prefer
preload.

> > This instruction could use the SRB hardware. It will load in shunk of
> > register ((R0)R1-R7,R8-R15,R16-R23,R24-R31,...).
> why do you want to mix prefetch and register load ? i really don't
> understand your intention.

No it's load, but i call them preload, because you effectivly do a load
but 8 at a time, in advance.

> remark that on another side, the LSU is also in charge of recognizing
> "memory streams", it will issue contiguous memory fetches if it
> sees that several contiguous fetches have triggered cache misses
> in the past.
>

A predicator, hum, i have a doubt... a very big one ;p So easy to have
the opposite effect !
 
> > > However, if you can't "pay" multiported SRAM (whatever
> > > the configuration, say 3*(2W1R)), you have the possibility to
> > > "downgrade" your implementation : you will not support
> > > any 2W operation and you will reduce the scheduler to 1 write
> > > per cycle. the other operations will have to be emulated
> > > and the performance will suffer a lot... but it is still
> > > possible. You will still be able to run "out of order completion"
> > > code but with limited features and a low tolerance for
> > > bus contention.
> >
> > ???
> 
> in simpler words : with only 1 write port (even if it's 128-bit wide)
> you won't be able to execute the following instruction chunck :
> 
> add r1,r2,r3
> and r4,r5,r6
> 
> because the add operates in 2 cycles and and is 1 cycle,
> it will create a contention ("hazard" if you prefer) on the write bus.
> the scheduler will have to delay the AND.
> Having 2 write ports not only helps with 2R2W operations, but also
> with badly scheduled instructions that finish at the same time,
> and the above example is a case that is statistically happening
> more often than 2R2W instructions.
>

That's true but you will be prettier with the cpu and schedule it an
other way. And it will work .
and r4,r5,r6
add r1,r2,r3
 
> > > As a side note, i'm realistic enough to see that we won't
> > > be able to "implement" the FC0 before a long time (how many years ?)
> > > so i am not concerned "directly" with this issue. This is interesting
> > > to see how we can modify the architecture with respect to this
> > > economic problem but you know that this industry evolves very quickly.
> > > Too quickly for us.
> > Maybe, but will always have 20% time penalty, at least !
> 
> as long as we are conscious about it, it's ok.
>

I think that it's an overkill.
 
> > > my conclusion is : as long as you can sustain 2 writes per cycles,
> > > it's ok for me. Even if you have to use 4 register banks which are
> > > addressed with their 2LSB (i use the analogy with a 4-way set associative
> > > cache, so in this case there are 4 register banks of 16 registers each).
> > > 128-bit registers is not a "decent solution" IMHO : it's really too large
> > > and not useful in 80% of the cases where 2W/cycle is necessary.
> > ?? Not usefull and you need 2 write ports only for one instruction: a
> > post increment-load !
> no, as shown in the previous simple example.

Even with 2 port you will have contention. Imagine the use of divider
which take 30 cycles or more. If you writte an horrible code : all data
are ready in the same time. You could always need more port. But don't
forget the bypass net, you always could read the data by this means (the
R4 of your instruction could be read, but the add unit is freezed).

> of course, you can decide to have only one write port but the compiler
> will be more complex (it will have to manage all the write hazards).
> 

You always have write hasard with multi cycle instruction and variable
latency. You need one port by unit + the memory load/store latency. It
will never end ! But that's not a problem most of the time. The problem
is to immediatly reused the result, that could be done with the bypass
net.

> > Do you remember the rules : if add 10% more silicium it must increase
> > the speed by 10%, and i don't think that for the post incremental load.
> i'm sorry but :
>  - the load and store instructions are _critical_ instructions in RISC world.

Sur it's critical but having a double write port to write a data when
the second could be take almost 150 cycles seems really big. 

If you force to make load/store determinisc, it's as you force to make a
load inside the L0 cache and you only slide the problem. 

>  - IIRC the original remark was 10% more critical datapath,
>     and not 10% more silicium. we can put datapaths in parallel.

mmh, ok!

>  - the "10%" rule is valid for "classical" designs, while in the F-CPU case
>    we are in another "extreme" situation :
>    * silicon surface is not "expensive" (the package and the test is as
>      expensive so 10% more silicon is not "mecanically" 10% higher price)

!!!! All this kind of things are linked, but chips are sold in mm² so,
only the size is important.

>    * we consider that we are already at maximum speed and operations
>      can't be "compressed"
> 
> > > i hope we will be able to discuss this matter in depth.
> > > i am currently designing the scheduler so it's the right time.
> > I should write mine before you finish yours !! to compare.
> the comparison will certainly be interesting, sure :-)
> 
> is it possible to meet you ?
> i have to draw some pictures that will be useful for you.
> 
> > > merry christmas everybody, btw !
> > The same !
> >  nicO
> > > > nicO
> > > WHYGEE

I have well understood your cache which link register number and memory
content. To speed up cache it's possible to use direct register 64
register of 256 bits. It will be as fast as bank register access. But i
don't see how you handel alias for data (an old thread soon speak about
that).

nicO
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/