[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w



hello,

nicO wrote:
> Yann Guidon a écrit :
> > nicO wrote:
> > > Yann Guidon a écrit :
> > > > nicO wrote:
> > > > > So i propose to change all reference in the manual concerning register
> > > > > access as rn and rn+1 by rn with n even and rn+1. So why ?
> > > > currently, i do rn and rn xor 1. this is handy and friendly
> > > It's not enought because you always need 5 port.
> > Either you have the 2 write ports, or the core (deespite the higher
> > frequency) runs software slower because of the contention of the
> > write bus. you have the choice...
> I don't think that you will balance the penalty time of 5th ported
> memory.

<back to the future>
FC0's register set was designed by taking into account that
other CPUs of that time already had huge register sets. The IA64
was already at the "Merced" stage and you know that we can compare
it's register set to a football stadium.
</back to the future>

You say that you can't directly use the "average" 2-port memory.
To me it's the same thing as saying "Linux is not ready for the desktop",
it is simply not correlated. In the beginning, we thought about F-CPU
with comparisons to what was technically possible, in the present
and in the future. We did not take "IP" into account because that
did not exist at that time (at least in the same form as today).

However it is still possible to use 1R1W memory blocks and assemble
them to form any configuration. If you give me 16*16 bit registers
with 1 read and 1 write port, i can assemble them into something
that will work nicely in 75% of the cases.


> > > > there is a big problem if you can't write 2 different data
> > > > (at different addresses). The 2R2W instructions are only
> > > > one side of the problem and you have overlooked a detail :
> > > > instructions such as "load post-incremented" (a 2R2W instruction)
> > > > performs an addition and a load in parallel. The result of
> > > > the load will be ready faster (3 cycle starting from the decoding
> > > > stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
> > > > without even counting the TLB and the cache lookup). Do you see
> > > > where it goes ?
> > >
> > > Humm, an add slower than a load : look so odd !
> >
> > it is because "load" is the instruction, and not what it "really" does.
> > "load" and "store" transfer data to and from the LSU. this is a
> > _deterministic_ instruction because it goes throught the following units :
> >  - LSU
> >  - LSU alignment/shift
> >  - Xbar
> >  - R7 (Register set)
> > (in the reverse order for "store").
> >
> > i told it : forget about the other cores ! in FC0, the load and store
> > instructions do NOT go to the memory ! it's the LSU's job and the
> > decode stage only asks if a given pointer has the associated data ready,
> > so if a loaded data is not ready, the instruction is stalled at the Xbar
> > stage.
> >
> 
> And what if the data aren't in "L0 cache", you always must create the
> hardware to handel that case.

yes, of course. But the stall is not inside the "execution pipeline"
and an instruction that requests data that are pointed to by a register
will be held in the decoder until the data is ready.
This way, all delays inside the "execution pipeline" (that is : which
communicate through the internal Xbar, from and to the register set and
the execution units) are completely static. If there is a memory "miss"
then the load instruction will be halted _before_ execution, and not
in the middle (where it is harder to manage).

> > the load/store instruction is both deterministic and exempt from
> > exceptions inside the "execution core". When it arrives to the LSU,
> > this unit takes the misses into account and manages the 8 buffers.
> > It communicates with the decoder with flags that indicate if data
> > is ready or not, or if a register has been validated as a pointer.
> > This way, we can fire the TLB misses only at decode/xbar stage.
> 
> In case of miss, you have to handel it !

yes but the only effect (for the program) is that the instruction
is halted at decoding stage, while all other instructions before it
continue to execute. When execution restarts, all the instruction
ordering and latency is still deterministic.

concerning the cache misses, i don't think we need to reinvent the wheel.

> > > ??? The data will come even more slowly compare to the cor speed, so ?
> > i will make you a little drawing so you can understand.
> > in the last example, i assume that all the requested data is already
> > present in the LSU's buffer (a kind of L0 cache and reorder buffer
> > which contains 8*32 bytes).
> 
> I know (but you must remember that this cache will be much more slow
> than register read).

i know. that's why the LSU uses 2 cycles.

> > > In an other point, i propose to add definitly the 8 register load
> > > instruction.
> > ???? this means that you transfer 8*64=512 bits at once, while our
> > maximal internal bus width is 256 bits currently !
> 
> Yep ! You speak about something similar using SRB memory mecanism. So in
> that case, it will take several cycle, it will be great to manage burst
> access to the main memory.

considering that memroy fetches can be atomic, why do you want them
to write to contiguous registers (on top of that ?)

> > > This is a kind of preload for loop. To try to manage prefetch in x86, i
> > > can say that the timing is very complicated to have : you prefetch too
> > > early, the data could be scratch and it goes slower. If you make it too
> > > late, you add some instruction to be performed, so it goes slower...
> > there is already a prefetch instruction in the f-cpu manual IIRC.
> 
> I know. And i know it's very difficult to use. That's why i prefer
> preload.

i see no difference : you can "preload" a L0 line simply by doing
a load to R0, and you get 256 bits. there will be no penalty until you want
to really use the data (with another "load" instruction). it's like a
"touch" or "cache warming" with no incidence on the register set.

On top of that, once you "consume" data (read or write in the line),
a second line is allocated and prefetched by the LSU and the fetcher,
thus implementing "double buffering" (a line is being used while another
is being loaded from memory).

> > > This instruction could use the SRB hardware. It will load in shunk of
> > > register ((R0)R1-R7,R8-R15,R16-R23,R24-R31,...).
> > why do you want to mix prefetch and register load ? i really don't
> > understand your intention.
> No it's load, but i call them preload, because you effectivly do a load
> but 8 at a time, in advance.

what is the purpose of writing to 8 registers ?

> > remark that on another side, the LSU is also in charge of recognizing
> > "memory streams", it will issue contiguous memory fetches if it
> > sees that several contiguous fetches have triggered cache misses
> > in the past.
> 
> A predicator, hum, i have a doubt... a very big one ;p So easy to have
> the opposite effect !

nobody forces you to use it. When it is written in the F-CPU source code,
you can set a flag that removes its compilation if you want.

> > >
> > > ???
> >
> > in simpler words : with only 1 write port (even if it's 128-bit wide)
> > you won't be able to execute the following instruction chunck :
> >
> > add r1,r2,r3
> > and r4,r5,r6
> >
> > because the add operates in 2 cycles and and is 1 cycle,
> > it will create a contention ("hazard" if you prefer) on the write bus.
> > the scheduler will have to delay the AND.
> > Having 2 write ports not only helps with 2R2W operations, but also
> > with badly scheduled instructions that finish at the same time,
> > and the above example is a case that is statistically happening
> > more often than 2R2W instructions.
> 
> That's true but you will be prettier with the cpu and schedule it an
> other way. And it will work .
> and r4,r5,r6
> add r1,r2,r3

sure it will work, but you know that no operation is completely independent
from everything : all the instructions before and after can come from a global
dataflow analysis. swapping two instructions can cause the re-analisis of all
the instruction scheduling in the block.

> > > > my conclusion is : as long as you can sustain 2 writes per cycles,
> > > > it's ok for me. Even if you have to use 4 register banks which are
> > > > addressed with their 2LSB (i use the analogy with a 4-way set associative
> > > > cache, so in this case there are 4 register banks of 16 registers each).
> > > > 128-bit registers is not a "decent solution" IMHO : it's really too large
> > > > and not useful in 80% of the cases where 2W/cycle is necessary.
> > > ?? Not usefull and you need 2 write ports only for one instruction: a
> > > post increment-load !
> > no, as shown in the previous simple example.
> Even with 2 port you will have contention.
of course, but less in average.

> Imagine the use of divider
> which take 30 cycles or more. If you writte an horrible code : all data
> are ready in the same time. You could always need more port. But don't
> forget the bypass net, you always could read the data by this means (the
> R4 of your instruction could be read, but the add unit is freezed).
adding "write buffers" to the register set is close to register renaming.
one of FC0's design goals is to not implement it.
Concerning the write port number, 2 is satisfying for FC0.
A 2-issue FC0 would be happy with 3 write ports.

Btw, one of the reasons why the P6 core is so fucked up is because
it can "retire" less instructions than it can emit. the morale of
the P6 story is : you can always try to schedule everything perfectly,
not only you'll never succeed (it's highly undeterministic and the
register pressure is extreme), but on top of that the retirement unit
will never sustain the flood.

> > of course, you can decide to have only one write port but the compiler
> > will be more complex (it will have to manage all the write hazards).
> 
> You always have write hasard with multi cycle instruction and variable
> latency. You need one port by unit + the memory load/store latency. It
> will never end !
if even with only 2 ports it's not enough, then with only 1 port it's much worse :-/

> But that's not a problem most of the time. The problem
> is to immediatly reused the result, that could be done with the bypass net.
I agree in part with that, as long as the control logic remains
human-readable...

> > > Do you remember the rules : if add 10% more silicium it must increase
> > > the speed by 10%, and i don't think that for the post incremental load.
> > i'm sorry but :
> >  - the load and store instructions are _critical_ instructions in RISC world.
> Sur it's critical but having a double write port to write a data when
> the second could be take almost 150 cycles seems really big.
unless you code comme une loutre, this must not happen for every instruction, right ?

> If you force to make load/store determinisc, it's as you force to make a
> load inside the L0 cache and you only slide the problem.
of course, i push the problem where it's not a problem anymore.
All design activity is like that.

As for the L0, it acts as a fine-grained reorder buffer, or a "write buffer",
or whatever, and current "big" CPUs have such things. i simply changed
or added some more functions.

> >  - the "10%" rule is valid for "classical" designs, while in the F-CPU case
> >    we are in another "extreme" situation :
> >    * silicon surface is not "expensive" (the package and the test is as
> >      expensive so 10% more silicon is not "mecanically" 10% higher price)
> 
> !!!! All this kind of things are linked, but chips are sold in mm² so,
> only the size is important.
however, when the size is proportional to the performance, what do you do ?
:-)

> I have well understood your cache which link register number and memory
> content. To speed up cache it's possible to use direct register 64
> register of 256 bits.
it would be REALLY too big... and not useful because not every register
contains a "pointer".

> It will be as fast as bank register access. But i don't see
> how you handel alias for data (an old thread soon speak about that).

i have found 3 strategies to handle aliases in the LSU and the fetcher.
all have bad and good sides... please note that i avoid by all possible
means the use of an intermediary LUT that transforms the register
number into a L0 line number, because it creates an overhead.
The mechanism must be an "associative memory" : you give the key
and you get the data, not an index.

 1) big and naughty : "copy everything."
the idea is that there is still a 1-to-1 association between a register
number and a physical line. But in this strategy, we allow the furiously
silly strategy of having the same logical address mapped to several physical
lines. This has the advantage that you can have as many aliases as there
are lines, but you imagine the crazy mess that maintains the coherency
between the lines... it _could_ be done but i don't like it because the
coherency management can become too close to a non-associative memory.

 2) working with pairs of lines.
This sounds natural at first glance because the L0 buffers will
certainly use a double-buffering strategy. If we work by pairs,
there will be less problems with double-buffering.
I have counted that given 2 registers and 2 lines, there are 8 legal
associations, so it can be coded into 3 bits. This means that when
the comparison with the register number is successful, we need a very low
overhead to know which line is read. It is also reversible : given
an address, we can determine which register(s) to deallocate (for
example when we want to flush a line, we have to flush the associated
registers). This is good (tm) but the alias is limited to 2 registers.

 3) the '<=>' strategy
This one is a bit more sophisticated but it is similar to 2).
however here we now extend to 3 registers because we don't work
with pair, but 3 consecutive lines and registers.
register comparator number i can be associated to the line i-1, i or i+1.
One line is associated exclusively to 1 address (like 2)).
One register can be associated to one line, but one line can be linked
up to 3 registers (except when i=0 or i=7, there are ony 2 neighbours).
It more flexible for the double-buffering because there is more choice.
I have not finished my analysis but i think i'll work on that strategy.

Warning !
Augmenting the associativity will augment the overhead, so i think that
the 3rd method is a good compromise between simplicity, fexibility
and efficiency. using a 4-register neighbourhood will double the logic
and the overhead. the 3rd method already requires 22 memory bits and the
LRU and replacement strategy is more complex than the simple
direct-associativity strategy.


> nicO
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/