[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 3r2w -> 2r1w, 8 loads inst, L0 coherency



I would take more time to answer but it's hard !

debat : 3r2w vs 2r1w

3r2w 64 bits
 Less write contention
 In 2 years, the technology will grown

2r1w 128 bits
 Technologie soon used
 Faster access time
 Equal bandwith
 Smaller
 Dual ported memory used even in FPGA 

The big point concern the penalty due to write contention : is it an
overkill in 2r1w or not ?
- i think no
- Whygee think yes.

The following speak about the load of 8 reg in the same time, and how L0
should work to maintain coherency.

Have a good read !

Yann Guidon a écrit :
> 
> hello,
> 
> nicO wrote:
> > Yann Guidon a écrit :
> > > nicO wrote:
> > > > Yann Guidon a écrit :
> > > > > nicO wrote:
> > > > > > So i propose to change all reference in the manual concerning register
> > > > > > access as rn and rn+1 by rn with n even and rn+1. So why ?
> > > > > currently, i do rn and rn xor 1. this is handy and friendly
> > > > It's not enought because you always need 5 port.
> > > Either you have the 2 write ports, or the core (deespite the higher
> > > frequency) runs software slower because of the contention of the
> > > write bus. you have the choice...
> > I don't think that you will balance the penalty time of 5th ported
> > memory.
> 
> <back to the future>
> FC0's register set was designed by taking into account that
> other CPUs of that time already had huge register sets. The IA64
> was already at the "Merced" stage and you know that we can compare
> it's register set to a football stadium.
> </back to the future>
>

Yes, at that time, the team speak about merced killer. Since the intel
cpu as commit suicide and we are waiting for Mac Kinley. 
 
> You say that you can't directly use the "average" 2-port memory.
> To me it's the same thing as saying "Linux is not ready for the desktop",
> it is simply not correlated. In the beginning, we thought about F-CPU
> with comparisons to what was technically possible, in the present
> and in the future. We did not take "IP" into account because that
> did not exist at that time (at least in the same form as today).
>

So now, we have some information : the fcpu can't begin to fight against
well established cpu as x86. Everybody knows it's crap but it's
compatible and cheap (if you don't beleive me look at the price of Sun
Blade 1000 750 which didn't reach the speed of >1600 Mhz cpu from Intel
or AMD), that's only why this line of cpu continue. To begin Fcpu must
find an easiest way. 
 
x86/PowerPc concerne very very few chip maker. Everybody (Thomson,
Nokia, Alcatel, Sony,...) which use a cpu in there chip could use an IP,
usually they use those from ARM (and make the debugging for them...),
but Fcpu (as Leon) could have a very great opportunities to win this
market. Why ? Simplie because compagny don't like to show there source
code !

> However it is still possible to use 1R1W memory blocks and assemble
> them to form any configuration. If you give me 16*16 bit registers
> with 1 read and 1 write port, i can assemble them into something
> that will work nicely in 75% of the cases.
>

Yes, but it could be slow and large. But i'm curious to see how you
create 2 write ports with many 1W ported memeory.
 
> > > > > there is a big problem if you can't write 2 different data
> > > > > (at different addresses). The 2R2W instructions are only
> > > > > one side of the problem and you have overlooked a detail :
> > > > > instructions such as "load post-incremented" (a 2R2W instruction)
> > > > > performs an addition and a load in parallel. The result of
> > > > > the load will be ready faster (3 cycle starting from the decoding
> > > > > stage) than the addition (dec+xbar+asu1+asu2+xbar=5 cycles min.,
> > > > > without even counting the TLB and the cache lookup). Do you see
> > > > > where it goes ?
> > > >
> > > > Humm, an add slower than a load : look so odd !
> > >
> > > it is because "load" is the instruction, and not what it "really" does.
> > > "load" and "store" transfer data to and from the LSU. this is a
> > > _deterministic_ instruction because it goes throught the following units :
> > >  - LSU
> > >  - LSU alignment/shift
> > >  - Xbar
> > >  - R7 (Register set)
> > > (in the reverse order for "store").
> > >
> > > i told it : forget about the other cores ! in FC0, the load and store
> > > instructions do NOT go to the memory ! it's the LSU's job and the
> > > decode stage only asks if a given pointer has the associated data ready,
> > > so if a loaded data is not ready, the instruction is stalled at the Xbar
> > > stage.
> > >
> >
> > And what if the data aren't in "L0 cache", you always must create the
> > hardware to handel that case.
> 
> yes, of course. But the stall is not inside the "execution pipeline"
> and an instruction that requests data that are pointed to by a register
> will be held in the decoder until the data is ready.
> This way, all delays inside the "execution pipeline" (that is : which
> communicate through the internal Xbar, from and to the register set and
> the execution units) are completely static. If there is a memory "miss"
> then the load instruction will be halted _before_ execution, and not
> in the middle (where it is harder to manage).
> 

Bof ! With my asynchronous pipeline it could be done (the unit say when
there ready, you could freeze them if there contention at the write
port, you absolutely don't mind how unit work (whygee: could you handel
the case with your proposal when a multicycle instructions are pipelined
(so "some" instruction could run in the same time) ? ), the cohenrency
is maintain by register dependancies only)

> > > the load/store instruction is both deterministic and exempt from
> > > exceptions inside the "execution core". When it arrives to the LSU,
> > > this unit takes the misses into account and manages the 8 buffers.
> > > It communicates with the decoder with flags that indicate if data
> > > is ready or not, or if a register has been validated as a pointer.
> > > This way, we can fire the TLB misses only at decode/xbar stage.
> >
> > In case of miss, you have to handel it !
> 
> yes but the only effect (for the program) is that the instruction
> is halted at decoding stage, while all other instructions before it
> continue to execute. When execution restarts, all the instruction
> ordering and latency is still deterministic.
> 

So you don't make any write combining, any write buffering ? That's very
often use and it's speed up execution too. L0 cache could be used to do
this. I don't understand to block when there is L0 miss. In case of lot
of load, you will kill the performance. In case of none blocking
load/store you could add more transaction and use all the feature of
DRAM buses (interlaced access to hide latency...)

> concerning the cache misses, i don't think we need to reinvent the wheel.

Comparre to the core, caches are sloooow. In moderne CPU, L1 cache are 3
stages pipeline, P4 use 2 stages caches that's why is only 8 Ko. As i
can see, fcp will be much more pipelined than P4... 
 
> > > > ??? The data will come even more slowly compare to the cor speed, so ?
> > > i will make you a little drawing so you can understand.
> > > in the last example, i assume that all the requested data is already
> > > present in the LSU's buffer (a kind of L0 cache and reorder buffer
> > > which contains 8*32 bytes).
> >
> > I know (but you must remember that this cache will be much more slow
> > than register read).
> 
> i know. that's why the LSU uses 2 cycles.
>

:D lol ! It's always funny to determine such think without much more
deffined things. I think that you make the work in the wrong way.
 
> > > > In an other point, i propose to add definitly the 8 register load
> > > > instruction.
> > > ???? this means that you transfer 8*64=512 bits at once, while our
> > > maximal internal bus width is 256 bits currently !
> >
> > Yep ! You speak about something similar using SRB memory mecanism. So in
> > that case, it will take several cycle, it will be great to manage burst
> > access to the main memory.
> 
> considering that memroy fetches can be atomic, why do you want them
> to write to contiguous registers (on top of that ?)
>

Because burst access are aligned, and because you couldn't put so many
information in instruction word and with 128 bit buses it's help !
 
> > > > This is a kind of preload for loop. To try to manage prefetch in x86, i
> > > > can say that the timing is very complicated to have : you prefetch too
> > > > early, the data could be scratch and it goes slower. If you make it too
> > > > late, you add some instruction to be performed, so it goes slower...
> > > there is already a prefetch instruction in the f-cpu manual IIRC.
> >
> > I know. And i know it's very difficult to use. That's why i prefer
> > preload.
> 
> i see no difference : you can "preload" a L0 line simply by doing
> a load to R0, and you get 256 bits. there will be no penalty until you want
> to really use the data (with another "load" instruction). it's like a
> "touch" or "cache warming" with no incidence on the register set.
>

That's here where you make a mistake ! That's abslutely not the same !
If you use prefetch you make no load, you guess that your data will be
ready (it's realy to know when ! and it change a lot between
generation), when you ask them. With preload, the data are effectively
load. With such instruction (and no blocking load/store) you could use
double buffering technique (load 8 data, when calculating on 8 others,
then swap !).

So in case of prefetch, you have _2_ pseudo load perform instead of one.
So you could kill your performance if you use to many prefetch not
finished. The problem concerne the calculation to be done and the fact
that umber will change with different version of the cpu (cache
speed/core speed/SDRAM speed ...). Try the program of my first article
and you will see, it could became horrible !
 
> On top of that, once you "consume" data (read or write in the line),
> a second line is allocated and prefetched by the LSU and the fetcher,
> thus implementing "double buffering" (a line is being used while another
> is being loaded from memory).
>

Prefetch are theoricaly good, but bad behavior could kill your
performance. 
 
> > > > This instruction could use the SRB hardware. It will load in shunk of
> > > > register ((R0)R1-R7,R8-R15,R16-R23,R24-R31,...).
> > > why do you want to mix prefetch and register load ? i really don't
> > > understand your intention.
> > No it's load, but i call them preload, because you effectivly do a load
> > but 8 at a time, in advance.
> 
> what is the purpose of writing to 8 registers ?
>

Double buffering, and use of maximum memory bandwith and i think 8 is a
good number.
 
> > > remark that on another side, the LSU is also in charge of recognizing
> > > "memory streams", it will issue contiguous memory fetches if it
> > > sees that several contiguous fetches have triggered cache misses
> > > in the past.
> >
> > A predicator, hum, i have a doubt... a very big one ;p So easy to have
> > the opposite effect !
> 
> nobody forces you to use it. When it is written in the F-CPU source code,
> you can set a flag that removes its compilation if you want.
> 
> > > >
> > > > ???
> > >
> > > in simpler words : with only 1 write port (even if it's 128-bit wide)
> > > you won't be able to execute the following instruction chunck :
> > >
> > > add r1,r2,r3
> > > and r4,r5,r6
> > >
> > > because the add operates in 2 cycles and and is 1 cycle,
> > > it will create a contention ("hazard" if you prefer) on the write bus.
> > > the scheduler will have to delay the AND.
> > > Having 2 write ports not only helps with 2R2W operations, but also
> > > with badly scheduled instructions that finish at the same time,
> > > and the above example is a case that is statistically happening
> > > more often than 2R2W instructions.
> >
> > That's true but you will be prettier with the cpu and schedule it an
> > other way. And it will work .
> > and r4,r5,r6
> > add r1,r2,r3
> 
> sure it will work, but you know that no operation is completely independent
> from everything : all the instructions before and after can come from a global
> dataflow analysis. swapping two instructions can cause the re-analisis of all
> the instruction scheduling in the block.
>

Yep ! 2 port will always be faster than 1 but i don't think that the
difference will be so important.
 
> > > > > my conclusion is : as long as you can sustain 2 writes per cycles,
> > > > > it's ok for me. Even if you have to use 4 register banks which are
> > > > > addressed with their 2LSB (i use the analogy with a 4-way set associative
> > > > > cache, so in this case there are 4 register banks of 16 registers each).
> > > > > 128-bit registers is not a "decent solution" IMHO : it's really too large
> > > > > and not useful in 80% of the cases where 2W/cycle is necessary.
> > > > ?? Not usefull and you need 2 write ports only for one instruction: a
> > > > post increment-load !
> > > no, as shown in the previous simple example.
> > Even with 2 port you will have contention.
> of course, but less in average.
> 
> > Imagine the use of divider
> > which take 30 cycles or more. If you writte an horrible code : all data
> > are ready in the same time. You could always need more port. But don't
> > forget the bypass net, you always could read the data by this means (the
> > R4 of your instruction could be read, but the add unit is freezed).
> adding "write buffers" to the register set is close to register renaming.
> one of FC0's design goals is to not implement it.

???? Bypass net as nothing to do with register renaming !

> Concerning the write port number, 2 is satisfying for FC0.
> A 2-issue FC0 would be happy with 3 write ports.
> 
> Btw, one of the reasons why the P6 core is so fucked up is because
> it can "retire" less instructions than it can emit. the morale of
> the P6 story is : you can always try to schedule everything perfectly,
> not only you'll never succeed (it's highly undeterministic and the
> register pressure is extreme), but on top of that the retirement unit
> will never sustain the flood.
>

I just reread the intel spec. Retirment unit in P6 could retire 3 µop
and the scheduler could send 3 µop. So i don't understand you're point
of view. Register pressure in P6 aren't the same because of the register
renaming. If you introduice dependancies other than read after write, it
will be renamed, so ?
 
> > > of course, you can decide to have only one write port but the compiler
> > > will be more complex (it will have to manage all the write hazards).
> >
> > You always have write hasard with multi cycle instruction and variable
> > latency. You need one port by unit + the memory load/store latency. It
> > will never end !
> if even with only 2 ports it's not enough, then with only 1 port it's much worse :-/
> 

i beleive that you want avoid all contention that way. 

> > But that's not a problem most of the time. The problem
> > is to immediatly reused the result, that could be done with the bypass net.
> I agree in part with that, as long as the control logic remains
> human-readable...
>

That's super easy, beside the register bank you read another unit
(beside the scoreboard) to send a hit if the data is present.
 
> > > > Do you remember the rules : if add 10% more silicium it must increase
> > > > the speed by 10%, and i don't think that for the post incremental load.
> > > i'm sorry but :
> > >  - the load and store instructions are _critical_ instructions in RISC world.
> > Sur it's critical but having a double write port to write a data when
> > the second could be take almost 150 cycles seems really big.
> unless you code comme une loutre, this must not happen for every instruction, right ?
>

I don't know that une loutre could code ! 
 
> > If you force to make load/store determinisc, it's as you force to make a
> > load inside the L0 cache and you only slide the problem.
> of course, i push the problem where it's not a problem anymore.
> All design activity is like that.

So will reintroduice latency, that is usually hide by buffer,
assynchronous operation,...

> 
> As for the L0, it acts as a fine-grained reorder buffer, or a "write buffer",
> or whatever, and current "big" CPUs have such things. i simply changed
> or added some more functions.
>

This write buffer are inside the load/store unit and don't work as you
describe it at all.
 
> > >  - the "10%" rule is valid for "classical" designs, while in the F-CPU case
> > >    we are in another "extreme" situation :
> > >    * silicon surface is not "expensive" (the package and the test is as
> > >      expensive so 10% more silicon is not "mecanically" 10% higher price)
> >
> > !!!! All this kind of things are linked, but chips are sold in mm² so,
> > only the size is important.
> however, when the size is proportional to the performance, what do you do ?
> :-)
> 

It's not really true ! And we look for the best ratio between mm²/E ...
If you want more power put ... 2 fcpu :p

> > I have well understood your cache which link register number and memory
> > content. To speed up cache it's possible to use direct register 64
> > register of 256 bits.
> it would be REALLY too big... and not useful because not every register
> contains a "pointer".
>

You avoid the flag memory, big mixes and control. It's only 2KByte
memory : very common, even in FPGA (in contrary of multiported
memory...)
 
> > It will be as fast as bank register access. But i don't see
> > how you handel alias for data (an old thread soon speak about that).
> 
> i have found 3 strategies to handle aliases in the LSU and the fetcher.
> all have bad and good sides... please note that i avoid by all possible
> means the use of an intermediary LUT that transforms the register
> number into a L0 line number, because it creates an overhead.
> The mechanism must be an "associative memory" : you give the key
> and you get the data, not an index.
> 

It's called CAM, it's used for TLB, it's sloooow and very big because
you need an "xor" array to check egality of the key for each line.

>  1) big and naughty : "copy everything."
> the idea is that there is still a 1-to-1 association between a register
> number and a physical line. But in this strategy, we allow the furiously
> silly strategy of having the same logical address mapped to several physical
> lines. This has the advantage that you can have as many aliases as there
> are lines, but you imagine the crazy mess that maintains the coherency
> between the lines... it _could_ be done but i don't like it because the
> coherency management can become too close to a non-associative memory.
>

I'm sorry but i understand nothing. The problem is to have 2 or more
registers which point to the same memory place. One could be coherent,
so the others no. What happen if i try to access the memory thought the
none coherent register. We must use the adress somewhere (as for typical
write buffer). Or how to be sure to access to a good data without the
use of that information.
 
>  2) working with pairs of lines.
> This sounds natural at first glance because the L0 buffers will
> certainly use a double-buffering strategy. If we work by pairs,
> there will be less problems with double-buffering.
> I have counted that given 2 registers and 2 lines, there are 8 legal
> associations, so it can be coded into 3 bits. This means that when
> the comparison with the register number is successful, we need a very low
> overhead to know which line is read. It is also reversible : given
> an address, we can determine which register(s) to deallocate (for
> example when we want to flush a line, we have to flush the associated
> registers). This is good (tm) but the alias is limited to 2 registers.
> 

So each line is assiociated to 2 registers. How do you do that ? with a
double entry CAM ?

And what if a stupid compiler create 3 aliases ?

>  3) the '<=>' strategy
> This one is a bit more sophisticated but it is similar to 2).
> however here we now extend to 3 registers because we don't work
> with pair, but 3 consecutive lines and registers.
> register comparator number i can be associated to the line i-1, i or i+1.
> One line is associated exclusively to 1 address (like 2)).
> One register can be associated to one line, but one line can be linked
> up to 3 registers (except when i=0 or i=7, there are ony 2 neighbours).
> It more flexible for the double-buffering because there is more choice.
> I have not finished my analysis but i think i'll work on that strategy.
> 
> Warning !
> Augmenting the associativity will augment the overhead, so i think that
> the 3rd method is a good compromise between simplicity, fexibility
> and efficiency. using a 4-register neighbourhood will double the logic
> and the overhead. the 3rd method already requires 22 memory bits and the
> LRU and replacement strategy is more complex than the simple
> direct-associativity strategy.
> 

Sure ! But you don't solve the general cases (in fact the worst cases).
This cache could work very well for jump but for data, there is so many
coherency problem. Adding associativity aren't a good solution because
you limit the number of possible aliases. I have understood that
programmer don't like having such coding rules (they qualidied them
"stupid" because they don't understand there usefullness)

nicO

> > nicO
> WHYGEE
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/