[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)



hi !

devik wrote:

Hello,

I know many men here have too much work but I'd
like to initiate short thread in order to get
deeper knowledge in field.

I was looking for some Subj. related infos but
haven't found much. Thus I learned Tomasulo,
looked at Sparc, MIPS and Alpha and tried to
create my own opinion. I'd like to head (personal)
feeleing and comments of others people here
(which is valuable for me by the way).

At the first glance OOC is much simpler to implement.
In OOE you have to think about
- ROB because of exceptions and control speculation
- for tomasulo based we need expensive CDB and many
 comparators (but bypass needs them anyway)

Similary gains of OOE more specificaly Tomasulo like
designs:
- ISA is the same for different number and speed of EUs

now look at the latest x86 architectures : their organisation has
changed so radically in the last years that it is impossible to develop
any "optimised" software for several platforms.
all that matters today is the compatibility with the ISA.
so let's start with a simple architecture...

- adaptation of EU controling to actual data flow including
 dynamic loop unrolling
- can go around memory latency (!)

this one is very important, but the complexity and its price is
probably too high for today's F-CPU team ....

- simple reg. renaming allows to have much less architectural
 register without sacrifying performance

register renaming = more transistors and longer pipeline.

- simpler datapath, you can do 4 way cpu with only 1r1w
 regsets

So what's the real performance (let's forget about ISA binary
compatibility thing) difference between OOE & OOC ?
IMHO it is memory load.

there is another one : today (as of y2k, as opposed to the 60's)
the pipeline depths and clock frequencies have increased a lot.
Tomasulo and/or OOOE can compensate for several cycles
but the "really hurting" latencies are from cache misses from "remote"
memories, either SDRAM or located on the HDD or another CPU...
and here, this lasts more than the few cycles that OOOE can
compensate for.

When tomasulo based CPU hits memory load it starts it and
goes further. All dependent instructions are filled into
reservation stations (RS) near (!) their EUs.

this means that it is able to compensate for L1, maybe L2 cache misses
in good conditions. This is equivalent to the reordering capability of compilers.
So let's spare some complex reordering stuffs that consume power
and dev/debug time ....

We can load many dataflow traces into CPU this way (limited
by number of RS but not neccessarily by internal register
set size) and these get started asynchronously by memory
subsystem when data arrives from slower memory.

All other EUs are typicaly semipredictable (we know exact
latency when we know datasize) and we can always schedule
them well for OOC. But loads are killers

right, but we can't do magic : memory latency increases as the memory's size does.
it's the compiler's job to organise data so cache misses are reduced, and the user's
task to code cleanly enough to allow the compiler to do its work....

- prefetch/preload
is nice but often you don't know correct address in advance.

it's possible for most numerical and DSP/imaging code so it's ok....

Take simple case - deleting element from double linked
list. You need to load x->prev and x->next and update them.
x->prev->next = x->next;
x->next->prev = x->prev;
load [r1],r3
load [r1+8],r4
store r4,[r3+8]
store r3,[r4]

you chose one of the worst cases ....
F-CPU is not well adapted to do linked lists.
However if you need max performance, you can
probably find a way to interleave memory references
(but then you'll need to manage several LL at the same time)

Assume L1 latency 2 and L2 latency 20.

that's not exacly that for FC0.
L0 is around 3 cycles (that is : the minimum number of
cycles to get data from the LSU), L1 is a few cycles more
but triggered on miss (let's say : around 4 cycles on top of
the LSU/L0 cycles). L2 is not well known because
it depends on what block is chosen (it's usually a "standard cell")

If x->prev is in L2 cache and x->next is in L1 cache then
1-issue tomasulo will in 4 cycles fill 4 RS then second
load is finished which also starts the first store.
Second load-store is started after some time later and
we can run other code in between.
OOC cpu will stall at first load for 20 cycles ..

that's the "price" for a simple core.
And you can see that even though it's simple,
it's far from finished.

Maybe an OOOE core will be done for FC1.
The debate is open for a long time but let's finish FC0 now :-)

So that do you think can good OOC perform better than
good OOE for generic code ?

the problem is more of managing complexity : FC0 is already
complex enough to not be finished. If OOOE was used,
then i believe that no code would be already available....

I don't think so if latency of memory load will stay
to be highly undeterministic.

hints can come from the program, from tracing and many other methods...

-------
Don't read below if your mind if labile ;)
If memory load is the only problem, what about something
like "machine code select(2)" for OOC ? I mean unix API
select(2).

Like to say in asm - now stall on these (started) four
loads and execute part of code (follows) depending on
which load completed. For linked list above it would read:
xload [r1],r3
xload [r1+8],r4
wait r3,a,1,r4,b,1,c
a:store r4,[r3+8]
b:store r3,[r4]
c:

xload is non-blocking load (like prefetch), wait line stands
for:
wait for r3, if completes, schedule 1 insn at relative
address a ... similarly for r4, when both finished then
continue at c.
Probably you got the (crazy) idea .. it is for thinking.
The idea is too fresh and maybe I'll dumpt it tomorrow.

But still I'm interested in yout ide about Subj.

as nicO said, the main problem is to know how to not break the whole architecture
in order to allow context switches and interrupts.

Theoretically, it is possible to make a conditional instruction based on
whether the load or store is completed. Though, for the load, it's far
from sure, because i remember that the load HAS TO COMPLETE before
another instruction is executed, because otherwise it could leave the pipeline
in a state where an interrupt or fault could not be recoverable.

Imagine you do :
load [Rx] -> Ry
store [Rv] <- Rw

now, imagine that the load 'passes' but is not completed.
store traps for whatever reason (not aligned, page fault ...)
and the SRB is started : upon return, SRB will restore the old value of Ry.

Well, i agree : it's the "old SRB" version,
it could be possible (and i presume, it's necessary) to add a
LSU sync to it (the register is "dirty" as long as the load and store
is not completed).

But i have come to the conclusion that memory and the compiler/source code
are the main limits, whatever sophisticated CPU is made. Look at the
most recent CPUs and the memory is still the main problem, much
more than execution strategies or ISA problems... So i take this as
a "fixed limit" and even when compared to a competing MIPS
or similar architecture, the memory structure of F-CPU is not that bad....

Now, to answer your question, it is maybe possible to include
conditional instructions based on the availability of loaded data,
but in the end .... what does that change ? take a look at usual
code : they don't care to know whether data are ready,
they just want to use them. So the "data loaded" condition
would be useless....
However, a SMT core would compensate for this more easily.

good night,
devik

have fun,
YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/