[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] OOC vs. OOE (scoreboard & tomasulo)



Hello,

I know many men here have too much work but I'd
like to initiate short thread in order to get
deeper knowledge in field.

I was looking for some Subj. related infos but
haven't found much. Thus I learned Tomasulo,
looked at Sparc, MIPS and Alpha and tried to
create my own opinion. I'd like to head (personal)
feeleing and comments of others people here
(which is valuable for me by the way).

At the first glance OOC is much simpler to implement.
In OOE you have to think about
- ROB because of exceptions and control speculation
- for tomasulo based we need expensive CDB and many
  comparators (but bypass needs them anyway)

Similary gains of OOE more specificaly Tomasulo like
designs:
- ISA is the same for different number and speed of EUs
- adaptation of EU controling to actual data flow including
  dynamic loop unrolling
- can go around memory latency (!)
- simple reg. renaming allows to have much less architectural
  register without sacrifying performance
- simpler datapath, you can do 4 way cpu with only 1r1w
  regsets

So what's the real performance (let's forget about ISA binary
compatibility thing) difference between OOE & OOC ?
IMHO it is memory load.
When tomasulo based CPU hits memory load it starts it and
goes further. All dependent instructions are filled into
reservation stations (RS) near (!) their EUs.
We can load many dataflow traces into CPU this way (limited
by number of RS but not neccessarily by internal register
set size) and these get started asynchronously by memory
subsystem when data arrives from slower memory.

All other EUs are typicaly semipredictable (we know exact
latency when we know datasize) and we can always schedule
them well for OOC. But loads are killers - prefetch/preload
is nice but often you don't know correct address in advance.
Take simple case - deleting element from double linked
list. You need to load x->prev and x->next and update them.
x->prev->next = x->next;
x->next->prev = x->prev;
load [r1],r3
load [r1+8],r4
store r4,[r3+8]
store r3,[r4]

Assume L1 latency 2 and L2 latency 20.
If x->prev is in L2 cache and x->next is in L1 cache then
1-issue tomasulo will in 4 cycles fill 4 RS then second
load is finished which also starts the first store.
Second load-store is started after some time later and
we can run other code in between.
OOC cpu will stall at first load for 20 cycles ..

So that do you think can good OOC perform better than
good OOE for generic code ?
I don't think so if latency of memory load will stay
to be highly undeterministic.

-------
Don't read below if your mind if labile ;)
If memory load is the only problem, what about something
like "machine code select(2)" for OOC ? I mean unix API
select(2).

Like to say in asm - now stall on these (started) four
loads and execute part of code (follows) depending on
which load completed. For linked list above it would read:
xload [r1],r3
xload [r1+8],r4
wait r3,a,1,r4,b,1,c
a:store r4,[r3+8]
b:store r3,[r4]
c:

xload is non-blocking load (like prefetch), wait line stands
for:
wait for r3, if completes, schedule 1 insn at relative
address a ... similarly for r4, when both finished then
continue at c.
Probably you got the (crazy) idea .. it is for thinking.
The idea is too fresh and maybe I'll dumpt it tomorrow.

But still I'm interested in yout ide about Subj.

good night,
devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/