[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)

To: f-cpu@seul.org
Subject: Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
From: nico <nicolas.boulay@ifrance.com>
Date: Sun, 2 Mar 2003 23:38:10 +0000
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Sun, 02 Mar 2003 17:39:05 -0500
In-reply-to: <Pine.LNX.4.33.0303021959510.525-100000@devix>
References: <Pine.LNX.4.33.0303021959510.525-100000@devix>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

On Sun, 2 Mar 2003 23:10:05 +0100 (CET)
devik <devik@cdi.cz> wrote:

> Hello,
> 
> I know many men here have too much work but I'd
> like to initiate short thread in order to get
> deeper knowledge in field.
> 
> I was looking for some Subj. related infos but
> haven't found much. Thus I learned Tomasulo,
> looked at Sparc, MIPS and Alpha and tried to

Mips is often cited as a very good architecture. Much more often than
sparc or Alpha. Cray is very cited, too.

Then very new Cray X1 use a modified MIPS isa.

> create my own opinion. I'd like to head (personal)
> feeleing and comments of others people here
> (which is valuable for me by the way).
> 
> At the first glance OOC is much simpler to implement.
> In OOE you have to think about
> - ROB because of exceptions and control speculation
> - for tomasulo based we need expensive CDB and many
>   comparators (but bypass needs them anyway)
> 
> Similary gains of OOE more specificaly Tomasulo like
> designs:
> - ISA is the same for different number and speed of EUs
> - adaptation of EU controling to actual data flow including
>   dynamic loop unrolling
> - can go around memory latency (!)
> - simple reg. renaming allows to have much less architectural
>   register without sacrifying performance
> - simpler datapath, you can do 4 way cpu with only 1r1w
>   regsets

One nice other things.  R1 op R2 -> R3 instructions set aren't so much
an obligation. We could use R1 op R2 -> R2, this could be a save because
fanout of instructions are very little (1.4 in means).

> 
> So what's the real performance (let's forget about ISA binary
> compatibility thing) difference between OOE & OOC ?
> IMHO it is memory load.
> When tomasulo based CPU hits memory load it starts it and
> goes further. All dependent instructions are filled into
> reservation stations (RS) near (!) their EUs.
> We can load many dataflow traces into CPU this way (limited
> by number of RS but not neccessarily by internal register
> set size) and these get started asynchronously by memory
> subsystem when data arrives from slower memory.
> 
> All other EUs are typicaly semipredictable (we know exact
> latency when we know datasize) and we can always schedule
> them well for OOC. But loads are killers - prefetch/preload
> is nice but often you don't know correct address in advance.
> Take simple case - deleting element from double linked
> list. You need to load x->prev and x->next and update them.
> x->prev->next = x->next;
> x->next->prev = x->prev;
> load [r1],r3
> load [r1+8],r4
> store r4,[r3+8]
> store r3,[r4]
> 
> Assume L1 latency 2 and L2 latency 20.
> If x->prev is in L2 cache and x->next is in L1 cache then
> 1-issue tomasulo will in 4 cycles fill 4 RS then second
> load is finished which also starts the first store.
> Second load-store is started after some time later and
> we can run other code in between.
> OOC cpu will stall at first load for 20 cycles ..
> 
> So that do you think can good OOC perform better than
> good OOE for generic code ?
> I don't think so if latency of memory load will stay
> to be highly undeterministic.
> 
> -------
> Don't read below if your mind if labile ;)
> If memory load is the only problem, what about something
> like "machine code select(2)" for OOC ? I mean unix API
> select(2).

ouch... polling many load result ?

> 
> Like to say in asm - now stall on these (started) four
> loads and execute part of code (follows) depending on
> which load completed. For linked list above it would read:
> xload [r1],r3
> xload [r1+8],r4
> wait r3,a,1,r4,b,1,c
> a:store r4,[r3+8]
> b:store r3,[r4]
> c:
> 
> xload is non-blocking load (like prefetch), wait line stands
> for:
> wait for r3, if completes, schedule 1 insn at relative
> address a ... similarly for r4, when both finished then
> continue at c.
> Probably you got the (crazy) idea .. it is for thinking.
> The idea is too fresh and maybe I'll dumpt it tomorrow.
> 
> But still I'm interested in yout ide about Subj.
> 

The most annoying thing is how to manage the state of the cpu ? State
must be saved&restored at context switch. And the wait instruction can
have so much fields. 

nicO

> good night,
> devik
> 
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

References:
- [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
  - From: devik <devik@cdi.cz>

Prev by Date: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
Next by Date: Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
Previous by thread: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
Next by thread: Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
Index(es):
- Date
- Thread