[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] OOC vs. OOE (scoreboard & tomasulo)
On Sun, 2 Mar 2003 23:10:05 +0100 (CET)
devik <devik@cdi.cz> wrote:
> Hello,
>
> I know many men here have too much work but I'd
> like to initiate short thread in order to get
> deeper knowledge in field.
>
> I was looking for some Subj. related infos but
> haven't found much. Thus I learned Tomasulo,
> looked at Sparc, MIPS and Alpha and tried to
Mips is often cited as a very good architecture. Much more often than
sparc or Alpha. Cray is very cited, too.
Then very new Cray X1 use a modified MIPS isa.
> create my own opinion. I'd like to head (personal)
> feeleing and comments of others people here
> (which is valuable for me by the way).
>
> At the first glance OOC is much simpler to implement.
> In OOE you have to think about
> - ROB because of exceptions and control speculation
> - for tomasulo based we need expensive CDB and many
> comparators (but bypass needs them anyway)
>
> Similary gains of OOE more specificaly Tomasulo like
> designs:
> - ISA is the same for different number and speed of EUs
> - adaptation of EU controling to actual data flow including
> dynamic loop unrolling
> - can go around memory latency (!)
> - simple reg. renaming allows to have much less architectural
> register without sacrifying performance
> - simpler datapath, you can do 4 way cpu with only 1r1w
> regsets
One nice other things. R1 op R2 -> R3 instructions set aren't so much
an obligation. We could use R1 op R2 -> R2, this could be a save because
fanout of instructions are very little (1.4 in means).
>
> So what's the real performance (let's forget about ISA binary
> compatibility thing) difference between OOE & OOC ?
> IMHO it is memory load.
> When tomasulo based CPU hits memory load it starts it and
> goes further. All dependent instructions are filled into
> reservation stations (RS) near (!) their EUs.
> We can load many dataflow traces into CPU this way (limited
> by number of RS but not neccessarily by internal register
> set size) and these get started asynchronously by memory
> subsystem when data arrives from slower memory.
>
> All other EUs are typicaly semipredictable (we know exact
> latency when we know datasize) and we can always schedule
> them well for OOC. But loads are killers - prefetch/preload
> is nice but often you don't know correct address in advance.
> Take simple case - deleting element from double linked
> list. You need to load x->prev and x->next and update them.
> x->prev->next = x->next;
> x->next->prev = x->prev;
> load [r1],r3
> load [r1+8],r4
> store r4,[r3+8]
> store r3,[r4]
>
> Assume L1 latency 2 and L2 latency 20.
> If x->prev is in L2 cache and x->next is in L1 cache then
> 1-issue tomasulo will in 4 cycles fill 4 RS then second
> load is finished which also starts the first store.
> Second load-store is started after some time later and
> we can run other code in between.
> OOC cpu will stall at first load for 20 cycles ..
>
> So that do you think can good OOC perform better than
> good OOE for generic code ?
> I don't think so if latency of memory load will stay
> to be highly undeterministic.
>
> -------
> Don't read below if your mind if labile ;)
> If memory load is the only problem, what about something
> like "machine code select(2)" for OOC ? I mean unix API
> select(2).
ouch... polling many load result ?
>
> Like to say in asm - now stall on these (started) four
> loads and execute part of code (follows) depending on
> which load completed. For linked list above it would read:
> xload [r1],r3
> xload [r1+8],r4
> wait r3,a,1,r4,b,1,c
> a:store r4,[r3+8]
> b:store r3,[r4]
> c:
>
> xload is non-blocking load (like prefetch), wait line stands
> for:
> wait for r3, if completes, schedule 1 insn at relative
> address a ... similarly for r4, when both finished then
> continue at c.
> Probably you got the (crazy) idea .. it is for thinking.
> The idea is too fresh and maybe I'll dumpt it tomorrow.
>
> But still I'm interested in yout ide about Subj.
>
The most annoying thing is how to manage the state of the cpu ? State
must be saved&restored at context switch. And the wait instruction can
have so much fields.
nicO
> good night,
> devik
>
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/