[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] latest gcc & immediate addressing [Was: BOUNCE f-cpu@seul.org:...]



> can you upload the file to the anonymous FTP
> at ftp.seul.org : log in as anonymous and enter anything
> as pwd. then go to pub/f-cpu/contrib and upload in binary :-)

Ahh ok done :) It is there as gcc32fcpu_20021229.tgz.
There is also readme in the tar expaining what is done
in gcc and what problems with porting we have (mainly
memory subsys and pointer aliasing).
I really miss paper about intented LSU internals. There
are some bits in list but it is hard to see what is
valid just now.

> BTW, don't be too much obsessed by code size,
> i have seen that IA64 code is much larger (often 2x)
> than x86, so it's ok given the current low optimisation.
> then we'll have to teach GCC new tricks to get the last
> 10% of performance.

yes. I should rather compare number of instructions, it
can give us rough estimate of optimizer quality. I do
it becaise x86 optimizer is the best one present in gcc
so that I can see how much away I'm. If I;d get 3 times
large code I'd have to look for bug.

To understand, if I do
loadcons sym,r1
...
load.64 r1,r2

then load has to check TLB and so it will stall pipeline
(because it could trap). So that I have to add prefetch
somewhere, am I right ? It will mark r1 as pointer which
tells me: r1 is TLB validated - to that OS TLB handler
should erase all there pointer flags when TLB is changed (?).

As the biggest problem I see pointer-reg<=>cache-line
tie. From my experiance with gcc (and general feeling)
comments in its code I see that it is not possible to
do complete alias analysis of C memory references.
Functions often returns pointers derived from their
arguments and at many places you ends up with three or
four pointers to similar memory location (typicaly
stack). There is no way to say that they can be combined
to single register and use post-modify trick.
Also post-modify of loads and stores ties them implicitly
to that they can't be run on future multiisue FCPUs.
Whole F-CPU has almost no kludges in its ISA but I feel
that current memory subsys will be a bottleneck.

I'd rather see indexed addressing mode. Imagine +-8bit
immediate offset. Each register would have 8 flags
telling you whether it is safe (read: will not TLB trap)
to access memory addressed by the register and adding/subing
offset having N nonzero LSB bits (N is 1 to 8).
This way you can hide 8 bit adder into LSU pipeline out
of critical path of program control flow. Simply when you
use nonzero immediate offset, the load will have larger
latency (by 1 cycle). For stores it is even hidden by cache.

The 8 flags above would be filled at the same place as
"pointer" flag today - computing them when checking TLB.
Normally then would be all 1 if TLB is valid and pointer
is not near "edge" of page and higher flags would turn to
zero as the pointer approaches an edge.

This 8bit offset would be heavily used by structure
accesses, local frame variables, in memory arguments...
For example for double linked list operation you
now need at least 3 pointers - pointer to "prev",
"next" and data (reference count f.e.).
All these points to similar locations. If you use
post-inc/dec you have to serialize them. With
immediate offset addresing mode you can do it with
one pointer on single cache line and DO IT IN PARALEL.
See item deletion in typical double linked list with
destructor (linux kernel uses MANY ot them):
struct L {
 struct L *next,*prev;
 funct dtr;  // destructor fn pointer
};
// r1 is pointer to item to delete

1. Unlink operation with current F-CPU ISA:
loadi.64 $8,r1,r2 // ->next
// 2 stalls
loadi.64 $8,r1,r3 // ->prev
// 2 stalls
load.64 r1,r4     // ->dtr
maddi $8,r2,r5    // ptr to ->next->prev
store r3,r2       // new fwlink
store r5,r3       // new backlink
call r4

In above you can't paralelize loadi because they are
interlocked. You can use different registers:

maddi $8,r1,r3
loadi.64 $16,r1,r2 // ->next
// 1 stall
load.64 r3,r3      // ->prev
load.64 r1,r4      // ->dtr
...

We saved 3 stalls but need second pointer to the
same cache line, can paralelize maddi & loadi.

With direct indexing:
load.64 r1,r2
loadoi.64 $8,r1,r3
loadoi.64 $16,r1,r4
point_1:
storeoi.64 $8,r2,r3
store.64 r3,r2
call r4

The above is the shortest code, has no stall and if
we can stuff/move other code to point_1 label it
can run ot 3-issue F-CPU without RAW stalls.

Finally I'd replace post-inc/dec by madd/msub. From my
analysis (again gcc) post-modify was used mainly in funct
prolog/epilog and sometimes in loops (cca 0.5% of code).
madd/msub can produce result to another register - more
flexibility.

Other interesting case is prolog/epilog saving without
SRB/msave/mload. Without immediate addressing you CAN'T
paralelize saves at all. Even if you use 2 pointers
on 2 issue cpu:
storei.64 $16,r1,x
storei.64 $16,r2,x
// 2 cycle stall
storei.64 $16,r1,x
storei.64 $16,r2,x

You get stalls IMHO. With immediate addressing you will
be able to save 32 registers (all callee saved) with
single register ...

BTW, I read somewhere that "load" of noncached data will
stall load at the decoder !? Is it true ? I always thought
that it will "schedule" load to LSU and LSU will snoop
on decoder to find cycle with free write slot latter.
If none is find it will stall pipeline when target register is
needed or cache line is about to flush ...

devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/