[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New suggestion about call convention

Antoine wrote:

> Just my two cents... About the idea that register optimization
> could be done at linking time. I don't know if F-CPU is meant
> to work as an isolated architecture, with only dedicated tools.
> But if you want to benefit from existing software, you must think
> in terms of compatibility with the Gnu tools.
> That means :
> - compiler capabilities

Can be added. Hey, it's Free Software, isn't it?

> - linker capabilities

Can be added as well (although I would prefer a specialized F-CPU
linker). Note that GNU ld is not required on all operating systems;
gcc is usually capable to use the native linker (and assembler) of the
OS it runs on.

> - object file format (ELF, etc.)

ELF is no problem at all. It was designed for these kind of things.
Other executable formats are rare nowadays (and most of them aren't
64-bit capable anyway, while ELF is).

> I'd bet (but I may be wrong ;-)) that none of those are ready
> for such a thing as link-time or load-time register reallocation.
> I'd also bet that if nobody does it at the moment, maybe it's
> because there are serious problems with it (despite the fact that
> compilation time is cheap so there is much incentive to optimize
> as far as possible in this phase). For once the optimization
> problem itself doesn't seem very simple, especially when programs
> get big.

It shouldn't be more complex than the register allocation pass inside the
compiler. Just on a higher level - replace `instruction' with `function',
and you've got it.

> An example is the mysql daemon (which is a respectable database,
> but not a big and featureful monster as Postgres or Oracle).
> [root@fsol antoine]# strip -s /usr/sbin/mysqld 
> [root@fsol antoine]# ls -la /usr/sbin/mysqld 
> -rwxr-xr-x    1 root     root      1957780 nov  5 21:38 /usr/sbin/mysqld*
> This is quite a big executable indeed. Of course there is not only
> code. Given that stuff other than code (strings, empty data...) is
> likely to be highly compressable, we can still estimate an order of
> magnitude of the actual code :

The command you are looking for is called `size', BTW :)

> [antoine@fsol antoine]$ gzip -c /usr/sbin/mysqld > mysqld
> [antoine@fsol antoine]$ ls -la mysqld 
> -rw-rw-r--    1 antoine  antoine    871030 nov  5 21:48 mysqld
> So we can, quite safely, assume that the mysqld code (again, not
> that a big database program compared to others) is several hundreds
> of kilobytes big (and this would probably be worse on a RISC
> machine, with lots of small instructions using lots of registers !).


> I don't think it's reasonable to expect processing some global
> register allocation (how do you do that ? ;-)) on such a big
> piece of code. Also, keep in mind that registers may have a
> special role and not be easily exchangeable (not really the case
> in F-CPU, except for instructions that also write to the register 
> next to the output register).

Not global register allocation; that's a different issue. We're
talking only about the `callee-saved' registers. And register pairs
should be easy to handle if the register sets are contiguous.

> Another strong argument is that you can't even be sure that there
> is a good solution to the optimization problem (even if you have
> invented an outstanding algorithm that makes you sure to find the
> optimum, and maybe it also helps you demonstrate that P==NP at the
> same time :-)). The optimum may be not much better than the common
> case.

I do not expect to get optimal results. The `register mask' approach
won't deliver optimal performance either. It may even perform worse than
a clever static allocation.

I just want to eliminate save/restore points where they're not necessary;
that's better than a `smart' save/restore, IMHO. The algorithm I have in
mind will simply process the call tree bottom-up (recursive functions will
be handled separately). That can be done in linear time - each function
is processed at most once. Unused functions need no processing at all; in
fact, they may be removed during the process (another win). Self-contained
(`leaf') functions need no modifications either. Transposing a function
also requires linear time: each instruction is processed exactly once.

In addition to that, encapsulated code modules (e.g. object files) can
be preprocessed to save compilation time. Shared libraries need special
handling because they can be replaced after the programs using them have
been compiled. This is the only place where the calling conventions have
to be strictly enforced.

> For example, if you have three functions (A, B, C) that each
> allocate 32 different registers. B calls C. A calls in turn :
> C then B. So sometimes C is called from B which is called from
> A, sometimes C is called directly from A. How do you think the
> optimal solution will look like, and will that optimality be
> satisfactory ?

In that case, you'll have to save registers at each call (assuming that
you have no more than 32 registers available). Everything else requires
a more detailed analysis of the code.

> (if you optimize C for being called by B, it won't be optimized
> for being called by A, and vice-versa ; of course, you can 
> duplicate C in two versions but beware of cache thrashing, given
> that this is a very simple example)

Let's take slightly different numbers: 16 registers per function, 32
registers available. The execution sequence shall look like this:

	A -> C => A -> B -> C => B => A

where `->' indicates a call, and `=>' indicates a return. When you
process the call tree (remember: bottom-up), you will first encounter the
`B -> C' call. Let's assume that C uses the first half of the available
registers; then B can use the second half, and the whole `B+C' code uses
all available registers. Only the code of B will have to be `transposed';
C stays unchanged (it's a `leaf' function). No registers need to be saved.
No function is duplicated (we're done with B, and it will stay that way
for the rest of the time).

Now look at A. It calls both B+C and C. B+C uses the whole register set,
while C uses only the first half. You have several options now; an obvious
approach is to save the second half when you enter B+C, and allocate
the same second half to A. Note that B+C doesn't need to be modified -
the linker will simply put a save/restore `wrapper' around it and use its
entry point instead of the real one. That is, another function D may still
call B without saving and restoring registers, using the original entry
point, or use another wrapper if it has different requirements. Finally,
A is transposed and we're done.

Ok, let's count the save/restore points. 16 registers have to be saved
when A calls B, and restored afterwards - and that's all. With your
masking approach, there would be `smart' save/restore points at the
entries of both B and C; assuming that 50% of the allocated registers
need saving, 24 registers would have been saved during the execution of
A. This sounds not too bad; but what happens if B calls C repeatedly,
in a loop? In that case, you'll have to save the same registers again
and again, and the number becomes MUCH higher, while it stays the same
with static allocation.

I have to admit that static (link-time) allocation and register masking
together will probably do even better. It also depends very much on the
code - the linker may have to take loops etc. into account when it selects
the save/restore points. But in general, the `lazy save' algorithm
I outlined above should work pretty well.

Now YG will probably ask again where I get `those crazy ideas' from...
It's basically the same `register windowing' approach that SPARC and
Itanium CPUs use - only in software.

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/