[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New suggestion about call convention

To: f-cpu@seul.org
Subject: Re: [f-cpu] New suggestion about call convention
From: Michael Riepe <michael@stud.uni-hannover.de>
Date: Wed, 6 Nov 2002 05:38:48 +0100
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Wed, 06 Nov 2002 08:20:22 -0500
In-Reply-To: <200211052229.00764.cedric.bail@free.fr>; from cedric on Tue, Nov 05, 2002 at 10:29:00PM +0000
References: <200211042333.19012.cedric.bail@free.fr> <20021105185014.20040@thrai.stud.uni-hannover.de> <200211052229.00764.cedric.bail@free.fr>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

On Tue, Nov 05, 2002 at 10:29:00PM +0000, cedric wrote:
> > First, a decent F-CPU compiler should not select registers randomly.
> > It should analyze every function and assign register numbers so that
> > chances for a collision are minimized.
> 
> It depend if you have knowledge about what the other module librairie do.
> You will certainly agree that if you don't have any knowledge about the 
> register usage, random is the best solution.

The goal is to have that knowledge available - either via calling
conventions, or via hints inside the object file, or via code analysis
if everything else fails. As a last resort, you can still save all
registers when you enter an unknown function.

> > That is, in a module containing several functions that call each other,
> > each function should use a different set of registers. Additionally,
> > functions can have both an "internal" entry point for intra-module calls
> > (which need not save any registers) and a "public" entry point that
> > saves all registers used inside it, prepares for restoring them, and
> > then dispatches to the internal entry point. Unless there are recursive
> > functions, you'll have to save registers only when you enter the module
> > (and restore them when you leave it).
> 
> It's a good idea for having quick call intra-module. The problem is in the 
> epilogue, how did you know from where you came ?

By changing the `return address' register, as indicated in the code
I included in my original mail.

> > Ideally, each function will use a contiguous set of registers, and there
> > will be an entry in the object file reporting which registers it uses.
> > That way it becomes possible to further minimize collisions at link time
> > by renumbering the registers inside a function - you may also call it
> > "deferred register allocation" if you like.
> 
> It's an interesting idea, but for each instruction in each function in each 
> module you need to search in a "database" which chunk to replace and then 
> replace it with a mask. But before taking your decision you need to execute a 
> allocation algorithm. Currently, with only relocation on x86, a lot of people 
> are thinking that time needed by linking is slow...

You mean run-time (dynamic) linking? We're going to do it at compile
time; there will be no overhead when the program starts.

You don't need a complicated database look-up. You just do the
following:

	get the instruction word
	for all register fields do
		if the register number has to be replaced
			replace it with the new register number

You can use a table-based translation (64 table entries) or an algorithmic
approach (add a constant to the register number if it's inside the range
that is supposed to be `pitch-shifted'). The number of register fields is
available from the opcode (another table with 256 entries). Nothing more,
nothing less. In either case, transposing an instruction will take only
a handful of CPU cycles.

The allocation algorithm isn't complicated either (I outlined it in my
other mail).

>   But the idea is good, what did you think about a double solution : for .o 
> (not librairie), you use your technique and for the rest the technique 
> suggested by Arnaud.

Inside a shared library or a program, reallocation can be used as well.
It just won't work at the interface between programs and shared libraries.

>   In the prologue you use the mask technique and put a second entry point for 
> optimal call. (I don't see why you need to have contiguous register, you only 
> need a 64 bits mask per function, so 2 techniques can be used together).

Contiguous register numbers make the allocation and transposition easier,
expecially when register pairs are involved. And it's what compilers
usually do: start with the first, and always pick the next available
register.

> > 	new_entry_point:
> > 		// allocate stack space here
> > 		storem xyz
> > 		move r63, saved_reg	// save return address
> > 		loadaddri old_entry_point, temp_reg
> > 		jump temp_reg, r63	// call original function
> > 	restore_code:
> > 		move saved_reg, r63	// restore return address
> > 		loadm xyz
> > 		// deallocate stack space here
> > 		jump r63
> 
> Problem with this is that you have a double jump each time you call a function 
> outside of your module. Why didn't you use something like this :

As I already mentioned, that piece of code can be improved :)

On the other hand, the `double jump' allows us to put multiple wrappers
around the same function without touching the function itself.

[...]
> With this you have both possibility. In fact you don't need to add a jump for 
> entry point with your method too, because you can add the 2 entries points
> when you compile.

The link time register reallocation relies on the fact that any number
of wrappers can be added to a function. That has to be done at link time
because the compiler doesn't know how many wrappers it has to provide.

> A possibility to improve your idea is to specify a other 
> registre Quick Entry, if qe == r0 => don't restore else restore so the 
> epilogue/prologue will look like :
> 
> low_entry_point:
> 	loadcons.1 0xFFFF, r0
> 	loadcons.2 0xFFFF, r0
> 	loadcons.3 0xFFFF, r0
> 	and r0, mr, r1
> 	maskstore r1, [sp]
> quick_entry_point:
> 	// My function
> 	jmpz qe, r63
> restore_code:
> 	loadcons.1 0xFFFF, r0
> 	loadcons.2 0xFFFF, r0
> 	loadcons.3 0xFFFF, r0
> 	and r0, mr, r1
> 	maskload r1, [sp]
> 	jmp r63

Nice idea, but it still allows only a single wrapper per function.

BTW: Your restore code is broken; you'll have to use the modified mask
when you restore the registers, and then put the original mask back in
place.

> > The same algorithm works with both functions and modules, and it only
> > needs the set of used registers per function/module and a dependency
> > graph. Both can be easily constructed from intermediate code. Register
> > renumbering is done on the object code directly (thanks for the uniform
> > register model and instruction format of the F-CPU which allow us to
> > `transpose' the object code -- note to guitar players: think `capo' ;-).
> 
> I don't understood you here, we have a lot of different instruction : 2r1w, 
> 1i1w, 1i1r, 1i1r1w, 3r2w... For each you have a special case.

But the register numbers are always encoded in the same 1...3 fields
of the instruction word - and that's all that matters. We don't have to
touch the instructions itself, only the register numbers.

> > Conclusion: Using a smart linker, we can resolve register collision
> > issues statically. We can also take profiling (feedback) information
> > into account, like you suggested, to optimize the number and placement of
> > save/restore points. Therefore, I see no real need to do it at run time.
> 
> The problem is doing a optimising linker will really improve the code, but 
> will not work for shared code (librairie are loaded only once) and will 
> certainly be really slow for big code... The mr is not as optimal as your 
> proposition but didn't cost so much each time you run a application.

Au contraire, mon ami! (I always wanted to say that ;)

Link time register allocation is done *once* for each program - before
it is run. At run time, you'll only waste cycles when you have to
save/restore registers. The `mask' approach also needs some cycles for
bookkeeping (that is, the mask operations), which makes it slower than
link time allocation in the worst case.

One might consider the `mask' approach as an alternative for the
program/library interface (after all, that looks like the weak point
of my approach). But there is a good argument against it: A reasonably
complex program will use *all* registers internally. That is, you always
have to save registers when you enter a shared library - with either
approach. Both link time allocation (when the library is built) and
register masking (at run time) help to reduce the number of registers
you have to save, but link time allocation is superior because it has
less run time overhead.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] New suggestion about call convention
  - From: cedric <cedric.bail@free.fr>

References:
- [f-cpu] New suggestion about call convention
  - From: cedric <cedric.bail@free.fr>
- Re: [f-cpu] New suggestion about call convention
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- Re: [f-cpu] New suggestion about call convention
  - From: cedric <cedric.bail@free.fr>

Prev by Date: Rep:Re: Re: [f-cpu] New suggestion about call convention
Next by Date: Re: [f-cpu] New suggestion about call convention
Prev by thread: Re: [f-cpu] New suggestion about call convention
Next by thread: Re: [f-cpu] New suggestion about call convention
Index(es):
- Date
- Thread