[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] Stack Frame Layout and Varargs Handling



It's time to specify argument passing and the like in more detail...

To summarize: r1...r15 hold the first 15 arguments, the rest is pushed on
the stack (downward), with the stack pointer (r62) pointing to argument
#16 when the function is entered.  Note that r1 need not hold the real
first argument but may be a `hidden' argument - a pointer to a buffer
that is used to return a struct, for example -, with the real first
argument following in r2.

Values (including structs and unions) that fit in 8 bytes shall be
passed in a register.  Larger data that is not declared `const' shall
be copied to a temporary location in memory, and the address of the copy
shall be passed to the function.  Large data that is declared `const' in
the function prototype shall be passed by reference, to avoid copying.
Arguments passed on the stack shall always occupy 8 bytes, to preserve
alignment of the stack pointer.

Return values are handled similarly:  If they fit into 8 bytes, they
shall be returned in r1.  Larger structs and unions shall be returned
in a temporary memory location that is allocated by the caller, and
passed by reference in a hidden argument (that is, in r1).  On return,
r1 shall hold the address of the result.

A `varargs' function (one that accepts a variable number of arguments)
which has less than 15 `fixed' (that is, explicitly named) arguments
shall push all unnamed arguments that were passed in registers to
the stack before it calls the va_start() macro for the first time.
The additional pushed arguments shall form a contiguous array with the
arguments pushed by the caller, if any.  E.g. the function

    #include <stdarg.h>

    int
    snprintf(char *str, size_t size, const char *format, ...) {
        va_list ap;
        ...
        va_start(ap, format);
        ...
        va_end();
    }

shall push r4...r15 to the stack before it executes va_start().
`ap' will most likely be a pointer to the argument array on the stack.
When the function returns, it shall deallocate the memory area used for
unnamed arguments, discarding their contents (ideally, it will be part
of the function's stack frame).  Other arguments on the stack shall be
deallocated by the caller (see the discussion of the `scratch space'
below for a simple solution).

Note that the function need not save the `unnamed argument' registers
immediately when it is entered.  It can also reserve space for them
and store them later, before it calls va_start() for the first time.
If it doesn't call va_start() at all, it need not save anything.

Remember that functions only have to obey these rules if (a) they have
external linkage or (b) their address is passed around as a pointer.
Private functions (those that can only be called from within the same
source file) may use any calling convention that seems suitable - if
they aren't inlined anyway.

=== Implementation Example ===

The stack frame of a function may look as follows (shown from highest
address to lowest):

        // caller-pushed args go here
    stack_pointer_at_function_entry:
        .space 8 * number_of_unnamed_args_saved // for varargs functions only
    address_of_varargs_array:
        .space 8 * number_of_callee_saved_regs  // including frame pointer, if needed
    frame_pointer_after_epilogue:               // if needed
        .space space_for_local_variables        // if needed
        // dynamically allocated locals go here (for alloca())
        .space number_of_scratch_bytes          // if needed
    stack_pointer_after_epilogue:

The `scratch space' provides space for arguments to other functions,
return value buffers and the like.  It should be large enough for
all possible calls, so that the stack pointer need not move during
execution of the function unless alloca() is called.  alloca() may then
be implemented as

    alloca:
        addi $7,r1,r1   // align
        andni $7,r1,r1
        sub r1,r62,r62  // allocate
        addi $number_of_scratch_bytes,r62,r1    // return value
        jmp r63

(it may also be inlined, of course).  Note that the contents of the
scratch space are clobbered by the call - but that will be true for
all functions, since they may modify their arguments.

A typical, simple function prologue may look like this:

        subi $8*number_of_unnamed_args_saved+8,r62,r62  // at most 68 bytes
        storei $-8,r62,r61      // save frame pointer
        // save callee-saved register(s)
        ...
        storei $-8 r62,r34
        storei $-8 r62,r33
        storei $-8,r62,r32
        // save return address and allocate space for locals
        loadconsx $-(space_for_local_variables+number_of_scratch_bytes),r16
        move r62,r61            // set new frame pointer
        store r16,r62,r63       // use storei if constant is small enough

and the matching epilogue would be:

        move r61,r62            // deallocate locals and scratch
        loadi $8,r62,r63        // restore return address
        loadi $8,r62,r32        // restore callee-saved register(s)
        loadi $8,r62,r33
        loadi $8,r62,r34
        ...
        // restore frame pointer and return
        loadi $8*number_of_unnamed_args_saved+8,r62,r61
        jmp r63

This is suboptimal, however.  There are read-after-write dependencies
between the individual loadi (or storei) instructions which will cause
the CPU to stall for 2 cycles (due to the repeated use of r62).  It's
faster to use three pointers in a round-robin fashion:

        // arrange pointers so that temporary pointers are used first
        subi $8*number_of_unnamed_args_saved+8,r62,r16
        subi $8*number_of_unnamed_args_saved+16,r62,r17
        subi $8*number_of_unnamed_args_saved+24,r62,r62
        storei $-24,r16,r61
        storei $-24,r17,r34
        storei $-24,r62,r33
        storei $-24,r16,r32
        loadconsx $space_for_local_variables+number_of_scratch_bytes,r18
        move r17,r61
        store r17,r63
        sub r18,r17,r62         // use subi if constant is small enough

        // function code goes here

        // arrange pointers so that r62 is used in last loadi
        move r61,r17
        addi $8,r61,r62
        addi $16,r61,r16
        loadi $24,r17,r63
        loadi $24,r62,r32
        loadi $24,r16,r33
        loadi $24,r17,r34
        loadi $8*number_of_unnamed_args_saved+8,r62,r61
        jmp r63

Of course this kind of code is harder to generate.  But it's worth the
effort if there are several registers to save: you need 4...5 additional
instructions, but you'll also save 4 cycles per register that is saved
and restored (provided that the save/restore area is prefetched).

Varargs registers can be stored in a similar fashion.  For `snprintf'
above which has 3 named parameters, the code may look like this:

        addi $8*number_of_callee_saved_regs,r61,r16
        addi $8*number_of_callee_saved_regs+8,r61,r17
        addi $8*number_of_callee_saved_regs+16,r61,r18
        storei $24,r16,r4
        storei $24,r17,r5
        storei $24,r18,r6
        storei $24,r16,r7
        storei $24,r17,r8
        storei $24,r18,r9
        storei $24,r16,r10
        storei $24,r17,r11
        storei $24,r18,r12
        store r16,r13
        store r17,r14
        store r18,r15

va_start() will be a macro or inline function that calculates the value of
`r61 + 8 * number_of_callee_saved_regs' and stores it in `ap':  The rest
of <stdarg.h> can be implemented as follows:

    typedef void **va_list;

    #define va_arg(ap, type)  (*(type*)(sizeof(type) <= 8 ? (ap)++ : *(ap)++))
    #define va_copy(dst, src) ((void)((dst) = (src)))
    #define va_end(ap)        ((void)0)

(or similar).

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/