[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] latest gcc & immediate addressing [Was: BOUNCE f-cpu@seul.org:...] (fwd)



On Tue, Jan 07, 2003 at 01:23:48PM +0100, devik wrote:
> Ehh I forgot .. I fwd my reply to group too as someone might
> like to comment something.
> ---------
> 
> > I finally got it up and running now, but the assembler still doesn't
> > grok the code, for several reasons:
> 
> hi, you seem to be pretty familiar with gcc internals, did you
> think about changing/fixing some things directly in a code ?

I wasn't sure whether your working version differs from the released
one, and merging any changes might have been difficult. And I really
prefer to hack my own code ;-)

> I'll try to address all of these but later when I get some more
> time.
> Just now only some comments/questions. PLEASE read and comment
> it. It would help to decide how to do some things.
> 
> >     - The ranges for immediate operands are wrong. umin/umax take
> >       unsigned bytes, smin/smax signed bytes, for example (btw:
> >       smin/smax are missing).
> 
> ahh yes - there was a lot of experiments with ranges and this code
> is not checked yet. And smin/smax was not there because I finshed
> that part before you told me that there will be signed versions ..
> 
> >     - For shift/rotate ops, the second operand should always have
> >       QImode.  That avoids an unnecessary explicit zero extension
> >       when the operand is not a full register.  Immediate operands
> >       are unsigned, BTW.
> 
> ok, one comment which is VERY important conceptional thing. Because
> gcc doesn't handle SIMD implicitly (yet) I use only non-simd (except
> for instrinstic funtions - not done yet). When we do some operation
> in SI mode we can rely on CPU to zero extend it to 64bits (for FC0).
> This holds for all insns I know about.

Unfortunately, this is not always true: the ROP2 (logical) operations
behave differently. They really suck when it comes to efficient non-SIMD
code generation (in particular, conditional branches); we should talk
about changing the ISA definition (without sacrificing the current
functionality, of course). E.g. we could define the following variants:

    // these are new
    <op>.<and|or>.<size>    // truncates result to chunk size
    <op>.<size>             // truncates result to chunk size
    mux.<size>              // truncates result to chunk size

    // these correspond to the current ones without `s' prefix
    s<op>.<and|or>.<size>   // always operates on full register (size used for combine)
    s<op>.<size>            // always operates on full register (size is ignored)
    smux.<size>             // always operates on full register (size is ignored)

Then your assumption would be valid again.

> It seems natural to use it and define zero_extend as no-op (I tested
> it using subregs). Other CPUs in gcc often define TRULY_NOOP_TRUNCATION
> to tell that we can truncate just by pretending register as already
> truncated - ie. accessing 64bit register with .16 insn truncates
> DI to HI.
> We can't use it on FCPU because of 2 reasons:
> 1) it would colide with our no-op extension, producing no code
>    for case { uint i; ushort s = (uchar)i; } where is should
>    produce either truncate or zextend

In that case, you MUST explicitly truncate `i' to 1 byte, probably
with `move.8 reg(i), reg(s)', whether or not TRULY_NOOP_TRUNCATION is
true.

> 2) it makes jmpz/nz problematic because it tests full 64bit
>    register - TRULY_NOOP_TRUNCATION leaves garbage in upper
>    bits and if((uchar)i) where i==256 jumps even if low 8bits
>    is zero. If we would want TRULY_NOOP_TRUNCATION we would
>    need to emit "and" before each such test.

I just wondered whether we can use `jmpl/nl' for conditional branches.
Compare and x[n]or.{and|or} operations will always set the LSB to 1 if
the result was true.

> If we will decide to use noop zero_extend then the code will
> be shorter and also there is no problem with rotation size above.
> 
> >     - logic operations take 9-bit signed immediate operands.
> 
> uhh .. my 2.7 manual doesn't state it :( Where can I find it ?

Dunno. I read it somewhere (maybe in the source code), and implemented
it in both the assembler and the emulator.

> >     - STACK_BOUNDARY must be at least 64 (fullword!). FUNCTION_BOUNDARY
> >       should be the size of a cache line (256 bits).  And even more
> >       important: STRICT_ALIGNMENT must be 1 (unaligned memory access
> >       is strictly forbidden).
> 
> good point - it is another thing I didn't find in manual. So
> that f-cpu needs natural align on all data ?

Yes, up to 8 bytes at least. It's not clear yet whether machines with
larger registers will have even stricter alignment constraints. For
the moment, we should concentrate on the 64-bit version.

> >     - We pass arguments in r1...r15, not r2...r15.  Leaving r1 unused
> >       is pointless.  On the other hand, we also pass arguments in
> >       registers when the function has variable arguments - currently,
> >       gcc puts the `fixed' arguments in registers and pushes the
> >       rest on the stack?!
> 
> then the manual pg 223 is wrong again. About variable arguments,
> yes they go to the stack just now - I have to look how can I force
> gcc to find them in registers. Again this convention seems not to be
> documented.

We talked about it here. In any case (varargs or not), arguments 1...15
are passed in r1...r15, the rest (if any) on the stack.

> >     - The function prologue is wrong. `storei $-8, r62, r61' uses
> >       *post*decrement, not predecrement.  I suggest you calculate
> >       the required amount of memory, subtract it from r62, and then
> >       use a temporary register with postincrement for storing
> >       registers and initializing locals.
> 
> I know. I simply assume that stack pointer points to next free
> slot instead of last used one. Then in prolog I can use post
> decrement and save several insn and additional register (which
> can save us troubles with register-cache aliasing).

We could declare that to be true. The caller would have to provide
the first empty register-save slot (and maybe even more stack memory).
There are also architectures where the caller reserves stack space for
parameters it passes in registers, for scratch space and similar things.
But I guess we'll have to play with different stack conventions for a
while before we can make a good decision.

> Epilog first substracts 8 from SP - it is ok because at end of
> fn there is often a lot of free scheduler slots where this
> substract insn fits at no expense (which is not case in prolog).
> 
> However I just realized that it is not interrupt/signal safe :-(
> I really have to decrement in prolog ...
> 
> >     - ASM_OUTPUT_REG_PUSH and ASM_OUTPUT_REG_POP are wrong. The
> >       former because it assumes predecrement again, the latter
> >       because the instruction is called `loadi', not `load'.
> 
> yes. I still didn't reallized what are these macros used for. I hope
> to be able to undef them completely.

The compiler uses them internally, IIRC.

> >     - There's a much easier way to specify the pattern for `mux'
> >       (note the constraints for operands 3 and 4):
> >
> >         (define_insn "*mux"
> >             [(set (match_operand 0 "register_operand" "=r,r")
> >                 (ior
> >                     (and
> >                         (match_operand 1 "register_operand" "r,r")
> >                         (match_operand 2 "register_operand" "r,r"))
> >                     (and
> >                         (not (match_operand 3 "register_operand" "1,2"))
> >                         (match_operand 4 "register_operand" "0,0"))))
> >      [....]
> >             "@
> >             mux %1,%2,%0
> >             mux %2,%1,%0"
> >             [(set_attr "type" "logic")])
> 
> Are you sure ? Constraints are tested AFTER matching. So that your
> version will probably took: a&b | c&~d
> as one which matches and the will try to use consraints to ensure
> correct register allocation. It will found that none matches and
> will abort. See this except fro manual:
> "More precisely, the two operands that match must include one input-only
>  operand and one output-only operand."
> So that you can use that 0,0 but not 1,2 constraint.

This kind of pattern is also used in other machine descriptions,
but it really seems to be wrong - if I try to compile something like
`a & b | ~c & d' with optimization, the reload pass fails :-(
Ok, forget it...

[...]
> >     - The `*loadcons' patterns look suspicious to me. Better drop them.
> >       In fact, better don't use loadcons at all, use loadconsx
> >       instead (and rely on the assembler to optimize the instruction
> >       sequence).
> 
> :) They are under developement still but the really describe what
> loadcons does !!
> They should be also able to translate code:
> a = b & 0xffffffff0000ffff | 0x45330000;
> to single loadcons.1.

Isn't that a rather rare case? It only seems suitable for access to
certain bitfields.

> Also we really need to tell gcc about all loadcons isns because splitter
> will need them to do correct scheduling. Loadcons are ideal "stuffers"
> for free schedule slots.

That's true. But then you also should tear apart any bigger
loadcons[x] sequence.

> >     - `loadaddri' (without `d') is correct for a LABEL_REF (while a
> >       SYMBOL_REF would require `loadaddrid'). But you got the offset
> >       wrong: it should read `loadaddri $label-.-4, reg'.
> 
> Ok about the offset. But SYMBOL_REF doesn't imply loadaddrid ! SYMBOL_REFs
> are used both for external data & code pointers and there is currently
> no way to distinguish them (unfortunately) ! Thus the warning comment in
> the asm. LABEL_REF is used ONLY for internal labels BUT also it can be
> used with loadaddrid sometimes ! (gcc allows you to take pointer to
> label, switch() statements needs jump table at label and gcc also labels
> private data tables.

The documentation seemed to be pretty clear that a LABEL_REF always
refers to code, while a SYMBOL_REF refers to data. It even mentions
that:

	"The reason for using a distinct expression type for code label
	references is so that jump optimization can distinguish them."

> Thus we will need to analyze insn dataflow to found which kind of
> loadaddr to use - and there are further possible cases where (G)CSE
> will try to use resulting pointer for both data & code access (it
> is legal to read from function pointer AFAIK).

*sigh*

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/