[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] new files



hello,

Juergen Goeritz wrote:
> On Sat, 20 Apr 2002, Yann Guidon wrote:
> >Juergen Goeritz wrote:
> >> have been reading through dct and the proposed optimization
> >> techniques. Sounds very interesting to me. From that 'small'
> >> example already the benefits are visible. But do you really
> >> think this scheme can also be used on a global level?
> >manually, by hand and with a single brain, obviously not.
> >However i believe that the method can be automated and the user
> >should be able to look at the esulting work.
> Yup.

usually, you can't automate what you can't do by hand.
some people might misunderstand my intentions ("oh gosh, not in asm !")
but practice is a good way to learn how to react in complex situations
and then teach computers how to deal with it. I might be reinventing
the wheel but at least i know how and why it spins :-)

> >> It looks to me like a try and error algorithm regarding the
> >> 64 register boundary - the long it fits okay - but what do
> >> you do when e.g. you start with much more regs required and
> >> the algorithm can't manage to fit into 64, what is your
> >> proposal then (maybe I overread something?) ?
> >
> >If there is some "spilling", the global dataflow graph must
> >be analysed : the longest "branches" (data that is not modified
> >or used during a long time) must be stored to memory.
> >For example if you have a graph "width" of 100 data and
> >you have only 63 usable registers, then some of them must be
> >used to point at the data we'll "spill" and the remaining
> >registers (let's say 55) are allocated to contain the 55 shortest
> >"branches" of the graph. Once you have determined the spilling
> >sequence, you can compute the post-increments for the load/stores.
> >The temporary (spilled) variables should not be arbitrarily
> >allocated, a global analysis must also be performed to reduce
> >the length of the strides (there is only a +256/-256 limit
> >for the immediate post-incremented load/store so if you spill
> >more than 256 or 512 bytes, you might run into problems
> >that a global algorithm would avoid).
> 
> The hint flag for the data cache on write would be used again :)

if you spill more registers than L1 can contain, you have to add another
level of optimisation, but i don't see a working set that would need that.
At that point, it is better to perform the simple graph analysis by taking
that limit into account.

well, i might be the only one who understands what i write, but give me some
time to wake up :-)

> >i presume it's not pretty clear, but i'll maybe write another
> >paper about that. I have a few ideas in mind for the subject.
> 
> Yes good idea. Please read my other mail about "fitter" design
> in the 'Winograd DCT' line.

there's an interesting idea here.

> >> The point is, how should the compiler do the optimization
> >> and resolving in case of 'fit' and 'no-fit'.
> >i guess that it is compiler-dependent : each one can have
> >a specific strategy to allow problems to appear or not...
> 
> Yes, but that wasn't my point. Maybe you could address this
> overload resolution in your further maybe-paper, do you?

my next paper is about the 8*8 DCT, so there is obviously a
register set "overflow" : there are 64 registers in output and 64 registers
in output, plus the 4 "twiddle" constants, and the many temporary variables...

the first part will show the "classic" approach and i'll try to go beyond that.

> >> And second,
> >> maybe even more important, how do you create the latency
> >> table for compiler input and keep it consistent with the
> >> advancing versions of f-cpu???
> >
> >there are currently architecture-specific flags in most compilers.
> >This flag would change the latency table. i don't see black
> >magic there.
> 
> And ther are really any compilers counting cycles in
> a significant amount of opcodes for every new version?

actually, i don't think that they "count" cycles, rather they can
try to build a "suitable" codee "by construction". they don't
"count" after the code is made, but rather they use the latency
informations during the build phase.

> JG
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/