[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Rep:Re: Re: [f-cpu] New suggestion about call convention



-----Message d'origine-----
De: whygee@club-internet.fr
A: f-cpu@seul.org
Date: 06/11/02
Objet: Re: Re: [f-cpu] New suggestion about call convention


hi,

>De: nico

>On Tue, 05 Nov 2002 00:50:27 +0100
>Yann Guidon wrote:

<snip intro> 
>> <snip examples>
>> 
>> >The problem of the first solution are :
>> >- complexity
>> >- popcount unit must not be optional
>> >- block the CPU for 3/4 cycles (before being sure that no TLB trap
append)
>> >  
>> not only that, but :
>>  - instruction lifelength is not static ==> more difficult to decode
and 
>> schedule
>
>??? I need to see a proof of that.

proof of what ?
 * the instruction lifelength is not static because
 the number of operations is indicated by a register,
 not a field in the opcode.
 * if the instruction is not equivalent to a static
 dataflow graph, then it is not possible to schedule it
 in FC0.

Now you can admit that it is "a bit more complex" than
a simple ADD or even the division unit (which is an
exception to the static scheduling because it has a
static datapath).

>>>>Maybe with the vision of the scheduler that you have (with fifo and
"simulation" of behavior) that needs to stop the pipeline in case of
cache miss because the latency is too high for you NP algorythme..

>>>It's always possible to have a multicycle load/store unit that handle
that stuff.

>>  - instruction cannot be interrupted in the middle
>>      (IRQ/whatever) ==> IRQ response time is unpredictable :-(

>Like our /0 trick,

gni ?

>>>> "divide by zero". All instruction are in order at the beginning.
You check the exception. And then you could finish the instruction. So
there is no write to cancel. 

> the pipeline should check IRQ first.
FC0 doesn't "check IRQ".
The new instruction flow is inserted in the pipeline
whenever it is available and can be issued.
It can be ok to delay IRQ while an instruction
waits for the operands to be ready before it is
issued, but allowing more delay (particularly
when it could have been avoided with the use of
discrete instructions) reduces the system's
responsiveness. It may be completely off-topic
for an average desktop PC, but F-CPU is not
meant to be used only there.


> And then the following stay asynchronous.

?

do you meant that IRQ is blocked when the instruction
runs, or do you allow asynchronous IRQ in the middle
of the instruction ?...

>>>Neither. Instruction are raised as early as possible, it's the first
work of the pipeline for every instruction (what you could "put the
exception check in the decoder stage", that's not possible at all but
that's an image)

>>  - it can't be pipelined (issued and then another instruction can be 
>> decoded)
>It could.
then tell me how.

>>>>:) Every thing could be pipeline. Like divide,...

> Where is the probleme ?
people have sexy ideas but no way to integrate
them in the existing framework.

Think about it : the existing FC0 pipeline
is designed in such a way that an instruction
implements a simple function : "add" is decoded,
operands are fetched, result is computed and
written back. THAT can be pipelined and it works well.

>>>That's RISC stuff, but F-cpu is quite far from that, no?

Now if an instruction must perform several steps,
it has to "stay" in the decoder, so that the steps

>>>No ! The load/store unit hold it.

can reuse the existing pipeline. This means that
the instruction is "blocking" because no other
instruction can start decoding. This is why it is
not "pipelinable" because even if the rest of the
data pipeline is used, the instruction fetch and decode
pipeline is stalled and no IRQ can be acknowledged.

>>>There alwas be the problem of read/write port contention. things go
heavy and not in a simple way. It's as hard as SRB stuff. 

> You have to play with a contention on the register bank.
i wouldn't call that "play"....

>>  - the read port is connected to the instruction buffer ==> it is not

>> possible to generate the sequence of registers to be saved. And even
a counter 
>> would not be ok (in order to generate the register numbers), because
the 
>> mask can have holes !
>
>You could mask hole. But then you loose cycle.
heh. that's what i meant.

> I'm pretty sure that a
>"sequencer generator" could be used.
a #what# ?...

and don't forget about the "tight pipeline stages" :
if the "solution" takes more than 1 cycle per register,
then it's worthless.

>>>For a very longtime i said that 6 gate per stage is far from
realistic, but...

>> >For the second solution :
>> >- complexity
>> >- popcount unit must not be optional
>> >- block the CPU for 3/4 cycles like the first solution, but you need
to use 
>> >this instruction more frequently than the previous solution, but
this 
>> >solution give you the possibility to pass a chunk if not needed.
>> same remarks as before.
>> it's multicycle, CISC instrtuction with most of the problems.
>
>the biggest probleme is the connection of the read/write port that
>annoyed instruction buffer but that the case of SRB, too.

remember that SRB is "optional" .....

>>>And remember that i didn't like that so much thing could be optional
! there will be no compatibility at all between Fcpu otherwise.

>> >Sorry for this long, but I hope it could be interresting,
>> > Cedric
>> >
>> well, at last it made it to this list.
>> 
>> YG 
>
>Maybe the idea of Michael is better (SW). It's okay if the linker could
really do the job. Otherwise...

i would go for it. However, the loadm/storem have
some of the problems of the masked load/stores.

>>>>hmm, not my own version ! :) "m" could signifie 4 or 8 registers at
a glance, so you could read or write a complete cache line in a cycle
(so adress are aligned and even 1K adress aligned to simply things even
more) :) (it's really easy to do 4x or 8x the width of the register, and
make a choice thought a mux (for usual use) or not (for "m" operation)).
So you could manipulate R0-3/r4-7/r8-11/... in one cycle.
I love this because it's really simple to do it in HW. It provide a udge
bandwith from cache to cpu (hmm, let's see 256*4=1024 bits interface,
nice :)

>>>>You don't increase true memory bandwith but it easier to hide
latency. Imagine unrolling loops.
nicO

YG


__________________________________________________
Modem offert : 150,92 euros rembourss sur le Pack eXtense de Wanadoo ! 
Haut dbit  partir de 30 euros/mois : http://www.ifrance.com/_reloc/w

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/