[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



On Tue, Apr 15, 2003 at 03:17:45AM +0200, Yann Guidon wrote:
[...]
> >The fetcher could increment the address as soon as it requests a cache
> >line (in parallel, just as a postincremented load would do).  Maybe even
> >earlier if we use a `tandem register' (that is, double buffering).
> >  
> >
> well, too much prefetch can overload the memory system
> and increase the real execution time.
> But of course, since the sources are "free", anybody can "play"
> with the Fetcher's strategy.

How can one line be too much?

[...]
> >Yep.  But I'd rather present the whole cache line to the decoder, and
> >let it send back a `next line' bit.
> >
> hmmm ... no.
> even a cache line is a buffer that must be carefully handled.
> and a decoder decodes, a fetcher fetches.
> At one point, there must be the "current instruction" that is being 
> decoded :
> this is where there is the limit between "fetch" and "decode".

Since there is no PC register, why should there be a current
instruction? ;)

IMHO, the fetcher ends at the line buffers.  If a buffer is filled, the
line is "fetched", and everything else is decoding.  Whether instructions
are decoded one by one or in parallel is not a fetcher but a decoder
issue.

> If you provide several instructions for parallel decoding,
> it increases the decoding logic and the problems.
> Furthermore, in the classical computing world, only one
> instruction is "active" at a time. So i consider that it is present
> in one "pipeline register", it is updated by the "fetcher" upon
> request from the decode and issue stages.

Let's not argue about words, please.  In the end, IF/D is going to be
a single unit anyway, not separate fetcher and decoder units, so it
really doesn't matter *who* selects the active instruction.

> >  Alternatively, the decoder could
> >use an `instruction window' that contains 2...4 (or more) instructions.
> >In case we ever want to do a kind of peephole optimization here (or issue
> >several instructions per cycle), we wouldn't have to change the fetcher.
> >  
> >
> KISS, you know ?

For FC0, yes.  But I'm also trying to look into the future sometimes.

> first, it's usually the compiler's job to optimise the code sequences, and
> it has opportunities to perform them on a much larger window than 2 to 4 
> instructions.

Jein.  Yes, it's its job (and so on).  But there are things a compiler
can't do.

> Next, we have 64 registers and potentially 4 register addresses,
> how are you going to compare together 8 to 16 6-bit words together ?
> (theoretically, that's 8*7/2=38 to 16*15/2=120 comparators).
> the clock speed and/or the pipeline depth are going to suffer.

I'm not sure that I have to.  But I'll have to think about it.

> >BTW: One possible optimization would be `loadcons clustering'.  If there
> >are several loadcons[x] instructions in a row (with the same destination
> >register), the decoder might put the constants together and process all
> >the loads in a single cycle.
> >  
> >
> naaah ....
> 
> how are going going to deal with "clusters" that spread across a cache 
> line boundary
> (or worse) through a page ?

Not at all.

> Next, you'll have to put comparators, that determine that the loadcons 
> is sent to the same
> register. That's O(log2(8)), at least 3 logic depths for 8 
> instructions/line.

First, I'm going to check which instructions are loadcons[x]es at all.
That's a 7-bit comparator with fixed second argument, depth=2.  I'll also
compare the destination registers of consecutive instructions (6 XOR
gates and combining NOR gates, depth=2). Then I'll combine those bits
with an AND (depth=3) and I have an indicator that these two consecutive
instructions can be clustered.

Of course this operation will benefit from the predecoding stage I
mentioned earlier.

> Then, you'll have to make sure that the constants are in the correct order :
> no place for constant optimasations and "holes" in the constant, you have
> to provide contiguous indices.

The order doesn't matter. The `right' (later) instruction will have
precedence over the `left' (earlier) one.  That's a simple MUX
operation for all but the first instruction in the cluster.

> so that's reasons i'm able to find in my post-tiredness state,
> wait till i get some sleep if you want more :-)

Not necessary ;)
But let's talk about this stuff again when IF/D ist implemented.
BTW: Will you write it?

[...]
> >>* currently, i don't think that predecoding
> >>can bring any advantage (yet). It will be useful in later architectures
> >>but i see no case where it can bring any benefit.
> >>    
> >>
> >The main idea was to simplify the decoder (lower latency),
> >
> waht complex instruction needs preprocessing ?

Probably all of them.  I'm mainly concerned about the complexity of
the decoder.

[...]
> >E.g. if there is an unconditional jump, the fetcher may skip
> >the `linear' prefetch cycle to save memory bandwidth (and a buffer);
> >
> hmmm ... there are many ways to code an unconditional jump,
> like a call without ever using return for example.

That's something we can't detect, right.  Since it will cause a
useless prefetch, I consider it bad coding style ;)

> Jump instructions are easy to find but they offer quite some freedom
> and even return can be conditional (since it uses the main "jump" form).

Conditional jumps are not a problem, since they should not turn
prefetching off.

> You're going to put 8 complex comparators that will add 1 bit to the 
> cache lines.
> It's probably interesting but i'm curious to know how many tens of percents
> of wallclock time it is going to save.

We'll have to investigate that.  But on the other hand, every single
cycle counts :)

> >if there is a syscall instruction, it may prefetch code from the syscall
> >handler, and so on.
> >  
> >
> syscall can have either an immediate form or a register form :
> prefetch will not be possible for the register form.

In my versions of the manual, it's always been strictly immediate.

[...]
> now there is another question : split or unified TLB ?

A split TLB would harmonize with the split L1 cache and the separate IF/D
and L/S units, and would decouple code and data TLB lookups.  But it's
less space efficient (fixed limit of code and data TLB entries), and
probably also harder to implement.  That is, I'm not sure.

[...]
> >An adder for loadaddri won't fit into a single pipeline stage.
> 
> if there is a carry at the Xth position, it can stall the process in 
> order to compute the MSB ?

Please don't.  Remember that the immediate value is sign-extended,
and that the high part may go up or down.

[...]
> Now that most interesting units are "ready", there is quite a lot of 
> work to do in
> order to connect these units together and then schedule operations.

Yep.  But there also are some things to improve.  E.g. `widen' is still
unimplemented (not a big issue), and the SHL unit doesn't scale in byte
mode (in particular, `permute'/`sdup' will be a problem when the word
size increases).  And I think I have found a way to implement add/sub
with signed saturation.

> And i don't know what has happened to the Register Set's sources.

Did you lose them?

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/