[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

(LSU!) Re: [f-cpu] manual update



hi,

nico wrote:

> In fact, it's very easy to explain. But It's not the strong point of
> Whygee ;D
it's a problem with more people than you'd agree...

And please read this post carefully, it's probably the clearest
thing i could write. The invalidation problem is certainly
a big point that you have not yet addressed and that has forced
me to abandon the rough idea you have.

> To speed up things, register adress are used as an inputs inside  kind
> of cache and return the content of a cache line (or more) pointed by the
> content of the register. So we bypass the register read and we can
> immediately acceded to the content of the memory before having read the
> register bank. To verify the integrity, we check the real adresse with
> the adress read in the register.
no need to check the integrity afterwards. The address and the
register's contents MUST be coherent at the time the load/store/jump
instruction is issued. Otherwise, what would be the benefit ?
Of course, race conditions must be avoided (and they can appear
everywhere because the latency is pretty high) but they must be
solved "by design", not by explicit coherency checks...

> This memory could react as a caches. But i think that a 64 registers
> adresse isn't so much and we can use a memory instead of a cache system
> (with 8 lines instead of 64).
in theory you have a point. Except that you open a big
can of worms and 64 lines is really a lot, small FPGAs
could not contain that. The LSU is also a multiported
memory and more lines would make the chip even bigger.
I estimate that the FC0 clock speed is limited by the
register set (and possibly by the Xbar), so if you add
another big array, you further slow the clock down.
Finally, the alias problems appear. Data duplication is
often a sign of bad design, particularly when it is
in the critical path...

Don't forget that the LSU has to speak directly with the
cache and at least one I/O (otherwise, where would the
informations come ???) which is certainly not 256 bit wide.
Furthermore, writes must have a byte-granularity and store
some "dirty" flags. Finally, how would you separate the
instructions from the data ? (because the registers can point
to both types) Would there be a big block from which instructions
are read and data are read/written ? then that would be
a multiported block, at least with 5 read ports and 4 write
ports. The Register set already totals 5 ports and it's a
big slow piece of silicon...

The original idea of 8 lines for the LSU and 8 lines for the fetcher
is not such a bad idea. The critical datapath is already a problem
but it can be pipelined. The register-line associativity problem
is the price to pay, if we want to work at reasonable speeds
and use a reasonable amount of silicon (though i already know
the usual objections to these arguments). Here is the central problem :

When a LSU line is flushed (selected by the LRU mechanism, and
a line must be replaced with data coming from the cache)
you have to reverse-map the allocation between
the registers and the lines. And this part is extremely important :
if you do it slowly, there are more risks of race conditions...
The reverse mapping is usually done by a CAM, just like what i do
with the configuration i promote now. Except that my CAM has
at less entries than yours. And i use the CAM (comparators)
at the decode stage, while you need the comparators at the time
of replacement. That's the structural difference, but it can
become important.

> The problem are alias
not if you design your system to avoid this condition.

> what's up if 2 registers write on the same
> location thought 2 differents pointers ? This line are no more up to
> date (this unit are the LSU, if i have a good mem).
In the current LSU system,
writes and reads are serialised (race conditions can be detected
and avoided). If your program is correctly designed (that is :
the order of the operations is important, as it usually is the case)
i don't see why there would be a problem.

Of course, if your system promotes data duplication, alias
and coherency are some of the many inherent problems.
Another problem is that the unit is larger --> consumes
more power and is slower...

> Whygee seems to have found an idea to handel that with 2 pointers. I
> personnaly beleive that's to much glue logic.
there is a compromise to do at one point.
The goal is to reduce the amount of logic to the minimum.
(As written above, less logic = less surface = less power = faster)
You have to agree (otherwise there's something wong)
that a reverse-mapping mechanism is necessary :
when the LRU logic of the line indicates which line to evict
from the LSU or the Fetcher (because we need a line
for data coming from main RAM or caches), you have to
invalidate the registers that were associated to this line
(so that any use of this register as a pointer will trigger
another line replacement). LRU makes data replication impossible
and transparent to the programmer, but some coherency maintainance
is necessary.

Remember that the mechanism you propose was the "simple"
solution i first found : looks nice and fast, but the
invalidation logic is an order of magnitude more complex.
If i give you the number of the line that is being invalidated,
how do you invalidate the numbers ?

  - first solution : each LSU and fetcher line has another
63-bit wide field, with each bit representing the associated
register. When the line is invalidated, the corresponding bit
is sent to reset the "valid" flag of the register.
  ===> note that this means that we have a minimum of
       63*16 bits of memory. The 63-bit field contains
       the same information as the information stored by the
       register flag, except that the encoding is a bit different.
63*16=1008 bits of memory, the quarter of the size of the
normal register set. That's much. And the wires are pretty long
and they are a lot : it's not a good thing for the surface,
the speed, etc.

  - second solution : use a "CAM", put 2x 4-bit comparators at
    each regsister flag (we have to be able to invalidate both
    LSU and fetcher lines at the same time). The flag contains
    6 bits : 1 "Checked" bit, 1 "valid" bit, 1 Instruction/data bit
    and 3 line bits (the address). When the correct line address is
    detected on the 2x3-bit input (coming from the LSU and Fetcher),
    the flags are cleared.
====> that's better : only 3*63 bits of storage (a total of 400 bits
when we also count the other flags). However you have to compare
126 values ! Even though the comparators are 3-bits each, that's a lot.

Here, notice that i take your assumption that full associativity
is required. All 63 registers can be used as pointers at a time
(as long as the addressed space fits inside the LSU capacity).

  - Whygee's solution :
take a program and count how many pointers can be used at a time.
Count on the LRU to evict pointers that were not used for a certain time...
I'll put the upper limit arbitrarily to 32, but it can't exceed 63
anyway. A limit of 24 would be more realistic (it still leaves
some room for a SW call stack) but let's say it's a user-defined
parameter.

This number also represents the number of comparators.
They are 6-bit wide because they compare the register number
(that comes from the instruction word) with the number that is
locally stored (between the LSU and the fetcher). An additional
flag will indicate whether the data is valid, or we could also
reset the value to 0 (register #0 is invalid as a pointer).
With 32 register comparators, we need 32 comparators
of 6 bits = 192 bits of storage. If that's too much,
use only 24 comparators and you need 144 bits.

The "counting argument" is one of the many points. Another point
is the drive : with the 2nd solution, the invalidation step
needs that 6 wires (from the LSU and the fetcher) feed 63 inputs each.
With my solution, we need 6 wires to drive 32 inputs each with the
register to find.

The last (and most interesting argument) is that invalidation
doesn't need to propagate invalidation signals through long wires :
for speed and locality reasons, the comparators are next or very close
to the lines to which they are associated. When a match is detected,
they can send the "hit" signal to the line faster than with the 2nd
solution, which has to decode the register number, then decode the
line number. That's a 2-level indirection (6->63 then 3->8)
that is probably not faster than comparators because of the wire lengths.

Note that the problem can become different in a superscalar
architecture. Decoding only one instruction per cycle has
the advantage of simplicity but a 2- or 3-instruction wide
CPU will certainly require another strategy...

Concerning the association between a register and a line, it is
not necessarily "one to one" : it is possible to associate
a register to several lines (3 or 4) and the reverse is necessarily
possible (a line can be pointed to by more than one register).
This depends on the number of comparators. It adds some storage
(it needs to know which line is associated to which register)
but not inside the critical datapath (it's managed by the LRU
mechanics in a single cycle).

Concerning the register invalidation process, the signal comes
from the LRU mechanism, which is triggered by the replacement logic,
which is itself used when a new cache line is fetched for example.
With nicO's approach (multi-CPU system that can "share" cache lines),
it can also come from a remote CPU but it's his business,
the mechanism is the same.

> This mecanism could be only used for code because it's "mostly" read
> only, so there no alias problem. But for data, i don't know.
If you say you don't know... :-P

Concerning aliases : they must be avoided "by design".
The address that comes from the TLB is compared with all the
other addresses in the fetcher and the LSU ==> no line
will answer at the beginning. The least recently use line
will be selected to hold the data. Later, when the address
is wanted again, only ONE line will answer. This is independent
from the register number. pointer alias does not necessarily
mean duplication of the data with the existing LSU.

> I have proposed instead of using the register adress to use the stream number
> of the load&store instructions.
? maybe i didn't understand.

> That's the main purpose of the stream : differentiate data stream. 
at least you have understood the principle ;-)
but this story is different from the story of associating
registers to pointers. Remember that there is no "stream hint"
for the jump instructions, but the jump uses the same
mechanism as load and store.

happy voting,

> nicO
>> WHYGEE
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/