[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Rep:(LSU!) Re: [f-cpu] manual update
- To: <firstname.lastname@example.org>
- Subject: Rep:(LSU!) Re: [f-cpu] manual update
- From: "Nicolas Boulay" <email@example.com>
- Date: Mon, 10 Jun 2002 15:14:58 GMT
- Delivered-To: firstname.lastname@example.org
- Delivered-To: email@example.com
- Delivered-To: firstname.lastname@example.org
- Delivery-Date: Mon, 10 Jun 2002 11:15:16 -0400
- Reply-To: email@example.com
- Send-By: 126.96.36.199 with Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; FR 15/06/2000)
- Sender: firstname.lastname@example.org
De: Yann Guidon <email@example.com>
Objet: (LSU!) Re: [f-cpu] manual update
> In fact, it's very easy to explain. But It's not the strong point of
> Whygee ;D
it's a problem with more people than you'd agree...
And please read this post carefully, it's probably the clearest
thing i could write. The invalidation problem is certainly
a big point that you have not yet addressed and that has forced
me to abandon the rough idea you have.
> To speed up things, register adress are used as an inputs inside kind
> of cache and return the content of a cache line (or more) pointed by
> content of the register. So we bypass the register read and we can
> immediately acceded to the content of the memory before having read
> register bank. To verify the integrity, we check the real adresse with
> the adress read in the register.
no need to check the integrity afterwards. The address and the
register's contents MUST be coherent at the time the load/store/jump
instruction is issued. Otherwise, what would be the benefit ?
>>> more parrallel operations. At the verification stage you could stop
to start a load.
Of course, race conditions must be avoided (and they can appear
everywhere because the latency is pretty high) but they must be
solved "by design", not by explicit coherency checks...
>>> Yes ! But how ?
> This memory could react as a caches. But i think that a 64 registers
> adresse isn't so much and we can use a memory instead of a cache
> (with 8 lines instead of 64).
in theory you have a point. Except that you open a big
can of worms and 64 lines is really a lot, small FPGAs
could not contain that. The LSU is also a multiported
>>> Small FPGA couldn't handel the mul unit !
Big fpga could have ~100 000 bits of storage by block of 4000 bits. This
isn't a problem at all. This memory could have 2 port (one write, one
memory and more lines would make the chip even bigger.
I estimate that the FC0 clock speed is limited by the
register set (and possibly by the Xbar), so if you add
another big array, you further slow the clock down.
>>> The true bloquing point are the number of port to the register bank
Finally, the alias problems appear. Data duplication is
often a sign of bad design, particularly when it is
in the critical path...
Don't forget that the LSU has to speak directly with the
cache and at least one I/O (otherwise, where would the
informations come ???) which is certainly not 256 bit wide.
>>> It speak only with the data bus for the data part and with the
instruction bus for the program part (harvard architecture). That's all.
Furthermore, writes must have a byte-granularity and store
some "dirty" flags. Finally, how would you separate the
instructions from the data ? (because the registers can point
>>> The hint aren't done by the same instruction.
to both types) Would there be a big block from which instructions
are read and data are read/written ? then that would be
a multiported block, at least with 5 read ports and 4 write
ports. The Register set already totals 5 ports and it's a
big slow piece of silicon...
The original idea of 8 lines for the LSU and 8 lines for the fetcher
is not such a bad idea. The critical datapath is already a problem
but it can be pipelined. The register-line associativity problem
>>> Cache like system is much slower than simple RAM.
is the price to pay, if we want to work at reasonable speeds
and use a reasonable amount of silicon (though i already know
the usual objections to these arguments). Here is the central problem :
When a LSU line is flushed (selected by the LRU mechanism, and
a line must be replaced with data coming from the cache)
you have to reverse-map the allocation between
the registers and the lines. And this part is extremely important :
if you do it slowly, there are more risks of race conditions...
The reverse mapping is usually done by a CAM, just like what i do
with the configuration i promote now. Except that my CAM has
at less entries than yours. And i use the CAM (comparators)
at the decode stage, while you need the comparators at the time
of replacement. That's the structural difference, but it can
>>> True CAM are much too slow. Array of comparateur are much bigger
than a decoder. (a CAM is only a cache system with as associativity as
the number of line). A cache system always need a strategy to be filled
up. Here, no need of that !
Hint that a register contain a pointer -> corresponding line are filled
In case of data, i didn't use register but stream number.
> The problem are alias
not if you design your system to avoid this condition.
> what's up if 2 registers write on the same
> location thought 2 differents pointers ? This line are no more up to
> date (this unit are the LSU, if i have a good mem).
In the current LSU system,
writes and reads are serialised (race conditions can be detected
and avoided). If your program is correctly designed (that is :
the order of the operations is important, as it usually is the case)
i don't see why there would be a problem.
>>> you assert that. Compiler will be to much tricky, if the alias
problem is remaining.
Of course, if your system promotes data duplication, alias
and coherency are some of the many inherent problems.
Another problem is that the unit is larger --> consumes
more power and is slower...
>>> That's the problem !
> Whygee seems to have found an idea to handel that with 2 pointers. I
> personnaly beleive that's to much glue logic.
there is a compromise to do at one point.
The goal is to reduce the amount of logic to the minimum.
(As written above, less logic = less surface = less power = faster)
You have to agree (otherwise there's something wong)
that a reverse-mapping mechanism is necessary :
when the LRU logic of the line indicates which line to evict
from the LSU or the Fetcher (because we need a line
for data coming from main RAM or caches), you have to
invalidate the registers that were associated to this line
(so that any use of this register as a pointer will trigger
another line replacement).
>>> That's sure. That's how cache work.
LRU makes data replication impossible
and transparent to the programmer, but some coherency maintainance
>>> There is no coherency between the true adresse and the memory
content. You never used the true adresse of the data, so ?
Remember that the mechanism you propose was the "simple"
solution i first found : looks nice and fast, but the
invalidation logic is an order of magnitude more complex.
If i give you the number of the line that is being invalidated,
how do you invalidate the numbers ?
- first solution : each LSU and fetcher line has another
63-bit wide field, with each bit representing the associated
register. When the line is invalidated, the corresponding bit
is sent to reset the "valid" flag of the register.
===> note that this means that we have a minimum of
63*16 bits of memory. The 63-bit field contains
the same information as the information stored by the
register flag, except that the encoding is a bit different.
63*16=1008 bits of memory, the quarter of the size of the
normal register set. That's much. And the wires are pretty long
and they are a lot : it's not a good thing for the surface,
the speed, etc.
>>> that's what is called tag memory plan of a cache ?
- second solution : use a "CAM", put 2x 4-bit comparators at
each regsister flag (we have to be able to invalidate both
LSU and fetcher lines at the same time). The flag contains
6 bits : 1 "Checked" bit, 1 "valid" bit, 1 Instruction/data bit
and 3 line bits (the address). When the correct line address is
detected on the 2x3-bit input (coming from the LSU and Fetcher),
the flags are cleared.
====> that's better : only 3*63 bits of storage (a total of 400 bits
when we also count the other flags). However you have to compare
126 values ! Even though the comparators are 3-bits each, that's a lot.
>>> I didn't really understood what is the index of the cam and what is
the indexed data.
Here, notice that i take your assumption that full associativity
is required. All 63 registers can be used as pointers at a time
(as long as the addressed space fits inside the LSU capacity).
- Whygee's solution :
take a program and count how many pointers can be used at a time.
Count on the LRU to evict pointers that were not used for a certain
I'll put the upper limit arbitrarily to 32, but it can't exceed 63
anyway. A limit of 24 would be more realistic (it still leaves
some room for a SW call stack) but let's say it's a user-defined
This number also represents the number of comparators.
They are 6-bit wide because they compare the register number
(that comes from the instruction word) with the number that is
locally stored (between the LSU and the fetcher). An additional
flag will indicate whether the data is valid, or we could also
reset the value to 0 (register #0 is invalid as a pointer).
With 32 register comparators, we need 32 comparators
of 6 bits = 192 bits of storage. If that's too much,
use only 24 comparators and you need 144 bits.
The "counting argument" is one of the many points. Another point
is the drive : with the 2nd solution, the invalidation step
needs that 6 wires (from the LSU and the fetcher) feed 63 inputs each.
With my solution, we need 6 wires to drive 32 inputs each with the
register to find.
The last (and most interesting argument) is that invalidation
doesn't need to propagate invalidation signals through long wires :
for speed and locality reasons, the comparators are next or very close
to the lines to which they are associated. When a match is detected,
they can send the "hit" signal to the line faster than with the 2nd
solution, which has to decode the register number, then decode the
line number. That's a 2-level indirection (6->63 then 3->8)
that is probably not faster than comparators because of the wire
Note that the problem can become different in a superscalar
architecture. Decoding only one instruction per cycle has
the advantage of simplicity but a 2- or 3-instruction wide
CPU will certainly require another strategy...
Concerning the association between a register and a line, it is
not necessarily "one to one" : it is possible to associate
a register to several lines (3 or 4) and the reverse is necessarily
possible (a line can be pointed to by more than one register).
This depends on the number of comparators. It adds some storage
(it needs to know which line is associated to which register)
but not inside the critical datapath (it's managed by the LRU
mechanics in a single cycle).
>>> But don't see how you could find that 2 prefetch lines have the same
adress. That's our problem. All other speak are about a cache technique
Concerning the register invalidation process, the signal comes
from the LRU mechanism, which is triggered by the replacement logic,
which is itself used when a new cache line is fetched for example.
With nicO's approach (multi-CPU system that can "share" cache lines),
>>> ??? shared a cache ??? Do you speak about the L2 inside a multicpu
it can also come from a remote CPU but it's his business,
the mechanism is the same.
> This mecanism could be only used for code because it's "mostly" read
> only, so there no alias problem. But for data, i don't know.
If you say you don't know... :-P
Concerning aliases : they must be avoided "by design".
>>> By the program, you mean ?
The address that comes from the TLB is compared with all the
other addresses in the fetcher and the LSU ==> no line
will answer at the beginning. The least recently use line
will be selected to hold the data. Later, when the address
is wanted again, only ONE line will answer. This is independent
from the register number. pointer alias does not necessarily
mean duplication of the data with the existing LSU.
>>> Ok ! When you want to fill the LSU, you compare each store adress by
the adresse of soon store cache line. So a line could be associated with
more than a register. So you need a memory that contain the link between
line of LSU and the register.
In case of read, you make reg@->line#->content of the memory. That 2
read of 63*6 bits and any(8)*cache line.
In case of filling, you need 8 comparators of 64 bits to detectet
I have understood ! It's exactly how "write buffer"/"prefetch buffer"
> I have proposed instead of using the register adress to use the stream
> of the load&store instructions.
? maybe i didn't understand.
>>> intead of linking register adress and data, i prefer to link data
stream and memory. That's the goal of memory stream ! Not checking
coherency between them.
> That's the main purpose of the stream : differentiate data stream.
at least you have understood the principle ;-)
but this story is different from the story of associating
registers to pointers. Remember that there is no "stream hint"
for the jump instructions, but the jump uses the same
mechanism as load and store.
>>> That's why i speak about data for memory stream number and register
adress for instruction.
>>> Soon maid ! It's closed since 20h :p
To unsubscribe, send an e-mail to firstname.lastname@example.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/
ifrance.com, l'email gratuit le plus complet de l'Internet !
vos emails depuis un navigateur, en POP3, sur Minitel, sur le WAP...
To unsubscribe, send an e-mail to email@example.com with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/