[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?
> From: Michael Riepe
> > On Sat, Mar 02, 2002 at 02:07:32AM +0100, Yann Guidon wrote:
> > > Nicolas Boulay wrote:
> > > > > - synonym problem (several different virtual addresses cannot span the same
> > > > > physical addresses without being dupplicated in cache).
> > > > Abgelehnt. This one causes severe problems.
> > > > >>> Not really, it causes waste space, only !
> > > i do not agree with "only".
> > Neither do I (unless we're talking about the I-cache only).
> Why might there be severe problems ? well, caches both contains
> a tag and a data associated to a virtual address. If there are two
> different virtual addresses which span the same physical address
> and are present in the cache, it means that two entries in the
> cache might also have their data to be different and so coherency
> would be broken : which data to keep and update into the external memory ?
this problem is "solved" by using physical tags, that's all :-)
> > > > > (2) physically-addressed caches (physical tags)
> > > > > - do virtual-to-physical address translation on every access
> > > > Not necessarily. The TLB lookup can be started as soon as loadaddr (or
> > > > one of its variants) is called, and doesn't have to be repeated in all
> > > > cases (e.g. with a postincremented pointer, you'll perform a range check
> > > > first and only do a full lookup if the range check fails).
> > > >
> > > > >>> Consider that the LSU is a kind of virtually addressed caches.
> > > that's one perspective.
> > I'm not sure if I fully understand what exactly the LSU is supposed to do.
> > In my mental picture, its job was to keep the data cache filled, not
> > cache things itself.
> I thought LSU is "Load Store Unit" and was like a functional unit to handle LOAD and STORE ?
This is the programmer's point of view. it is correct but the LSU (and its symmetric :
the instruction fetcher) do much more than that because "handle" implies many things.
Marco Al wrote:
> From: "Christophe" <firstname.lastname@example.org>
> > Why might there be severe problems ? well, caches both contains a tag and
> > a data associated to a virtual address. If there are two
> > different virtual addresses which span the same physical address and are
> > present in the cache, it means that two entries in the
> > cache might also have their data to be different and so coherency would be
> > broken : which data to keep and update into the external
> > memory ?
> There is a simple software solution to this, dont allow the OS to let that
> happen :)
i don't know if it is wise to put that kind of pressure on the OS.
> With 64 bit's to go around you dont really need per process memory spaces
i don't think that this argument is valid. F-CPU defines 64-bt pointers
but might implement a small subset thereof. For example, a dumb prototype
might use 5+16 address bits (5 bits for the LSU index (each cache line is
32 byte wide) and 16 bits for the remaining comparators (address comparators
of the LSU, for example). A first commercial might increase that to 5+32=37 bits,
and others might chose to reduce or increase the physically addressable space,
for ranging from "embedded" versions to mainframes...
I think that the VMID (named ASID in MIPS) is still necessary and overlapping pages
might help communication between processes from different VMIDs, thus reducing
the cost of copying pages from one process to another.
> (you still need per process memory protection of course, but you dont
> need to go to external memory to change that).
> From: Marco Al <email@example.com>
> > From: "Christophe" <firstname.lastname@example.org>
> Lazy solution for my opinion. Not viable for micro kernel
> or exo kernel which really counts upon sharing
> some physical pages in different virtual addresses with different access rights.
> Anyway, if an OS programmer wants to use per process address spaces,
> he should be able to do so. Having it
> doesn't prevent us from being able to use a unique address space if we like.
that's more or less the point i made above :-)
Let's implement just enough, but not too much, so that if it's unused,
there is no waste of silicon surface...
> michael :
> I'm not sure if I fully understand what exactly the LSU is supposed to do.
> In my mental picture, its job was to keep the data cache filled, not
> cache things itself.
> >>> I see it as a simple bubble buffer that act as caches (link to the
> register number for I-caches and to the memory tags for the D-caches))
* it is indeed a buffer that has the same behaviour as a small cache,
but reduced to 8 lines of 256 bits.
* The fetcher and the LSU are symmetrical, with the _only_ difference
in structure that the fetcher handles 32-bit data only (8 per line)
and does not have to handle writes from the pipeline.
* Each line is "linked" to zero, one or more register numbers,
to ease decoding and static scheduling. So there is a constant number
of cycles for transfering the data from the buffer to the pipeline,
and for the other direction (to the LSU).
* Upon accessing a line, there is a 3-bit LRU counter which is updated,
so the LSU and the fetcher work in almost-ideal Least Recently Used fashion.
* A pair of lines (could be non-contiguous but it's not clear yet how) will
be formed to handle "streams" (contiguous memory access to/from memory)
with a double-buffering scheme.
These were the behaviours from the pipeline point of view but these are
not the only characteristics of the LSU and the fetcher !
* The LSU and the fetcher are the only devices connected to their
associated L1 cache : this simplifies the layout and protocol,
because there is a unique 256 bit bus for (simultaneous) reading
or writing a line.
* The design of the LSU and the fetcher is tightly coupled to their
associated L1 caches, which can be considered like a "reservoir" or
"swap space" for the memory buffers.
* The L1 strategy is to write back on cache line replacement : this is
logical in a system where there is no "main memory" but a "private memory range".
In a sense, the L1 is like a large FIFO from which one element can be
taken and put in front again, while the line arriving at the end of the FIFO
is written to memory
* All memory reads and writes to/from L1 go through the corresponding
memory buffer (LSU or fetchers). This is both a cause and a consequence
from the fact that the L1 must be kept as simple as possible but must
handle multi-word transactions on 32-bit and 64-bit words. The cache
line's "scatter and gather" of the word is handled only by the LSU or the
* nicO doen't agree on this : the LSU and the fetcher are connected to their
own L1, as well as the other memory system buses : L2, I/O bus, private (SD)RAM bus.
I justify this with several facts and desirable things :
- perform the line split (256 bits to and from 8, 16, 32 and 64 bits) in a single
location. This increases the local density but simplifies all the rest.
- having a single location where data is considered coherent. If the CPU
can have a direct access to all the memory cache layers, it simplifies the
invalidation problem. When an external system asks for data read
or write to our CPU, all there is to do is hand the address to the LSU
and fetcher, which then do their work like it usually does for the
- reducing the overall memory latency. Having all (or most) layers connected
to a single point might be a small overhead but avoids the "avalanche" effect
on a cache miss, and this becomes particularly important when the code works
mostly with uncached or scattered data. OTOH, the LSU and the fetcher are
designed so that the minimum latency is hidden when consecutive accesses
are performed (with a streaming detection which automatically enables
Damn looking like a patent claim list :-/
To unsubscribe, send an e-mail to email@example.com with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/