[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Rep:Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?

To: <f-cpu@seul.org>
Subject: Rep:Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?
From: "Nicolas Boulay" <nicolas.boulay@ifrance.com>
Date: Mon, 4 Mar 2002 12:54:36 GMT
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Mon, 04 Mar 2002 07:54:47 -0500
Reply-To: f-cpu@seul.org
Send-By: 140.94.82.18 with Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; FR 15/06/2000)
Sender: owner-f-cpu@seul.org
-----Message d'origine-----
De: Yann Guidon <whygee@f-cpu.org>
A: f-cpu@seul.org
Date: 04/03/02
Objet: Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?

hello,

Christophe wrote:
> From: Michael Riepe
> > On Sat, Mar 02, 2002 at 02:07:32AM +0100, Yann Guidon wrote:
> > > Nicolas Boulay wrote:
> > > > > - synonym problem (several different virtual addresses cannot
span the same
> > > > > physical addresses without being dupplicated in cache).
> > > > Abgelehnt. This one causes severe problems.
> > > > >>> Not really, it causes waste space, only !
> > > i do not agree with "only".
> > Neither do I (unless we're talking about the I-cache only).
> Why might there be severe problems ? well, caches both contains
> a tag and a data associated to a virtual address. If there are two
> different virtual addresses which span the same physical address
> and are present in the cache, it means that two entries in the
> cache might also have their data to be different and so coherency
> would be broken : which data to keep and update into the external
memory ?

this problem is "solved" by using physical tags, that's all :-)

>>> Sure ! But you should put the TLB in the critical data path of the
memory system. So you slow done every memory access.

> > > > > (2) physically-addressed caches (physical tags)
> > > > > - do virtual-to-physical address translation on every access
> > > > Not necessarily. The TLB lookup can be started as soon as
loadaddr (or
> > > > one of its variants) is called, and doesn't have to be repeated
in all
> > > > cases (e.g. with a postincremented pointer, you'll perform a
range check
> > > > first and only do a full lookup if the range check fails).
> > > >
> > > > >>> Consider that the LSU is a kind of virtually addressed
caches.
> > > that's one perspective.
> > I'm not sure if I fully understand what exactly the LSU is supposed
to do.
> > In my mental picture, its job was to keep the data cache filled, not
> > cache things itself.
> I thought LSU is "Load Store Unit" and was like a functional unit to
handle LOAD and STORE ?

This is the programmer's point of view. it is correct but the LSU (and
its symmetric :
the instruction fetcher) do much more than that because "handle" implies
many things.

Marco Al wrote:
> From: "Christophe" <christophe.avoinne@laposte.net>
> > Why might there be severe problems ? well, caches both contains a
tag and
> > a data associated to a virtual address. If there are two
> > different virtual addresses which span the same physical address and
are
> > present in the cache, it means that two entries in the
> > cache might also have their data to be different and so coherency
would be
> > broken : which data to keep and update into the external
> > memory ?
> 
> There is a simple software solution to this, dont allow the OS to let
that
> happen :)
i don't know if it is wise to put that kind of pressure on the OS.

> With 64 bit's to go around you dont really need per process memory
spaces
i don't think that this argument is valid. F-CPU defines 64-bt pointers
but might implement a small subset thereof. For example, a dumb
prototype
might use 5+16 address bits (5 bits for the LSU index (each cache line
is
32 byte wide) and 16 bits for the remaining comparators (address
comparators
of the LSU, for example). A first commercial might increase that to
5+32=37 bits,

>>>> All of this are details !

and others might chose to reduce or increase the physically addressable
space,
for ranging from "embedded" versions to mainframes...
I think that the VMID (named ASID in MIPS) is still necessary and
overlapping pages
might help communication between processes from different VMIDs, thus
reducing
the cost of copying pages from one process to another.

> (you still need per process memory protection of course, but you dont
> need to go to external memory to change that).
> 
> Marco

Christophe wrote:
> From: Marco Al <marco@simplex.nl>
> > From: "Christophe" <christophe.avoinne@laposte.net>
<snip>
> Lazy solution for my opinion. Not viable for micro kernel
> or exo kernel which really counts upon sharing
> some physical pages in different virtual addresses with different
access rights.
> 
> Anyway, if an OS programmer wants to use per process address spaces,
> he should be able to do so. Having it
> doesn't prevent us from being able to use a unique address space if we
like.

that's more or less the point i made above :-)
Let's implement just enough, but not too much, so that if it's unused,
there is no waste of silicon surface...

nicO:
> michael :
>
> I'm not sure if I fully understand what exactly the LSU is supposed to
do.
> In my mental picture, its job was to keep the data cache filled, not
> cache things itself.
> 
> >>> I see it as a simple bubble buffer that act as caches (link to the
> register number for I-caches and to the memory tags for the D-caches))
> nicO

* it is indeed a buffer that has the same behaviour as a small cache,
but reduced to 8 lines of 256 bits.

* The fetcher and the LSU are symmetrical, with the _only_ difference
in structure that the fetcher handles 32-bit data only (8 per line)
and does not have to handle writes from the pipeline.

* Each line is "linked" to zero, one or more register numbers,
to ease decoding and static scheduling. So there is a constant number
of cycles for transfering the data from the buffer to the pipeline,
and for the other direction (to the LSU).

>>> Away from details, the only interresting things here is that a data
should be in the LSU. Otherwise the entire cpu is frozen, so there isn't
any variables cycles instructions in the pipeline. 

* Upon accessing a line, there is a 3-bit LRU counter which is updated,
so the LSU and the fetcher work in almost-ideal Least Recently Used
fashion.

>>> So you speak about the replacement policy of the cache. Nothing new
?

* A pair of lines (could be non-contiguous but it's not clear yet how)
will
be formed to handle "streams" (contiguous memory access to/from memory)
with a double-buffering scheme.

These were the behaviours from the pipeline point of view but these are
not the only characteristics of the LSU and the fetcher !

>>> So it's a cache were the data should be present otherwise the
pipeline is frozen. 

* The LSU and the fetcher are the only devices connected to their
associated L1 cache : this simplifies the layout and protocol,
because there is a unique 256 bit bus for (simultaneous) reading
or writing a line.

* The design of the LSU and the fetcher is tightly coupled to their
associated L1 caches, which can be considered like a "reservoir" or
"swap space" for the memory buffers.

* The L1 strategy is to write back on cache line replacement : this is
logical in a system where there is no "main memory" but a "private
memory range".

>>> in case of distributed memory you could NOT use write back, but
write thought or write invalidate. Otherwise, you will have coherency
problem.

In a sense, the L1 is like a large FIFO from which one element can be
taken and put in front again, while the line arriving at the end of the
FIFO
is written to memory

>>> That's the typical goal of a cache !

* All memory reads and writes to/from L1 go through the corresponding
memory buffer (LSU or fetchers). This is both a cause and a consequence
from the fact that the L1 must be kept as simple as possible but must
handle multi-word transactions on 32-bit and 64-bit words. The cache
line's "scatter and gather" of the word is handled only by the LSU or
the
fetcher.

* nicO doen't agree on this : the LSU and the fetcher are connected to
their
own L1, as well as the other memory system buses : L2, I/O bus, private
(SD)RAM bus.

>>> That's only a strange manner to present that. Usually, the memory
hierarchy is ... hierachical. You don't mix things. But it's black box,
so you could introduice how many connexion you want between the bloc.

I justify this with several facts and desirable things :
 - perform the line split (256 bits to and from 8, 16, 32 and 64 bits)
in a single location. This increases the local density but simplifies
all the rest.

>>> So it simplify the layout ? i(m not sure because you only change the
upper level point of view but not the fonctionnality.

 - having a single location where data is considered coherent. If the
CPU can have a direct access to all the memory cache layers, it
simplifies the invalidation problem. When an external system asks for
data read or write to our CPU, all there is to do is hand the address to
the LSU and fetcher, which then do their work like it usually does for
the execution pipeline.

>>>> Ouch ! If an external access occur to the memory why bothering the
cpu of the node !!! The system could provide access to the data very
easly without problems.

 - reducing the overall memory latency. Having all (or most) layers
connected to a single point might be a small overhead but avoids the
"avalanche" effect on a cache miss, and this becomes particularly
important when the code works mostly with uncached or scattered data.
OTOH, the LSU and the fetcher are designed so that the minimum latency
is hidden when consecutive accesses are performed (with a streaming
detection which automatically enables double buffering).

>>> So you want a cache that give immediately the result of there fetch.
You didn't want to wait the complete load of a cache line. Nothing new !
cf :
http://www.cs.washington.edu/education/courses/471/00au/Lectures/cacheAd
v.pdf
When you speak of the avalanche effect does that means that you iniate
an access to every memory at each demand (LSU,L1,L2, main memory) ?
So each time you must canceled all the access for very little gain in
the common case. Most of the time there is a penalty for canceled burst
transfert. In SMP system (with cpus on the same die) it's an overkill ! 
I think that the use of flag in the instruction world could help to
bypass the L1 cache in case of none cacheable access to the memory.
nicO

Damn looking like a patent claim list :-/

WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

 
______________________________________________________________________________
ifrance.com, l'email gratuit le plus complet de l'Internet !
vos emails depuis un navigateur, en POP3, sur Minitel, sur le WAP...
http://www.ifrance.com/_reloc/email.emailif


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
Prev by Date: Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?
Next by Date: [f-cpu] elbrus
Prev by thread: Rep:Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?
Next by thread: Rep:Re: Rep:Re: [f-cpu] virtually or physically-addressed cache ?
Index(es):
- Date
- Thread