[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Where are the LSU and fetcher descriptions??



plop,

nico@seul.org wrote:

Hi.

Still in the investigation process, I read some of the vhdl code found in
snapshot_jws_30_07_2002.tar.bz2 and snapshot_yg_29_07_2002.tbz.
I see no description for LSU and a very simple description for fetcher.
What is the status of these unit?

not very well describe :)

maybe it would be better if you dared to write some source code ?

Where can I found some doc to start?

hum in the f-cpu manual there is some hint in the instruction description.

plus i gave a deep explanation during the ISIMA conference.

It can be summarized as a multiported buffer
that acts as a fine-grained cache of the memory
(L1 included). The LSU and Fetcher also play
the role of building whole 256-bit lines for the
D-Cache and I-Cache, so the L1 + FC0 core
can have a simple 32/64/128-bit memory interface
to the outside world. Informations are considered
'vaid' only at the place of LSU and Fetcher in the
CPU system (NUMA is used for multi-CPU systems).

I'm study more deeply in this 2 very complexe area. I have now many ideas
to implement it. I should learn now how to make a clean brain dump.

heh. ideas are not everything.

Here is my "result".

LSU have many pitfall. multi-level of cache, virtual memory management,
DMA management, DRAM controler.

there things are completely unrelated to each others and are /not/ a problem for FC0.
these are pitfalls in your reasoning, probably because you start with a different mind set.

and you don't address the access time and bandwidth needs of today's CPU.
FC0's LSU and the Fetcher are designed so the main memory access time can be roughly
8 clock cycles. If you use old P&H concepts, the CPU core will stall much more often
or run at a lower clock frequency.

I find many simple idea in the classical
book (from D.Patterson). Very simple and fast L1 cache, very
complexe/clever L2 cache. The problem is where to put DMA : between L2 and
L1 cache (so L2 is a part of the DRAM controler) or between L2 and the
controler or DMA could be part of a big block with L2 and the controler,
(IO didn't need to be cached but we should take care of aliasing).

forget about P&H and DMA (for the current time) and think about how the core can access the main memory.
DMA is not an issue, it can be plugged anywhere later, and could even have its own LSU port.

L1 cache use Adresse Space Number fields to avoid cash flush in context
change.

in "your" design, maybe, but in FC0 it uses plain physical addresses.

you remember : past the LSU/Fetcher barreer, all addresses are physical.
this avoids any concern about aliasing etc.

L2 are a big victim buffer to avoid L1/L2 data duplication. L2
could also be used as prefetch buffer.

we didn't speak about L2 yet, or at least this issue is not a problem at all,
so we can use a standard cache block out of any founder's library.

L1 has a very low access latency,

not necessarily, since for example L1 can be a direct-mapped,
physically addressed memory. Count 4 or 5 cycles
for full read operation including the pipeline and the TLB.

L2 must have a very low miss rate. L2 must be physicaly mapped to avoid
duplication in case of 2 process which map the same physical pages.

We could use fields in instruction to enable differente policy in L1 cache
(write allocate or not, write thought/write back). This policy could be
also influence by VM fields.

there are some free bits in the load/store instructions, the "streal hint flags" can be used for this purpose.

I have also think about F-bus2, which look like CAN bus. AMBA bus are very
complexe by dealing with split and retry so the idea is to split all
request, like for the CAN bus. It complexify a lot the "client" but is a
must to hide latency and maximise the use of the bus.

if you want it, make it yourself.

Fetcher are connectected to L0 and L1.

you mean LSU and L1 ?
The LSU is more sophisticated than a dumb RAM array,
otherwise it woul have been coded for a while.

* Historical hack : To accelerate jump there is a direct link between the
register number of the register set and cache line of L0.

it's not a "direct link" but an associative memory, so the link is dynamic
and some kinds of tables must be managed (i don't call this "direct" but what the heck).

This is trivial for Icache but there are strong alias issue with dcache.

it's not really the most important problem :
the LSU's and Fetcher's buffers are designed so only one has a copy
of the data. It's possible because the TLB output feeds both LSU and
Fetcher's address comparators so we can detect if requested data is
already somewhere else.

It's coherent with the fact that you usually don't
overwrite the instruction you're currently executing.
in the case of reading something close to the instruction (such as immediate),
then a read-only version can be obtained from the L1 as well
(at the price of some penalty cycles, so the 256-bit line
is transfered through a 64-bit wide only lunk)
but writes will either trap, stall or not be executed
(please make you choice).

the real pain is to design a FAST 2-way associative memory
that remains coherent when something changes at any end,
that is : the entry must be updated either when a linked register is
written with a new value, or when a Fetcher's line is flushed by LRU.

* instruction format are not well defined today. There are many of them,
it should be analysed in the new manual.

which itself is 1 year old.

* There are 2 kinds of instruction scheduler : one using a fifo to
calculate  when will arrive the result to avoid conflict, one simpler
which freeze the shortest pipeline in case of write conflict.

The old good FIFO is implemented by Jaap : http://f-cpu.seul.org/new/scheduler.png

* There is no reoder buffer but exception must be thrown in order so there
is a common pipeline for all instruction to calculate it. This is a main
problem for floating point unit which will be very slow or non IEEE
compliant.

??????

if you want reorder buffers, make them yourself, as well as all the coherency management logic
and the self-test procedures. you know that the more complex the core, the less testability can be ensured.
the effect is that the design effort is higher and the computer is available later.
All that so you can have asynchronous exceptions on FP, which happen ... very infrequently on good code.


All this post ends up rehashing old stinky stuffs and i'm not sure that Pierre is helped any further.


Regards,
nicO


greets,
--
Pierre
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/



*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/