[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Where are the LSU and fetcher descriptions??



> plop,
>
> nico@seul.org wrote:
>
>>>Hi.
>>>
>>>Still in the investigation process, I read some of the vhdl code found
>>> in
>>>snapshot_jws_30_07_2002.tar.bz2 and snapshot_yg_29_07_2002.tbz.
>>>I see no description for LSU and a very simple description for fetcher.
>>>What is the status of these unit?
>>>
>>>
>>not very well describe :)
>>
>>
> maybe it would be better if you dared to write some source code ?
>
>>>Where can I found some doc to start?
>>>
>>>
>>hum in the f-cpu manual there is some hint in the instruction
>> description.
>>
>>
> plus i gave a deep explanation during the ISIMA conference.
>
> It can be summarized as a multiported buffer
> that acts as a fine-grained cache of the memory
> (L1 included). The LSU and Fetcher also play
> the role of building whole 256-bit lines for the
> D-Cache and I-Cache, so the L1 + FC0 core
> can have a simple 32/64/128-bit memory interface
> to the outside world. Informations are considered
> 'vaid' only at the place of LSU and Fetcher in the
> CPU system (NUMA is used for multi-CPU systems).
>
>>I'm study more deeply in this 2 very complexe area. I have now many ideas
>>to implement it. I should learn now how to make a clean brain dump.
>>
>>
> heh. ideas are not everything.
>
>>Here is my "result".
>>
>>LSU have many pitfall. multi-level of cache, virtual memory management,
>>DMA management, DRAM controler.
>>
> there things are completely unrelated to each others and are /not/ a
> problem for FC0.

funny... I hope there is not to much expert to read at this.

> these are pitfalls in your reasoning, probably because you start with a
> different mind set.
>

I'm feed up with your personnal attack... Buy a punching ball and stop
bashing on the list...

> and you don't address the access time and bandwidth needs of today's CPU.
> FC0's LSU and the Fetcher are designed so the main memory access time
> can be roughly
> 8 clock cycles. If you use old P&H concepts, the CPU core will stall

8 cycles...
main memory access are arround 150/200 cycles.
L2 access arround 10
L1 access 2 or 3

It was fun to read you bashing P&H but it will be great if you have at
least understood all of the book. 3rd edition is less than 2 years old.

> much more often
> or run at a lower clock frequency.

But if you try to be more clever you slowdown every thing... you can't put
a lot of things in 6 gates deep pipeline...

>
>>I find many simple idea in the classical
>>book (from D.Patterson). Very simple and fast L1 cache, very
>>complexe/clever L2 cache. The problem is where to put DMA : between L2
>> and
>>L1 cache (so L2 is a part of the DRAM controler) or between L2 and the
>>controler or DMA could be part of a big block with L2 and the controler,
>>(IO didn't need to be cached but we should take care of aliasing).
>>
>>
> forget about P&H and DMA (for the current time) and think about how the
> core can access the main memory.
> DMA is not an issue, it can be plugged anywhere later, and could even
> have its own LSU port.

No it can't. There are a real mess with aliasing and caching. And even for
small system it's not possible to avoid DMA.

>
>>L1 cache use Adresse Space Number fields to avoid cash flush in context
>>change.
>>
> in "your" design, maybe, but in FC0 it uses plain physical addresses.
>

much too slow. The vm translation is slower than iCache or Dcache  because
most of the time TLB are a full associative cache. L1 are 2 set
associative or dierct mapped + victim cache.

Even with a 4 way tlb it will be much too slow.

> you remember : past the LSU/Fetcher barreer, all addresses are physical.
> this avoids any concern about aliasing etc.

? A lot of aliasing came from caches strategie not VM.

>
>>L2 are a big victim buffer to avoid L1/L2 data duplication. L2
>>could also be used as prefetch buffer.
>>
> we didn't speak about L2 yet, or at least this issue is not a problem at
> all, so we can use a standard cache block out of any founder's library.

...
That's an evidence to use founders'RAM where i say something else ?  But
cache are not simple SRAM area, you know...

>
>>L1 has a very low access latency,
>>
> not necessarily, since for example L1 can be a direct-mapped,

directed mapped cache has the lowest latency...

> physically addressed memory. Count 4 or 5 cycles
> for full read operation including the pipeline and the TLB.

A fetch in 5 cycles ? a read in 5 cycle:) those latency are around 2 or 3
todays.

>
>>L2 must have a very low miss rate. L2 must be physicaly mapped to avoid
>>duplication in case of 2 process which map the same physical pages.
>>
>>We could use fields in instruction to enable differente policy in L1
>> cache
>>(write allocate or not, write thought/write back). This policy could be
>>also influence by VM fields.
>>
>>
> there are some free bits in the load/store instructions, the "streal
> hint flags" can be used for this purpose.
>

yep.

>>I have also think about F-bus2, which look like CAN bus. AMBA bus are
>> very
>>complexe by dealing with split and retry so the idea is to split all
>>request, like for the CAN bus. It complexify a lot the "client" but is a
>>must to hide latency and maximise the use of the bus.
>>
>>
> if you want it, make it yourself.
>
>>Fetcher are connectected to L0 and L1.
>>
> you mean LSU and L1 ?
> The LSU is more sophisticated than a dumb RAM array,
> otherwise it woul have been coded for a while.
>

That's why my previous mail was so long.

>>* Historical hack : To accelerate jump there is a direct link between the
>>register number of the register set and cache line of L0.
>>
> it's not a "direct link" but an associative memory, so the link is dynamic
> and some kinds of tables must be managed (i don't call this "direct" but
> what the heck).
>

Much to slow for such buffer to save few gate. Count a complete clock
cycle to read a SRAM of 64/128 adress. So an associative buffer will need
at least 2 reads.

>>This is trivial for Icache but there are strong alias issue with dcache.
>>
> it's not really the most important problem :
> the LSU's and Fetcher's buffers are designed so only one has a copy
> of the data. It's possible because the TLB output feeds both LSU and
> Fetcher's address comparators so we can detect if requested data is
> already somewhere else.

?
the mecanism to garantie that is a kind of full associative array. Which
are damn slow.

>
> It's coherent with the fact that you usually don't
> overwrite the instruction you're currently executing.

yep. that why there is no problem with iCache.

> in the case of reading something close to the instruction (such as
> immediate),
> then a read-only version can be obtained from the L1 as well
> (at the price of some penalty cycles, so the 256-bit line
> is transfered through a 64-bit wide only lunk)
> but writes will either trap, stall or not be executed
> (please make you choice).
>
> the real pain is to design a FAST 2-way associative memory
> that remains coherent when something changes at any end,
> that is : the entry must be updated either when a linked register is
> written with a new value, or when a Fetcher's line is flushed by LRU.
>

yep.

>>* instruction format are not well defined today. There are many of them,
>>it should be analysed in the new manual.
>>
> which itself is 1 year old.
>
>>* There are 2 kinds of instruction scheduler : one using a fifo to
>>calculate  when will arrive the result to avoid conflict, one simpler
>>which freeze the shortest pipeline in case of write conflict.
>>
> The old good FIFO is implemented by Jaap :
> http://f-cpu.seul.org/new/scheduler.png
>
>>* There is no reoder buffer but exception must be thrown in order so
>> there
>>is a common pipeline for all instruction to calculate it. This is a main
>>problem for floating point unit which will be very slow or non IEEE
>>compliant.
>>
>>
> ??????
>
> if you want reorder buffers, make them yourself, as well as all the
> coherency management logic
> and the self-test procedures. you know that the more complex the core,
> the less testability can be ensured.
> the effect is that the design effort is higher and the computer is
> available later.
> All that so you can have asynchronous exceptions on FP, which happen ...
> very infrequently on good code.
>

Reorder buffer have been created because sw guy can't handle out of order
exception. In f-cpu we try to avoid it by detected problem very early in
the pipeline. But this can't be done for FP unit.

>
> All this post ends up rehashing old stinky stuffs and i'm not sure that
> Pierre is helped any further.

Buy a punching ball...
nicO

>
>
>>Regards,
>>nicO
>>
>>
>>
>>>greets,
>>>--
>>>Pierre

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/