[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] FC0 and SMT



hi,

Martin Devera wrote:

Hi again (after ... 3 ? yrs),

or so ...

still thinking in free time about some cpu aspects and I found myself thinking a lot about SMT ala Niagara ala T1.

There are several "methods" to do SMT.
i like the Seymour Cray way (see the Peripheral Processing Unit of the CDC6600)
http://f-cpu.seul.org/whygee/CDC/DesignOfAComputer_CDC6600.pdf
http://f-cpu.seul.org/whygee/CDC/60100000D_6600refMan.pdf
the advantage is that it is simple and very easy to understand,
without the need of complex scoreboards and other mechanisms.
That is more or less what i intend to with the VSP : 4-way SMT with a very simple 32-bit core.


I was interested to try FC0 like core with it.

FC0 is not adapted, unless you want to dedicate 1/2 of the gates to control logic.


I sketched a simple test architecture to validate it and I decided to share raw ideas here (because I don't know if I will ever get time to realize it).

Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)

URL ? btw, i could get a 64*6 FPGA emulator for $100 (in Paris). the catch is that it is used, 1998-era design, no software, nothing. and it's heavy and a newer chips of the same price are certainly more worth it.

Architecture:
I decided to take from FC0 what I find interesting for this project: scoreboard, fixed registers decode,zero detecting,SRB,no exception in pipeline.

sure, these parts are quite interesting. however it is good for a 64-bit CPU, but probably not for 32-bit, simply because the decode logic vs operating logic ratio.

The sketch is simple: 4 2r1w 16x32 register files (each for one thread), 4 groups of FUs (probably LSU,ADD,MUL+SHUFFLE,LOGICAL).

In the VSP, there are 4 simple units : ASU, ROP2, SHL (all taken from FC0 and simplified)
plus a new one : "IE" (insert/extract) that performs byte/halfword/word alignments in conjunction
with the memory system. There is also one "SR" read and write port, but that's not critical in HW.
With the VSP, there is NO LSU, but a different mechanisms that tries to simplify things
(at the expense of others, like the software coder's sanity, but who cares ? :-P)


i'm not sure that MUL is needed for your design, but you seem to get them "for free" so what's the problem...
And as they seem to be quite fast, using them for SHL is an interesting trick
(ie : one MUL port is a 4->16 bits decoder that feeds the multiplier with the shift amount, i like this idea :-D)


These groups are connected together via 3 omega networks whose also does sign/zero extension (simple to do because we use 4 input LUTs as 2-mux so that we have other 2 to define input select plus 1/0 output force).

why omega networks ? you want to do 4 simultaneous operations ?

This alone allows you to do 4 ops per cycle peak.

ooops :-)

This datapath except LSU costs us on spartan3 (for 32bits):
LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R)

-> try to combine it with the MUL (see above)

3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).

yep but the omega is 1/2 of this. big surface and performance hit.

Now we need scoreboard and scheduler FIFO where both are private to each thread.

now you begin to realize where the problems are.

We can issue from thread to appropriate FU when is ready and there is no other higher prio thread which is ready and wants the same FU (prio may be fixed per RF).

there are too many parameters that change every cycle....

Thread is ready if:
- there is unused timeslot in thread's scheduler fifo
- there is no dependency on register "in progress"
Note that ready test is local to thread. It may be extended to output third state "ready if bypass is used" to handle shared bypass.
Yes the scheduling still has dark parts (I'd like to avoid adding next pipeline stage).

The cool thing with FC0 is that the decoding is kept rather simple AND it tries to reduce the "fetch to issue" depth.
When more parameters are added (like priorities etc.) it doesn't work anymore, maybe you should look at
another core's concept to solve this.


Basicaly output from decoder above would control RF-FU omegas and update thread fifos. End of fifos would be encoded (we have 1 cycle to it because there is always at least latency in RF-FU omegas without bypass) to 8 bits controling FU-RF omega.

Problems:
- can hazard checks between threads be done in 1 cycle ? It is function of 12 vars in simplest case so that it migh be doable

it's still a matter of how much surface and delay you accept to suffer, no ?


- I forgot to add omega net for OP in decoder-FU direction
- how to schedule with bypass
- fixed prios model can stall lower prio threads, need round robin but it is more complex to find hazards then

unless you start to duplicate some often-used units (like : 2 or 3xLSU and 2x ASU ?)



Jumps:
For near jumps within one page (when MMU is added, no pages now) I'd use jump by replacing low 12 bits of IP. Because .so libs are loaded on page boundary anyway then we only need to teach gcc to exploit them. There is nice thing that it can't generate exception as page was already validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see LSU chapter.

The replacement of the LSB is interesting but should be carefully examined. because of the complex implications, it is not doable in FC0.

for a "toy CPU", however it might be interesting to test it.

Better : you can still increase the 2 bits to more bits.
If the MSB are cleared, then only the LSB are changed.
If the MSB are set, then it can be used to find the TLB entry more easily,
as we decode only one part of the address and the remainig bits don't change.


So the TLB would be split into several smaller groups of equal size.
Only one is used at the same time for each thread (which reduces
the latency and logic depth).
- each thread has one "current TLB" tag which selects which group of TLB to select
- the high bits (4, 6 or so ?) select the individual TLB entry in the group : it is no more an associative array,
but a simple array like the register set !
- the TLB entry fills the MSB of the physical address, while the JMP's LSB fill the 12 remaining bits.


what do you think ?

LSU:
Only last 4 registers can contain address.

i would have put 8 (i speak from experience with the VSP)

Address is validated/translated via special insn which does TLB check and stores xlated address into hidden register. It is to catch exceptions before they enter pipeline.

good, you have learned the FC0 lessons :-) /patintheback :-P

Hidden registers are also part of context during switch so that OS can invalidate them along with TLB entries.

however the FC0 regenerates the hidden parts when contexts are switched back.
This reduces the quantity of data to backup and avoids any risk of inconsistency
(imagine that you change the TLB entry when the thread is out, and when
it comes back, the hidden part is bogus...)


I'd support OR-addressing (ORing physical address with 12bit oraddr) to faciliate fast access to naturaly aligned structures like function local vars.

my latest programming efforts showed that FC0 suck at accessing non contiguous data.
When comparing x86 and ALPHA binaries, this was obvious, and FC0 is even worse.
Even though, when comparing real execution times, the EV6 ALPHA was not much
left behind despite the 3 instructions that emulate the 1-intruction of x86.
but yet, a better thing MUST be found.


The problem with the 12-bit OR-addressing is that it implies that the
line width is equal to the page size. Maybe 8-bit is better as it requires
less space, and it should fit half of the needs.

So this is one of the main issue in F-CPU : once you define an arbitrary size,
it should have no effect on HW. And the OR-addressing has many effects
and this is not easy to find a solution.


LSU thus doesn't need to know value of virtual address, only number of pointer register (0-15) to index appropriate thread's hidden register. So that port B of LSU is used for oraddr (as you can see it can be from immediate or register - you can fast-index arrays without adder) and port A is value for writes.
I'm not expert on LSUs

Frankly, the LSU is the dark spot of FC0
and i intend to clean that up with VSP (which reuses parts of FC0 and experience will
be brought back to FC0).
See http://f-cpu.seul.org/whygee/VSP/


I expect no cache in FPGA as its DDR interface will be faster than main clock :)

you forgot about access time. If your DDR has an access time of 10ns, that would be too good (or too expensive ?)


Also I'd use Alpha like model - only aligned 32bit accessed and learn shuffle unit to do rest.

VSP is a bit like that BUT the IE unit makes it completely transparent :-)))

Other ideas:
- use irregular omegas to have 8 ports on RF side, so that it can handle 7 threads and use 8th port for bypass. Niagara needs 4 threads to keep 1 issue pipeline full so that I guess that 4 threads and 4 issue (peak) pipeline will stall too often.
But 8 port regular omega net will have probably too complex/slow controls.


ugh, time to go sleep. Comments are welcome as usually :)

some of the ideas are interesting, some may cause trouble, but you're certainly going to have "fun" :-)



devik

YG (not reading emails often anymore)

*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/