[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[f-cpu] FC0 and SMT



Hi again (after ... 3 ? yrs),

still thinking in free time about some cpu aspects and I found myself thinking a lot about SMT ala Niagara ala T1.
I was interested to try FC0 like core with it. I sketched a simple test architecture to validate it and I decided to share raw ideas here (because I don't know if I will ever get time to realize it).


Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)
- at least 100MHz in FPGA
- 4 thread SMT
- multiissue but not superscalar (per thread)
- no MMU in first phase

Architecture:
I decided to take from FC0 what I find interesting for this project: scoreboard, fixed registers decode,zero detecting,SRB,no exception in pipeline.
I decided to do it as 32bit cpu (some parts are considerably smaler then), little endian (just because I'm used to it and I can't find rationale behind multi-endianness).
16 registers because I can do fast 2r2w file in xilinx (I need only 2r1w).
Logic, add/sub, mul (xc2s has many 18bit 270MHz multipliers), bit shuffle, lsu. No simd now (will prevent usage of optimized adders and muls in fpga).


The sketch is simple: 4 2r1w 16x32 register files (each for one thread), 4 groups of FUs (probably LSU,ADD,MUL+SHUFFLE,LOGICAL). These groups are connected together via 3 omega networks whose also does sign/zero extension (simple to do because we use 4 input LUTs as 2-mux so that we have other 2 to define input select plus 1/0 output force).
There is other 32bit mux on register file output B to mix 16 bit immediates in.
Also one shared bypass and associated "befire FU" mux might be useful.
This alone allows you to do 4 ops per cycle peak. This datapath except LSU costs us on spartan3 (for 32bits):
LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R)
3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).


Now we need scoreboard and scheduler FIFO where both are private to each thread.
We can issue from thread to appropriate FU when is ready and there is no other higher prio thread which is ready and wants the same FU (prio may be fixed per RF). Thread is ready if:
- there is unused timeslot in thread's scheduler fifo
- there is no dependency on register "in progress"
Note that ready test is local to thread. It may be extended to output third state "ready if bypass is used" to handle shared bypass.
Yes the scheduling still has dark parts (I'd like to avoid adding next pipeline stage).
Basicaly output from decoder above would control RF-FU omegas and update thread fifos. End of fifos would be encoded (we have 1 cycle to it because there is always at least latency in RF-FU omegas without bypass) to 8 bits controling FU-RF omega.


Problems:
- can hazard checks between threads be done in 1 cycle ? It is function of 12 vars in simplest case so that it migh be doable
- I forgot to add omega net for OP in decoder-FU direction
- how to schedule with bypass
- fixed prios model can stall lower prio threads, need round robin but it is more complex to find hazards then


Jumps:
For near jumps within one page (when MMU is added, no pages now) I'd use jump by replacing low 12 bits of IP. Because .so libs are loaded on page boundary anyway then we only need to teach gcc to exploit them. There is nice thing that it can't generate exception as page was already validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see LSU chapter.


LSU:
Only last 4 registers can contain address. Address is validated/translated via special insn which does TLB check and stores xlated address into hidden register. It is to catch exceptions before they enter pipeline. Hidden registers are also part of context during switch so that OS can invalidate them along with TLB entries.
I'd support OR-addressing (ORing physical address with 12bit oraddr) to faciliate fast access to naturaly aligned structures like function local vars.
LSU thus doesn't need to know value of virtual address, only number of pointer register (0-15) to index appropriate thread's hidden register. So that port B of LSU is used for oraddr (as you can see it can be from immediate or register - you can fast-index arrays without adder) and port A is value for writes.
I'm not expert on LSUs (neither on CPUs generaly) so I'd design it as set of buffers and initiate external bus cycle while allocating new buffer and releasing buffer once bus transaction commits. During release it must snoop for free timeslot in LSU-RFx path to write back load result. I expect no cache in FPGA as its DDR interface will be faster than main clock :)
Also I'd use Alpha like model - only aligned 32bit accessed and learn shuffle unit to do rest.


Other ideas:
- use irregular omegas to have 8 ports on RF side, so that it can handle 7 threads and use 8th port for bypass. Niagara needs 4 threads to keep 1 issue pipeline full so that I guess that 4 threads and 4 issue (peak) pipeline will stall too often.
But 8 port regular omega net will have probably too complex/slow controls.


ugh, time to go sleep. Comments are welcome as usually :)
devik
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/