[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[f-cpu] FC0 and SMT
- To: f-cpu@xxxxxxxx
- Subject: [f-cpu] FC0 and SMT
- From: Martin Devera <devik@xxxxxx>
- Date: Wed, 22 Feb 2006 01:11:42 +0100
- Delivered-to: archiver@seul.org
- Delivered-to: f-cpu-outgoing@seul.org
- Delivered-to: f-cpu@seul.org
- Delivery-date: Tue, 21 Feb 2006 19:11:53 -0500
- Reply-to: f-cpu@xxxxxxxx
- Sender: owner-f-cpu@xxxxxxxx
- User-agent: Thunderbird 1.5 (X11/20051201)
Hi again (after ... 3 ? yrs),
still thinking in free time about some cpu aspects and I found myself
thinking a lot about SMT ala Niagara ala T1.
I was interested to try FC0 like core with it. I sketched a simple test
architecture to validate it and I decided to share raw ideas here
(because I don't know if I will ever get time to realize it).
Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)
- at least 100MHz in FPGA
- 4 thread SMT
- multiissue but not superscalar (per thread)
- no MMU in first phase
Architecture:
I decided to take from FC0 what I find interesting for this project:
scoreboard, fixed registers decode,zero detecting,SRB,no exception in
pipeline.
I decided to do it as 32bit cpu (some parts are considerably smaler
then), little endian (just because I'm used to it and I can't find
rationale behind multi-endianness).
16 registers because I can do fast 2r2w file in xilinx (I need only 2r1w).
Logic, add/sub, mul (xc2s has many 18bit 270MHz multipliers), bit
shuffle, lsu. No simd now (will prevent usage of optimized adders and
muls in fpga).
The sketch is simple: 4 2r1w 16x32 register files (each for one thread),
4 groups of FUs (probably LSU,ADD,MUL+SHUFFLE,LOGICAL). These groups
are connected together via 3 omega networks whose also does sign/zero
extension (simple to do because we use 4 input LUTs as 2-mux so that we
have other 2 to define input select plus 1/0 output force).
There is other 32bit mux on register file output B to mix 16 bit
immediates in.
Also one shared bypass and associated "befire FU" mux might be useful.
This alone allows you to do 4 ops per cycle peak. This datapath except
LSU costs us on spartan3 (for 32bits):
LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R)
3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).
Now we need scoreboard and scheduler FIFO where both are private to each
thread.
We can issue from thread to appropriate FU when is ready and there is no
other higher prio thread which is ready and wants the same FU (prio may
be fixed per RF). Thread is ready if:
- there is unused timeslot in thread's scheduler fifo
- there is no dependency on register "in progress"
Note that ready test is local to thread. It may be extended to output
third state "ready if bypass is used" to handle shared bypass.
Yes the scheduling still has dark parts (I'd like to avoid adding next
pipeline stage).
Basicaly output from decoder above would control RF-FU omegas and update
thread fifos. End of fifos would be encoded (we have 1 cycle to it
because there is always at least latency in RF-FU omegas without bypass)
to 8 bits controling FU-RF omega.
Problems:
- can hazard checks between threads be done in 1 cycle ? It is function
of 12 vars in simplest case so that it migh be doable
- I forgot to add omega net for OP in decoder-FU direction
- how to schedule with bypass
- fixed prios model can stall lower prio threads, need round robin but
it is more complex to find hazards then
Jumps:
For near jumps within one page (when MMU is added, no pages now) I'd use
jump by replacing low 12 bits of IP. Because .so libs are loaded on page
boundary anyway then we only need to teach gcc to exploit them. There is
nice thing that it can't generate exception as page was already
validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see
LSU chapter.
LSU:
Only last 4 registers can contain address. Address is
validated/translated via special insn which does TLB check and stores
xlated address into hidden register. It is to catch exceptions before
they enter pipeline. Hidden registers are also part of context during
switch so that OS can invalidate them along with TLB entries.
I'd support OR-addressing (ORing physical address with 12bit oraddr) to
faciliate fast access to naturaly aligned structures like function local
vars.
LSU thus doesn't need to know value of virtual address, only number of
pointer register (0-15) to index appropriate thread's hidden register.
So that port B of LSU is used for oraddr (as you can see it can be from
immediate or register - you can fast-index arrays without adder) and
port A is value for writes.
I'm not expert on LSUs (neither on CPUs generaly) so I'd design it as
set of buffers and initiate external bus cycle while allocating new
buffer and releasing buffer once bus transaction commits. During release
it must snoop for free timeslot in LSU-RFx path to write back load
result. I expect no cache in FPGA as its DDR interface will be faster
than main clock :)
Also I'd use Alpha like model - only aligned 32bit accessed and learn
shuffle unit to do rest.
Other ideas:
- use irregular omegas to have 8 ports on RF side, so that it can handle
7 threads and use 8th port for bypass. Niagara needs 4 threads to keep 1
issue pipeline full so that I guess that 4 threads and 4 issue (peak)
pipeline will stall too often.
But 8 port regular omega net will have probably too complex/slow controls.
ugh, time to go sleep. Comments are welcome as usually :)
devik
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/