[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [f-cpu] FC0 and SMT
hi,
Martin Devera wrote:
Hi again (after ... 3 ? yrs),
or so ...
still thinking in free time about some cpu aspects and I found myself
thinking a lot about SMT ala Niagara ala T1.
There are several "methods" to do SMT.
i like the Seymour Cray way (see the Peripheral Processing Unit of the
CDC6600)
http://f-cpu.seul.org/whygee/CDC/DesignOfAComputer_CDC6600.pdf
http://f-cpu.seul.org/whygee/CDC/60100000D_6600refMan.pdf
the advantage is that it is simple and very easy to understand,
without the need of complex scoreboards and other mechanisms.
That is more or less what i intend to with the VSP : 4-way SMT with a
very simple 32-bit core.
I was interested to try FC0 like core with it.
FC0 is not adapted, unless you want to dedicate 1/2 of the gates to
control logic.
I sketched a simple test architecture to validate it and I decided to
share raw ideas here (because I don't know if I will ever get time to
realize it).
Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)
URL ?
btw, i could get a 64*6 FPGA emulator for $100 (in Paris).
the catch is that it is used, 1998-era design, no software, nothing.
and it's heavy and a newer chips of the same price are certainly
more worth it.
Architecture:
I decided to take from FC0 what I find interesting for this project:
scoreboard, fixed registers decode,zero detecting,SRB,no exception in
pipeline.
sure, these parts are quite interesting.
however it is good for a 64-bit CPU, but probably not for 32-bit,
simply because the decode logic vs operating logic ratio.
The sketch is simple: 4 2r1w 16x32 register files (each for one
thread), 4 groups of FUs (probably LSU,ADD,MUL+SHUFFLE,LOGICAL).
In the VSP, there are 4 simple units : ASU, ROP2, SHL (all taken from
FC0 and simplified)
plus a new one : "IE" (insert/extract) that performs byte/halfword/word
alignments in conjunction
with the memory system. There is also one "SR" read and write port, but
that's not critical in HW.
With the VSP, there is NO LSU, but a different mechanisms that tries to
simplify things
(at the expense of others, like the software coder's sanity, but who
cares ? :-P)
i'm not sure that MUL is needed for your design, but you seem to get
them "for free" so what's the problem...
And as they seem to be quite fast, using them for SHL is an interesting
trick
(ie : one MUL port is a 4->16 bits decoder that feeds the multiplier
with the shift amount, i like this idea :-D)
These groups are connected together via 3 omega networks whose also
does sign/zero extension (simple to do because we use 4 input LUTs as
2-mux so that we have other 2 to define input select plus 1/0 output
force).
why omega networks ? you want to do 4 simultaneous operations ?
This alone allows you to do 4 ops per cycle peak.
ooops :-)
This datapath except LSU costs us on spartan3 (for 32bits):
LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R)
-> try to combine it with the MUL (see above)
3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).
yep but the omega is 1/2 of this. big surface and performance hit.
Now we need scoreboard and scheduler FIFO where both are private to
each thread.
now you begin to realize where the problems are.
We can issue from thread to appropriate FU when is ready and there is
no other higher prio thread which is ready and wants the same FU (prio
may be fixed per RF).
there are too many parameters that change every cycle....
Thread is ready if:
- there is unused timeslot in thread's scheduler fifo
- there is no dependency on register "in progress"
Note that ready test is local to thread. It may be extended to output
third state "ready if bypass is used" to handle shared bypass.
Yes the scheduling still has dark parts (I'd like to avoid adding next
pipeline stage).
The cool thing with FC0 is that the decoding is kept rather simple AND
it tries to reduce the "fetch to issue" depth.
When more parameters are added (like priorities etc.) it doesn't work
anymore, maybe you should look at
another core's concept to solve this.
Basicaly output from decoder above would control RF-FU omegas and
update thread fifos. End of fifos would be encoded (we have 1 cycle to
it because there is always at least latency in RF-FU omegas without
bypass) to 8 bits controling FU-RF omega.
Problems:
- can hazard checks between threads be done in 1 cycle ? It is
function of 12 vars in simplest case so that it migh be doable
it's still a matter of how much surface and delay you accept to suffer, no ?
- I forgot to add omega net for OP in decoder-FU direction
- how to schedule with bypass
- fixed prios model can stall lower prio threads, need round robin but
it is more complex to find hazards then
unless you start to duplicate some often-used units (like : 2 or 3xLSU
and 2x ASU ?)
Jumps:
For near jumps within one page (when MMU is added, no pages now) I'd
use jump by replacing low 12 bits of IP. Because .so libs are loaded
on page boundary anyway then we only need to teach gcc to exploit
them. There is nice thing that it can't generate exception as page was
already validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see
LSU chapter.
The replacement of the LSB is interesting but should be carefully examined.
because of the complex implications, it is not doable in FC0.
for a "toy CPU", however it might be interesting to test it.
Better : you can still increase the 2 bits to more bits.
If the MSB are cleared, then only the LSB are changed.
If the MSB are set, then it can be used to find the TLB entry more easily,
as we decode only one part of the address and the remainig bits don't
change.
So the TLB would be split into several smaller groups of equal size.
Only one is used at the same time for each thread (which reduces
the latency and logic depth).
- each thread has one "current TLB" tag which selects which group of
TLB to select
- the high bits (4, 6 or so ?) select the individual TLB entry in the
group : it is no more an associative array,
but a simple array like the register set !
- the TLB entry fills the MSB of the physical address, while the JMP's
LSB fill the 12 remaining bits.
what do you think ?
LSU:
Only last 4 registers can contain address.
i would have put 8 (i speak from experience with the VSP)
Address is validated/translated via special insn which does TLB check
and stores xlated address into hidden register. It is to catch
exceptions before they enter pipeline.
good, you have learned the FC0 lessons :-) /patintheback :-P
Hidden registers are also part of context during switch so that OS can
invalidate them along with TLB entries.
however the FC0 regenerates the hidden parts when contexts are switched
back.
This reduces the quantity of data to backup and avoids any risk of
inconsistency
(imagine that you change the TLB entry when the thread is out, and when
it comes back, the hidden part is bogus...)
I'd support OR-addressing (ORing physical address with 12bit oraddr)
to faciliate fast access to naturaly aligned structures like function
local vars.
my latest programming efforts showed that FC0 suck at accessing non
contiguous data.
When comparing x86 and ALPHA binaries, this was obvious, and FC0 is even
worse.
Even though, when comparing real execution times, the EV6 ALPHA was not much
left behind despite the 3 instructions that emulate the 1-intruction of x86.
but yet, a better thing MUST be found.
The problem with the 12-bit OR-addressing is that it implies that the
line width is equal to the page size. Maybe 8-bit is better as it requires
less space, and it should fit half of the needs.
So this is one of the main issue in F-CPU : once you define an arbitrary
size,
it should have no effect on HW. And the OR-addressing has many effects
and this is not easy to find a solution.
LSU thus doesn't need to know value of virtual address, only number of
pointer register (0-15) to index appropriate thread's hidden register.
So that port B of LSU is used for oraddr (as you can see it can be
from immediate or register - you can fast-index arrays without adder)
and port A is value for writes.
I'm not expert on LSUs
Frankly, the LSU is the dark spot of FC0
and i intend to clean that up with VSP (which reuses parts of FC0 and
experience will
be brought back to FC0).
See http://f-cpu.seul.org/whygee/VSP/
I expect no cache in FPGA as its DDR interface will be faster than
main clock :)
you forgot about access time. If your DDR has an access time of 10ns,
that would be too good (or too expensive ?)
Also I'd use Alpha like model - only aligned 32bit accessed and learn
shuffle unit to do rest.
VSP is a bit like that BUT the IE unit makes it completely transparent :-)))
Other ideas:
- use irregular omegas to have 8 ports on RF side, so that it can
handle 7 threads and use 8th port for bypass. Niagara needs 4 threads
to keep 1 issue pipeline full so that I guess that 4 threads and 4
issue (peak) pipeline will stall too often.
But 8 port regular omega net will have probably too complex/slow
controls.
ugh, time to go sleep. Comments are welcome as usually :)
some of the ideas are interesting, some may cause trouble, but you're
certainly going to have "fun" :-)
devik
YG (not reading emails often anymore)
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/