[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [f-cpu] FC0 and SMT
still thinking in free time about some cpu aspects and I found myself
thinking a lot about SMT ala Niagara ala T1.
There are several "methods" to do SMT.
i like the Seymour Cray way (see the Peripheral Processing Unit of the
CDC6600)
http://f-cpu.seul.org/whygee/CDC/DesignOfAComputer_CDC6600.pdf
http://f-cpu.seul.org/whygee/CDC/60100000D_6600refMan.pdf
the advantage is that it is simple and very easy to understand,
without the need of complex scoreboards and other mechanisms.
That is more or less what i intend to with the VSP : 4-way SMT with a
very simple 32-bit core.
are you speaking about that 10way barrel ? It is the way used in Niagara
IIUC, big register file and use 2 MSB to select thread. In case of OOC
you could use the same scheduler like in FC0, just copy scoreboard bits
four times too and add four bits to each FIFO line.
I was looking at more complex way to utilize FUs in paralel but more below
Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)
URL ?
http://www.xilinx.com/products/silicon_solutions/fpgas/product_tables.htm#Spartan3E
but oops, the price is for XC3S400. But http://www.xess.com/prod035.php3
might be the way :-) Nice present for next christmas for me :)
btw, i could get a 64*6 FPGA emulator for $100 (in Paris).
64*6 ?
In the VSP, there are 4 simple units : ASU, ROP2, SHL (all taken from
FC0 and simplified)
plus a new one : "IE" (insert/extract) that performs byte/halfword/word
alignments in conjunction
yes I already looked at VSP a bit. Such IE FU might be neccessary as
byte accesses are not uncommon. On other side 16bit accesses are not so
common in 32 bit code :)
With the VSP, there is NO LSU, but a different mechanisms that tries to
simplify things
yes that indirect regs :)
i'm not sure that MUL is needed for your design, but you seem to get
them "for free" so what's the problem...
And as they seem to be quite fast, using them for SHL is an interesting
trick
(ie : one MUL port is a 4->16 bits decoder that feeds the multiplier
with the shift amount, i like this idea :-D)
I was thinking about that. The problem is that is locks placer to
certain parts of chip and that it is not so simple to cascade 4 18 bit
MULs to get 32 bit ROL (there is Xilinx AP note about it).
These groups are connected together via 3 omega networks whose also
does sign/zero extension (simple to do because we use 4 input LUTs as
2-mux so that we have other 2 to define input select plus 1/0 output
force).
why omega networks ? you want to do 4 simultaneous operations ?
bingo ;-)
LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R)
3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).
yep but the omega is 1/2 of this. big surface and performance hit.
Hmm. Performance-wise it is not so big problem. On spartan3 LUT takes
under 1ns and with all routing I expect about 5ns.
But you are right about the surface (not speaking about control logic
overhead).
When I add that it will not sustain the peak rate. I just profiled
Opteron with its perfmance counters and on average it can retire 1 insn
per cycle and issue unit stalls 50% of clocls (mem related) on average
and 80% when in kernel or mysql. It has 3 int units and it executes 2
insn at the same time on average.
But Opteron check dependency of tens loads in paralel so that without
OOE I'd guess the stall proportion might be 10 times bigger.
Thus on that average my 4 way SMT would stall most of the time and as
you pointed out the omega net is overkill.
Oh so...
them. There is nice thing that it can't generate exception as page was
already validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see
LSU chapter.
The replacement of the LSB is interesting but should be carefully examined.
because of the complex implications, it is not doable in FC0.
for a "toy CPU", however it might be interesting to test it.
Better : you can still increase the 2 bits to more bits.
If the MSB are cleared, then only the LSB are changed.
If the MSB are set, then it can be used to find the TLB entry more easily,
as we decode only one part of the address and the remainig bits don't
change.
So the TLB would be split into several smaller groups of equal size.
Only one is used at the same time for each thread (which reduces
the latency and logic depth).
- each thread has one "current TLB" tag which selects which group of TLB
to select
- the high bits (4, 6 or so ?) select the individual TLB entry in the
group : it is no more an associative array,
but a simple array like the register set !
- the TLB entry fills the MSB of the physical address, while the JMP's
LSB fill the 12 remaining bits.
what do you think ?
interesting. The original idea was to have shadow regs with phys. addr
per thread so that with LSB replacement you don't need to go through TLB
(thus making thread count increase simpler). And also 12 bits is
possible to stuff as immediate.
But I'm not sure how to divide TLB between threads. Because they can
overlap, we would end up with duplicates. And in addition, TLBs are
often fully associative and now we would end up with direct cache (if 4
bits are used then each 16th page would compete for the same entry and
with CALL into that page would cause TLP entry ping-pong).
LSU:
Only last 4 registers can contain address.
i would have put 8 (i speak from experience with the VSP)
Address is validated/translated via special insn which does TLB check
and stores xlated address into hidden register. It is to catch
exceptions before they enter pipeline.
good, you have learned the FC0 lessons :-) /patintheback :-P
:) of course, even if f-cpu project would stall forever it is excelent
source of ideas
Hidden registers are also part of context during switch so that OS can
invalidate them along with TLB entries.
however the FC0 regenerates the hidden parts when contexts are switched
back.
This reduces the quantity of data to backup and avoids any risk of
inconsistency
(imagine that you change the TLB entry when the thread is out, and when
it comes back, the hidden part is bogus...)
yes I know. only I was not able to figure how to do that automatic part
efectively. So that the lines I walk by:
- ctx switch causes kernel code to run (or caused by kernel code)
- kernel controls TLB's content, he can examine hidden fields (via SRs
probably) and repair some when no more valid - but this is unlikely that
valid page will go out thus I don't care about performance of it
- there is benefit - shadow reg will keep phys address even when TLB
lost that entry - combined with replacement addresing we have cache for
TLB in this
I'd support OR-addressing (ORing physical address with 12bit oraddr)
to faciliate fast access to naturaly aligned structures like function
local vars.
my latest programming efforts showed that FC0 suck at accessing non
contiguous data.
When comparing x86 and ALPHA binaries, this was obvious, and FC0 is even
worse.
Even though, when comparing real execution times, the EV6 ALPHA was not
much
left behind despite the 3 instructions that emulate the 1-intruction of
x86.
but yet, a better thing MUST be found.
The problem with the 12-bit OR-addressing is that it implies that the
line width is equal to the page size. Maybe 8-bit is better as it requires
less space, and it should fit half of the needs.
Do you speak about cache-line width ? I don't think so. For my toy cpu I
expected 32 bit wide cache lines (because I have dual port block
memories 32x512 in spartan3).
I didn't expect to use these 12 bit to index a big L0->LSU mux. I'd
simply use all 32 bits (shadow | 12bits) to lookup data L0 cache and
start bus load when not found. Because I didn't plan IE unit before, L0
output is directly used as LSU's output (2 LSB are ignored).
In case of bigger cacheline, use a few other LSB to control select mux.
On gcc side you only need to align stack when filling on-stack args AND
when allocating local vars. If you need 12 bytes of storage you can
allocate 16, if you need 96 you might 64-align and use 2 pointers to
access it.
Other way is to allow for 12 bit real adder (yes some latency) which
wraps on end of a page. Then it is up to function prolog to check how
many bytes there is to page end and setup second pointer if needed.
I expect no cache in FPGA as its DDR interface will be faster than
main clock :)
you forgot about access time. If your DDR has an access time of 10ns,
that would be too good (or too expensive ?)
maybe even it doesn't exists. I'd use SDRAM 133 2-2-2, thus about
5 cpu cycles latency. I'm not sure whether I can cook better one in fpga.
Martin
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/