[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] FC0 and SMT



still thinking in free time about some cpu aspects and I found myself thinking a lot about SMT ala Niagara ala T1.

There are several "methods" to do SMT.
i like the Seymour Cray way (see the Peripheral Processing Unit of the CDC6600)
http://f-cpu.seul.org/whygee/CDC/DesignOfAComputer_CDC6600.pdf
http://f-cpu.seul.org/whygee/CDC/60100000D_6600refMan.pdf
the advantage is that it is simple and very easy to understand,
without the need of complex scoreboards and other mechanisms.
That is more or less what i intend to with the VSP : 4-way SMT with a very simple 32-bit core.

are you speaking about that 10way barrel ? It is the way used in Niagara IIUC, big register file and use 2 MSB to select thread. In case of OOC you could use the same scheduler like in FC0, just copy scoreboard bits four times too and add four bits to each FIFO line.
I was looking at more complex way to utilize FUs in paralel but more below


Goals:
- ultrasimple pipeline (to be fast and small)
- realizable in contemporary FPGA (xilinx xc3s1600e, about $30)

URL ?

http://www.xilinx.com/products/silicon_solutions/fpgas/product_tables.htm#Spartan3E
but oops, the price is for XC3S400. But http://www.xess.com/prod035.php3 might be the way :-) Nice present for next christmas for me :)


btw, i could get a 64*6 FPGA emulator for $100 (in Paris).

64*6 ?

In the VSP, there are 4 simple units : ASU, ROP2, SHL (all taken from FC0 and simplified)
plus a new one : "IE" (insert/extract) that performs byte/halfword/word alignments in conjunction

yes I already looked at VSP a bit. Such IE FU might be neccessary as byte accesses are not uncommon. On other side 16bit accesses are not so common in 32 bit code :)


With the VSP, there is NO LSU, but a different mechanisms that tries to simplify things

yes that indirect regs :)

i'm not sure that MUL is needed for your design, but you seem to get them "for free" so what's the problem...
And as they seem to be quite fast, using them for SHL is an interesting trick
(ie : one MUL port is a 4->16 bits decoder that feeds the multiplier with the shift amount, i like this idea :-D)

I was thinking about that. The problem is that is locks placer to certain parts of chip and that it is not so simple to cascade 4 18 bit MULs to get 32 bit ROL (there is Xilinx AP note about it).


These groups are connected together via 3 omega networks whose also does sign/zero extension (simple to do because we use 4 input LUTs as 2-mux so that we have other 2 to define input select plus 1/0 output force).

why omega networks ? you want to do 4 simultaneous operations ?

bingo ;-)

LOGICAL: 8 LC
ADD/SUB: 10 LC
MUL: 8 LC + 4 18-bit MULs
SHUFFLE: 50 LC (can do ROL/R and SHL/R) 3 OMEGAs: 192 LC
4 RFs: 128 LC
---------------
about 400 LC till now (1600E part has 33.000 LC).

yep but the omega is 1/2 of this. big surface and performance hit.

Hmm. Performance-wise it is not so big problem. On spartan3 LUT takes
under 1ns and with all routing I expect about 5ns.
But you are right about the surface (not speaking about control logic overhead).
When I add that it will not sustain the peak rate. I just profiled Opteron with its perfmance counters and on average it can retire 1 insn per cycle and issue unit stalls 50% of clocls (mem related) on average and 80% when in kernel or mysql. It has 3 int units and it executes 2 insn at the same time on average.
But Opteron check dependency of tens loads in paralel so that without OOE I'd guess the stall proportion might be 10 times bigger.
Thus on that average my 4 way SMT would stall most of the time and as you pointed out the omega net is overkill.
Oh so...


them. There is nice thing that it can't generate exception as page was already validated (thus no need for TLB check).
For cross-page jumps and indirect ones, let's use FC0's style but see LSU chapter.

The replacement of the LSB is interesting but should be carefully examined. because of the complex implications, it is not doable in FC0.

for a "toy CPU", however it might be interesting to test it.

Better : you can still increase the 2 bits to more bits.
If the MSB are cleared, then only the LSB are changed.
If the MSB are set, then it can be used to find the TLB entry more easily,
as we decode only one part of the address and the remainig bits don't change.


So the TLB would be split into several smaller groups of equal size.
Only one is used at the same time for each thread (which reduces
the latency and logic depth).
- each thread has one "current TLB" tag which selects which group of TLB to select
- the high bits (4, 6 or so ?) select the individual TLB entry in the group : it is no more an associative array,
but a simple array like the register set !
- the TLB entry fills the MSB of the physical address, while the JMP's LSB fill the 12 remaining bits.


what do you think ?

interesting. The original idea was to have shadow regs with phys. addr per thread so that with LSB replacement you don't need to go through TLB (thus making thread count increase simpler). And also 12 bits is possible to stuff as immediate.
But I'm not sure how to divide TLB between threads. Because they can overlap, we would end up with duplicates. And in addition, TLBs are often fully associative and now we would end up with direct cache (if 4 bits are used then each 16th page would compete for the same entry and with CALL into that page would cause TLP entry ping-pong).




LSU:
Only last 4 registers can contain address.

i would have put 8 (i speak from experience with the VSP)

Address is validated/translated via special insn which does TLB check and stores xlated address into hidden register. It is to catch exceptions before they enter pipeline.

good, you have learned the FC0 lessons :-) /patintheback :-P

:) of course, even if f-cpu project would stall forever it is excelent source of ideas


Hidden registers are also part of context during switch so that OS can invalidate them along with TLB entries.

however the FC0 regenerates the hidden parts when contexts are switched back.
This reduces the quantity of data to backup and avoids any risk of inconsistency
(imagine that you change the TLB entry when the thread is out, and when
it comes back, the hidden part is bogus...)

yes I know. only I was not able to figure how to do that automatic part efectively. So that the lines I walk by:
- ctx switch causes kernel code to run (or caused by kernel code)
- kernel controls TLB's content, he can examine hidden fields (via SRs probably) and repair some when no more valid - but this is unlikely that valid page will go out thus I don't care about performance of it
- there is benefit - shadow reg will keep phys address even when TLB lost that entry - combined with replacement addresing we have cache for TLB in this



I'd support OR-addressing (ORing physical address with 12bit oraddr) to faciliate fast access to naturaly aligned structures like function local vars.

my latest programming efforts showed that FC0 suck at accessing non contiguous data.
When comparing x86 and ALPHA binaries, this was obvious, and FC0 is even worse.
Even though, when comparing real execution times, the EV6 ALPHA was not much
left behind despite the 3 instructions that emulate the 1-intruction of x86.
but yet, a better thing MUST be found.


The problem with the 12-bit OR-addressing is that it implies that the
line width is equal to the page size. Maybe 8-bit is better as it requires
less space, and it should fit half of the needs.

Do you speak about cache-line width ? I don't think so. For my toy cpu I expected 32 bit wide cache lines (because I have dual port block memories 32x512 in spartan3).
I didn't expect to use these 12 bit to index a big L0->LSU mux. I'd simply use all 32 bits (shadow | 12bits) to lookup data L0 cache and start bus load when not found. Because I didn't plan IE unit before, L0 output is directly used as LSU's output (2 LSB are ignored).
In case of bigger cacheline, use a few other LSB to control select mux.


On gcc side you only need to align stack when filling on-stack args AND when allocating local vars. If you need 12 bytes of storage you can allocate 16, if you need 96 you might 64-align and use 2 pointers to access it.
Other way is to allow for 12 bit real adder (yes some latency) which wraps on end of a page. Then it is up to function prolog to check how many bytes there is to page end and setup second pointer if needed.


I expect no cache in FPGA as its DDR interface will be faster than main clock :)

you forgot about access time. If your DDR has an access time of 10ns, that would be too good (or too expensive ?)

maybe even it doesn't exists. I'd use SDRAM 133 2-2-2, thus about 5 cpu cycles latency. I'm not sure whether I can cook better one in fpga.

Martin
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/