[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] F-CPU architecture...


Michael Riepe wrote:


plus !

Tobias Bergmann wrote:

Oh I forgot to mention: A collegue of mine is writing a OS tool for circuit simulation, synthesis, ATPG, fault sim, ...
It's called signs: http://www.iti.uni-stuttgart.de/~bartscgr/signs/wiki/index.php/Main_Page

Already noticed that on freshmeat.net :) But I didn't look at it yet.

have a look whenever you can, as it may help you, it seems to be complementary to your own work.


How large would the effort be to add SMT to the FC0 core? I'm thinking of approx. 3-fold SMT.

Too high, IMHO. In particular, the required changes to the register set and crossbar would be real speed killers.


now comes the real "meat" in this mail :

I recently had an idea for light-weight parallel execution - let's call it "threadlets". By adding explicit fork/join instructions, an application could split itself into threadlets if it sees fit. Of course careful programming would be required because threadlets share the same register set.

The basic idea is that there is a variant of the jump instruction (with two arguments), called "fork":

    fork r2, r1

That will fork a threadlet starting at address r2 and return some kind of threadlet "ID" in r1. Now both the main program and the threadlet can work independently. When the threadlet is done, it will execute a "return from threadlet" instruction. The main program can use the "join" instruction

    join r1

to wait for termination of a particular threadlet, or

    join r0

to wait for all of them. To ease implementation, only the main program will be allowed to fork threadlets or execute special instructions like syscall or get/put.

Note that the core is not required to process threadlets in parallel at all. If support for parallel execution is missing, threadlets will be executed sequentially, in any order (or lack of order, as you like it). In the most simple implementation, the "fork" instruction would turn into a subroutine call (that is, "jump r2, r1"), and "join" would be a no-op.

On the other hand, a core may execute as many threadlets in parallel as it can. All we need to provide is an IF&D unit that supports multiple instruction streams.

after 5 minutes, i came to this conclusion : SMT or threadlets can certainly be done in FC1, but not in FC0.

your idea is quite smart, but beside support from compilers, a more pragmatic
issue arises : it goes against one of the core ideas in FC0.
specifically, we can avoid delayed branches and branch prediction because
a taken branch has very low penalty (1 cycle today). This is because as soon
as an instruction is selected by the Fetcher, it goes directly to the register set's 3 address
buses, so the data is available on the next cycle (ideally, which is also half the time
it takes to decode and issue the instruction).

Now, adding support for multiple instruction streams (whether they share or not the register set)
adds a minimum of one stage : the instruction must be selected among the threads,
and this decision is ideally based on resource availability (is the register's value available
somewhere on the Xbar or is the LSU ready ?).
This minimum additional stage doubles the branch penalty and breaks the architectural balance.

Of course, this penalty can be spread among the running threadlets, but i don't believe in
this ideal scenario because the other threadlet (that will mask the branch penalty) MUST
have 2 completely independent instructions which don't stall AT THIS EXACT MOMENT.

Adding a third and/or fourth threadlet will only bring the complexity and die surface higher,
and compiler support is not even here.

I propose that we finish FC0 the way we designed it almost 5 years ago,
and then we can move to more sophisticated stuffs :-) I believe that FC1
will be quite exciting but it will be impossible untill we finish FC0.


To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/