[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [f-cpu] F-CPU architecture...
Michael Riepe wrote:
Tobias Bergmann wrote:
Oh I forgot to mention: A collegue of mine is writing a OS tool for
circuit simulation, synthesis, ATPG, fault sim, ...
It's called signs:
Already noticed that on freshmeat.net :)
But I didn't look at it yet.
have a look whenever you can, as it may help you,
it seems to be complementary to your own work.
How large would the effort be to add SMT to the FC0 core? I'm
thinking of approx. 3-fold SMT.
Too high, IMHO. In particular, the required changes to the register
set and crossbar would be real speed killers.
now comes the real "meat" in this mail :
I recently had an idea for light-weight parallel execution - let's
call it "threadlets". By adding explicit fork/join instructions, an
application could split itself into threadlets if it sees fit. Of
course careful programming would be required because threadlets share
the same register set.
The basic idea is that there is a variant of the jump instruction
(with two arguments), called "fork":
fork r2, r1
That will fork a threadlet starting at address r2 and return some kind
of threadlet "ID" in r1. Now both the main program and the threadlet
can work independently. When the threadlet is done, it will execute a
"return from threadlet" instruction. The main program can use the
to wait for termination of a particular threadlet, or
to wait for all of them. To ease implementation, only the main program
will be allowed to fork threadlets or execute special instructions
like syscall or get/put.
Note that the core is not required to process threadlets in parallel
at all. If support for parallel execution is missing, threadlets will
be executed sequentially, in any order (or lack of order, as you like
it). In the most simple implementation, the "fork" instruction would
turn into a subroutine call (that is, "jump r2, r1"), and "join" would
be a no-op.
On the other hand, a core may execute as many threadlets in parallel
as it can. All we need to provide is an IF&D unit that supports
multiple instruction streams.
after 5 minutes, i came to this conclusion :
SMT or threadlets can certainly be done in FC1, but not in FC0.
your idea is quite smart, but beside support from compilers, a more
issue arises : it goes against one of the core ideas in FC0.
specifically, we can avoid delayed branches and branch prediction because
a taken branch has very low penalty (1 cycle today). This is because as soon
as an instruction is selected by the Fetcher, it goes directly to the
register set's 3 address
buses, so the data is available on the next cycle (ideally, which is
also half the time
it takes to decode and issue the instruction).
Now, adding support for multiple instruction streams (whether they share
or not the register set)
adds a minimum of one stage : the instruction must be selected among the
and this decision is ideally based on resource availability (is the
register's value available
somewhere on the Xbar or is the LSU ready ?).
This minimum additional stage doubles the branch penalty and breaks the
Of course, this penalty can be spread among the running threadlets, but
i don't believe in
this ideal scenario because the other threadlet (that will mask the
branch penalty) MUST
have 2 completely independent instructions which don't stall AT THIS
Adding a third and/or fourth threadlet will only bring the complexity
and die surface higher,
and compiler support is not even here.
I propose that we finish FC0 the way we designed it almost 5 years ago,
and then we can move to more sophisticated stuffs :-) I believe that FC1
will be quite exciting but it will be impossible untill we finish FC0.
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/