[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] F-CPU architecture...



hallo,

Tobias Bergmann wrote:

Yann Guidon wrote:

Yann Guidon wrote:
Random pattern test will hardly give you 99% at short test times but you can always prove me wrong ;P

it's not a matter of proving, 99% is here to show that it is never perfect.
it could be 95 or 90%, what the heck ? it's just a boot-time test
that supplements the factory test which is meant to be 101%.

OK. I see. even 98% on a production test is a lot these days.

yup but a single fault is enough to make the chip useless.

****************************************************
But this LFSR business has also another function :
reset ALL the registers and other memories
to a known state (usually 0).
****************************************************
that's why the coverage is not so important for the FSM/LFSR.
A multi-cycle approach spares a lot of silicon compared to a huge
RESET signal all over the die.
Plus, we can pre-compute the coverage and the signature
of the BIST on a computer (takes a while, but worth it).

:) good idea.

it puts some constraints on the LFSR algo but it makes it more challenging and interesting :-)

well, it's a stress-test too, right ? ;-)
and if it runs for 100K cycles, it's just a fraction of second
at real speed, so the "total" quantity of heat can be negligible
(particularly if the test is running at power-up, when the chip is cold).
no worries here.

Well your power supply has to be dimensioned for this worst case as well. Makes it more expensive for no good reason.

hmmm not sure. we'll have to "measure" the average and max activity ...

i had thought about defining our own VHDL data types
(instead of std_logic) so we can implement our own coverage tools.
It can also serve to create stats about activity etc...
but that would be very heavy and may not remain acurate
when we implement the core in ASIC or FPGA.
sometimes, synthesis can radically change the netlist and the low-level
architecture.

BTW, how many pipeline stages are there currently? And how much delay in the slowest path?

FC0 has an "out of order completion" pipeline because there are several kinds of instructions with different latencies. fastest are the ROP2 (AND/OR/etc.), slowest is IDIV... add/sub, multiplies and a few others have been pipelined.

Now, if something slow must be interfaced
(keybord ? mouse ?) there is no reason to let
that interfere with the FC0 directly. So the VSP
handles that (sparing a costly spurious context switch to the FC0)
and enqueues a message to the CPU that will deal with
user interaction. Keymaps and all these stuffs can be processed
in the VSP and message-passing is more suitable than interrupts
for such things.

So it is a IOE (Interrupt Offload Engine) to coin a new term for it.

not only. it offloads more than that : low level HW management, power management, system reconfiguration when a hotswap event occurs, FPGA upload etc. the VSP is a "32 microcontroller", so it's here for doing nasty work, that does not belong to the domain of a 64-register 64-bit wide CPU. Having few registers, the VSP can switch from one context to another faster than a higher clocked CPU.

In fact, i envisions something completely different from the Linux-on-PC
model,
where a lot of space is taken for 1) bootup and configuration (even if
it is erased
during normal run) 2) peripheral management / drivers.
I wish there was a kind of microkernel on each FC0 of the
system, each with only a driver of the only local peripheral it manages
(as before : network or SCSI or ATA or video or PCI bridge or...)
so each CPU does its job with as few bloat as possible
(then an application is split across the CPUs with dynamic load balancing
and according to "affinities" with the I/O they require).

But the way i see the system now, the VSP sets up everything
and uploads the kernel to each CPU, then starts them.
The VSP can also act as a remote console (since it has a kbd+mouse
connection,
it can be virtualised through serial or ethernet ports to allow remote
management)

I said high performance and I mean it. 1000 pins + >10GB BW.

hmmm the price of a 1K pins FPGA is out of reach of many's pockets. AMD and Intel can drive that low thanks to huuuge volumes and integrated fabs. If you're rich, no problem.

I'm not rich but I have quite nice FPGAs at work.

such as ? :-)

however, as you have seen, FC0 and F-CPU in general is just a core.
you can use this to build more complex chips, as long as you provide
a specific set of signals to its interfaces (clock, power, and data/addresses to the bus).

That's good to know...

just read the manual and the docs in the files :-P a core is a core ...

but you know how things go : look at the Alpha.
I believe that the FC0 can reach success because
we hit a price/complexity/performance point that
is on par with other proprietary "embedded processors"
(made of ARM (there are 64-bit versions), MIPS,
SH-x or others). These chips sell for 5 to 10$.

OK. So FC0 competes with embedded chips. Fine with me.

that is the best point to start. x86 proves that we can always scale up and the F-CPU model has some headroom.

but it is already defined so that you can process 256-bit wide registers
or wider if you want. so one FC0 could compete with one Cell's SPU
(more or less).

yup, but the memory speed does not increase as fast as CPU speed.

Bandwidth (almost) does.

who wants "almost" ? the truth is that the price for higher bandwidth is quite high. active termination in the latest memory modules consumes AMPERES.


"latency tolerance"... tell that to end users ;-P
and you ALWAYS need bandwidth
(even, and particularly, when you try to hide latency).
"You can't fake bandwidth" (Seymour C.)

nice one.

a famous one, too.

i re-had a look recently at the CDC6600 and Cray-1
architecture manuals, and this gives a good idea of what
"supercomputer" means. the CDC (circa 1664) could have
10 ongoing memory accesses, the Cray-1 simply has 16
independent memory channels (4 for instructions, 12 for vector data).
The key there is chip parallelism : the memory channels were
physical, no single-chip solution can do that (would require thousands
and thousands of wires).

The way i "solve" the memory bandwidth problem is by going
"multichip" (coherent NUMA) instead of "multicore".

FC0 (the current core) is designed for simplicity and raw computations.
FC1 may be more subtle than that (i have some ideas).

Looking forward to FC1. :-D

let's finish FC0 first........

But it is SIMPLE. And that's good. A very good test case for free HW development.

test case ? it's not a test IMHO.... particularly because we have to reinvent everything from scratch.

I meant there haven't yet been many distributed free HW projects. Some one-man shows and some licence changes of commercial products.

well... F-CPU (as of today) is not much better /o\

bis besser,
Tobias

I'm going away for the next days, have fun !

have fun!

Well, i instead went to see H2G2 tonight and i'm back. so i'm going away tomorrow ;-)

YG


************************************************************* To unsubscribe, send an e-mail to majordomo@xxxxxxxx with unsubscribe f-cpu in the body. http://f-cpu.seul.org/