[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] F-CPU architecture...


Tobias Bergmann wrote:

Hi Michael, Hi Yann,

Michael wrote:

My collegue develops a technique to test a processor via a Software Based Self Test that yields very high fault coverages at low power and short test times.

Sounds interesting. How does that work, approximately?

Using structural faults she genrates a program that when executed detects the structural faults.
Like a fault list -> code compiler!

sure, sounds like the obvious way.

Yann Guidon wrote:

Tobias Bergmann wrote:

Yann Guidon wrote:
Do not skip testability issues just because your prototype stage will be FPGA! The LEON made this mistake and was hard to port to ASIC!

don't worry, the BIST is an integral part of the architecture.
it's just that we need to find a way to verify the chip "when" in ASIC form,
so it does not use any gate in FPGA. A flag in the config file does this.
The other trouble is that, knowing how the SW world works,
ROM space will never be enough (who remembers the Xbox
fiasco ?). So to avoid bloat, the idea is to create a kind of LFSR
that sends pseudo-random signals to all units,
and reuse the integrated POPCOUNT unit to create a "signature"
that will be checked at the end. That's simple, fast, rather efficient
but the problem is to generate a LFSR that will give 99% coverage.

Random pattern test will hardly give you 99% at short test times but you can always prove me wrong ;P

it's not a matter of proving, 99% is here to show that it is never perfect. it could be 95 or 90%, what the heck ? it's just a boot-time test that supplements the factory test which is meant to be 101%.

But this LFSR business has also another function :
reset ALL the registers and other memories
to a known state (usually 0).

that's why the coverage is not so important for the FSM/LFSR.
A multi-cycle approach spares a lot of silicon compared to a huge
RESET signal all over the die.
Plus, we can pre-compute the coverage and the signature
of the BIST on a computer (takes a while, but worth it).

So, in production, we can resort to a second level of checks,
using specific code loaded from outside in the cache.
but no need to do these extensive checks at every reset, right ?
Once a chip is considered "good", it usually remains so for a long while.

Oh I even foresee chips testing themselves continuously during operation.

hmmm you want to run zOS on top of that ? :-P

But I agree that this is not yet be neccessary and a random test + a very short deterministic functional test (SBST) is the most cost efficient for F-CPU.

Although in the long-term I'd skip the random test as it generates too much heat!

well, it's a stress-test too, right ? ;-) and if it runs for 100K cycles, it's just a fraction of second at real speed, so the "total" quantity of heat can be negligible (particularly if the test is running at power-up, when the chip is cold). no worries here.

i've read some papers about it, years ago.
Some even proposed to include specific instructions
to help "boost" the coverage.
Anyway, the POPCOUNT unit is an integral part
of the system. It is not only useful for crypto, it also
help in signature compression (it takes 2 operands
from the data bus, XORs them together, yields a 6-bit
result, and this goes to "disturb" a "freewheel" 64-bit LFSR
(which can also serve as a weak pseudo-random generator
in practice).

I want to get rid of random tests but as written above it's a avlid compromise for now.

at school, i have learnt to make boundary-scan BISTs. this is a 10 years old technology that consumes space and time (particularly annoying because FC0 has very short pipeline stages so adding boundary scan would drop the clock frequency). So i came up with the current method, built on the existing POPCOUNT and free-wheel LFSR. They were optional because it's only a security feature, but they are now necessary for BIST. This may change if we find something better for FC1 but a freewheel LFSR as well as hamming distance is going to make F-CPU really attractive to some people so i don't expect these features to disappear soon...

I'm even thinking about putting a simple RISC like a LEON on die as well and let it handle the I/O and selftest. If done it switches to I/O pass through mode :)

i'm working on http://f-cpu.seul.org/whygee/VSP/ which can do exactly that. But not on the same die.

nice :) Why not on the same die?

it's a matter of separating unrelated functions.

VSP is designed for slow stuffs, while FC0 is for bandwidth.
so a FC0 chip will contain its private memory interface,
a front-side bus for communicating with other CPUs (think
of a single hypertransport-like) and a dedicated function
(might be 1000BaseT MACs, PATA/SATA, SCSI,
a PCI bridge, a video framebuffer, etc.).
All these work at high speed.

Now, if something slow must be interfaced
(keybord ? mouse ?) there is no reason to let
that interfere with the FC0 directly. So the VSP
handles that (sparing a costly spurious context switch to the FC0)
and enqueues a message to the CPU that will deal with
user interaction. Keymaps and all these stuffs can be processed
in the VSP and message-passing is more suitable than interrupts
for such things.

I really don't see why we need a "companion" scaled-down CPU
"on the same die" as FC0. As the F-CPU model evolves from
"a big central CPU" to "a scalable swarm of simple, cheap and fast separated chips",
all the different functions can be split and accessed independently,
so we can build a very flexible and "hackable" platform
(in the sense that there are more things to do with a soldering iron
than with a PC-like central architecture).

And don't tell me: "OMG. So much die space wasted!" If F-CPU is to be high end then the size of a wasted LEON is almost 0 in comparison!

it's not a matter of dies space. F-CPU is not /that/ large compared to today's cores. It's a matter of I/O. Multicore dies are fun as long as one is not limited by memory bandwidth and pin count. F-CPU is meant to be cheap, so this matters a lot.

Why should it be limited by pins and BW?

because of cost. large dies don't account for all the price. packaging can be extremely expensive. I have seen figures like 1$ / pin.

I said high performance and I mean it. 1000 pins + >10GB BW.

hmmm the price of a 1K pins FPGA is out of reach of many's pockets. AMD and Intel can drive that low thanks to huuuge volumes and integrated fabs. If you're rich, no problem.

however, as you have seen, FC0 and F-CPU in general is just a core.
you can use this to build more complex chips, as long as you provide
a specific set of signals to its interfaces (clock, power, and data/addresses to the bus).

The fastest and easiest way to "consume" your 1K pins
is to widen the busses. Easy. However, half of your 1K pins
will be used for power (Vcc and gnd). This leaves you with
a 256 bit wide private memory bus, SCSI or PATA, and the interco bus.

If you can afford only 250 or 300 pins for the package,
what is better ? A large die with several cores (expensive
because exponentially more prone to defects) which compete
to access external memory ? Or a cheaper, smaller die (well,
it's just a consequence of only one core) with all the memory
bandwidth for itself alone ? If one is going multi-core,
the second solution looks better to me : cheaper, more scalable
(you can tune how many CPUs you want), and all the cores have
their 'private' memory bandwidths, which becomes scalable
(just add modules containing CPU+memory).

Of course we have to consider the constraints. The prototypes won't have as many pins and BW as I suggested above.

sure. but you know how things go : look at the Alpha. I believe that the FC0 can reach success because we hit a price/complexity/performance point that is on par with other proprietary "embedded processors" (made of ARM (there are 64-bit versions), MIPS, SH-x or others). These chips sell for 5 to 10$.

If F-CPU is designed for being a "big iron" from the start,
it will never reach the "embedded CPU" price point,
unless it is considerably stripped.

If you have spare die space, just boost the L2 :-)
(i have an idea about how to make this scalable, fast, multiport
and more importantly : fault and fault tolerant :-P)

nice :D Do tell.

simple, given the constraints :
- maximize die utilization despite fab defects
- one SRAM array is used or provided by the fab (more could cost more in royalties)

So the idea is to reuse the L1 arrays (there are data and instructions caches already).

* These arrays are connected to several busses, in case one branch of one bus is defect.

* Several buses signifies : multiple simultaneous paths to multiple memories,
so the architecture looks as a multiport array from the outside.

* in the fab, the tests will check which array is defect. when one array is found faulty,
a fuse will remove Vcc to avoid shorts and reduce consumption. The bus driver for this
block is also disabled to avoid troubles.

* This also lets us "optimize" the L2 quantity depending on the fundry parameters.
we're not bound to 2^N architectures, but simply multiples of the L1 array size
(so the die surface is optimally filled).

* from there, we can also tune power consumption with software-accessible
bits to turn power on/off for each L2 sub-block.

It is certainly going to be less efficient than specific L2 SRAM cells
(which are designed for power efficiency and compactness). But it's an interesting
thing to try anyway.

You know that the largest bottleneck in recent CPUs
is the external memory bandwidth. If you add more cores
on die, you better have to execute CPU-bound code.
but most today's codes are memory-bound.

There is some CPU-bound code as well. And latency bound code. What a variety! :P

yup, but the memory speed does not increase as fast as CPU speed.

However, nobody will come after you if you put
X FC0s and Y LEONs on the same die/FPGA
(whatever X and Y). I simply wonder how you will feed
them with instructions and data without resorting to
expensive multi-chip modules :-P

With a latency tolerant design we won't need huge caches and huge bandwidth.

"latency tolerance"... tell that to end users ;-P and you ALWAYS need bandwidth (even, and particularly, when you try to hide latency). "You can't fake bandwidth" (Seymour C.)

Sad thing that F-CPU is not AFAIK.

FC0 (the current core) is designed for simplicity and raw computations. FC1 may be more subtle than that (i have some ideas).

But it is SIMPLE. And that's good. A very good test case for free HW development.

test case ? it's not a test IMHO.... particularly because we have to reinvent everything from scratch.

bis besser,

I'm going away for the next days, have fun !



To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/