[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Verification, Testing, And Random Numbers



Michael Riepe wrote:

[I hope the mail doesn't bounce again]
nope :-)

Hi F-gang!

You seem to be talking about a lot of different things at the same time: a) verification of a design and/or its actual implementation (in VHDL), b) testing the generated circuit (in silicon), c) the famous F-CPU POST, and d) the use of random numbers.
it seems that the discussion is slowly spreading to
related issues, as usual :-)

The second problem with random number tests is that its "error coverage" is too low. You only check a small percentage of the 2**64 (or more) possible input variants, but errors may hide in any of them, so you probably won't catch them all. This leads to Pentium FDIV bugs and the like.

Functional symbolic verification would be the real thing. It transforms both the operands and the result into a special representation (using BDDs oder one of their derivatives), performs the operation in either representation, and compares the results. You need only a single run per operation, and the result is as reliable as that of an exhaustive test, only much faster. But there are no tools for it (yet).
Currently, we can't do that inside the development process as there is no suitable Free Softwre for this.
One could run a test once in a while in his company, but that's a kind of bottleneck, to say the least.
The best we can do today is to write handcrafted, specific tests and it's currently enough to go forward,
at least for the EUs. The sequential code for the scheduler and even the POST control signal generator
will be a tougher problem.

Let's come to b), testing. There's no time for doing an exhaustive test: even at 1000 GHz it would take 2**128 / 10**12 = ~3*10**26 seconds to test all variants of a single binary operation like `add'. A year has only ~3*10**7 seconds, so you would have to wait 10**19 years. Are any immortals listening here?

The usual way is to use one of the common "fault models" and build as many test vectors as you need to test for all detectable faults of a unit. But this requires that you know its actual implementation, which is not available before synthesis. Actually, this is another case of a "systematic" test which is fast but not fully reliable.
The best i have found is the POPCOUNT+LFSR trick. it integrates well
in the core and provides reusable functions without slowing down the clock
(usuall, there are special (larger and slower) boundary scan pipeline latches to do this and
the F-CPU POST doesn't need that).

Of course, the POST patterns must be correctly designed : we must ensure that the
patterns provide good coverage. it is possible to "instrument" the VHDL code,
making "wrappers" around the unit to verify that enough bits are toggled and collect
some statistics, but the real efficiency must be checked after synthesis using real
coverage tools :-/

As a consequence, we can't give a formal specification of the LFSR,
not even its polynomial, because it would vary a lot with the included features and
the process....

Random numbers may work for boundary scan testing as well, but still has the same problems as with verification: there is no "golden" code, and the error coverage is too low.
There are "ATPG" (automated test pattern generators), but they depend on the synthesised design.
so let's just admit that there is no one-fits-all test/verification method,
and we can't do all the testing in a portable, generic way.

c) The famous F-CPU POST can't use an exhaustive test, nor can it use test vectors (there is not enough room on the chip - and, after all, errors may occur anywhere, even in the ROM that would hold the vectors). It doesn't make sense to use random numbers, because there is no way to verify the result of an operation - putting "golden" code on a chip is simply impossible.

The only way to perform a reasonable POST is to start from a well-known state, perform a well-known sequence of operations - that is, execute a "built-in" test program -, and compare the result with a well-known value. How that comparision is done is more or less a matter of taste; the LFSR approach outlined by Yann is one possible solution (reminds me of a CRC generator, by the way).
well, it's more or less that, since modern CRCs use a LFSR instead of simply
adding numbers together (which can easily be defeated. Software-based CRC
use a table that is generated by a LFSR, i once found a very good document
on the web about this...

The POST also has the purpose to initialise the core to a known state :
Due to wiring and fanout issues, not all sequential circuits can have a "RESET" and
a trick is to generate patterns that not only test the units but also leave them in
a well-defined state. I have written the VHDL random code in order to find
which latch needs to be RESET.

A good example is the register set : routing a RESET line to all 63*64 bits would
be a real overkill, increasing delay, surface, power etc....
When FC0 is powered-up, the POST considers each register value as unknown.
It will write values that are on the Xbar's result bus to every register, then read
the register back to cycle them through EUs, and write the result back etc.
Of course, the results will be "monitored" by the POPCOUNT+LFSR and we
don't care where the error is, we just need to notice single-bit errors (at least).

By cycling data through the EUs and the registers, we test all the functions :
persistance of data in the register set, singal integrity on the wires, correct
operation in the EU and indirectly, proper operation of the POST control signals.
At the end of the POST, a 0 value is used (reading R0 for example) and cycled
everywhere so that only 0 is present in the busses, and written back to every registers
and SRs. The core is then ready to fetch data.

POST can include other units than Execution Units, for example the TLB, caches
and LSU/Fetcher. Testing the rest (beyond the memory interface) is another story ....

Finally, we come to d), random numbers. If you've listened carefully to what I said above, you will probably come to the conclusion that they're not useful at all for verification, testing, or POST.
Concerning FP units, they can help characterize the units,
particularly when it comes to tracking acuracy and rounding problems,
or other IEEE issues like that.

Q.E.D.

Michael.


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/