[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: EU Report (was: Re: [f-cpu] Register set revised)

To: f-cpu@seul.org
Subject: Re: EU Report (was: Re: [f-cpu] Register set revised)
From: Yann Guidon <whygee@f-cpu.org>
Date: Fri, 21 Mar 2003 08:40:07 +0100
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Fri, 21 Mar 2003 02:38:41 -0500
Organization: Freedom CPU Project
References: <20030319222448.04918@thrai.stud.uni-hannover.de> <Pine.LNX.4.33.0303201015280.723-100000@devix> <20030320141312.19015@thrai.stud.uni-hannover.de>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
User-agent: Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.0.0) Gecko/20020530

hi,

(i only reacted after several reads)

Michael Riepe wrote:

On Thu, Mar 20, 2003 at 10:26:32AM +0100, devik wrote:

What exactly is the SAD insn supposed to do?

Sum of Absolute Differences. It take two SIMD words, makes
abs(a-b) for each corresponding bytes and adds all these differences.

IC. It's a little heavy for a single instruction, but it can be calculated
if there is a `byte adder'. The abs(a-b) part can be calculated with
instructions from the current instruction set, either as `abs(a-b)' or
as `max(a,b)-min(a,b)', whatever is more convenient.

currently, for FC0, the computation time is directly proportional to the instruction count
(if there is no stall). in this case, "abs(a-b)" takes 2 instructions and "max(a,b)-min(a,b)"
takes 3 instructions (and cycles) even if the second case has a parallel execution for the min/max.

The SAD is in fact doing 3 things : minus, absolute and finally the parallel addition.
BTW, i have seen instruction sets with the 2 first operations but NEVER with the parallel
addition.
Provided that scheduling interleaves enough work to avoid stalls, it is possible to sustain
the computation of substraction and absolute values at full speed (as long as the core
can get the data). If there is a parallel addition, it can be performed at the end of the loop :
usually, it's a 8*8 block and the results can first be accumulated in columns,
then the final results are combined in the last row with a few add/shifts.

This is used when computing motion vectors in mpeg encoder.

note here : it's for the ENcoder, not the DEcoder/player.
and it's not for MPEG1.

I've seen it in several instruction sets. However I don't know where
it is usable outside of mpeg.
The byte (or chunk) adder will also be useful in vector computations.

for what kinds of algorithms ???

But I doubt that we will have a chunk adder that works with FP numbers.

heh ...

In any case, the chunks of a word can be combined by using `mix':

mix.8 r0, r1, r2 // distributes the chunks across r2 and r3
add.16 r2, r3, r1
mix.16 r0, r1, r2
add.32 r2, r3, r1
mix.32 r0, r1, r2
add.64 r2, r3, r1 // gotcha!

:-)

This will also work with other commutative operations, e.g. mul. A `chunk
add' insn may be more convenient, however (and will also be much faster).

probably but not used often enough.

tmp = A ą (B + lsb(C))
Y = tmp % pow(2, chunksize)
Z = tmp / pow(2, chunksize)

Since that's a 3r2w operation, don't expect it to be implemented in FC0.
It would be quite useful for adding/subtracting big numbers, however.

seems useful. Big number libs (gmp, rsalib...) often need to resort
to some trick most related to carry propagation. In GMP manual
there is many info on the topic.

Yep, I know. I've done things like that in the emulator, too.

The problem with this instruction is that we only have three register
number fields in the instruction word. r1 and r1^1 will be the outputs
Y and Z, r2 and r3 will be A and B, respectively -- but where does C
come from? My best guess is to use r1 for that as well -- but then
we'll have to move away r1^1 (which typically contains the result of
the last chunk computed) first.

Would not be possible to do CMP in ASU too ? Both are working
with "propagating" information from LSB toward MSB ...

The ASU could compare operands (at least in unsigned mode), but it
won't perform the other operations that EU_CMP supports: min/max/sort
and msb0/msb1 aren't possible with just an adder. And since the adder
is busy enough (remember the instruction census?), it's probably better
to leave these operations where they are. That does not mean that you
can't use `subb' to compare operands for unsigned-less, of course.

this confirms what i said earlier : it's interesting to have many ways to perform the
same computation, because the exact behaviour varies a bit and can have some benefits
depending on the exact circumstances...

Or other question, is the new source available ? I'd like to
take a look :)
I'll make a new release of my VHDL sources as soon as I've finished my
CeBIT reports for next month's iX issue.  That is, by the end of march.

cool.
then, i'll have to make some cleanup in my files ....

YG (playing with ultra-low power electronics)


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: EU Report (was: Re: [f-cpu] Register set revised)
  - From: Michael Riepe <michael@stud.uni-hannover.de>

References:
- EU Report (was: Re: [f-cpu] Register set revised)
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- Re: EU Report (was: Re: [f-cpu] Register set revised)
  - From: devik <devik@cdi.cz>
- Re: EU Report (was: Re: [f-cpu] Register set revised)
  - From: Michael Riepe <michael@stud.uni-hannover.de>

Prev by Date: Re: EU Report (was: Re: [f-cpu] Register set revised)
Next by Date: Re: EU Report (was: Re: [f-cpu] Register set revised)
Previous by thread: Re: EU Report (was: Re: [f-cpu] Register set revised)
Next by thread: Re: EU Report (was: Re: [f-cpu] Register set revised)
Index(es):
- Date
- Thread