[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New EU_SHL Instruction



hi,

Michael Riepe wrote:

On Thu, Jan 09, 2003 at 01:59:35AM +0100, Yann Guidon wrote:
[...]

and_reduce (or "combine" as written in ROP2) is not possible
for very wide data.

Furthermore, the xorn.and trick is useful for "detecting" that a byte
corresponds, but if you need to find the index of the character,
the "obvious" answer is to loop over the register.
if you have a result of 0x00FF000000000000, it's not a good solution.
So the idea is to "transpose" the bits in the word, that would become
0x4040404040404040 and the last byte can then ben binary encoded
in INC (if it's implemented).

Wouldn't it be sufficient to `collapse' each chunk into a single bit?
That is, if the chunk's value is not zero, the corresponding bit will
be set, otherwise it will be zero:

	r2 = 0xab00cd00ef0000
	collapse.b r2, r1
	r1 <= 0x54
	collapse.d r2, r1
	r1 <= 0x0e

and so on.

no i don't think it's interesting enough since it is not reversible
and looses most bits. you should also propose the AND and OR modes,
to make it more useful.

Another use of the "transpose" instruction is when
you want to perform boolean operations (like a programmed lookup table)
on the consecutive bits in a register. For example, imagine
that the tables of a DES round can be translated into a succession
of ROP2 instructions. Since the tables are the same in the algorithm,
you can use the full width of the registers to computer many lookups
in parallel.

The "transpose" operation helps "split" and "merge" a word into several registers
so each register corresponds to a bit position in the original chunk.
Since the operation is reversible (2 transpositions in a row don't change
the data), only one opcode is needed.

Instead of doing
for (i=0; i< X; i++)
a[i] = table[a[i]];

it is possible to do :
transpose registers
ROP2 operations
transpose registers

The first approach is ok when there is a few lookups,
but when it becomes intensive, it is better to compute
the operations in parallel. It is less sensitive from the
memory architecture and latency and the ROP2 unit
allows powerful simplifications compared to other CPUs
(there are 8 2R1W operations and 1 3R1W operation)
so the cost is low. With 64 registers, even 64-bit wide,
8-input 8-output boolean operations can be performed.
The lookup/instruction ratio increases with the register width.
The only annoying thing is that the lookup table must
be known at compile time ....

A complementary `uncollapse' instruction would be nice,
too (it would allow you to generate chunk masks more easily):

	r2 = 0x5a
	uncollapse.b r2, r1
	r1 <= 0x00ff00ffff00ff00
	uncollapse.d r2, r1
	r1 <= 0x0000ffff0000ffffffff0000ffff0000	// yes, that's 128 bits ;)

Something like that is already possible with SDUP and ROP2
so i don't think it is critical, but if you can implement it for free, why not...
but it should be based on the sdup instruction, so you can "address"
the chunk that you want to duplicate.

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/