[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Rep:Re: [f-cpu] SIMD register



-----Message d'origine-----
De: Yann Guidon <whygee@f-cpu.org>
A: f-cpu@seul.org
Date: 09/04/02
Objet: Re: [f-cpu] SIMD register

hi,

Nicolas Boulay wrote:
> After reading a idct MMX code i don't think it really easy to use simd
> register without knowing there size. Such idct use 8 word chunk
because
> it fill the right data.

if you read Intel's documents, or code designed for Intel computers,
i understand your doubts. However, some Zen, a clear and clean mind
and some patience will help you read the "basic" books in a different
way.

>>> nop ! It's from the ffmpeg project.

I have already started writing a DCT code, optimised for F-CPU.
It is surprisingly easy when you know a few tricks. And more
importantly,
i had started from an already good-looking code, so it was almost
straight-forward.

> A study on theoretical cpu on a new and simple algorythme of compiling
> to use vector instruction on spec program give it's maximum around
> 128-256 bits register (4*64 bits float). With bigger register inter
> register dependancies increase and code speed DECREASE.
> 
> So size independant code will slow done even more the code.

I don't agree with you because you assume that the study is perfect.
It is based on prototype code, in very specific conditions and the algo
is probably badly chosen. Add to that that the memory system is probably
not adapted, and you see that this is probably a misleading result.

Don't forget that in the past, most people said "32-bit registers
are too wide, we don't need all these bits" or "8 registers are enough
for any algorithm". Since then , the balance and architecture of the
computers
have radically changed : it's not wise to say that we won't need 256-bit

>>>That's the all pupose argument.

registers in the future. If the use of embedded DRAM increases, your
study
might well become a geek's joke.

>>> ??? If you use embedded DRAM, you will increase memory bandwith
(comparre to external one) so i don't understand you're point.

Most importantly, nobody today writes code that is independent from
the platform (except in C where the size of the ints is unknown).

>>> You should say exactly the opposit : nobody write a code for a
specific plateforme. That's the success of Java. That's the marketing of
.net.
When you write in C you try to be as portable as possible.
You always think about asm but it's only for the core of some codec to
increase speed. Nowadays, even DSP code are written in C because of the
complexity of the traitements. A good C compiler that could vectorise as
for the next version of gcc or the future compiler based on Trimaran
will do the job. So you won't need to write code by hand any more.

So it's easy to say now that size-independent code is not worth.
However, with a few programming habits, you could write once a
code that can be executed as is and as fast as possible on any compliant
platform. It's "just" a matter of complying with a programming model,
so you don't have to touch old code.

I think that FC0 is easily scalable to 256-bits and 2 instructions per
cycle.
when such a CPU will be implemented, we will have already started FC1, i
guess.
But we will have to deal with 3 kinds of codes, if i follow your idea
correctly.
There is however a simpler solution : in computation-intensive code
(or bandwidth-stressing code), use the maximum width register (in SIMD
mode).

>>> That's the binary compatibility. With a good C compiler you don't
need it. Binary compatibility aren't usefull in the world of free
software and open-sources. ;p

execute a "get SR_MAX_SIZE, rd" and divide your loop counter by rd
(or play with the loop counter, substracting rd instead of just 1).
This way, your code can compile and execute on any version of the CPU.

I don't think it's too complex to do.

>>> It is ! You suggest that's all application have this form :
for i
  c[i]=f(a[i],b[i])
But the general case are :
for i
  c[g(i)]=f(a[h(i)],b[k(i)])

This case could be diffucult or impossible to vectorise. Depending on
g() h() and k() it could add some dependancies (for example if a==c !!).
In the mathematical traitement look at the problem of small matrix, if
you manipulate 8x8 of them and 8 words fit on a register what happen to
the algorithme when the register size double ? You lose the coherency !
If you read each time the SR to know how many register you get, you will
lose lot of time !
nicO



> nicO
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

 
______________________________________________________________________________
ifrance.com, l'email gratuit le plus complet de l'Internet !
vos emails depuis un navigateur, en POP3, sur Minitel, sur le WAP...
http://www.ifrance.com/_reloc/email.emailif


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/