[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Rep:Re: [f-cpu] Conditionnal load and store, the return



look for >>> for answer.

-----Message d'origine-----
De: Yann Guidon <whygee@f-cpu.org>
A: f-cpu@seul.org
Date: 30/08/02
Objet: Re: [f-cpu] Conditionnal load and store, the return

hi,

Christophe Avoinne wrote:
> "Cedric BAIL" wrote :
> > The problem is that you didn't have any error in fact the test is
false, so no
> > real memory access are done. So you must not reexecute the
instruction and only
> > pass it even if the address is bad. It's were I see a problem, you
execute a
> > handler for nothing...
> 'loadCC' : If you access memory regardless the test result, yes you
will
> raise an exception even for false test. Due to this design you need to
delay
> the exception until the test is completed and is true.

the condition test AND the pointer check happen at the same time.
The decision (stall, trap or issue) is taken during the Xbar stage.
So there is no problem such as what Cédric explained (with the Null
pointer
and false condition). The issue logic should be smart enough (and it's
not too difficult) to avoid these situations.

Let's say that the condition has precedence over the pointer.
In HW, condition is faster than pointer LUT reading, so it's also
a natural choice.

> You should really detail your explanation because I really don't see
how you
> planned to execute 'loadCC' and 'storeCC' and find out that kind of
problem.

i hope the problem has disappeared now :-)


In a previous mail, you also wrote :

> From: "Cedric BAIL"
> 'load' : you mean you always load the value from memory and assign the
value
> to the data register only if test is succeeded ? well, if so, an
exception
> will occur before any test anyway.
no. This can be a waste of CPU time (spent in the trap handler)
if the SW is badly coded.

> But if you only access memory just after the test succeeds, an
exception
> will occur and still you need to rexecute the instruction.
they are accessed at the same time, and the issue logic (a big "AND-OR"
of all the status and conditions) will sort things.

> Because your instruction has no internal state. You cannot use partial
> execution with exception. After an exception occurs, you cannot finish
an
> instruction by resuming partial execution : you need to reexecute the
> instruction.

The F-CPU instructions are (and should remain so) such that partial
execution is not necessary. The instruction flow is only controlled
at the decode/issue stage.

> > > Again, I don't see any problem.
> >
> > Currently if you have this :
> >
> > [DATA]
> >    |
> > [CPU 1]---[CPU 2]
> >
> >
> > When CPU2 do a conditional load/Store pair, it will not be abble to
see
> > if CPU 1 access to the data.
# error : "access" variable undefined.

for reading, there is no problem.
for writing, the "dirty" flag will change
so locked things will work (well, unless you
read all the lengthy thread about this too much).

> > The only reason why CPU 2 know that, is because
> > all memory access will always be send to CPU 2 by CPU 1... It can be
a
> > big overkill. I perhaps miss something but the problem exist.
badly written programs. false assumptions.
communication is an expensive resource, that a lot of programmers waste
for laziness reasons (that they excuse with "portability", "langage",
"existing code base" etc....). Currently, we can't afford a complex
and costly MESI thing. And if you look at PCs, you'll have more reasons
to seek another approach.

>>>It's you're usual speach but programing efficiently MIMD computer is
a research topic. But maybe you soon overcome this. 

> Ok you are speaking about inter-cpu locking, not intra-cpu locking.
Well of
> course it is the most difficult problem to solve.
... given a certain perspective.
This has been "solved" in many ways by several generations of
programmers
and computer designers..

>>> It's as *never* been sold by a satisfactionnely. NUMA (None uniform
memory access : flat adressing but each node have it's memory bank to
speed up local access, remote access are much longer, hardwaring sharing
model is "central server type") computer put a great pressure under the
OS scheduler. Don't say they are bad programmer ! A scheduling is a
NP-complet problem ! 

>>> COMA (cache only memory access) of the last SUN E10000 use the
read-replication approch.

>>> From a research point of view none approch is better than the other
: it depend on the application. My idea is to permit the 4 approches
(could read a old draft at http://f-cpu.seul.org/nico ). Maybe we should
look at L4 and decide what is needed in hardware to do it. 

>>> I really think that not so much thing are needed. We only have to
find a good way to permit to have a duplicated virtual pages on 2
physical one (on different node). This could add some interresting
problem ;p

> CPU1 and CPU2 can access directly to DATA, because they both have a
> different LSU we are stuck.
that's one very simple way to see this :-)

> In fact, the problem only occurs when you want a bi-processor or more,
so I
> think you an extra stuff to allow global locking of data between CPU.
This is the kind of stuff that exists in "high performance" computers
but when i speak about that, i get flamed. otherwise, i would already 

>>>> Yep ! Because you never explain correctly how you're idea work !

have
invented a clean interface for that and the debate would be over.
But there is the argument that "locks" are mostly for local (intra-CPU)
resources and an external lock would slow all CPUs.

>>>> We not need a list od signals ! But a list of bus fonctionnality.

> I suppose you want several CPU able to access the same DATA directly :
> 
> CPU1------------------\
> CPU2------------------+------DATA
> CPU3------------------+
> CPU4------------------/

i would naturally put this in the "G-chip"...

>>> Grrr ! We put the memory controller inside the f-cpu to decrease
latency and you want to add an other chip ! Think with fonctionnality in
mind (IP !), not chip ! look at what AMD want to do with opteron... 

> You need a bridge to access DATA for all CPU. It cannot be possible
for all
> CPU to access DATA meanwhile.

4-port SRAMs can now work around 250MHz....
but a FPGA would do the job easily as well, and manage the locks.

>>> Limited ressources, a big buse for very few use. Why speaking one
more time about FPGA ? We design IP, so we could then choose when things
will be implemented. 250 Mhz for FPGA is very very optimistic !
On our system it must have only memory+f-cpus+io controler, that's all.
One more chip will increase the price for almost nothing.

> Just an idea : beside the LSU for each CPU (internal LSU), we can have
a
> external LSU which only contains the locked entries :

does this mean that a specific instruction is needed ?

>>> ???

i wanted to do this through the SRs, in order to not limit the number
of running locks/semaphores.

>>> You can't use SR for that refind an old Christophe post. That's
simply impossible with SR that are only write and read.

> if a CPU do normal load/store, just bypass external LSU : faster
behavior.
> if a CPU do lock load, set a new entry in external LSU.
> if a CPU do lock store, check if entry is in external LSU.
> 
> This external LSU can be seen as a special memory.

looks cool but there is a problem : how do you manage coherency between
the usual 'L0'  LSU and the special one ?

>>> LSU is only a caches.

The problem of splitting the functions is a natural one, but i wouldn't
put this in the memory addressing range, as it conflicts with the LSU
and coherency between the units will adds more problems (and shift
the SW problems to the HW, but HW is usually more difficult to design,
particularly when there is no code).

I would propose an independent "lock" space for this purpose,
so this wouldn't conflict with the memory space. The same kinds
of techniques and protections can be enforced (N entries, associative
addressing, trap on illegal ranges...) but it's much more simpler.
Instructions would be "lock imm/reg, value, result" and
"release imm/reg, result". no addressing, no TLB access/miss,
no granularity problems (memory should be reserved for high-bandwidth,
large chunk transfers, and a single byte or word is an overkill).

>>> lock aren't so common but using 20 pins to do that is really an
overkill.

You see, i try to bring solutions, too. i hope someone will enhance
on them. Thanks to Crhistophe for the "parallel LSU" idea, it's less
scary than the SRs and can reuse a lot of existing methods.

Btw, in a multi-CPU configuration, an interconnexion network can
be dedicated to passing lock messages, if it can't be multiplexed

>>> overkill !!!

with the memory streams on the front-side bus (this last idea
is however what will be certainly implemented first).

>>> It's only a new fonctionnality of the bus a new transaction, that's
all.

This is a common technique in large computers.

>>>> Old large computer, with frequency of 50 Mhz, with no really wire
routing problem, that cost millions dollars,...
nicO

> A+
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/


______________________________________________________________________________
Pour mieux recevoir vos emails, utilisez un PC plus performant !
Découvrez la nouvelle gamme DELL en exclusivité sur i (france)
http://www.ifrance.com/_reloc/signhdell

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/