[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Rep:Re: Rep:Re: [f-cpu] Conditionnal load and store, the return

To: <f-cpu@seul.org>
Subject: Rep:Re: Rep:Re: [f-cpu] Conditionnal load and store, the return
From: "Nicolas Boulay" <nicolas.boulay@ifrance.com>
Date: Fri, 30 Aug 2002 11:57:45 GMT
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Fri, 30 Aug 2002 07:57:54 -0400
Reply-To: f-cpu@seul.org
Send-By: 140.94.82.18 with Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; FR 15/06/2000)
Sender: owner-f-cpu@seul.org
-----Message d'origine-----
De: whygee@club-internet.fr
A: f-cpu@seul.org
Date: 30/08/02
Objet: Re: Rep:Re: [f-cpu] Conditionnal load and store, the return

hi !

"Nicolas Boulay" wrote :
>De: Yann Guidon

>hi,
>
>Christophe Avoinne wrote:
>> From: "Cedric BAIL"
>> > > Again, I don't see any problem.
>> >
>> > Currently if you have this :
>> >
>> > [DATA]
>> >    |
>> > [CPU 1]---[CPU 2]
>> >
>> >
>> > When CPU2 do a conditional load/Store pair, it will not be abble to
see
>> > if CPU 1 access to the data.
># error : "access" variable undefined.
>
>for reading, there is no problem.
>for writing, the "dirty" flag will change
>so locked things will work (well, unless you
>read all the lengthy thread about this too much).
>
>> > The only reason why CPU 2 know that, is because
>> > all memory access will always be send to CPU 2 by CPU 1.... It can
be a
>> > big overkill. I perhaps miss something but the problem exist.
>badly written programs. false assumptions.
>communication is an expensive resource, that a lot of programmers waste
>for laziness reasons (that they excuse with "portability", "langage",
>"existing code base" etc....). Currently, we can't afford a complex
>and costly MESI thing. And if you look at PCs, you'll have more reasons
>to seek another approach.
>
>>>>It's you're usual speach but programing efficiently MIMD computer is
>a research topic. But maybe you soon overcome this. 
well, this research seems to be endless and unfruitful.

>>>>>>Who you are to permit sucgh judgement ? I hope my last teacher in
parrallel computing will never read such words !

unless you read some litterature. You have found a paper/course
on COMA, i have found other resources : you see, it's not a
desperate case.

>>>>Ok ,it's not. But it's a very hard problem. Not solve at all.

>> Ok you are speaking about inter-cpu locking, not intra-cpu locking.
Well of
>> course it is the most difficult problem to solve.
>.... given a certain perspective.
>This has been "solved" in many ways by several generations of
>programmers
>and computer designers..
>
>>>> It's as *never* been sold by a satisfactionnely.
oh, yes. Pentium is better because it is sold in larger
quantities. cool architectural argument.

>>>> oups. you must read "solved" not sold. Sorry.

> NUMA (None uniform
>memory access : flat adressing but each node have it's memory bank to
>speed up local access, remote access are much longer, hardwaring
sharing
>model is "central server type") computer put a great pressure under the
>OS scheduler. Don't say they are bad programmer ! A scheduling is a
>NP-complet problem ! 

then tell that to the people who put tens or hundreds of DSP
in a rack. And scheduling is not a new science either.
just type this keyword in google and you'll find heuristics.

>>>>>> Sorry, but it's stupid ! You can't compare general computing with
highly specific DSP archetecture, where topology are defined for the
application, well known when the software is written. In such world, you
manipulate message passing directly, and there is no OS who handle data
stream without knowing what the software do.

>>>>> All the stuff is controlled. SW knows exactly the hardware on wich
the software will be run. There is absolutely no question of security,
or even scheduling !

>>>> COMA (cache only memory access) of the last SUN E10000 use the 
>read-replication approch.
>
>>>> From a research point of view none approch is better than the other
>: it depend on the application. My idea is to permit the 4 approches
>(could read a old draft at http://f-cpu.seul.org/nico ). Maybe we
should
>look at L4 and decide what is needed in hardware to do it. 
good luck.

>>>> I really think that not so much thing are needed. We only have to
>find a good way to permit to have a duplicated virtual pages on 2
>physical one (on different node). This could add some interresting
>problem ;p

do you think it's impossible ?
i believe that F-CPU is not the first computer to attempt
that. look at a T3D/E.

>>>I have look at it. This architecture is called none-CC-NUMA which
means none cache coherent none uniforme memory access. It's a real pain
to write a program in that mode. And sorry this computer is only NUMA,
nothing new.

>> CPU1 and CPU2 can access directly to DATA, because they both have a
>> different LSU we are stuck.
>that's one very simple way to see this :-)
>
>> In fact, the problem only occurs when you want a bi-processor or
more, so I
>> think you an extra stuff to allow global locking of data between CPU.
>This is the kind of stuff that exists in "high performance" computers
>but when i speak about that, i get flamed. otherwise, i would already 
>
>>>>> Yep ! Because you never explain correctly how you're idea work !

ditto. one more reason for this thread to end up with
Godwin points distributions. If it ever ends.

>>>>>:p

>have invented a clean interface for that and the debate would be over.
>But there is the argument that "locks" are mostly for local (intra-CPU)
>resources and an external lock would slow all CPUs.
>
>>>>> We not need a list od signals ! But a list of bus fonctionnality.

i don't see where i speak about signals, here.

>>>>>i read "interface" so i thought about you're "fbus".

>> I suppose you want several CPU able to access the same DATA directly
:
>> 
>> CPU1------------------\
>> CPU2------------------+------DATA
>> CPU3------------------+
>> CPU4------------------/
>
>i would naturally put this in the "G-chip"...
>
>>>> Grrr ! We put the memory controller inside the f-cpu to decrease
>latency and you want to add an other chip !
this is not related. The local memory is accessed by the same CPU but
several CPUs want to talk together. If there is a "hub",
then it's the best place to store the inter-CPU locks.

>>>So you lose all the benefit of keep off the northchip chipset.
Welcome back to the 90's !

> Think with fonctionnality in mind (IP !), not chip !
just like you, i think whatever pleases me.

>>>>Yep ! But in that case, it's a none sense.

> look at what AMD want to do with opteron... 
and then count the pins ;-)

>>>>Yes count it : only memory interface + interconnection buses. In
you're proposal. Maybe the f-cpu will have less pin but the bandwith to
the chip will decrease and the latency increase. And i can't imagine the
size of the gchip !

>> You need a bridge to access DATA for all CPU. It cannot be possible
for all
>> CPU to access DATA meanwhile.
>4-port SRAMs can now work around 250MHz....
>but a FPGA would do the job easily as well, and manage the locks.
>>>> Limited ressources, a big buse for very few use. Why speaking one
>more time about FPGA ? We design IP, so we could then choose when
things
>will be implemented. 250 Mhz for FPGA is very very optimistic !
>On our system it must have only memory+f-cpus+io controler, that's all.
>One more chip will increase the price for almost nothing.

<yawn>

>> Just an idea : beside the LSU for each CPU (internal LSU), we can
have a
>> external LSU which only contains the locked entries :
>does this mean that a specific instruction is needed ?
>
>>>> ???

yes : separate unit implies a separate instruction.
pure example of associated ideas.

>>>From my point of view LSU, is kind of L0 caches. So some instruction
give hint for it, but that's only hint.

>i wanted to do this through the SRs, in order to not limit the number
>of running locks/semaphores.
>
>>>> You can't use SR for that refind an old Christophe post. That's
>simply impossible with SR that are only write and read.

you forgot that since we design the stuff, we are free to
modify whatever parts do not fit in the "big picture".
but let's forget about SRs now. The "pseudo-LSU" idea
seems promising (but protection, scheduling and atomicity
are still problems if we don't use the register association flags.

>> if a CPU do normal load/store, just bypass external LSU : faster
behavior.
>> if a CPU do lock load, set a new entry in external LSU.
>> if a CPU do lock store, check if entry is in external LSU.
>> This external LSU can be seen as a special memory.
>
>looks cool but there is a problem : how do you manage coherency between
>the usual 'L0'  LSU and the special one ?
>
>>>> LSU is only a caches.

and cache coherency is not an issue for you ?
what is the _real_ cost for this ?
 * if the same addressing space (memory) is used,
   then this will conflict with other ongoing streams.
 * if the LSU is parallel to the lock unit, the units
    must be synchronised and keep coherency.
 * if the units are in series, then one will slow
    down the other.
To these problem, there is a simple solution :
create a new adressing space that does not coincide
with the memory space. We can apply some of the techniques
of the LSU and forget about coherency.

>>>>i must rethink about that. But if you need to keep the external LSU
coherent with internal LSU, i don't understand the interrest of the
external lsu.

>The problem of splitting the functions is a natural one, but i wouldn't
>put this in the memory addressing range, as it conflicts with the LSU
>and coherency between the units will adds more problems (and shift
>the SW problems to the HW, but HW is usually more difficult to design,
>particularly when there is no code).
>
>I would propose an independent "lock" space for this purpose,
>so this wouldn't conflict with the memory space. The same kinds
>of techniques and protections can be enforced (N entries, associative
>addressing, trap on illegal ranges...) but it's much more simpler.
>Instructions would be "lock imm/reg, value, result" and
>"release imm/reg, result". no addressing, no TLB access/miss,
>no granularity problems (memory should be reserved for high-bandwidth,
>large chunk transfers, and a single byte or word is an overkill).
>
>>>> lock aren't so common but using 20 pins to do that
> is really an overkill.
it is also an overkill to use memory (with the 256-bit granularity)
for a single 32-bit lock. And we are not forced to use

>>> no it's not !!!!!! Such lock are really uncommon in program. So lose
even one pin is soon too much.

external pins either. Look at the bottom of the message
where we start to agree that the FSB can support special
messages that can be harmlessly interlevead in the memory
streams and bursts.

>>>That's called bus fonctionnality, and we could extend the wishbone
for it.

>You see, i try to bring solutions, too. i hope someone will enhance
>on them. Thanks to Crhistophe for the "parallel LSU" idea, it's less
>scary than the SRs and can reuse a lot of existing methods.
>
>Btw, in a multi-CPU configuration, an interconnexion network can
>be dedicated to passing lock messages, if it can't be multiplexed
>
>>>> overkill !!!
i said "if".

but it seems that we what we want is a contradiction in itself.
We want "high-end performance" with low-level technology, and
even though there are some techniques, we can't get more than
what we pay.

>>>>False ! We work in a world where pin count is a very limited
ressource for many different raison (power consomption, package price,
routing,...).

>>>> So a pin, represent some Gbits/s of bandwith. If it's not often
use, this number decrease, and it's a waste of IO capacities !

>>>>imagine 99% of the time io bus are use to communicate with other
cpu, 1% for locking. What do you prefer : increasing the speed of the 1%
by 50% or increasing the speed of the 99% by 10% ?

>with the memory streams on the front-side bus (this last idea
>is however what will be certainly implemented first).
>
>>>> It's only a new fonctionnality of the bus a new transaction, that's
all.
that's what i said. it's not difficult to modify existing
protocols to include these "messages".

>>>Yep !

>This is a common technique in large computers.
>
>>>>> Old large computer, with frequency of 50 Mhz, with
> no really wire routing problem, that cost millions dollars,...
sorry about the wire routing problems. If they cost so
much, it's because the routing is the main problem.
and if the routing is so expensive, it's efficient.
you get what you pay, there is no magic.

>>>So we know for low cost we can't have lot of wire. I prefer buy a big
pc rather than a bi-UIII, it cost 5x the price for the same
performance... 

see you on thursday,

> nicO

>> A+
>WHYGEE

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/


______________________________________________________________________________
Pour mieux recevoir vos emails, utilisez un PC plus performant !
Découvrez la nouvelle gamme DELL en exclusivité sur i (france)
http://www.ifrance.com/_reloc/signhdell

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
Prev by Date: Re: Rep:Re: [f-cpu] Conditionnal load and store, the return
Next by Date: [f-cpu] Hot issue : external LSU ?
Prev by thread: Re: Rep:Re: [f-cpu] Conditionnal load and store, the return
Next by thread: Rep:[f-cpu] Hot issue : external LSU ?
Index(es):
- Date
- Thread