[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Rep:Re: [f-cpu] Conditionnal load and store, the return



hi !

"Nicolas Boulay" wrote :
>De: Yann Guidon

>hi,
>
>Christophe Avoinne wrote:
>> From: "Cedric BAIL"
>> > > Again, I don't see any problem.
>> >
>> > Currently if you have this :
>> >
>> > [DATA]
>> >    |
>> > [CPU 1]---[CPU 2]
>> >
>> >
>> > When CPU2 do a conditional load/Store pair, it will not be abble to see
>> > if CPU 1 access to the data.
># error : "access" variable undefined.
>
>for reading, there is no problem.
>for writing, the "dirty" flag will change
>so locked things will work (well, unless you
>read all the lengthy thread about this too much).
>
>> > The only reason why CPU 2 know that, is because
>> > all memory access will always be send to CPU 2 by CPU 1.... It can be a
>> > big overkill. I perhaps miss something but the problem exist.
>badly written programs. false assumptions.
>communication is an expensive resource, that a lot of programmers waste
>for laziness reasons (that they excuse with "portability", "langage",
>"existing code base" etc....). Currently, we can't afford a complex
>and costly MESI thing. And if you look at PCs, you'll have more reasons
>to seek another approach.
>
>>>>It's you're usual speach but programing efficiently MIMD computer is
>a research topic. But maybe you soon overcome this. 
well, this research seems to be endless and unfruitful.
unless you read some litterature. You have found a paper/course
on COMA, i have found other resources : you see, it's not a
desperate case.

>> Ok you are speaking about inter-cpu locking, not intra-cpu locking. Well of
>> course it is the most difficult problem to solve.
>.... given a certain perspective.
>This has been "solved" in many ways by several generations of
>programmers
>and computer designers..
>
>>>> It's as *never* been sold by a satisfactionnely.
oh, yes. Pentium is better because it is sold in larger
quantities. cool architectural argument.

> NUMA (None uniform
>memory access : flat adressing but each node have it's memory bank to
>speed up local access, remote access are much longer, hardwaring sharing
>model is "central server type") computer put a great pressure under the
>OS scheduler. Don't say they are bad programmer ! A scheduling is a
>NP-complet problem ! 

then tell that to the people who put tens or hundreds of DSP
in a rack. And scheduling is not a new science either.
just type this keyword in google and you'll find heuristics.

>>>> COMA (cache only memory access) of the last SUN E10000 use the read-replication approch.
>
>>>> From a research point of view none approch is better than the other
>: it depend on the application. My idea is to permit the 4 approches
>(could read a old draft at http://f-cpu.seul.org/nico ). Maybe we should
>look at L4 and decide what is needed in hardware to do it. 
good luck.

>>>> I really think that not so much thing are needed. We only have to
>find a good way to permit to have a duplicated virtual pages on 2
>physical one (on different node). This could add some interresting
>problem ;p

do you think it's impossible ?
i believe that F-CPU is not the first computer to attempt
that. look at a T3D/E.

>> CPU1 and CPU2 can access directly to DATA, because they both have a
>> different LSU we are stuck.
>that's one very simple way to see this :-)
>
>> In fact, the problem only occurs when you want a bi-processor or more, so I
>> think you an extra stuff to allow global locking of data between CPU.
>This is the kind of stuff that exists in "high performance" computers
>but when i speak about that, i get flamed. otherwise, i would already 
>
>>>>> Yep ! Because you never explain correctly how you're idea work !

ditto. one more reason for this thread to end up with
Godwin points distributions. If it ever ends.

>have invented a clean interface for that and the debate would be over.
>But there is the argument that "locks" are mostly for local (intra-CPU)
>resources and an external lock would slow all CPUs.
>
>>>>> We not need a list od signals ! But a list of bus fonctionnality.

i don't see where i speak about signals, here.

>> I suppose you want several CPU able to access the same DATA directly :
>> 
>> CPU1------------------\
>> CPU2------------------+------DATA
>> CPU3------------------+
>> CPU4------------------/
>
>i would naturally put this in the "G-chip"...
>
>>>> Grrr ! We put the memory controller inside the f-cpu to decrease
>latency and you want to add an other chip !
this is not related. The local memory is accessed by the same CPU but several CPUs want to talk together. If there is a "hub",
then it's the best place to store the inter-CPU locks.

> Think with fonctionnality in mind (IP !), not chip !
just like you, i think whatever pleases me.

> look at what AMD want to do with opteron... 
and then count the pins ;-)

>> You need a bridge to access DATA for all CPU. It cannot be possible for all
>> CPU to access DATA meanwhile.
>4-port SRAMs can now work around 250MHz....
>but a FPGA would do the job easily as well, and manage the locks.
>>>> Limited ressources, a big buse for very few use. Why speaking one
>more time about FPGA ? We design IP, so we could then choose when things
>will be implemented. 250 Mhz for FPGA is very very optimistic !
>On our system it must have only memory+f-cpus+io controler, that's all.
>One more chip will increase the price for almost nothing.

<yawn>

>> Just an idea : beside the LSU for each CPU (internal LSU), we can have a
>> external LSU which only contains the locked entries :
>does this mean that a specific instruction is needed ?
>
>>>> ???

yes : separate unit implies a separate instruction.
pure example of associated ideas.

>i wanted to do this through the SRs, in order to not limit the number
>of running locks/semaphores.
>
>>>> You can't use SR for that refind an old Christophe post. That's
>simply impossible with SR that are only write and read.

you forgot that since we design the stuff, we are free to
modify whatever parts do not fit in the "big picture".
but let's forget about SRs now. The "pseudo-LSU" idea
seems promising (but protection, scheduling and atomicity
are still problems if we don't use the register association flags.

>> if a CPU do normal load/store, just bypass external LSU : faster behavior.
>> if a CPU do lock load, set a new entry in external LSU.
>> if a CPU do lock store, check if entry is in external LSU.
>> This external LSU can be seen as a special memory.
>
>looks cool but there is a problem : how do you manage coherency between
>the usual 'L0'  LSU and the special one ?
>
>>>> LSU is only a caches.

and cache coherency is not an issue for you ?
what is the _real_ cost for this ?
 * if the same addressing space (memory) is used,
   then this will conflict with other ongoing streams.
 * if the LSU is parallel to the lock unit, the units
    must be synchronised and keep coherency.
 * if the units are in series, then one will slow
    down the other.
To these problem, there is a simple solution :
create a new adressing space that does not coincide
with the memory space. We can apply some of the techniques
of the LSU and forget about coherency.

>The problem of splitting the functions is a natural one, but i wouldn't
>put this in the memory addressing range, as it conflicts with the LSU
>and coherency between the units will adds more problems (and shift
>the SW problems to the HW, but HW is usually more difficult to design,
>particularly when there is no code).
>
>I would propose an independent "lock" space for this purpose,
>so this wouldn't conflict with the memory space. The same kinds
>of techniques and protections can be enforced (N entries, associative
>addressing, trap on illegal ranges...) but it's much more simpler.
>Instructions would be "lock imm/reg, value, result" and
>"release imm/reg, result". no addressing, no TLB access/miss,
>no granularity problems (memory should be reserved for high-bandwidth,
>large chunk transfers, and a single byte or word is an overkill).
>
>>>> lock aren't so common but using 20 pins to do that
> is really an overkill.
it is also an overkill to use memory (with the 256-bit granularity)
for a single 32-bit lock. And we are not forced to use
external pins either. Look at the bottom of the message
where we start to agree that the FSB can support special
messages that can be harmlessly interlevead in the memory
streams and bursts.

>You see, i try to bring solutions, too. i hope someone will enhance
>on them. Thanks to Crhistophe for the "parallel LSU" idea, it's less
>scary than the SRs and can reuse a lot of existing methods.
>
>Btw, in a multi-CPU configuration, an interconnexion network can
>be dedicated to passing lock messages, if it can't be multiplexed
>
>>>> overkill !!!
i said "if".

but it seems that we what we want is a contradiction in itself.
We want "high-end performance" with low-level technology, and
even though there are some techniques, we can't get more than
what we pay.

>with the memory streams on the front-side bus (this last idea
>is however what will be certainly implemented first).
>
>>>> It's only a new fonctionnality of the bus a new transaction, that's all.
that's what i said. it's not difficult to modify existing
protocols to include these "messages".

>This is a common technique in large computers.
>
>>>>> Old large computer, with frequency of 50 Mhz, with
> no really wire routing problem, that cost millions dollars,...
sorry about the wire routing problems. If they cost so
much, it's because the routing is the main problem.
and if the routing is so expensive, it's efficient.
you get what you pay, there is no magic.

see you on thursday,

> nicO

>> A+
>WHYGEE

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/