[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] another DATE report



Well, it's true i'm shy in english :-)

More seriously, I think the issue about the time penality about using CAS and
CAS2 is overstated. The solution offered by Yann to replace it with a fix array
of semaphores through the SR mechanisms is an evidence that either he don't
know about what we are really talking or he misunderstands the real purposes of
CAS2. Yann, let me tell you that your solution is really clumsy (I will explain
why). Personnaly, I think that you are too more confident about yourself -
especially since you states everytime that you are the only one who code.
Please, stop with that !!! you shouldn't speak that way simply because you
don't want to hear about something you regard as a crap. You will exasperate
too many people with such a behavior.

Ok, take easy. Let us try a more constructed debate :

Why I find your solution clumsy compared with the non-blocking synchro
primitives :

- To access SR, we have two instructions : PUT and GET. The first one is just a
WRITE operation. The second one is just a READ operation.

- If you are speaking about a range of SR as mutex - BLOCKING synchro
primitives by the way - how do you proceed to stop a task in a processor to
swith another one ? the only way I see is : GET to acquire a semaphore and PUT
to release a semaphore. Checking if a task must wait can be reading the value
after a GET execution. But what the heck if there is another process or a task
switching between the GET instruction and the code checking if we must stop the
task and this other processor or task releases a semaphore ?

device-side hardware GET : atomic { r1 = sr[r2]; sr[r2] = 1; }
device-side hardware PUT : atomic { sr[r2] = 0; }

task A :- executes GET for mutex #1
<< task switch >>
task B :- executes PUT for mutex #1 : releasing mutex #1.
task B :- awakes a task waiting for mutex #1 : no task in list indeed.
task B :- ...
<< task switch >>
task A :- Our early GET stated that mutex #1 is already acquired by task B, so
task A must stop and go in the semaphore #1 wait list.
<< task switch >>
task B :- ...
<< task switch >>
task B :- ...
<< task switch >>
task B :- ...
task B :- termination of task

task A will never be awaken, because no other task will release mutex #1 !

Please, as you can see it is a BLOCKING synchro-primitive which doesn't prevent
us from the necessity to stop a task when demanded. So you will need to call a
trap which is a very costly operation especially the way you intend to handle
your traps (CMB saving)...

- If you are speaking about a range of SR as semaphores, there is no way : PUT
cannot tell us if we must awake a task in the semaphore list because it is a
WRITE operation. Or you must have two ranges instead of one and only use GET.

- fixed-range semaphores : very limitive ressource.

I need a semaphore for a structure, so I must allocate a semaphore and get its
SR as ID if available (oh bad ! ressource is critical !). Now I got it, I must
recorder it in the structure so the latter can access it when acquiring or
releasing : everytime I will do a GET or a PUT i will first access in the
structure (in memory) to get the semaphore ID, so you cannot avoid the memory
access indeed.

I cannot figure out how to use your semaphores in a Linux for your F-CPU...

So the best solution is using CAS and CAS2 when the external BUS supports
something like a READ-CHECK-MODIFY transaction :

- semaphores and mutexes can be done with them without the synchronisation
problem I spoke above
- the most elegant way to have the fastest spinlocks both in UP and SMP
architecture
- the memory slot for synchronisation is embedded in structure so there is no
extra memory access
- no range limit
- cpu locking regards only for those which try to access concurrently the same
memory slot.
- no need for blocking synchro when using them in stack, queue, dequeu, etc.
- etc. etc. etc.

----- Original Message -----
From: nico <nicolas@seul.org>
To: <f-cpu@seul.org>
Sent: Sunday, March 17, 2002 12:03 PM
Subject: Re: [f-cpu] another DATE report


> Yann Guidon a écrit :
> >
> > hi !
> >
> > nico wrote:
> > > hi,
> > > I'm back from vacation !
> >
> > oh yeah ! :-)
> > let's got back to the synchro primitives :-)
> >
> > And remember that i still don't agree with using the memory
> > address space for doing that...
> >
> > > Yann Guidon a écrit :
> > > >
> > > > hi !
> > > >
> > > > nico wrote:
> > > <...>
> > > > > Nop, absolutely necessary to handel very easly synchronous problem.
> > > > > Christophe convince me, if he could resend it's URl about the paper
that
> > > > > speak about.
> > > > > Such technic, in the common case, where there is no collision the
only
> > > > > lose is this CAS2 instruction, there is no OS call overhead, no task
> > > > > switch, nothing. Only 4 atomic transactions ! We must have that !
> > > >
> > > > i agree that CAS-like operations are interesting, no doubt about that.
> > > > However, what you propose is uglier than what i heard before, and
> > > > you know that we have discussed a lot ;-)
> > > >
> > > > * first, there must be an instruction that supports this kind of
> > > > transaction. there is none and this instrruction would block the
pipeline.
> > > > you couldn't for example "pipeline" the CASes.
> > >
> > > That's not a problem !
> >
> > excuse me ?
> >
> > if you can't pipeline an instruction, there is no reason
> > for using a "RISC" architecture and the scalability is compromised.
> > Do not imagine that a computer with 2048 CPU is science fiction,
> > the largest computers have around 10K CPU i remember. Communication
> > and synchronisation grow _at_best_ as the log of the number of CPU,
> > so if you can't sustain several simultaneously ongoing synchs,
> > your mega computer will spend its life waiting for "semaphores"
> > or other signals.
> >
>
> CAS2 is the fastest synchro primitive that you can find. SR style
> semaphore are simply impossible. I said to Christophe to write is
> demonstration but he's shy in english.
>
> You want to pipiline something more than the div instruction ! But you
> use such instruction each time you want to update a data so very few for
> each processor. If tousand processor communicate with each other it's
> not a problem at all.
>
> A part, i don't beleive that we will refind 10k chip computer, is far
> too expensive and "outside" communication became far more too costly.
>
> > pipelining the synchro primitives is a way to enhance the communications
> > inside the system. This way, a CPU can send a synchro request to N other
CPU,
> > then request the answer from all CPUs. An external memory reference uses
> > tens of cycles so you understand that if you request the answer
immediately,
> > the CPU will stall during tens of cycles. thank you.
> >
>
> There isn't any answer from other cpus, never. That's just the good
> thing of this primitive !
>
> > >  Most of the other technique are much more costly
> > > (call to the os, trap,...)
> >
> > costly in what ?
> > It's always a matter of "HW/SW Interface" (cf P&H's book)
> > if it's not done in HW, it's in SW (and vice versa), so we're
> > always moving the cost around, here and there...
> >
>
> Nop ! Sorry, traping in kernel mode is much more costly than the slowest
> cpu instruction.
>
> > > > * second, probably no device would support this operation. I don't know
> > > > how you would like to implement that but F-CPU is not alone in the
system
> > > > (otherwise there would be no use for an I/O bus :-D) and the other
elements
> > > > must support the same protocol. If there is only one memory block, ok,
> > > > but if other CPUs of a different sort are present, there might be
risks.
> > >
> > > If the arbitrer of the bus does it's not a problem any more ! It's it
> > > that grant the devices.
> >
> > who speaks about a "bus" ? i speak about a "system", whether "on a chip" or
not.
> > The interconnexion between the different (active and passive) elements of
> > the system might have any topology and use heterogeneous protocols which
> > won't all comply to our little specification.
> > And sorry, when i wrote "(otherwise there would be no use for an I/O bus
:-D)"
> > i meant "I/O interface". my apologies.
> >
>
> It doesn't change anything ! There is always in any case an arbitrer so
> if the arbitrer support CAS2, there is no problem for Fcpus connected
> together. A IO device doesn't have CAS2 instruction. So, they does not
> use it ! What's the problem ?
>
> > > > * third, and more difficult for me (because i'm one of the people who
code),
> > > > your feature requires a more complex state machine than needed.
Generally
> > > > such a complex thing is not used 99.95% of the time, so i think it's an
> > > > overhead on my shoulders, unless you want to do it.
> > >
> > > ?? it's only a 4 straits state machine, you're scheduler will be much
> > > more paintfull !
> >
> > i wouldn't describe the design of the scheduler as a "state machine",
> > even though there are always people who think about everything as state
> > machines. It's not about theory : the scheduler works with a very simple
> > principle (despite its complex implementation) : "if all the necessary
> > ressources are available, then issue the instruction". The "state" is
> > not "inside" a monolithic state machine like people usually see. In fact,
> > a stateless machine is just a boolean operation, even if the result is
> > memorised by pipeline latches. it can be interrupted without problem
> > because there is no "single state" but a more complex cooperation.
> >
> > However, putting Compare And Swap as an instruction creates a wholy CISC
> > situation. What you describe is a pure RISC vs CISC problem.
> >
> >                        !!!!!!!!!!!!!!!!!!
>
> Not at all. CAS2 operation are found in powerPC for example. Because
> such instructions are atomic you can't compare it to a usual "CISC"
> operation.
>
> An other way to say that is : how do you make a difference in a ciscy
> instruction and a riscy one ? I'm glade to know it ! The only thing that
> i can see is that ciscy operation could use an arbitrary number of clock
> cycle. We could add that an riscy operation has a fixed size instruction
> word (to ease decoding) (but for code density some dsp use variable size
> operation 8-16-32-48 instructions with "wide decoder" so "many"
> instructions could be decode in the same time and code is denser ). For
> me that's all !
>
> >
> > I propose to create a device, looking like a RAM module that is plugged
> > in the system and acts as a "mailbox". This way, using the already existing
> > instructions (load and store), and a simple mechanism inside the "mailbox",
> > the "CAS" operation is performed by the external device and NOT by the
CPUs.
> > With a single access port, one can ensure that all accesses are serialised
> > and coherency is thus ensured.
> >
>
> I'm not sure it could work.
>
> > I still don't like this idea because it uses the general memory addressing
space
> > for doing things that don't work with cache line granularity (-> waste of
bandwidth)
> > but i hope that this compromise will help keep the F-CPU simple and
working.
> > At least, this solution does not create artificial limits on the CPI and
> > operating frequency.
> >
> >                        !!!!!!!!!!!!!!!!!!
> >
> > > > > > > - maybe things to manage cache L2 (from my point of view L2 is
tied to
> > > > > > > the DRAM controller, the L1 is tied to the CPU). None caching
access for
> > > > > > > example to avoid cache trashing.
> > > > > > what do you mean by "things to manage cache L2" ? what things ?
> > > > >
> > > > > For example, caching or no caching data (imagine the FCPu as the core
of
> > > > > the Geoforce 8, rendered image didn't need to be cached).
> > > > there's a flag in the Load and Store instructions which specify if the
> > > > data is to be cached.
> > >
> > > Grrr! I speak about the bus that connect device internally. I know there
> > > is such flag. But there is no interreset if ouer internal bus does
> > > handel it to informe the DRAM controler.
> >
> > maybe things are not clear.
> > I thought that this need was already addressed anyway.
> >
> > > > > To extend that, i would try to apply distributed memory in some
control logic to
> > > > > make invalidate some cache line...
> > > >
> > > > if we work with private and public memory spaces, there is no
invalidation
> > > > logic required.
> > >
> > > ????? I think you miss completly one thing. Invalidate is one of 4
> > > technics but none of one of them are superior to the other it depend
> > > much on the application.
> >
> > why can't we keep it simple in the beginning ?
> > If we continuously add features here and there, because a paper you
> > read says it wins 1%, F-CPU will always remain vaporware.
> > Unless you have some hidden code somewhere.
> >
> > I try to write code now, and i hope you understand that i can't answer
> > to all questions and trolls (though i don't dislike it). Without code,
> > all our blabla is lost time. Some people have alread written code,
> > i hope to see yours one day. really.
> >
> > > > > > see you soon,
> > > > > Yep ! I leave Paris for 1 weeks. Don't blow my mail box !
> > > > send me a postcard ;-)
> > > >
> > > > if you go in vacations, take some time to write a white paper about how
to
> > > > use Wishbone in the most simple case (1 CPU + 1 memory). I'll try to
make
> > > > that for the VCI interface. Some VHDL code would be nice, too.
> > >
> > > Too late!
> >
> > no code, no glory ;-P
> >
> > > nicO
> > WHYGEE
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > *************************************************************
> > To unsubscribe, send an e-mail to majordomo@seul.org with
> > unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/