[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] another DATE report



Yann Guidon a écrit :
> 
> hi !
> 
> nico wrote:
> > hi,
> > I'm back from vacation !
> 
> oh yeah ! :-)
> let's got back to the synchro primitives :-)
> 
> And remember that i still don't agree with using the memory
> address space for doing that...
> 
> > Yann Guidon a écrit :
> > >
> > > hi !
> > >
> > > nico wrote:
> > <...>
> > > > Nop, absolutely necessary to handel very easly synchronous problem.
> > > > Christophe convince me, if he could resend it's URl about the paper that
> > > > speak about.
> > > > Such technic, in the common case, where there is no collision the only
> > > > lose is this CAS2 instruction, there is no OS call overhead, no task
> > > > switch, nothing. Only 4 atomic transactions ! We must have that !
> > >
> > > i agree that CAS-like operations are interesting, no doubt about that.
> > > However, what you propose is uglier than what i heard before, and
> > > you know that we have discussed a lot ;-)
> > >
> > > * first, there must be an instruction that supports this kind of
> > > transaction. there is none and this instrruction would block the pipeline.
> > > you couldn't for example "pipeline" the CASes.
> >
> > That's not a problem !
> 
> excuse me ?
> 
> if you can't pipeline an instruction, there is no reason
> for using a "RISC" architecture and the scalability is compromised.
> Do not imagine that a computer with 2048 CPU is science fiction,
> the largest computers have around 10K CPU i remember. Communication
> and synchronisation grow _at_best_ as the log of the number of CPU,
> so if you can't sustain several simultaneously ongoing synchs,
> your mega computer will spend its life waiting for "semaphores"
> or other signals.
> 

CAS2 is the fastest synchro primitive that you can find. SR style
semaphore are simply impossible. I said to Christophe to write is
demonstration but he's shy in english.

You want to pipiline something more than the div instruction ! But you
use such instruction each time you want to update a data so very few for
each processor. If tousand processor communicate with each other it's
not a problem at all.

A part, i don't beleive that we will refind 10k chip computer, is far
too expensive and "outside" communication became far more too costly.

> pipelining the synchro primitives is a way to enhance the communications
> inside the system. This way, a CPU can send a synchro request to N other CPU,
> then request the answer from all CPUs. An external memory reference uses
> tens of cycles so you understand that if you request the answer immediately,
> the CPU will stall during tens of cycles. thank you.
>

There isn't any answer from other cpus, never. That's just the good
thing of this primitive !
 
> >  Most of the other technique are much more costly
> > (call to the os, trap,...)
> 
> costly in what ?
> It's always a matter of "HW/SW Interface" (cf P&H's book)
> if it's not done in HW, it's in SW (and vice versa), so we're
> always moving the cost around, here and there...
>

Nop ! Sorry, traping in kernel mode is much more costly than the slowest
cpu instruction.
 
> > > * second, probably no device would support this operation. I don't know
> > > how you would like to implement that but F-CPU is not alone in the system
> > > (otherwise there would be no use for an I/O bus :-D) and the other elements
> > > must support the same protocol. If there is only one memory block, ok,
> > > but if other CPUs of a different sort are present, there might be risks.
> >
> > If the arbitrer of the bus does it's not a problem any more ! It's it
> > that grant the devices.
> 
> who speaks about a "bus" ? i speak about a "system", whether "on a chip" or not.
> The interconnexion between the different (active and passive) elements of
> the system might have any topology and use heterogeneous protocols which
> won't all comply to our little specification.
> And sorry, when i wrote "(otherwise there would be no use for an I/O bus :-D)"
> i meant "I/O interface". my apologies.
> 

It doesn't change anything ! There is always in any case an arbitrer so
if the arbitrer support CAS2, there is no problem for Fcpus connected
together. A IO device doesn't have CAS2 instruction. So, they does not
use it ! What's the problem ?

> > > * third, and more difficult for me (because i'm one of the people who code),
> > > your feature requires a more complex state machine than needed. Generally
> > > such a complex thing is not used 99.95% of the time, so i think it's an
> > > overhead on my shoulders, unless you want to do it.
> >
> > ?? it's only a 4 straits state machine, you're scheduler will be much
> > more paintfull !
> 
> i wouldn't describe the design of the scheduler as a "state machine",
> even though there are always people who think about everything as state
> machines. It's not about theory : the scheduler works with a very simple
> principle (despite its complex implementation) : "if all the necessary
> ressources are available, then issue the instruction". The "state" is
> not "inside" a monolithic state machine like people usually see. In fact,
> a stateless machine is just a boolean operation, even if the result is
> memorised by pipeline latches. it can be interrupted without problem
> because there is no "single state" but a more complex cooperation.
> 
> However, putting Compare And Swap as an instruction creates a wholy CISC
> situation. What you describe is a pure RISC vs CISC problem.
> 
>                        !!!!!!!!!!!!!!!!!!

Not at all. CAS2 operation are found in powerPC for example. Because
such instructions are atomic you can't compare it to a usual "CISC"
operation. 

An other way to say that is : how do you make a difference in a ciscy
instruction and a riscy one ? I'm glade to know it ! The only thing that
i can see is that ciscy operation could use an arbitrary number of clock
cycle. We could add that an riscy operation has a fixed size instruction
word (to ease decoding) (but for code density some dsp use variable size
operation 8-16-32-48 instructions with "wide decoder" so "many"
instructions could be decode in the same time and code is denser ). For
me that's all !

> 
> I propose to create a device, looking like a RAM module that is plugged
> in the system and acts as a "mailbox". This way, using the already existing
> instructions (load and store), and a simple mechanism inside the "mailbox",
> the "CAS" operation is performed by the external device and NOT by the CPUs.
> With a single access port, one can ensure that all accesses are serialised
> and coherency is thus ensured.
>

I'm not sure it could work.
 
> I still don't like this idea because it uses the general memory addressing space
> for doing things that don't work with cache line granularity (-> waste of bandwidth)
> but i hope that this compromise will help keep the F-CPU simple and working.
> At least, this solution does not create artificial limits on the CPI and
> operating frequency.
> 
>                        !!!!!!!!!!!!!!!!!!
> 
> > > > > > - maybe things to manage cache L2 (from my point of view L2 is tied to
> > > > > > the DRAM controller, the L1 is tied to the CPU). None caching access for
> > > > > > example to avoid cache trashing.
> > > > > what do you mean by "things to manage cache L2" ? what things ?
> > > >
> > > > For example, caching or no caching data (imagine the FCPu as the core of
> > > > the Geoforce 8, rendered image didn't need to be cached).
> > > there's a flag in the Load and Store instructions which specify if the
> > > data is to be cached.
> >
> > Grrr! I speak about the bus that connect device internally. I know there
> > is such flag. But there is no interreset if ouer internal bus does
> > handel it to informe the DRAM controler.
> 
> maybe things are not clear.
> I thought that this need was already addressed anyway.
> 
> > > > To extend that, i would try to apply distributed memory in some control logic to
> > > > make invalidate some cache line...
> > >
> > > if we work with private and public memory spaces, there is no invalidation
> > > logic required.
> >
> > ????? I think you miss completly one thing. Invalidate is one of 4
> > technics but none of one of them are superior to the other it depend
> > much on the application.
> 
> why can't we keep it simple in the beginning ?
> If we continuously add features here and there, because a paper you
> read says it wins 1%, F-CPU will always remain vaporware.
> Unless you have some hidden code somewhere.
> 
> I try to write code now, and i hope you understand that i can't answer
> to all questions and trolls (though i don't dislike it). Without code,
> all our blabla is lost time. Some people have alread written code,
> i hope to see yours one day. really.
> 
> > > > > see you soon,
> > > > Yep ! I leave Paris for 1 weeks. Don't blow my mail box !
> > > send me a postcard ;-)
> > >
> > > if you go in vacations, take some time to write a white paper about how to
> > > use Wishbone in the most simple case (1 CPU + 1 memory). I'll try to make
> > > that for the VCI interface. Some VHDL code would be nice, too.
> >
> > Too late!
> 
> no code, no glory ;-P
> 
> > nicO
> WHYGEE
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/