[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] another DATE report



hi !

nico wrote:
> hi,
> I'm back from vacation !

oh yeah ! :-)
let's got back to the synchro primitives :-)

And remember that i still don't agree with using the memory
address space for doing that...

> Yann Guidon a écrit :
> >
> > hi !
> >
> > nico wrote:
> <...>
> > > Nop, absolutely necessary to handel very easly synchronous problem.
> > > Christophe convince me, if he could resend it's URl about the paper that
> > > speak about.
> > > Such technic, in the common case, where there is no collision the only
> > > lose is this CAS2 instruction, there is no OS call overhead, no task
> > > switch, nothing. Only 4 atomic transactions ! We must have that !
> >
> > i agree that CAS-like operations are interesting, no doubt about that.
> > However, what you propose is uglier than what i heard before, and
> > you know that we have discussed a lot ;-)
> >
> > * first, there must be an instruction that supports this kind of
> > transaction. there is none and this instrruction would block the pipeline.
> > you couldn't for example "pipeline" the CASes.
> 
> That's not a problem !

excuse me ?

if you can't pipeline an instruction, there is no reason
for using a "RISC" architecture and the scalability is compromised.
Do not imagine that a computer with 2048 CPU is science fiction,
the largest computers have around 10K CPU i remember. Communication
and synchronisation grow _at_best_ as the log of the number of CPU,
so if you can't sustain several simultaneously ongoing synchs,
your mega computer will spend its life waiting for "semaphores"
or other signals.

pipelining the synchro primitives is a way to enhance the communications
inside the system. This way, a CPU can send a synchro request to N other CPU,
then request the answer from all CPUs. An external memory reference uses
tens of cycles so you understand that if you request the answer immediately,
the CPU will stall during tens of cycles. thank you.

>  Most of the other technique are much more costly
> (call to the os, trap,...)

costly in what ?
It's always a matter of "HW/SW Interface" (cf P&H's book)
if it's not done in HW, it's in SW (and vice versa), so we're
always moving the cost around, here and there...

> > * second, probably no device would support this operation. I don't know
> > how you would like to implement that but F-CPU is not alone in the system
> > (otherwise there would be no use for an I/O bus :-D) and the other elements
> > must support the same protocol. If there is only one memory block, ok,
> > but if other CPUs of a different sort are present, there might be risks.
> 
> If the arbitrer of the bus does it's not a problem any more ! It's it
> that grant the devices.

who speaks about a "bus" ? i speak about a "system", whether "on a chip" or not.
The interconnexion between the different (active and passive) elements of
the system might have any topology and use heterogeneous protocols which
won't all comply to our little specification.
And sorry, when i wrote "(otherwise there would be no use for an I/O bus :-D)"
i meant "I/O interface". my apologies.

> > * third, and more difficult for me (because i'm one of the people who code),
> > your feature requires a more complex state machine than needed. Generally
> > such a complex thing is not used 99.95% of the time, so i think it's an
> > overhead on my shoulders, unless you want to do it.
> 
> ?? it's only a 4 straits state machine, you're scheduler will be much
> more paintfull !

i wouldn't describe the design of the scheduler as a "state machine",
even though there are always people who think about everything as state
machines. It's not about theory : the scheduler works with a very simple
principle (despite its complex implementation) : "if all the necessary
ressources are available, then issue the instruction". The "state" is
not "inside" a monolithic state machine like people usually see. In fact,
a stateless machine is just a boolean operation, even if the result is
memorised by pipeline latches. it can be interrupted without problem
because there is no "single state" but a more complex cooperation.

However, putting Compare And Swap as an instruction creates a wholy CISC
situation. What you describe is a pure RISC vs CISC problem.

                       !!!!!!!!!!!!!!!!!!

I propose to create a device, looking like a RAM module that is plugged
in the system and acts as a "mailbox". This way, using the already existing
instructions (load and store), and a simple mechanism inside the "mailbox",
the "CAS" operation is performed by the external device and NOT by the CPUs.
With a single access port, one can ensure that all accesses are serialised
and coherency is thus ensured.

I still don't like this idea because it uses the general memory addressing space
for doing things that don't work with cache line granularity (-> waste of bandwidth)
but i hope that this compromise will help keep the F-CPU simple and working.
At least, this solution does not create artificial limits on the CPI and
operating frequency.

                       !!!!!!!!!!!!!!!!!!

> > > > > - maybe things to manage cache L2 (from my point of view L2 is tied to
> > > > > the DRAM controller, the L1 is tied to the CPU). None caching access for
> > > > > example to avoid cache trashing.
> > > > what do you mean by "things to manage cache L2" ? what things ?
> > >
> > > For example, caching or no caching data (imagine the FCPu as the core of
> > > the Geoforce 8, rendered image didn't need to be cached).
> > there's a flag in the Load and Store instructions which specify if the
> > data is to be cached.
> 
> Grrr! I speak about the bus that connect device internally. I know there
> is such flag. But there is no interreset if ouer internal bus does
> handel it to informe the DRAM controler.

maybe things are not clear.
I thought that this need was already addressed anyway.

> > > To extend that, i would try to apply distributed memory in some control logic to
> > > make invalidate some cache line...
> >
> > if we work with private and public memory spaces, there is no invalidation
> > logic required.
> 
> ????? I think you miss completly one thing. Invalidate is one of 4
> technics but none of one of them are superior to the other it depend
> much on the application.

why can't we keep it simple in the beginning ?
If we continuously add features here and there, because a paper you
read says it wins 1%, F-CPU will always remain vaporware.
Unless you have some hidden code somewhere.

I try to write code now, and i hope you understand that i can't answer
to all questions and trolls (though i don't dislike it). Without code,
all our blabla is lost time. Some people have alread written code,
i hope to see yours one day. really.

> > > > see you soon,
> > > Yep ! I leave Paris for 1 weeks. Don't blow my mail box !
> > send me a postcard ;-)
> >
> > if you go in vacations, take some time to write a white paper about how to
> > use Wishbone in the most simple case (1 CPU + 1 memory). I'll try to make
> > that for the VCI interface. Some VHDL code would be nice, too.
> 
> Too late!

no code, no glory ;-P

> nicO
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/