[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs. Itanium



hello,

Martin Devera wrote:
> > > ;-] recently I started to use uClibc (for our embeeded apps)
> > > it is simple, complete and only 250kb SO (with 30k dynamic
> > > loader)
> >
> > arghflk ! <drooling>
> > and how fast does this compile (compared to Glibc) ? :-)
> 
> argh .. fucked pine .. I was writting reply 30 minutes and it crashed ;(
> Now I'm going to write again.
this doesn't only happen with pine ;-)
Fortunately, Netscape designed the "save as draft" button ;-P

> > > Yes it is the best way - I know something about hw but not so
> > > much as you and other guys here.
> > some of your later comments later show that you're not clueless either.
> ;) I've learned a lot here.
me too, and i think a lot of other people. In that respect, even if no prototype
ever gets made, F-CPU can't be a failure.

> By the way do you know Bochs ? It would
> be interesting to change its cpu model to f-cpu and when compiler
> is ready you can emulate linux inside (as it have free bios and pci,
> vga,net,fdd and hdd model).

mmm IIRC, Bochs is not GNU ?
It makes the same problem as a plain C simulator : we already have to deal with
a VHDL source tree and there are too few contributors yet. we can't hire
anybody, you know : it's all volunteer work. when work is done.

> > who shared their experience. In some places it looks like a worthless
> > compromise and in other places, where no existing solution exists,
> > new things are imagined.
> is the SRB such new idea ? It amazed me when I first saw it.

after some research, i think that some other computers use
this kind of technique but i'm not sure.
SRB was created when writing a mail during the "how many registers" thread
and it is closely related to the FC0 architecture but it is possible
to find a similar principle in the i860, maybe (or is it i960 ?)

> > forget the concept of "simple logic" :-) at that scale, anything takes
> > time. Of course now, transmission takes more time than computation, but
> > even that has a "cost".
> I've been thinking about it last night and I think I understand
> now how tight it is these days.
it takes some time to realise it :-) Even though it was ahead of
others when FC0 started, it is now comparable to others. it's time to
do something.

> > that's acurate, even though now i'm less optimistic.
> > if i participate to FC1, i'll take more precautions
> > because it seems that FC0's register set will be the slowest
> > part of the whole core.
> I thing so. If you want a big paralelism and you want to feed N FUs
> simultaneously you need each FU to have its own port to the register
> file. It means that you need sum[i=0..N,eu_ports(i)] buses in register
> file. IIRC the average ILP in programs is about 4 so that we'd need
> about 16 ports to the RF. It seems way too much for me.

a realistic RF has maybe 8 ports. I've worked with chips using 8-ported
semi-custom SRAM and speed was not the issue. I think that a FC0 with
an overall 3 write and 5 read ports is possible, but really
slow if we can't access full-custom technologies (like Intel and IBM do).
Beyond, another strategy must be used.

> As you've written later in this mail it seems you are convinced
> that some form of splitted set is neccesary (like in TTA) so that
> now I know that this part of debate is a bit void ;)
> I found some partialy interesting articles - nothing new but
> interesting:
> http://www.opencores.org/articles?cmd=view_article&id=4
> http://www.opencores.org/projects/or2k/
i don't follow Opencores for a while...

> > For example, nicO wants to reduce the number of ports and i thought
> > about a way to achieve that. However, making a 3r2w register set
> > with 1r1w blocks does not reduce the surface and the complexity because
> > a lot of things must be duplicated :
> >  - you can do a 2r1w bank by using 2*1r1w blocks, so having 3 read ports
> >    requires 3 identical blocks.
> >  - you can't do a Xr2w bank as simply as a 2r1w bank because the data
> >    must be written to all the sub-blocks. One solution is to use
> >    some kind of FIFO that serializes the write but it's not a suitable
> >    solution in this case. One solution i proposed was to ensure that
> >    two simultaneous writes would not write the same sub-block, but this
> >    can reduce the overall CPU efficiency.
> >
> > (i'll describe the trick later)
> 
> I'm interested ;) Do you mean "one 64bit register latch" by
> sub-block term here ?

i mean : splitting the 64-rergister block into 4 or 8 subblocks
(8 or 16 registers each). It's a solution i don't like but others
(with less sentiments and more interest in peak numbers instead
of sustained numbers) would use it or have used it.

> I'm also interested how do you expect SMP to be created. As
> I spent many time with memory manager of linux I'm curious
> whether is will work with f-cpu SMP.
> Would it be kind of NUMA machine ?
probably but this issue is outside of the goal of the F-CPU project.

This also highly depends on how the memory interface is implemented.

> > the speed of the inverter is not directly linked to the frequency of the whole chip
> > because a lot of parameters have recently become prominent :
> Hmm did you read about the new BJT chip design ? I can't remember which
> university did it but they have working 3 ported 32x31 register set
> operating at 16GHz with only 20W of thermal loss.
"only" ?...

> They use differential
> signal lines as pairs with low (200mV) swing. They planned to test 200GHz
> gate ... Only big problem is that they require pair of wires of the
> same length for each signal.
this is very technology dependent. it can't be done at the VHDL level ...

> > > You can do it on code which is one module with sources. The big advantage
> > > of compile/link is compile speed. I can't imagine developers to wait
> > > several hours when recompiling moderate project linked to glibc - you
> > > would have to compile glibc along with it to do global optimization..
> > > But if it is only way to go for speed .. well.
> >
> > This depends on the local definition of speed.
> > When you develop an algorithm, compile time is moderately crucial.
> > Often, only small parts of the code base are modified, so incremental
> > compilation is possible (unless there's a avalanche effect). Then when
> > you are finished, you can let your computer run when you sleep for some
> > deep optimisations, if you want.
> 
> And what about ABI ? Do you want to do these optimizations inside one
> compilation unit only or also between .so and binary ? The you will
> have no ABI and no closed software will run eficiently on it.
this is the user's choice : either speed or interface. I know a lot of people
will use the default settings anyway, they will want to reuse the existing
methods and will certainly be disapointed by the performance.

> On one side it can be good for open sw or OTOH it can make M$'s monopoly
> bigger - because it will no longer be possible to write efficient
> sw for closed-source OS.
who knows. But my concern is about the core, despite all the discussions about
languages...

> > There's an archive in Mexico (i forgot the URL).
> > it's a pretty large archive (around 20MB of files and attachments)
> I'll look for it ! 20MB is not so much for 10Mbit Internet pipe here :)
lucky :-) i'm on a RTC modem...

> > what do you think about this ? there's only a 1/8 overhead/bloat
> > and it's pretty portable accross different implementations
> > (recognize this opcode as NOP if not useful, extract only the needed
> > information otherwise...)
> Wonderfull ! The only word I can say :->
you're the only one who looks enthusiastic ;-)

> > 1:
> >  loadimm  1, r1 ; // or something like that.
> >  store   r2, r1 ; // here, r2 already points to the location p, the address is
> >                   // "desambiguified" at execution (otherwise it would trap
> >                   // if a TLB miss occured).
> > 2:
> >  load    r3, r4 ; // like above, r3 already points to *q. If q==p, the value
> >                   // of r4 becomes this of r1. there might be a small delay
> >                   // (1 or 2 cycles at most) if no bypass is designed in the LSU.
> > 3:
> >   add r2, r4, r2; // there's a bypass on the Xbar. The only bad surprise comes from
> >                   // the future reuse of r2 as a pointer without a previous explicit
> >                   // prefetch, a few cycles of penalty. my first estimations are
> >                   // less than i expected (1 or 2 cycles) but this will not be true
> >                   // in the future or in far-from-ideal cases.
> 
> maybe I misused the term disambiguation - I understand that code
> above will go just well. But often you can do this:
> loada   r3, r4  ; start loading of r4, add r3 to disambig. mem (DM)
???

> loadimm  1, r1
> store   r2, r1  ; if r2 is in DM remove it
> verify  r3, r4  ; if r3 is not in DM behave as load (instead as nop)
???
> add r2, r4, r2
> 
> So that loada will have a time to get the data during loadimm.
> IMHO this code should be faster (only one cycle in this particular
> case).
> But you can do it only if you are sure [r3] is not later changed
> by store. And you never know (at compile time) that two pointer's
> might be the same (if they are the same type).

in FC0, the LSU is a 8*256 buffer where each line can be associated
to a register (or several). so if there's an alias, there is no problem.

> The verify is another cycle here so it is no win. But I've seen
> larger samples in IA64 docbook where much more was saved.
> But it is probably tied to superscalar architecture - in FC0
> it will be simpler to do prefetch. So forget my kidding ;)
it's always interesting to chat about different things.

for example, it comforts my idea that the LSU design is a really cool thing.
it's a bit complex and it doesn't look like other computers (so people
might have more difficulties to use it) but it's a one-does-everything unit.

more about this later...

> regards, devik
good night,
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/