[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs. Itanium

> > ;-] recently I started to use uClibc (for our embeeded apps)
> > it is simple, complete and only 250kb SO (with 30k dynamic
> > loader)
> arghflk ! <drooling>
> and how fast does this compile (compared to Glibc) ? :-)

argh .. fucked pine .. I was writting reply 30 minutes and it crashed ;(
Now I'm going to write again.
Well look at www.uclibc.org. It compiles about five times faster
here (IIRC) and it misses only a few parts - international support
line iconv, gettext and some exotic functions (wordexp).

> > Yes it is the best way - I know something about hw but not so
> > much as you and other guys here.
> some of your later comments later show that you're not clueless either.

;) I've learned a lot here. By the way do you know Bochs ? It would
be interesting to change its cpu model to f-cpu and when compiler
is ready you can emulate linux inside (as it have free bios and pci,
vga,net,fdd and hdd model).

> who shared their experience. In some places it looks like a worthless
> compromise and in other places, where no existing solution exists,
> new things are imagined.

is the SRB such new idea ? It amazed me when I first saw it.

> forget the concept of "simple logic" :-) at that scale, anything takes
> time. Of course now, transmission takes more time than computation, but
> even that has a "cost".

I've been thinking about it last night and I think I understand
now how tight it is these days.

> that's acurate, even though now i'm less optimistic.
> if i participate to FC1, i'll take more precautions
> because it seems that FC0's register set will be the slowest
> part of the whole core.

I thing so. If you want a big paralelism and you want to feed N FUs
simultaneously you need each FU to have its own port to the register
file. It means that you need sum[i=0..N,eu_ports(i)] buses in register
file. IIRC the average ILP in programs is about 4 so that we'd need
about 16 ports to the RF. It seems way too much for me.

As you've written later in this mail it seems you are convinced
that some form of splitted set is neccesary (like in TTA) so that
now I know that this part of debate is a bit void ;)
I found some partialy interesting articles - nothing new but

> For example, nicO wants to reduce the number of ports and i thought
> about a way to achieve that. However, making a 3r2w register set
> with 1r1w blocks does not reduce the surface and the complexity because
> a lot of things must be duplicated :
>  - you can do a 2r1w bank by using 2*1r1w blocks, so having 3 read ports
>    requires 3 identical blocks.
>  - you can't do a Xr2w bank as simply as a 2r1w bank because the data
>    must be written to all the sub-blocks. One solution is to use
>    some kind of FIFO that serializes the write but it's not a suitable
>    solution in this case. One solution i proposed was to ensure that
>    two simultaneous writes would not write the same sub-block, but this
>    can reduce the overall CPU efficiency.
> (i'll describe the trick later)

I'm interested ;) Do you mean "one 64bit register latch" by
sub-block term here ?
I'm also interested how do you expect SMP to be created. As
I spent many time with memory manager of linux I'm curious
whether is will work with f-cpu SMP.
Would it be kind of NUMA machine ?

> the speed of the inverter is not directly linked to the frequency of the whole chip
> because a lot of parameters have recently become prominent :

Hmm did you read about the new BJT chip design ? I can't remember which
university did it but they have working 3 ported 32x31 register set
operating at 16GHz with only 20W of thermal loss. They use differential
signal lines as pairs with low (200mV) swing. They planned to test 200GHz
gate ... Only big problem is that they require pair of wires of the
same length for each signal.

> > You can do it on code which is one module with sources. The big advantage
> > of compile/link is compile speed. I can't imagine developers to wait
> > several hours when recompiling moderate project linked to glibc - you
> > would have to compile glibc along with it to do global optimization..
> > But if it is only way to go for speed .. well.
> This depends on the local definition of speed.
> When you develop an algorithm, compile time is moderately crucial.
> Often, only small parts of the code base are modified, so incremental
> compilation is possible (unless there's a avalanche effect). Then when
> you are finished, you can let your computer run when you sleep for some
> deep optimisations, if you want.

And what about ABI ? Do you want to do these optimizations inside one
compilation unit only or also between .so and binary ? The you will
have no ABI and no closed software will run eficiently on it.
On one side it can be good for open sw or OTOH it can make M$'s monopoly
bigger - because it will no longer be possible to write efficient
sw for closed-source OS.

> There's an archive in Mexico (i forgot the URL).
> it's a pretty large archive (around 20MB of files and attachments)

I'll look for it ! 20MB is not so much for 10Mbit Internet pipe here :)

> what do you think about this ? there's only a 1/8 overhead/bloat
> and it's pretty portable accross different implementations
> (recognize this opcode as NOP if not useful, extract only the needed
> information otherwise...)

Wonderfull ! The only word I can say :->

> 1:
>  loadimm  1, r1 ; // or something like that.
>  store   r2, r1 ; // here, r2 already points to the location p, the address is
>                   // "desambiguified" at execution (otherwise it would trap
>                   // if a TLB miss occured).
> 2:
>  load    r3, r4 ; // like above, r3 already points to *q. If q==p, the value
>                   // of r4 becomes this of r1. there might be a small delay
>                   // (1 or 2 cycles at most) if no bypass is designed in the LSU.
> 3:
>   add r2, r4, r2; // there's a bypass on the Xbar. The only bad surprise comes from
>                   // the future reuse of r2 as a pointer without a previous explicit
>                   // prefetch, a few cycles of penalty. my first estimations are
>                   // less than i expected (1 or 2 cycles) but this will not be true
>                   // in the future or in far-from-ideal cases.

maybe I misused the term disambiguation - I understand that code
above will go just well. But often you can do this:
loada   r3, r4  ; start loading of r4, add r3 to disambig. mem (DM)
loadimm  1, r1
store   r2, r1  ; if r2 is in DM remove it
verify  r3, r4  ; if r3 is not in DM behave as load (instead as nop)
add r2, r4, r2

So that loada will have a time to get the data during loadimm.
IMHO this code should be faster (only one cycle in this particular
But you can do it only if you are sure [r3] is not later changed
by store. And you never know (at compile time) that two pointer's
might be the same (if they are the same type).
The verify is another cycle here so it is no win. But I've seen
larger samples in IA64 docbook where much more was saved.
But it is probably tied to superscalar architecture - in FC0
it will be simpler to do prefetch. So forget my kidding ;)

regards, devik

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/