[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Conditionnal load and store, the return



hi,

Christophe Avoinne wrote:
> "Cedric BAIL" wrote :
> > The problem is that you didn't have any error in fact the test is false, so no
> > real memory access are done. So you must not reexecute the instruction and only
> > pass it even if the address is bad. It's were I see a problem, you execute a
> > handler for nothing...
> 'loadCC' : If you access memory regardless the test result, yes you will
> raise an exception even for false test. Due to this design you need to delay
> the exception until the test is completed and is true.

the condition test AND the pointer check happen at the same time.
The decision (stall, trap or issue) is taken during the Xbar stage.
So there is no problem such as what Cédric explained (with the Null pointer
and false condition). The issue logic should be smart enough (and it's
not too difficult) to avoid these situations.

Let's say that the condition has precedence over the pointer.
In HW, condition is faster than pointer LUT reading, so it's also
a natural choice.

> You should really detail your explanation because I really don't see how you
> planned to execute 'loadCC' and 'storeCC' and find out that kind of problem.

i hope the problem has disappeared now :-)


In a previous mail, you also wrote :

> From: "Cedric BAIL"
> 'load' : you mean you always load the value from memory and assign the value
> to the data register only if test is succeeded ? well, if so, an exception
> will occur before any test anyway.
no. This can be a waste of CPU time (spent in the trap handler)
if the SW is badly coded.

> But if you only access memory just after the test succeeds, an exception
> will occur and still you need to rexecute the instruction.
they are accessed at the same time, and the issue logic (a big "AND-OR"
of all the status and conditions) will sort things.

> Because your instruction has no internal state. You cannot use partial
> execution with exception. After an exception occurs, you cannot finish an
> instruction by resuming partial execution : you need to reexecute the
> instruction.

The F-CPU instructions are (and should remain so) such that partial
execution is not necessary. The instruction flow is only controlled
at the decode/issue stage.

> > > Again, I don't see any problem.
> >
> > Currently if you have this :
> >
> > [DATA]
> >    |
> > [CPU 1]---[CPU 2]
> >
> >
> > When CPU2 do a conditional load/Store pair, it will not be abble to see
> > if CPU 1 access to the data.
# error : "access" variable undefined.

for reading, there is no problem.
for writing, the "dirty" flag will change
so locked things will work (well, unless you
read all the lengthy thread about this too much).

> > The only reason why CPU 2 know that, is because
> > all memory access will always be send to CPU 2 by CPU 1... It can be a
> > big overkill. I perhaps miss something but the problem exist.
badly written programs. false assumptions.
communication is an expensive resource, that a lot of programmers waste
for laziness reasons (that they excuse with "portability", "langage",
"existing code base" etc....). Currently, we can't afford a complex
and costly MESI thing. And if you look at PCs, you'll have more reasons
to seek another approach.

> Ok you are speaking about inter-cpu locking, not intra-cpu locking. Well of
> course it is the most difficult problem to solve.
... given a certain perspective.
This has been "solved" in many ways by several generations of programmers
and computer designers..

> CPU1 and CPU2 can access directly to DATA, because they both have a
> different LSU we are stuck.
that's one very simple way to see this :-)

> In fact, the problem only occurs when you want a bi-processor or more, so I
> think you an extra stuff to allow global locking of data between CPU.
This is the kind of stuff that exists in "high performance" computers
but when i speak about that, i get flamed. otherwise, i would already have
invented a clean interface for that and the debate would be over.
But there is the argument that "locks" are mostly for local (intra-CPU)
resources and an external lock would slow all CPUs.

> I suppose you want several CPU able to access the same DATA directly :
> 
> CPU1------------------\
> CPU2------------------+------DATA
> CPU3------------------+
> CPU4------------------/

i would naturally put this in the "G-chip"...

> You need a bridge to access DATA for all CPU. It cannot be possible for all
> CPU to access DATA meanwhile.

4-port SRAMs can now work around 250MHz....
but a FPGA would do the job easily as well, and manage the locks.

> Just an idea : beside the LSU for each CPU (internal LSU), we can have a
> external LSU which only contains the locked entries :

does this mean that a specific instruction is needed ?
i wanted to do this through the SRs, in order to not limit the number
of running locks/semaphores.

> if a CPU do normal load/store, just bypass external LSU : faster behavior.
> if a CPU do lock load, set a new entry in external LSU.
> if a CPU do lock store, check if entry is in external LSU.
> 
> This external LSU can be seen as a special memory.

looks cool but there is a problem : how do you manage coherency between
the usual 'L0'  LSU and the special one ?

The problem of splitting the functions is a natural one, but i wouldn't
put this in the memory addressing range, as it conflicts with the LSU
and coherency between the units will adds more problems (and shift
the SW problems to the HW, but HW is usually more difficult to design,
particularly when there is no code).

I would propose an independent "lock" space for this purpose,
so this wouldn't conflict with the memory space. The same kinds
of techniques and protections can be enforced (N entries, associative
addressing, trap on illegal ranges...) but it's much more simpler.
Instructions would be "lock imm/reg, value, result" and
"release imm/reg, result". no addressing, no TLB access/miss,
no granularity problems (memory should be reserved for high-bandwidth,
large chunk transfers, and a single byte or word is an overkill).

You see, i try to bring solutions, too. i hope someone will enhance
on them. Thanks to Crhistophe for the "parallel LSU" idea, it's less
scary than the SRs and can reuse a lot of existing methods.

Btw, in a multi-CPU configuration, an interconnexion network can
be dedicated to passing lock messages, if it can't be multiplexed
with the memory streams on the front-side bus (this last idea
is however what will be certainly implemented first).
This is a common technique in large computers.

> A+
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/