[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] CAS in FC0


Christophe wrote :
> > I do not doubt that what you say is true, here. the problem
> > is still : "how do you think it can be implemented in the existing framework" ?
> I'm sure you would find a solution if you wanted to storm your brain ;-).
i do not like this kind of answer, even though i know this is humorous
(and we lack from it here). The allusion that i'm not doing my best
to find a solution is a lie and i hope everybody notice the smiley.

Come on : you and others ask for something, debate from it etc.
But nothing "more". You speak about "simple state machines" and
other abstract things and you take its realisation as granted,
no plan, no code, just words. If you rely on me for finding
solutions to YOUR problems, we won't get very far. i'm a bit
tired from being the "F-CPU man" and this kind of behaviour
doesn't help. The rule is usually : if you want something,
you have to find the solution, not ask others to do the dirty
work, otherwise it's not fair and too easy for those who add
bells and whistles at no effort.

But you're lucky this time. As if i had no late work for the
university and not enough work on the F-CPU VHDL source code,
manual and utilities (i just finished a first version of FROMFS
and started rebuilding YGASM)....


I have to read and reread what Christophe has already written
(thank you for your patience). I may have a beginning of a solution
for the CAS (not CAS2) instruction in F-CPU and the implementation
in FC0. However it is (still) ugly and has some drawbacks.

**** Concerning the format :

It must be an "optional" opcode because it requires some specific fields
(using R1 as an operand fetched from the register set).

here is a copy/paste of one of your posts :
cas r1,r2,[r3] :
IN :
    r1 : requested value, r2 : new value, r3 : pointer
    r1 : new requested value, r2 : 0 if CAS successes

this should more look like
 cas r1,[r2],r3 :
IN :
    r1 : requested value, r2 : pointer, r3 : new value
    r3 : new requested value, r2 : 0 if CAS successes

- the pointer must be in the middle
- the main destination register must be on the right
  (we have the choice of using r1 and r3 (load format),
   or r3 and r3^1 (standard format))

**** Concerning the scheduling : if CAS becomes an instruction, it
is pipelined. This means that there is a "schedule" for each operation.
For CAS, there is a problem : ALL WRITES occur during the 5th cycle
of the pipeline (after the instruction was decoded). It makes things
easier and there is nothing to worry about. But with a CAS, there is
a read then a write, so the write cycles occurs several cycles later !
So the scheduler must be modified to verify that no store instruction
enters the pipeline when CAS is finishing. Nothing impossible here
but it adds yet another condition to check and the instruction issue
logic is already pretty fat :-/
OTOH there is only one pointer and it requires only one check during
the first cycles, so the "late" store is not "much dangerous"
(as long as CAS is the only exception to the rule).

You describe it like this :
>    - read the actual value in memory location.
In fact it works like this :
 - decode opcode + register read + check pointer validity
 - Xbar + issue + LSU read
 - At this point, there is a problem : the register's value
   must now enter the execution units but the LSU has not
   yet aligned and shifted the 256-bit line, so the 2 operands
   have not entered the pipeline at the same time : HEADACHE #1 !
    There is no clean solution to this problem. Adding a delay
    is not a good thing. reducing the LSU depth will slow down
    the clock. h*l* sh*t.

>    - if the actual value equals to the requested value, set the requested value
>      with the new value else set the requested value with the actual value
>    - new value is set to true if it equals to requested value.
Usually this would be done in the (extended) INC unit, where an unsigned
comparison can be performed and a MUX can be switched from the result
(min/max instruction, but not yet implemented). But here the dataflow is a bit
different because one operand of the swap is not the operand of the comparison.

>    - write the requested value in memory location.
So the result of the CAS operation goes to Xbar, then to the R7 and LSU shifter,
and finally (if validated) the last cycle is a LSU write.

I don't think it's impossible but it's a smelling worm can.


**** Concerning data locking :

In the pipelined (and hypothetical) example, the write occurs depending on the
result of the comparison and the possible access to our local memory
was not addressed. Now, i will examine the conditions that interrupt the
write-back :

- IRQ or trap : in this case, the CAS instruction continues its way through
the pipeline. It finishes soon enough to avoid any interference with the
IRQ routine's code (or so i hope).

- Another CPU writes to the currently used LSU line. Or worse ! our own code
makes a write to the same location, an hypothetical code like :
  CAS r3,[r1],r4
  store [r1],r2

As we know, the store will complete before CAS. But in the general case,
the CAS must fail when the line that is pointed to by [r1] is modified since
the read. This is nothing more than the "dirty" flag, but in practice we need
a "real" dedicated flag anyway because each LSU byte can have a lot of "dirty"
bits already on, as the result of normal operation. More generally : the
"lock" flag of the pointed line is set when we read, and it is reset when
_any_ write occurs. Not difficult to do.

So here i point this fact : there is a way to perform CAS (and other "locked"
operations) without a "flag" (like what nicO said) but a collection of flags,
one per LSU line. This is important because it allows several simultaneous
locks to be done at a time. It is not a bottleneck either, as would a single
"flag" be (remember : "whygee's scalability rants"). It is another "hidden"
flag that depends on the implemented HW and the theoretical limit is the number
of registers itself (if there is one day a version which can have 63 pointers
at a time). I would have liked something better but it's already a win.

However, if we "lock" the LSU line, there is a problem : only one CAS
can occur at a time in a LSU line. This means that a large part of the line
is wasted, unless there are other read-only variables in the same line
(during the lifelength of the CAS). Those who don't like this can modify
the source and increase the LSU complexity (and decrease the clock speed),
those who put several CAS variables in the same line can either
enter in a deadlock (in the worst case) or read the manual
(you get what you pay for).

That said, and by remembering that the I/O interface is connected only
to the LSU/fetcher, you get the picture : if a CPU writes data to the
same LSU line as this of the "reserved" line, CAS will fail to write back.
Simple, doable with existing gates, and it validates my architecture
(no need for MESI, LSU is connected to L1, L2, SDRAM and I/O).


Now that we can "tag" the LSU lines, there is still the scheduling problem.
in fact, the real problem is to think of CAS as an "instruction".
Things become _so_easy_ if we use a combination of instructions !!!
We do not need any "CAS instruction" but a variant of the usual
load and store instructions, just like the locked versions in ALPHA
(and other CPUs).

* "load locked" will "tag" the line it selected for reading. it's just a bit
that the issue logic has to set after decoding the instruction.

* "store conditional" will proceed if two conditions are met : the specified
condition is true (we can check for LSB, MSB, zero...) and the "tag" is still
set. I often wondered if it was worth it to add a conditional store, but now
i'm convinced. Except that we need a version that checks the "tag".

What about the scheduling ? it's almost as fast, if there are all the
bypass networks !

  load_tag [r1],r2
  xor  r2,r3,r4
  if r4==0 store_locked [r1],r3

(or something like that)
>From the scheduling point of view, there is no change to the existing
structure :-) and it executes as fast as a single instruction because 
there are bypass nets. You can also check whatever condition you want,
not only egality but superiority or any funky stuff.

I am not sure it's possible to do a CAS2 with that, simply because
it's not possible to check 2 pointers at a time in a single cycle
(because the instructions are not decoded at the same time).


The previous code is interruptible, nice, i hope you like it.
But i still have to find how an external device can do a store_locked
in another CPU's LSU :-/ Because it is where things become REALLY
annoying and you wish there was a central mailbox, as i proposed before.
I hope that Christophe will let me explain that when i find time to do it.

enough babbling,
good night.
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/