[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: Navier-Stokes



hi !

Juergen Goeritz wrote:
> On Thu, 18 Apr 2002, Yann Guidon wrote:
> >mojn',
> moin,
jo gojl,

> >Juergen Goeritz wrote:
> >> But if I want to use a switchable strategy I need some means
> >> to control this beast, don't I? Are there any registers where
> >> I could add those control bits or do I have to make a separate
> >> set?
> >At this time, there is nothing prepared yet. At least a few SRs should
> >be present to give the programmer the HW configuration (block size, line
> >width, replacement strategy) at least in a hardwired way, so the user
> >can select the proper algorithm. But a configurable setup is not excluded,
> >as long as it doesn't hurt the cooperation of tasks (like the MTRRs, it
> >must be accessed only by the superuser).
> 
> How about special register access methodology opcodes?
> This way you define the entry point how to access them
> but do not need to fix everything (layout, number, etc.)
> from the beginning. Just have it read/write for superuser
> only, one parameter being SR#.

i don't understand exactly, but just in case, here is the current method :

- there are only get, put, geti and puti as means to access the SRs.
  This instruction stalls the pipeline until it is sure that there is no
  access violation.

- the SRs are either hardwired (read-only, for example : the mask revision),
supervisor (read-only too, if you are not running with a suitable privilege)
or user-contolled (for example your private size registers).

- the hardwired registers provide informations to the running system, they are
  defined during the CPU design and say how many SRs there are, what kind of CPU,
  what is the register size, what opcodes and what units are provided...

- The "superuser" registers control those features that have any impact on the
  running system, for example the trap handler addesses, the TLB, the memory mapping...
  Note that when it is possible, an individual "authorisation" is allowed to every
  ressource so a pice of code can deal only with one ressource at a time.
  A task can only reset this bit (when it is set), but not set it, to enforce
  the system's protection. This is a bit similar to the Hurd tokens but can
  be used in other places such as Linux as well.

- The "user" registers are those which have no protection bit, for example the
  task's private performance counters or size attributes. A user can be "granted" access
  to a SR-controlled ressource if the super-user modifies the associated SR enable bit,
  but otherwise all the task has is a virtual address range to play with.

There is no specific opcode for managing that, it's all done with get/put
and it traps if the necessary rights are not granted, that's all.

> >>From this I understand that the load/store opcodes each have
> >> a flag telling the cache what to do with the data, i.e. keep
> >> or forget immediately after use.
> >yup. that's in the manual since... huh... a long time.
> 
> ... JG at his high desk carefully blowing the dust off
> the ancient manual nearly falling apart. Carefully turning
> each page to not have them crumbling to dust. Only by the
> contrast enhancer glasses he is able to read the fading
> letters from the yellowed surface... :-D

just in case you have not found, it's almost at the end ;-)

> >> >Together and with some adaptative algorithms, this is enough AFAIK.
> >> >Adaptative strip-mining is an efficient way to process large data sets
> >> >at the speed of the L1. I think that multi-level strip-mining is also
> >> >possible, though a bit more complex, but if i think what you think correctly,
> >> >this will do the trick.
> >>
> >> Yes, this would do the trick. It's still open though how the
> >> high level language developer (f,c,c++) could use/influence
> >> this option manually.
> >
> >Adaptative strip-mining is a very high-level construct which
> >requires the user to read the system clock so he can dynamically
> >adapt a set of parameters (usually a buffer size, which will converge
> >to the size of the the L1 minus the most used global variables).
> >It's not often straight-forward but highly portable across platforms,
> >because it will adapt to it automatically. For example, i had
> >designed a program on a PMMX and found different performances
> >when run on a PII because the cache strategy is completely different.
> 
> Anyway, who wants to end up programming around hardware cache
> strategies? :-/
nobody "wants" but there are cases (yours ?) where this becomes necessary.

Fortunately, the cache has become quantitatively and qualitatively better
with the PII because the L2 cache was much closer to the core. The bandwidth
and the response time got better and they probably implemented a
speculative prefetching mechanism. So my code ran at L1 speed from L2-size
buffers with a PII when it only ran at L1 speed and size with a PMMX.

> Could probably be easier to change the cache strategy on the fly?

if you want to change the cache strategy, there must be more than one
in HW. Intel's MTRR (and the competitor's equivalents) provide a mean
to modify the cache strategy for a fixed number of memory ranges
(for example, the video memory can be set as write-combine and
multi-CPU shared memory space can be set as uncachable).
This is usually possible to modify this when the system is "alive"
but there's still a risk of loosing data if you switch from one
policy to another when you forget to properly flush the caches etc...
But that's for the x86 world.
This is possible to do this for F-CPU but i don't see the point of a MTRR-like
system, at least for the first-generation F-CPUs. I am not exactly sure
of what you have in mind but if it's simply changing the cache
from write-through to write-back, do not forget that F-CPU is a multitasking
system and a user task doesn't have to change an environmental variable
that might impact the rest of the task's performance.
So as a rule of thumb, the FC0 has a private memory range (as fast as possible)
and public range (uncached), and the task can specify whether to keep in cache
or not with the "hint bit" (flush) in the load and store instructions. This
does not impact mullti-tasking systems at all and is very simple to implement.


> >However, the LSU hints are not "portable" and not accessible from
> >portable C code. Intrinsics or macros are probably necessary,
> >but it's as ugly as using MMX intrinsics in C code, so go figure...
> >But if a compiler is "smart" (?) enough, it should do the job.
> >This goes along with the same process that is used to globally
> >allocate the registers, because program-wise statistics (and even
> >profiling) are necessary to set the good flags at the right place.
> 
> Jo, jo. But there ain't no PD compilers around that could
> do the job, are there?
you mean, profiling compiler ?

> The gcc 2stage profiling optimization
> features are not thaaat convenient to use for global...

profiling is not often used, but it is useful in constrained real-time applications,
such as if you want to track where your soft DVD player wastes time.
If you have a relatively coherent program you can attempt to do some auto-tuning
(adaptative programs) which has the advantage to be portable, but you often
forget what "convenience" means when you are CPU-bound, memory-bound and ressource-bound, no ?

> JG
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/