[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The little trick to doubble graphic performance

>> 4) There is a little trick who can double or trebble the graphic
>> perform¨ance of the box with modern processors.  Some games become
>> fast to be played.  :-) This is shell work but we need to hack
>> Xconfigurator in order to collect some data needed

>What kind of hack?

   Pentium Pro and later architectures <P II, P III, Celeron> have a graphic
cache and some additional registers.  They are disabled at powerup.  When
the PPro came out, a few programs appeared to turn them on; CTPPro and
Fastvid.  They made a big difference, and are still used to tweak PIIs and
PIIIs.  I am enclosing the readme files for both.  These are, of course,
windows programs, but the docs say what they do quite well.  How unusual for
Windows programs. :-)  I do also have the binaries if you want to see them.


FastVid 1.03.  Copyright 1996 by John Hinkley.  72466.1403@compuserve.com



According to Intel, enabling Write Posting (see below) on 82450 steppings
before B0 could result in "rare" problems on the PCI bus.  If the program
indicates that Write Posting is enabled when you first run this program
(the "Before" message) then you have a B0 or later stepping of the 82450
and don't need to worry about the A2 bugs.  The problem will manifest
itself when there are high levels of traffic on the PCI bus -- multiple
devices reading and writing at the same time.  A typical example where you
might have problems is when playing multimedia files like AVI and MPEG
animations.  The write combining options of this program can be used
without problems on any version of the 82450.

If you have a pre-B0 motherboard you may want to play around with write
posting to see the difference it makes but you shouldn't enable it all the
time -- it will occasionally lock up your computer.  If you really want or
need write posting you should consider getting a new motherboard.


			End of Warning.

This program enables Write Posting, banked VGA Write Combining and SVGA
linear frame buffer Write Combining on Pentium Pro motherboards based on
the 82450 chipset.  This will significantly improve graphic performance
from DOS and Win95.

The program must execute privileged instructions so it must be run in real
mode.  For the time being that means it must be run from DOS.  You cannot
run it from a DOS window or a full screen DOS session from Windows 3.x,
Win95, WinNT or OS/2.  For DOS, Windows 3.x and Win95 you can include the
program in your AUTOEXEC.BAT file (keep in mind that DOS4GW.EXE must be in
your search path).  If you try to run the program from a protected mode OS
you will get a DOS4GW error message and register dump.


Steppings of the 82450 chipset before B0 have bugs which forced Intel to
disable Write Posting -- in essence cache writes have been disabled for
the PCI bus.  The B0 stepping has been fixed and Write Posting is enabled
by default by the BIOS.  The difference is easily visible in writing to
video memory.  An A2 motherboard can only write about 8MB/sec to the
graphics card, a B0 motherboard gets about 18MB/sec.

But this is not the entire story.  With the Pentium Pro, Intel also
decided that Write Combining (the combining of several writes into a cache
line that can be bursted out the PCI bus) should be the responsibility of
the O/S, not the BIOS or hardware.  By enabling Write Combining the
throughput to video RAM can be further increased to 88MB/sec or more.

There are two mechanisms for which Write Combining needs to be enabled:
the banked VGA mechanism (the 128KB from A0000 to BFFFF) and the unbanked,
linear frame buffer that many of the newer cards support.  I will refer
henceforth refer to linear frame buffer write combining as LFBWC and
banked VGA write combining as BVWC.

Most low resolution DOS graphics applications and games use the banked
mechanism.  But since the VESA committee has defined a standard, and
UNIVBE and a few of the graphic card manufacturers have provided the VESA
services, some of the latest games are using the linear frame buffer (Duke
Nuke'm 3D and the Quake demo are two examples).  The linear frame buffer
usually gives better performance since it alleviates the reqirement to
switch banks in hires modes.

I have only personally tested this program with 2MB and 4MB Matrox MGA
Millennium cards.  It has been run by others with other cards (S3 964
based, S3 968 based, Tseng 6000 based) and most benefit to some extent. 
The Number Nine Imagine 128 card does not seem to benefit.

With a 2MB Millennium there were problems with BVWC -- most hires VESA
modes would result in vertical stripes over the entire screen.  This
appears to be either a hardware or software bug on the part of Matrox.  I
found a workaround which eliminates the stripes but you don't get the full
speed enhancement from the BVGA.  The LFB will still run at full speed. 
Using a negative value for the number of megabytes (for example
FASTVID  x11 -2) will enable this workaround.

I haven't tested FASTVID on any 8MB graphic cards but I think it will work
properly.  Please let me know if you find othewise.

On many graphic cards, enabling the BVWC results in problems with some
programs that use VGA mode 0x12 (640x480x16colors).  This appears to be
either a hardware or software problem on the part of Matrox.  The problem
stems from the BVWC so you can run with that disabled if necessary.  Users
of other graphic cards have indicated the same problem.  Note that this is
not the same as the "vertical stripe" problem mentioned above.

Unfortunately, I have found that EMM386 interferes in some way with LFBWC
(tested with DOS6.2 and 7.0).  I also have reports from beta testers that
QEMM can interfere with LFBWC under DOS.  When running DOS you must remove
EMM386 from your CONFIG.SYS file for LFBWC to work (BVWC is not affected
by EMM386).  If EMM386 is loaded you will see no increase in speed of the
linear frame buffer.  Early on I wasn't able to get LFBWC with Win95. 
After removing EMM386, LFBWC started working form both DOS and Win95
sessions.  At some point I re-enabled EMM386 and found that LFBWC contiued
to work from Win95.  I don't know what caused this "permanent" change
(maybe re-installing the graphic driver with EMM386 removed and LFBWC
turned on did it) -- let me know if you find out...

LFB write combining requires that FASTVID know where the linear frame
buffer is located.  Different graphic card manufacturers put it at
different addresses.  The LFBWC code in FASTVID currently queries any
installed VESA BIOS Extension driver for the LFB address so you should
install your VESA driver before FASTVID.  If you don't have a VESA driver
loaded (keep in mind that many cards have the driver in BIOS so you don't
need to explicitly load one) or your VESA driver doesn't support the LFB,
you will have to supply an address.  Theoretically this program will work
for any LFB address above 0x80000000 but I have only tested and verified
that it works for the Matrox MGA Millennium at 0xFF000000.  (Others have
successfully used it with other graphics cards at other addresses).  If
you supply an incorrect LFB address you will not see any increase in speed
of the LFBWC.

If this program can't automatically detect the LFB address, you can
determine it's location from Win95.  Select Start, Settings, Control Panel,
(or My Computer, Control Panel), System, Device Manager, Display Adaptors,
your graphics card, Resources.  Scroll to the bottom of the Resource
Settings box and you will see a line that reads: "Memory Range XXXXXXXX -
YYYYYYYY".  The first value is the location of the linear frame buffer. 
For the Matrox MGA Millennium it is 0xFF000000.  If you have another
address take note of it and input it into FASTVID when asked.



	X controls Write Posting.
	Y controls VGA (banked) Write Combining.
	Z controls SVGA (linear frame buffer) Write Combining.
		For all three, 0 disables, 1 enables, any other value
		results in no change from the current setting.
	N indicates the amount of video memory in MegaBytes.
		Valid values are 2, 4, and 8.  Also valid are -2, -4,
		and -8 to apply the special "vertical stripe" patch.
	ADDRESS is the address of the linear frame buffer in hex.
		The Matrox MGA Millennium has it at FF000000.

Example 1: FASTVID4

	If no arguments are supplied you run through a question and answer
	dialogue and the program sets up the environment.  It will also
	tell you what the equivalent command line is for the options you

Example 2: FASTVID4 111 4 FF000000

	Write posting is enabled
	VGA Write Combining is enabled
	SVGA Write Combining is enabled for 4MB video memory at FF000000

Example 3: FASTVID4 x01 4 FF000000

	The write posting setting is not changed by FASTVID
	VGA Write Combining is disabled
	SVGA Write Combining is enabled for 4MB video memory at FF000000

Example 4: FASTVID4 111 -2 FF000000

	Write posting is enabled
	VGA Write Combining is enabled.  "Vertical stripe" patch applied.
	SVGA Write Combining is enabled for 2MB video memory at FF000000

Example 5: FASTVID4 111 -4 FF000000

	Write posting is enabled
	VGA Write Combining is enabled.  "Vertical stripe" patch applied.
	SVGA Write Combining is enabled for 4MB video memory at FF000000


Included is a test program called VSPEED.EXE that reports the video
throughput for bit blit operations from DRAM to VRAM for both the banked
VGA and linear frame buffer mechanisms.

If you experience difficulties with VSPEED try using -l or -L on the
command line to eliminate the linear frame buffer test.  For example:


will test only the banked VGA mechanism.


will test both the banked VGA and the linear frame buffer (assuming the
card and VESA driver support it).


Sample VSPEED results from an Intel Aurora motherboard with the B0
stepping of the 82450 and a 4MB Matrox MGA Millennium:

Copy DRAM to banked VGA:           8.07 million bytes per second
Copy DRAM to linear framebuffer:   8.14 million bytes per second

Copy DRAM to banked VGA:          18.72 million bytes per second
Copy DRAM to linear framebuffer:  18.91 million bytes per second

Copy DRAM to banked VGA:          37.95 million bytes per second
Copy DRAM to linear framebuffer:  39.60 million bytes per second

Copy DRAM to banked VGA:          87.72 million bytes per second
Copy DRAM to linear framebuffer:  93.46 million bytes per second

FASTVID 111 -2
Copy DRAM to banked VGA:          49.20 million bytes per second
Copy DRAM to linear framebuffer:  93.46 million bytes per second


Sample VSPEED results from other cards with FASTVID 111 (unverified):

STB Powergraph (S3 Trio64)                       48 MillionBytes/sec
Spea/V7 (S3 Trio64)                              78 MillionBytes/sec
GXE64Pro (S3-964)				 22 MillionBytes/sec


The following tests were run on an Intel Aurora motherboard with the B0
stepping of the 82450, 64MB DRAM (all four SIMM sockets populated), and a
4MB Matrox MGA Millennium.  The "000" setting simulates an A2 motherboard
where Write Posting is disabled.

program:               fastvid setting: 000     100     011     111

VSPEED (LFB, million bytes/sec)           8      19      40      93
Duke Nuke'm 3D (640x480, fps)            14      25      18      31
Doom Benchmark (fps)                     38      70      48      74
640x480 FLC animation (fps)              25      48      88     121
Chris's 3D benchmark (SVGA)              21      38      66      77

Note that differences in motherboard and graphic card design may lead to
different results.  Most notably, some cards cannot sustain 93MB/sec in
the VSPEED test.

The above are all DOS applications.  If you have an A2 motherboard turning
on write posting will increase the WinBench96 Graphic Winmark score by
about 25 percent.  The write combining features don't make much difference
to the Graphic Winmark score but there _are_ circumstance where write
combining can make a big difference.  One example is using the Media
Player to play an animation to a high resolution, highcolor or truecolor
window.  For example:

Run Win95 in a high resolution, direct color mode; say 1024x768, 24bits
per pixel.  Start the Media Player.  Open \FUNSTUFF\VIDEOS\WEEZER.AVI from
the Win95 CD-ROM.  Enlarge the playback window to nearly full screen (do
not use the Media Player's "full screen" option -- if you do it will
change the screen to a lower resolution 8 bit mode for playback).  Press
the Play button.  With write posting and write combining turned off you
will get very poor results, about 2 frames per second.  With write posting
on and write combining off that will improve to about 4 frames per second.
With both write posting and write combining on you will get very smooth
playback with the frame rate too fast to count.

You can see similar affects with the Hover! game on the Win95 CD-ROM. 
Again, with Win95 in a hires direct color mode, enlarge the game window as
large as it will let you (about 640x480).  With write posting and write
combinging off you will get poor performance.  With write posting on the
game will be playable.  With write posting and write combining the action
will be very smooth.

If you have a pre-B0 motherboard you can still benefit from write
combining (without fear of encountering the 82450 bugs) in the above DOS
and Win95 situations.

			further descriptions:

1) Write Posting:

Write Posting is where the processor "posts" data to the PCI bus and then
goes on it's way without waiting for the write operation to complete.
Because of bugs in the pre-B0 stepping of the 82450 chipset Write Posting is
disabled on early Pentium Pro motherboards.  This severly limits the PCI
throughput to about 8MB/sec.  Most Pentium motherboards these days can get
over 80MB/sec, 10 times faster.  FASTVID can enable Write Posting on these
motherboards, increasing PCI throughput to about 18MB/sec.  You don't want to
do this routinely because the bugs in the chipset will eventualy cause the
PCI bus to hang, forcing a reboot of the machine.  Motherboards with the B0
stepping have this bug fixed and Write Posting enabled by default.

2) Banked VGA Write Combining (BVGAWC):

This function allows seperate writes to the banked VGA mechanism to be
combined into a cacheline that can be bursted out to video memory via the PCI
bus.  I believe this used to be handled in hardware but Intel decided to make
it a programable function with the Pentium Pro to make the motherboard
architecture more general.  If you enable BVGAWC with FASTVID PCI throughput
will increase from 18MB/sec (B0 motherboard) to 90MB/sec for programs that
use the banked VGA mechanism (most DOS games).  If you enable only BVGAWC on
an early motherboard (Write Posting remains off) the bus bandwidth increases
from 8MB/sec to about 40MB/sec.  Some of the newer motherboards (ASUS for
instance) have this as a BIOS setup option.

3) Linear Frame Buffer Write Combining:

Many newer graphics cards have their graphics memory mapped linearly at very
high physical addresses (in addition to the banked VGA mechanism at A000:0000
and B000:0000) beyond the 2GB mark.  The reason for doing this is to make
access to video memory simpler and faster -- programs (and Windows drivers)
don't have to switch banks all the time to access all of video memory.  I
believe Pentium motherboards enable Write Combining for all high addresses
but the Pentium Pro design requires the use of the processors MSR registers
to enable Write Combining.  Again, this was done to generalize the
motherboard design.  You can theoretically have multiple devices mapped in
high address space with different cachability options.  Intel believes that
proper place for this to be handled is within a PNP operating system.
Unfortunately, no operating system yet supports this.  As with BVGAWC, LFBWC
will increase PCI throughput from 18MB/sec to 90MB/sec (or 8MB/sec to
45MB/sec with Write Posting off) for programs that use the linear frame
buffer (some of the new hires DOS games, Windows drivers).

Exactly how much a difference any of these functions makes depends on the
applications being run and the graphics card you're using.  If you are using
a very slow graphics card you won't see much difference.  Programs that do
very little graphic output will show little or no difference.  Programs that
do lots of graphic output (realtime 3D games, multimedia animations) can show
a large difference.  There are even circumstances under Win95, OS/2 and NT
where the difference can be huge.

Here are some results with my Matrox MGA Millennium:

                                      A2      B0     FASTVID
 Duke Nuke'm 3D (640x480, fps)        14      25       31
 Doom Benchmark (fps)                 38      70       74
 640x480 FLC animation (fps)          25      48      121
 Chris's 3D benchmark (SVGA)          21      38       77
 Win95 Media Player* (fps)             2       5       15**

* WEEZER.AVI from the Win95 CD-ROM enlarged to _nearly_ full screen at
1152x864, 32 bits-per-pixel.

** the frame rate was too fast to count, 15fps is an estimate -- the
animation played fairly smoothly.


Notes for SuperMicro P6SNF and P6DNF motherboard users, and other 82440
(Natoma chipset) based motherboards:

Several FASTVID users have reported that one BIOS setting on these
motherboards conflicts with FASTVID resulting in a system crash.  If you
experience these crashes try turning off "USWC write combining" (Uncached
Speculative Write Combining) using the BIOS setup procedure.

FASTVID's controls for write-posting don't seem to have any effect on
82440 motherboards.  Presumably this means that write-posting is
controlled by a different mechanism.  I suggest using "FASTVID x11" so
that FASTVID doesn't attempt to change the write-posting option if you
have one of these motherboards.


Pro-formance c't 10/96 S. 124, Andreas Stiller, translated by Thomas Pabst
ctppro.exe   executable program

The Pentium Pro CPU contains internal registers (Memory Type
Range Register MTRR), which have to be programed for reaching 
full PCI performance, e.g. for the frame buffer of the video card.
In this case the transfer performance of fast video cards 
(Matrox Millennium, ET6000, etc.) goes up from about 20 MB/sec to
about 90 MB/sec - as long as the chipset write buffers are enabled
as well.

The MTRRs of the PPro are divided between the Fix Range MTRRs
between 0 und 1 MByte of the main memory (where you can find the
VGA buffer at A0000 - BFFFF), and 8 Variable Range MTRRs, responsible
for the address range above 1 MB.

For the regarding address ranges you can choose between the following
memory attributes:

UC  uncached
WC  Write Combining
WP  Write Protect
WT  Write Through
WB  Write Back

The first Variable Range MTRR (no. 0) is responsible for the normal main memory
and hence usually set to WB. The second Variable Range MTRR (no. 1) is usually 
not used, so you can enter the linear frame buffer (LFB) of the video
card, which has to bet set to WC (write combining). To find the location
of the frame buffer, have a look into the Win95 Device Manager under 
'ressorces' of the video adapter.

ctppro can display and change the contents of the MTRRs and is able
to change some special bits in the Intel Orion (450GX/KX) and Intel
Natoma (440FX) chipsets, to e.g. enable/disable the PCI write buffers
and more.

Command line parameters:

V              : sets VGA-memory A0000-BFFFF to WC
V:aa           : sets VGA-memory A0000-BFFFF to attribute 'aa'
n/xx,yy:aa     : sets MTRR 'n' for start address 'xx' and size 'yy' to 'aa'
xx,yy:aa       : sets MTRR 1 for start address 'xx' and size 'yy' to 'aa'
xx,yy          : sets MTRR 1 for start address 'xx' and size 'yy' to 'WC'

'xx' and 'yy' have to be either hex or decimal with additional K, M, G for 
kB, MB or GB.

M:             : enable 'Fast String Move'

S1             : set bits displayed under 1) to ON or '1'
S123           : set bits displayed under 1), 2), 3) to ON or '1'
R1             : set bits displayed under 1) to OFF or '0'
R123           : set bits displayed under 1), 2), 3) to OFF or '0'

P              : display frame buffer address (detected from PCI header)
FRAME,yy       : set MTRR1 to detected frame buffer address with size 'yy'
FRAME          : set MTRR1 to detected frame buffer address with detected size,
                 this can sometimes lead to crashes or system instabilities

You can sum up the command line parameters for using in batch files, seperated
by spaces.

If you want to use this for Windows95, you can add this command in your autoexec.bat.

For further help please use the help function of ctppro.