[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Yet another take on ppfDirTree



Bjarke Hammersholt Roune wrote:
>I bet everybody thougth I had said everything I had to say about how to most
>effeciently find filenames in a tree by now. Well, think again :)
>
>I've implemented yet another feature into the tree (or, well, it's actually
>a seperate class that uses the tree class, but anyway) to speed the process
>of finding up.
>
>This time, the idea is hashing the top of the tree with a standard
>hash-table. It's like chained hashing, just with trees instead of linked
>lists (the specific ADT used is a template parameter, so you could use a
>linked list without changing the class if it had the right interface).
>
>The hashtable computes the offset into the array by taking the hashvalue of
>the entry-to-find (which we are computing anyway) and using the % operator
>to get it withind the valid range of the array. Each position in the table
>is a seperate tree, and when the key is mapped to one location, a search is
>made in that tree for that key exactly as if a hash-table had not been used.
>
>The benefit of this is that each tree is smaller by a factor of S,
>where S is the size of the hash-table, assuming a perfect-distribution hash
>function.

I'm now thinking of using plain normal hashing *instead* of trees. I didn't
consider it before, because I generated all directories by inserting the
entries one by one, without knowing beforehand how many of them have to be
in. Hashing has big problems wich such a setting. But recently I thought a
bit more about it and found that the solution for that is actually quite
simple. (The "The Practice of Programming" book also helped a bit).

The idea is that we basically have four kinds of directories:

* Dirs read from Format 0 PakFiles
  We know exactly how many entries those have, so we can initialize the
  hashtable to an optimal size. Note that the hash insert operation is so
  fast that we don't have to store the dir entries as a "precalculated"
  data structure anymore (just storing the hash value with each entry is
  sufficient). That makes the Pak format (and with this Pak
  reading/writing) simpler.

* Dirs read from Format 1 PakFiles
  To keep things sinple I wouldn't store dir sizes in the PakFile, so we
  don't know how large the dirs are. But we *do* know that Format 1 Paks
  are usually small, so using a (small) fixed hashtable size will neither
  slow things down significantly nor waste too much memory (a size of
  32 to 256 would be good IMHO)

* Fill directories
  That's the dirs in the VFS hierarchy that just exist to make some path
  valid, e.g. if you mount something to /some/stupid/path/mountpoint , 
  the fill dirs /some/ , /some/stupid/ and /some/stupid/path/ are
  automatically generated. Those usually only hold 1 (the most common case)
  to 3 entries, with larger sizes being a very rare exception. That means
  very small fixed hastable sizes (4 or 8) are fine.

* Other dirs
  That's for example the dirs in a format 0 PakFile while it is being
  constructed or a dir read via FTP or so. These dirs can be large and they
  have to be grown dynamically. But they are not speed critical, so we can
  start with a medium hashtable size (64 to 256) and "rehash" the thing
  later (when dir construction is completed)

That means (almost) perfectly normal hashing can be applied. I more and
more think this is a good idea, since hashing is dead simple, much simpler
than the current balanced tree stuff, and thus it makes the code simpler
and easier to maintain and debug. And it's faster ;)

(In the book mentioned above the authors describe a simple, yet good hash
function which essentially generates a hash key evenly distributed over a
32Bit range and then calcs that value modulo the hashtable size. This is
very simple and can cope with any hashtable size.)


>On another note, I think it might be worthwhile to implement a
>memory-handler for ppfFile objects (or something like that. Those things you
>get a pointer to when opening a file), as many games are likely to be
>causing the allocation of alot of these. For some games, the number will be
>thousands and thousands (especially if each level causes a reopening of
>everything).

I agree. Perhaps this isn't suitable in all cases, but for example for
reading Format 0 Paks (where we know exactly how many entries we have and
where either all or none entries are freed) it would certainly be useful.
And (especially for this case) it should also handle the memory for the
entry name strings. Should be quite simple - we can even have some field
int the PakFiles saying *exactly* how much memory is needed.


	Christian
--

Drive A: not responding...Formatting C: instead