[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sv: Yet another take on ppfDirTree



Bjarke Hammersholt Roune wrote:
>>I'm now thinking of using plain normal hashing *instead* of trees. I didn't
>>consider it before, because I generated all directories by inserting the
>>entries one by one, without knowing beforehand how many of them have to be
>>in. Hashing has big problems wich such a setting. But recently I thought a
>>bit more about it and found that the solution for that is actually quite
>>simple. (The "The Practice of Programming" book also helped a bit).
>>
>Why were you inserting entries one-by-one?

Because it was easier back then. Inserting one by one can be used in any
situation and the PakFile format doesn't need to have any additional info
for this to work.

>I take it you are going to use chained hashing, as the only viable
>alternative, open-adressing with double hashing is not very viable anyway...
>:)

Er, I think yes. I don't know the term "chained hashing" as such, but it
sounds good ;)
I mean having an array (the hashtable) with pointers to linked lists
holding the entries.

>You are going to be using more memory. You have atleast the hash-table,
>which is going to be using up 4 times its size in bytes for pointers. That's
>not that much memory, though.
>
>There also is a pointer for each entry (for the list). That's 4 times the
>amount of entries used up for pointers (even if they are NULL). Still not
>that much memory, though.

That memory use is neglible.

>There also is the problem of having to allocate nodes for each entry. That's
>an amount of allocations equal to the amount of entries. That's slow, and it
>can potentially fragment memory.

Er, see below.

>Now, of course, I've assumed that we are actually going to be using a
>linked-list to make the hash-table chained. We of course don't have to do
>that. We might aswell use an array. So, we can limited the memory usage to

I don't think arrays would be good. The dynamic sizing of lists makes
things effectively simpler.

>Now comes the question of how large a load factor we want the table to have.
>I think somewhere betwen 2 and 3 is good. What do you think?

You mean HashTableSize = NrOfEntries * LoadFactor ?
Certainly not less than 1, and more than 3 won't bring much more benefits.
I guess we'll simply have to experiment with that.

>>* Dirs read from Format 0 PakFiles
>>  We know exactly how many entries those have, so we can initialize the
>>  hashtable to an optimal size. Note that the hash insert operation is so
>>  fast that we don't have to store the dir entries as a "precalculated"
>>  data structure anymore (just storing the hash value with each entry is
>>  sufficient). That makes the Pak format (and with this Pak
>>  reading/writing) simpler.
>>
>Even if insertion doesn't take *alot* of time, I don't see why we wouldn't
>want to store it as we are going to be needing it. The code for PenguinFile

Because in about 95% of the cases (if our hash function etc are good)
inserting simply means
if (hashtable [hashkey] == 0)
	hashtable [hashkey] = entry;

If there already is some entry at that position some sorting has to be done
- which is also extremely cheap given the very short chain lengths that
prevail.

Of course the worst case performance would be horrible, but with proper
tuning of the hash function and the load factor a worst case situation
would also be extremely unlikely! And no matter how fast Pak mounting is,
in such a case the file lookup code has to cope with that worst case.

I'm sure that the little speed gain given by storing all hash info in the
PakFile doesn't rectify the added complexity.

Optimization is good, but only if done in the right places.

>is free. If somebody can't figure out how to write a pak file properly, they
>can just take the parts that does what they can't figure out from
>PenguinFile. This includes the hash-table layout and generation algorithm.

Right, but as I described above it actually doesn't matter whether we
optimize that case in Pak mounting or not - a worst case situation will in
any case slow down the lookup operations - and we can't do much about that.

>>* Dirs read from Format 1 PakFiles
>>  To keep things sinple I wouldn't store dir sizes in the PakFile, so we
>>  don't know how large the dirs are. But we *do* know that Format 1 Paks
>>  are usually small, so using a (small) fixed hashtable size will neither
>>  slow things down significantly nor waste too much memory (a size of
>>  32 to 256 would be good IMHO)
>>
>Why wouldn't we store dir sizes in the pakfiles? How would this make things
>complicated? It's a data-field like any other.

Yep, but one that has to be updated later (remember that files can be added
to F1 Paks), which means that the dir structures we keep in RAM have to
keep track of where their data is in the PakFiles etc.
Again the cost (complexity) / speed (or memory) ratio isn't good enough here
(because F1 Paks and their dirs are small enough to make the speed gain
unnoticeable except for worst case situations).

>>* Other dirs
>>  That's for example the dirs in a format 0 PakFile while it is being
>>  constructed or a dir read via FTP or so. These dirs can be large and they
>>  have to be grown dynamically. But they are not speed critical, so we can
>>  start with a medium hashtable size (64 to 256) and "rehash" the thing
>>  later (when dir construction is completed)
>>
>I don't think I like that, but I guess its the least bad solution. Except
>for the format 0 Pakfile dirs while they are being built thing. Why would we
>need to hash these? We aren't going to be doing any searching anyway... (?)

Wrong. As soon as we add something to them (during construction) we have to
look up the destination directory. Of course speed isn't important in that
case so we could use some linked list of dynamic array as well, but that
would mean adding another algorithm to make things slower ;)
And of course the default hashtable size for PakF0 construction can be
larger than the default one for e.g. ftp dirs.

>>That means (almost) perfectly normal hashing can be applied. I more and
>>more think this is a good idea, since hashing is dead simple, much simpler
>>than the current balanced tree stuff, and thus it makes the code simpler
>>and easier to maintain and debug. And it's faster ;)
>>
>No maintenance should be needed for the tree as a result of templates. The
>tree is already debugged. If its faster... I suspect that the hash-table
>solution you suggest is faster, and therefore superior. Its of course impler
>too, agreed.

It's certainly faster in the average case. The only drawback I see now is
the abysmal performance in the worst case, but given a well-tuned hashing
such a case should be so unlikely that we can neglect it.

>>(In the book mentioned above the authors describe a simple, yet good hash
>>function which essentially generates a hash key evenly distributed over a
>>32Bit range and then calcs that value modulo the hashtable size. This is
>>very simple and can cope with any hashtable size.)
>>
>And this hash function is for strings, right? What is it?

Right. Here's the code:

-------------------------
enum { MULTIPLIER = 31 };

unsigned int hash (char *str)
{
  unsigned int h;
  unsigned char *p;

  h = 0;
  for (p = (unsigned char *) str ; *p != '\0' ; p++)
    h = MULTIPLIER * h + *p;
  return h % NHASH;
}
-------------------------

(NHash is the hashtable size - in our case that would be given to the
function as parameter)

According to the authors (that's Brian W. Kerninghan and Rob Pike),
"Experiments show that for a wide variety of strings it's hard to construct
a hash function that does appreciately better than the one above, [...]"

We'd of course have to do some benchmarking & tuning with it, but the
basics should be fine.

>>>On another note, I think it might be worthwhile to implement a
>>>memory-handler for ppfFile objects (or something like that. Those things
>you
>>>get a pointer to when opening a file), as many games are likely to be
>>>causing the allocation of alot of these. For some games, the number will
>be
>>>thousands and thousands (especially if each level causes a reopening of
>>>everything).
>>
>>I agree. Perhaps this isn't suitable in all cases, but for example for
>>reading Format 0 Paks (where we know exactly how many entries we have and
>>where either all or none entries are freed) it would certainly be useful.
>>And (especially for this case) it should also handle the memory for the
>>entry name strings. Should be quite simple - we can even have some field
>>int the PakFiles saying *exactly* how much memory is needed.
>>
>We wouldn't know how much memory is needed, as there is no saying how many
>files will be opened, regardless of how many files there is. Even if a
>program mounts a PakFile with a million files in it, that program might
>still only have 1 file open at any time.

Ah, sorry. misunderstood you. I thought you were talking about ppf_File
instead of ppfFILE ;)

But again I agree. Preallocating space for n ppfFILE structs (1024 should
be fine. The only kind of software I know that really needs more are *very*
busy web servers. And we always can grow it if needed.)

>I'm working on an allocator that's very much faster than the standard new
>both on allocation and deallocation.

Good.

	Christian
--

Drive A: not responding...Formatting C: instead