[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sv: Yet another take on ppfDirTree

>I'm now thinking of using plain normal hashing *instead* of trees. I didn't
>consider it before, because I generated all directories by inserting the
>entries one by one, without knowing beforehand how many of them have to be
>in. Hashing has big problems wich such a setting. But recently I thought a
>bit more about it and found that the solution for that is actually quite
>simple. (The "The Practice of Programming" book also helped a bit).
Why were you inserting entries one-by-one?

I take it you are going to use chained hashing, as the only viable
alternative, open-adressing with double hashing is not very viable anyway...

You are going to be using more memory. You have atleast the hash-table,
which is going to be using up 4 times its size in bytes for pointers. That's
not that much memory, though.

There also is a pointer for each entry (for the list). That's 4 times the
amount of entries used up for pointers (even if they are NULL). Still not
that much memory, though.

There also is the problem of having to allocate nodes for each entry. That's
an amount of allocations equal to the amount of entries. That's slow, and it
can potentially fragment memory.

Now, of course, I've assumed that we are actually going to be using a
linked-list to make the hash-table chained. We of course don't have to do
that. We might aswell use an array. So, we can limited the memory usage to
the memory the data we are storing actually fills, plus a pointer and a size
for each position in the hash-table, so it's 8 times the size of the
hash-table we use as overhead memory.

Now comes the question of how large a load factor we want the table to have.
I think somewhere betwen 2 and 3 is good. What do you think?

>* Dirs read from Format 0 PakFiles
>  We know exactly how many entries those have, so we can initialize the
>  hashtable to an optimal size. Note that the hash insert operation is so
>  fast that we don't have to store the dir entries as a "precalculated"
>  data structure anymore (just storing the hash value with each entry is
>  sufficient). That makes the Pak format (and with this Pak
>  reading/writing) simpler.
Even if insertion doesn't take *alot* of time, I don't see why we wouldn't
want to store it as we are going to be needing it. The code for PenguinFile
is free. If somebody can't figure out how to write a pak file properly, they
can just take the parts that does what they can't figure out from
PenguinFile. This includes the hash-table layout and generation algorithm.

>* Dirs read from Format 1 PakFiles
>  To keep things sinple I wouldn't store dir sizes in the PakFile, so we
>  don't know how large the dirs are. But we *do* know that Format 1 Paks
>  are usually small, so using a (small) fixed hashtable size will neither
>  slow things down significantly nor waste too much memory (a size of
>  32 to 256 would be good IMHO)
Why wouldn't we store dir sizes in the pakfiles? How would this make things
complicated? It's a data-field like any other.

>* Fill directories
>  That's the dirs in the VFS hierarchy that just exist to make some path
>  valid, e.g. if you mount something to /some/stupid/path/mountpoint ,
>  the fill dirs /some/ , /some/stupid/ and /some/stupid/path/ are
>  automatically generated. Those usually only hold 1 (the most common case)
>  to 3 entries, with larger sizes being a very rare exception. That means
>  very small fixed hastable sizes (4 or 8) are fine.
Sounds fine.

>* Other dirs
>  That's for example the dirs in a format 0 PakFile while it is being
>  constructed or a dir read via FTP or so. These dirs can be large and they
>  have to be grown dynamically. But they are not speed critical, so we can
>  start with a medium hashtable size (64 to 256) and "rehash" the thing
>  later (when dir construction is completed)
I don't think I like that, but I guess its the least bad solution. Except
for the format 0 Pakfile dirs while they are being built thing. Why would we
need to hash these? We aren't going to be doing any searching anyway... (?)

>That means (almost) perfectly normal hashing can be applied. I more and
>more think this is a good idea, since hashing is dead simple, much simpler
>than the current balanced tree stuff, and thus it makes the code simpler
>and easier to maintain and debug. And it's faster ;)
No maintenance should be needed for the tree as a result of templates. The
tree is already debugged. If its faster... I suspect that the hash-table
solution you suggest is faster, and therefore superior. Its of course impler
too, agreed.

>(In the book mentioned above the authors describe a simple, yet good hash
>function which essentially generates a hash key evenly distributed over a
>32Bit range and then calcs that value modulo the hashtable size. This is
>very simple and can cope with any hashtable size.)
And this hash function is for strings, right? What is it?

>>On another note, I think it might be worthwhile to implement a
>>memory-handler for ppfFile objects (or something like that. Those things
>>get a pointer to when opening a file), as many games are likely to be
>>causing the allocation of alot of these. For some games, the number will
>>thousands and thousands (especially if each level causes a reopening of
>I agree. Perhaps this isn't suitable in all cases, but for example for
>reading Format 0 Paks (where we know exactly how many entries we have and
>where either all or none entries are freed) it would certainly be useful.
>And (especially for this case) it should also handle the memory for the
>entry name strings. Should be quite simple - we can even have some field
>int the PakFiles saying *exactly* how much memory is needed.
We wouldn't know how much memory is needed, as there is no saying how many
files will be opened, regardless of how many files there is. Even if a
program mounts a PakFile with a million files in it, that program might
still only have 1 file open at any time.

I'm working on an allocator that's very much faster than the standard new
both on allocation and deallocation.