[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sv: Sv: Yet another take on ppfDirTree



>>I take it you are going to use chained hashing, as the only viable
>>alternative, open-adressing with double hashing is not very viable
anyway...
>>:)
>
>Er, I think yes. I don't know the term "chained hashing" as such, but it
>sounds good ;)
>I mean having an array (the hashtable) with pointers to linked lists
>holding the entries.
>
That's chained hashing. You have a chain (read: linked list) of entries
attached to each array position.

>>There also is the problem of having to allocate nodes for each entry.
That's
>>an amount of allocations equal to the amount of entries. That's slow, and
it
>>can potentially fragment memory.
>>
>>Now, of course, I've assumed that we are actually going to be using a
>>linked-list to make the hash-table chained. We of course don't have to do
>>that. We might aswell use an array. So, we can limited the memory usage to
>
>I don't think arrays would be good. The dynamic sizing of lists makes
>things effectively simpler.
>
You can resize arrays too... I do see how it wouldn't be a good idea ,
though, if you don't store the hash-table as a hash-table, but just as a big
bunch of entries.

Write a class for these arrays if you're afraid of complexity. That way, you
can't honestly mean that using an array is so mind-boggingly difficult and
complex that its unviable... Its something that everybody have done 100s of
times. Well, ok, atleast I have. I don't see how a programmer could get
around using and resizing arrays in the course of a career, nomatter how
short-lived.

>>Now comes the question of how large a load factor we want the table to
have.
>>I think somewhere betwen 2 and 3 is good. What do you think?
>
>You mean HashTableSize = NrOfEntries * LoadFactor ?
>Certainly not less than 1, and more than 3 won't bring much more benefits.
>I guess we'll simply have to experiment with that.
>
That depends on what you mean by "HashTableSize". Here, HashTableSize means
"amount of positions in the table", and the formula is:

LoadFactor = NrOfEntries / HashTableSize

So, with a load-factor of 2, we have 2 times as many entries as we have
positions, so, in average, each position has 2 entries.

>>Even if insertion doesn't take *alot* of time, I don't see why we wouldn't
>>want to store it as we are going to be needing it. The code for
PenguinFile
>
>Because in about 95% of the cases (if our hash function etc are good)
>inserting simply means
>if (hashtable [hashkey] == 0)
>hashtable [hashkey] = entry;
>
The hash-function you propose doesn't take forever to run, but if you have
5.000-10.000 entries, then I think you will find that it will take too long
to hash each and every entry. Ok, pak files with 5.000-10.000 entries will
perhaps not be the most common, but they certainly will be there. Its not
the least unrealistic to assume that all the files in a game combined could
reach this size, especially with the features PFile offers to make file
opening and storage more effecient.

>If there already is some entry at that position some sorting has to be done
>- which is also extremely cheap given the very short chain lengths that
>prevail.
>
Sorting? I see how that would be usefull with arrays using a binary search
algorithm. But linked lists? Why not just push the new entry onto the head
of the list?

>Of course the worst case performance would be horrible, but with proper
>tuning of the hash function and the load factor a worst case situation
>would also be extremely unlikely! And no matter how fast Pak mounting is,
>in such a case the file lookup code has to cope with that worst case.
>
With an array-based solution, you could use binary search algorithm. The
performance of this is comparative to that of a binary search tree. The only
requirement is that the arrays must be sorted in descending or ascending
order.

>I'm sure that the little speed gain given by storing all hash info in the
>PakFile doesn't rectify the added complexity.
>
I don't think the speed gain is that unsubstantial. Besides, I really do not
see how it makes things so much more complex. It takes two lines of code to
do:

(in the function that writes each entry to disk)
WriteToDisk(myEntry.hash);

(in the function that reads each entry from disk)
ReadFromDisk(myEntry.hash);

Have I overlooked something?

>>is free. If somebody can't figure out how to write a pak file properly,
they
>>can just take the parts that does what they can't figure out from
>>PenguinFile. This includes the hash-table layout and generation algorithm.
>
>Right, but as I described above it actually doesn't matter whether we
>optimize that case in Pak mounting or not - a worst case situation will in
>any case slow down the lookup operations - and we can't do much about that.
>
Using arrays and binary search, we still have O(log n) even in the
worst-case.

>>Why wouldn't we store dir sizes in the pakfiles? How would this make
things
>>complicated? It's a data-field like any other.
>
>Yep, but one that has to be updated later (remember that files can be added
>to F1 Paks), which means that the dir structures we keep in RAM have to
>keep track of where their data is in the PakFiles etc.
>Again the cost (complexity) / speed (or memory) ratio isn't good enough
here
>(because F1 Paks and their dirs are small enough to make the speed gain
>unnoticeable except for worst case situations).
>
I forgot we were talking about format 1. You are of course correct.

>>>* Other dirs
>>>  That's for example the dirs in a format 0 PakFile while it is being
>>>
>>need to hash these? We aren't going to be doing any searching anyway...
(?)
>
>Wrong. As soon as we add something to them (during construction) we have to
>look up the destination directory. Of course speed isn't important in that
>
I think I agree.

>>>(In the book mentioned above the authors describe a simple, yet good hash
>>>function which essentially generates a hash key evenly distributed over a
>>>32Bit range and then calcs that value modulo the hashtable size. This is
>>>very simple and can cope with any hashtable size.)
>>>
>>And this hash function is for strings, right? What is it?
>
>Right. Here's the code:
>
>-------------------------
>enum { MULTIPLIER = 31 };
>
>unsigned int hash (char *str)
>{
>  unsigned int h;
>  unsigned char *p;
>
>  h = 0;
>  for (p = (unsigned char *) str ; *p != '\0' ; p++)
>    h = MULTIPLIER * h + *p;
>  return h % NHASH;
>}
>-------------------------
>
>(NHash is the hashtable size - in our case that would be given to the
>function as parameter)
>
Actually, I considered using a hash-function quite similar to this. I
decided not to because it does alot of multiplication, which is slow. Now
that we've moved to a solely hash-table based solution, then the extra time
used is worth it. However, using suchs a hash function, I do think we'll
want to store hash-info.

>We'd of course have to do some benchmarking & tuning with it, but the
>basics should be fine.
>
Outside of assembler-optimizations (which could take 4 to 8 bytes of the
string in one iteration, based on the architeture), I don't see how you
could make it faster.

>>>>On another note, I think it might be worthwhile to implement a
>>>>memory-handler for ppfFile objects (or something like that. Those things
>>>>
>>>I agree. Perhaps this isn't suitable in all cases, but for example for
>>>reading Format 0 Paks (where we know exactly how many entries we have and
>>>where either all or none entries are freed) it would certainly be useful.
>>>And (especially for this case) it should also handle the memory for the
>>>entry name strings. Should be quite simple - we can even have some field
>>>in the PakFiles saying *exactly* how much memory is needed.
>>>
>>We wouldn't know how much memory is needed, as there is no saying how many
>>files will be opened, regardless of how many files there is. Even if a
>>program mounts a PakFile with a million files in it, that program might
>>still only have 1 file open at any time.
>>
>Ah, sorry. misunderstood you. I thought you were talking about ppf_File
>instead of ppfFILE ;)
>
>But again I agree. Preallocating space for n ppfFILE structs (1024 should
>be fine. The only kind of software I know that really needs more are *very*
>busy web servers. And we always can grow it if needed.)
>
I remember you saying that it would be perfectly ok behavior for a program
to open all of the files it needs at start-up or level-load, and just keep
them open. That could grow beyound 1024 in some games. But as you say, we
can always grow.

>>I'm working on an allocator that's very much faster than the standard new
>>both on allocation and deallocation.
>>
>Good.
>
My allocator has constant time performance for both allocation and
deallocation. Allocation might cause the allocator to request more memory,
but that is also constant-time.

There are three drawbacks, however: Each allocator can only allocate memory
in blocks of one size. This size is a constructor parameter, and cannot be
changed at a later time, except for the condition in which the allocator is
told to deallocate all of its memory. This means that my allocator will not
be able to provide for strings, that is by definition variable-size.

The other drawback is that its very bad at returning memory to the heap.
That is, doing that is a very expensive operation for it, and its O(n^2).
I've designed it so that its possible to do this incrementally, so that it
can be an idle-time job.

The third drawback is that the allocator cannot allocate stuff that is less
than 4 bytes.

That string thing... I'll write an allocator that doesn't support
deallocation, except for "everything at once", and that requires its exact
size to be specified at construction. That could be used for strings, and
would be very fast.