[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SV: SV: SV: Progress Update

>> Either you haven't understod what I've said, or I'm misunderstanding you.
>Obviously, there is some misunderstanding between each other. I could be
>wrong, but just in the case I'm right...
I now know where the problem is. The reason is that we have some quite
specific needs of the HashTable, which makes my use of it quite economic,
even if it wouldn't be in so many other situations.

>> A HashTable is a completely standard hashtable. There's two arrays, one
>> the hash table itself, and one for the entries that the positions in the
>> hash table points to. There are no linked lists in HashTable.
>I use only a single array in my "regular hashtables". What is the other
>containing exactly? Is that for going the "other way" (get the key for
>an entry?)??? I use linked lists to implements my buckets because I'm
>lazy. Since I like my buckets to be rather shallow (4 or 8 at most), I
>don't worry too much about the performance of going thru a bucket. They
>could be dynamically allocated arrays, since they are going to be fixed
>size anyway.
Actually, the HashTable has to either NEVER change, or only be intialised
with un-hash-table'd data ONCE. This makes it important that all data can be
added at once.

This isn't strictly true, but the situation where the hash-table needs to
grow one entry at a time is a situation where it doesn't need to get
accessed for the duration of the adds, and these add operations are taking
place when a pak file is getting compiled. The compilation time of a pak is
completely unimportant, as its a one-time thing that happens at the
developer side, not the game-side (it shouldn't be very processor intensive
anyway). This is why the following implementation is effecient:

You yourself mentioned dynamically allocated arrays for implementing chained
hashing. That's exactly what I'm doing. Imagine this setup, but all the
arrays are filled to maximum capacity (ie, no empty gaps). Now try to
imagine these arrays being put, in memory, rigth after one another, so they
form one consequtive block.

Now, instead of allocating each and every array by itself, its possible to
simply allocate the large block once, and split it up into smaller segments,
one segment for each position in the hash table array.

I think you can appreciate how terribly ineffecient adding and removing
entries with such an implementation is: you have update the hash-table for
all positions that come after the position that the entry you are adding or
removing hashes to, and you have to completely reallocate the entry array
(ie, where the hash table points point to). This is pretty heavy for just
one entry manipulation... Hence, the DynHashTable.

>Is that second array you're talking about actually a *number* of
>secondary level arrays, composing the buckets themselves (my
>"dynamically allocated arrays" improvement on my hashtables)? Or is
>there a wholly separate array?
If you meant what I just said, then yes. If you mean like an array of arrays
(ie, like char**), then no.

>I smell the misunderstanding partly because of things like these two
>arrays you have and the single one I have. We are clearly having two
>different approaches. Maybe I can improve *my* coding, which I like to
>do, and maybe I can help you improve *your* coding, which I also like to
>do. I win either way. :-)

My implementation only works well for hash tables that are never, or almost
never changed.

>As for the linked lists, I was talking about your DynHashTable.
Well, they are better than a dynamic array, as they don't need any
reallocation (again, my utilization of an allocator makes the picture a bit
more complicated, but the linked list come out the winner), and I only need
to access them once, ever, as long as removes are done before adds, and no
added entries not to be removed right after they have been added. If this is
not the case, the linked list will also have to be searched for the entry to
be removed.

>Nice idea. I use the order of the hash table as the maximum size of
>buckets, so that as the size of the hash table increases (and thus
>becomes more expensive to grow), it takes larger buckets to trigger the
>growing of the table.
>For the very worst case, you'd have a 2^32 entry hash table, each with
>32 entries in their bucket, which makes for a huge amount of total
>entries. Even the worst case of having to look thru all of the 32
>entries in a bucket is completely insignificant compared to the worst
>case of a simple array or a linked list, thus I feel I'm winning enough.
>I think that my straight method isn't O(n) as you mentioned, but rather
>O(n log n), which is nice enough for me.
Hmm... I'm not sure I understand you, but it would seem to me that if all
entries in the hash table hashes to the same location in the table, then
they would all have to be stored in that location, making you search a
serial one, which is clearly both very ineffecient and O(n).

Are you doing some kind of overflow checking that makes entries "spill over"
to the next position if the position gets full. That would be ineffecient
for my porpuse.

>Do you add entries that often? Adding an entry to a normal hash table,
>for the case where it doesn't need to grow the table (that the regular
>case), is quite fast.
I *always* have to grow, not the table, but entry array for each add or
remove, aswell as update the hash table.

Changing the hash table requires a conversion to a DynHashTable and a
conversion back to a HashTable. Notice that this is 99,999999999% close to
being exactly as effecient as simply having a ResizeTable() method. Both
conversions are implemented very effeciently.

>Once in a while, it has to grow the table, that's
>all. The fact that adding an element isn't done in bounded time
>sometimes is a big disadvantage, that's why the hash tables in the C++
>STL are actually *not* hash tables, but red-black trees!
Well, I don't do alot of changing to the table...

>The improvement you're getting with your DynHashTable is only visible
>for a large number of successive add entry operations, *or* when you
>need adding to be done in bounded time.
Which accounts for all cases I'll be doing it.

Except for the entry lists maintained over sub-directories in directories.
These are quite small, though, and the extra effort is minimal compared to
the importance of fast directory look-up.

>The reasoning is simply that it wasn't worth the effort to switch from
>one type of hash table to another, then back as we're finished adding
>stuff to it. We have one Hashtable class, that's it.
Well, going HashTable->DynHashTable is free, going DynHashTable->HashTable
isn't more expensive than adding one entry would normally be.

>I just felt that discussing THE most optimized algorithm for such a
>relatively unimportant part of the system is the kind of thing that
>makes projects of the scope of PenguinPlay bog down. Heck, regular Unix
>ufs uses a darn ARRAY for directories entries!
Actually, it wasn't as much a discussion as me telling Christian how I had
done it, letting everyone who cared listen in. The reason is that the
PenguinPlay and PenguinFile lists are one and the same, so if you subscribe
to the PenguinPlay list, you also get PenguinFile mail. This is limited to
the amount of mail that two persons can create on the subject, though (me
and Christian), so it shouldn't be a problem.

The one and almost only point (so far) of PenguinFile is to provide for fast
and effecient file handling (convinience is the other goal). The opening of
files is *the* area in which really substantial gains can be made, while
still keeping things quite simple. Thus, this is quite an important thing.

We also plan things like dynamic on-demand disk-caching of slower medias
like CD-ROMs or DVDs and transparent memory mapping of files. This is still
a bit far off right now though.

>Unix got where it is now with the "90% of the work with 10% of the
>effort" rule, guys...
Well, yes, but it wouldn't work if they simply said "Well, there's much less
effort in doing nothing at all, so lets leave it at that"... ;)

>Otherwise, you get Lisp and Lisp Machines, and you
>know where they are now (in a garbage dump and museums). Nobody cares
>that they were the best machines and environments ever conceived. Don't
>get pedantic with that last 10% of work, Unix is *FINE* without it, see?
I think people *would* have cared if they could've had the best possible
solution, while still having the complete free choice of OS and everything
else... Hmm? :)

>Where does PenguinPlay want to be? On all the Linux machines? Or in the
>garbage dump?
It wants to fulfill its porpuse, which in PenguinFile's case is solely
convinience and effeciency.