[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SV: SV: Progress Update



Bjarke Hammersholt Roune wrote:

> > Why the heck isn't this a normal hash table? No need to sort, no need
> > for linked lists. For the (once in a while) table resizing, you need a
> > costly realloc and only binary logical operations (masking the
> > significant parts of the key (they are already hashed, since they're in
> > the hash table), giving you the straight index).
> 
> Either you haven't understod what I've said, or I'm misunderstanding you.

Obviously, there is some misunderstanding between each other. I could be
wrong, but just in the case I'm right...

> A HashTable is a completely standard hashtable. There's two arrays, one for
> the hash table itself, and one for the entries that the positions in the
> hash table points to. There are no linked lists in HashTable.

I use only a single array in my "regular hashtables". What is the other
containing exactly? Is that for going the "other way" (get the key for
an entry?)??? I use linked lists to implements my buckets because I'm
lazy. Since I like my buckets to be rather shallow (4 or 8 at most), I
don't worry too much about the performance of going thru a bucket. They
could be dynamically allocated arrays, since they are going to be fixed
size anyway.

Is that second array you're talking about actually a *number* of
secondary level arrays, composing the buckets themselves (my
"dynamically allocated arrays" improvement on my hashtables)? Or is
there a wholly separate array?

I smell the misunderstanding partly because of things like these two
arrays you have and the single one I have. We are clearly having two
different approaches. Maybe I can improve *my* coding, which I like to
do, and maybe I can help you improve *your* coding, which I also like to
do. I win either way. :-)

As for the linked lists, I was talking about your DynHashTable.

> The only mild deviation from the normal hash table using chained hashing is
> that each segment of the entry array (ie, each segment of entries that hash
> to the same position in the hash table), is sorted in ascending order. This
> makes it possible to do a binary search.

Nice idea. I use the order of the hash table as the maximum size of
buckets, so that as the size of the hash table increases (and thus
becomes more expensive to grow), it takes larger buckets to trigger the
growing of the table.

For the very worst case, you'd have a 2^32 entry hash table, each with
32 entries in their bucket, which makes for a huge amount of total
entries. Even the worst case of having to look thru all of the 32
entries in a bucket is completely insignificant compared to the worst
case of a simple array or a linked list, thus I feel I'm winning enough.

I think that my straight method isn't O(n) as you mentioned, but rather
O(n log n), which is nice enough for me.

> DynHashTable is only used when altering the hash table, and it does this in
> the most effecient way possible; with only ONE reallocation, nomatter the
> amount of entries that get added (it does allocate for a linked lists, but
> that's pretty effecient due to a special allocator I'm using). If I moved
> the alteration functionality into HashTable and gave it an AddEntry()
> method, then that would be extremely ineffecient. I could do a AddEntries().
> That would still only require one reallocation, but it would also require
> the client code to have the entries to be added in one array, which migth
> not always be possible. The solution with DynHashTable is better because it
> allows me to do what can be done in terms of optimizing this process, while
> making client code both smaller and simpler.

Do you add entries that often? Adding an entry to a normal hash table,
for the case where it doesn't need to grow the table (that the regular
case), is quite fast. Once in a while, it has to grow the table, that's
all. The fact that adding an element isn't done in bounded time
sometimes is a big disadvantage, that's why the hash tables in the C++
STL are actually *not* hash tables, but red-black trees!

The improvement you're getting with your DynHashTable is only visible
for a large number of successive add entry operations, *or* when you
need adding to be done in bounded time. While there is nothing that can
be done to fix the unbounded time required to add an entry to a hash
table (you have to alter the algorithm, like going to your DynHashTable
or to red-black tree like the STL), I feel that the case where you're
adding a bunch of entries to the hash table is not very important. Since
I grow the bucket size with the size of the hash table (to the square
root, but whatever) and the table doubles in size, the rate at which
growing the table needs to happen goes down as the cost of doing so goes
up anyway. When growing the hash table becomes serious business, you can
be sure that you can add quite a few entries (at very little cost)
before you need to grow again.

The reasoning is simply that it wasn't worth the effort to switch from
one type of hash table to another, then back as we're finished adding
stuff to it. We have one Hashtable class, that's it.

I just felt that discussing THE most optimized algorithm for such a
relatively unimportant part of the system is the kind of thing that
makes projects of the scope of PenguinPlay bog down. Heck, regular Unix
ufs uses a darn ARRAY for directories entries!

Unix got where it is now with the "90% of the work with 10% of the
effort" rule, guys... Otherwise, you get Lisp and Lisp Machines, and you
know where they are now (in a garbage dump and museums). Nobody cares
that they were the best machines and environments ever conceived. Don't
get pedantic with that last 10% of work, Unix is *FINE* without it, see?

Where does PenguinPlay want to be? On all the Linux machines? Or in the
garbage dump?

-- 
Pierre Phaneuf
Ludus Design, http://ludusdesign.com/
"First they ignore you. Then they laugh at you.
Then they fight you. Then you win." -- Gandhi