[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

I need my head examined (on ppf_DirTree)



I sit down and begin writing a short Email about how ppf_DirTree could be
optimised, and, guess what, I of course forget what the single most
important reason I began writing it in the first place was... Guess its just
one of those days...

The current behavior of ppf_DirTree to find out how two filenames relate to
each other (less than, equal to or more than) is to simply compare the full
filenames using strcmp(). This works nicely, as all filenames must be
unique, so we have an unique identifier.

The disadvantage to using the filenames as the key is that comparing strings
is a relatively expensive action. The question then becomes wheter or not
there is some way in which we could get a more effecient way of finding out
how two filenames relate to each other.

We could use the length of the string as our primary key, and the filename
itself as our secondary key. So, we only go ahead and do the expensive
filename match if we know that the strings have the same length. This has a
number of drawbacks, though: A) if the tree contains many files of the same
length (pix001.gif, pix002.gif, pix003.gif etc), we will be comparing
filenames each and every time anyway, and we then infact made our compare
algorithm slower, as we are comparing more than we were before B) we would
have to know the length of each filename in the tree, and if we haven't been
told what it is, we would have to compute it for each and every filename we
wished to search for C) we would have to store the length of the filename
together with the entry data for each node, making our tree take up more
memory than would otherwise be the case.

strlen() is therefore not that effecient a hash function to use. However,
isn't there some way in which we can hash the filename into something that
has a larger chance of being unique? Well, we could do something like ORing
each byte of the string together. That would certainly be a very good hash
value, as it would change with any change in the filename, so pix001.gif and
pix002.gif would hash to different values. Using suchs a hash algorithm, we
would have a pretty good chance of having the hash values spread out to
statistically fill the entire 0-255 range of a byte, this only making each
node larger by 1 byte. For each 1024 files, that would 1 kb.

That sounds a bit expensive, to OR each byte (char) of the string together,
but it really is not. Consider what is needed to compare two identical
filenames that has a length of 8: 8 compares, 9 checks for null and 9
iterations of some kind. Some implementations might utilise assembly
optimizations that make it possible to compare 4 bytes at a time, or even 8
bytes if the platform is 64-bit, but the benefit is minimal with these short
strings (filenames are usually not that long), and it requires that the
string is aligned properly, which it takes a little extra overhead to ensure
that it is.

Well, what is needed to find the hash value of a filename of length 8? 8
ORs, 9 checks for null and 9 iterations of some kind. I don't know, but it
would seem reasonable that a compare and an OR operation takes roughly the
same amount of time.

Imagine a balanced binary search tree with 256 entries. Suchs a tree would
have a depth of 8 and be full, so all leafs would also have a depth of 8.
Now imagine that we have a filename, and we need to find out wheter or not
this filename is in the tree somewhere, and if it is, we need to find the
exact node in which it is contained. Using the filename as our key and no
hashing, we would on average have to do:

    (8+7+6+5+4+3+2+1) / 8 = 4,5

checks on this key. That's, on average, 4,5 calls to strcmp(). If we on the
other hand were to hash the key to a hash value rather than use the key
directly, we would have to do one initial hash calculation, and then, on
average, 4,5 compares.

Now, the question is wheter 4,5 calls to strcmp() is slower or faster than a
hash calculation, 1 call to strcmp and 4,5 hash value compares. The answer I
come to is that hashing is faster. So in a 256 node tree of our kind,
hashing is faster on average.

How about a best-case scenario, where the filename we are looking for is
contained by the root node? In this case, hashing is slower by 1 compare of
a hash value and a call to strcmp(). In any case where the answer is not in
the root nor its children, hashing is faster, so even in a tree with just 9
nodes, hashing will still be faster on average (in a tree where all the
filenames look somewhat the same), and it gets faster and faster compared to
always comparing the full filenames the larger the tree grows.

Notice that I have not taken into account situations in which two filenames
have the same hash value, but are not identical. The reason for this is that
it statiscally only happends once in every 256 compares, so the effect of
this should be quite minimal.

Other hash algorithms could be to only OR the first 4 or so bytes (chars) of
the string, or the 4 last (without the extension) or perhaps to use a
multi-byte hash value, perhaps storing two seperate hash value in each byte.
The upper could be the combined (ORed) value of the last 4 bytes of the
string (without extension) and the lower byte could be the length, or
perhaps the combined value of the first 4 bytes or something like that (I
think that would be ineffecient, though). There are many possibilities.

Also, I was thinking about the fact that there usually aren't that many
directories. Perhaps it would be more effecient to store directories in
another kind of structure than a tree?