[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: <penguinplay@sunsite.auc.dk>*Subject*: I need my head examined (on ppf_DirTree)*From*: "Bjarke Hammersholt Roune" <asmild@post6.tele.dk>*Date*: Sun, 18 Jul 1999 03:35:29 +0200*Reply-To*: penguinplay@sunsite.auc.dk

I sit down and begin writing a short Email about how ppf_DirTree could be optimised, and, guess what, I of course forget what the single most important reason I began writing it in the first place was... Guess its just one of those days... The current behavior of ppf_DirTree to find out how two filenames relate to each other (less than, equal to or more than) is to simply compare the full filenames using strcmp(). This works nicely, as all filenames must be unique, so we have an unique identifier. The disadvantage to using the filenames as the key is that comparing strings is a relatively expensive action. The question then becomes wheter or not there is some way in which we could get a more effecient way of finding out how two filenames relate to each other. We could use the length of the string as our primary key, and the filename itself as our secondary key. So, we only go ahead and do the expensive filename match if we know that the strings have the same length. This has a number of drawbacks, though: A) if the tree contains many files of the same length (pix001.gif, pix002.gif, pix003.gif etc), we will be comparing filenames each and every time anyway, and we then infact made our compare algorithm slower, as we are comparing more than we were before B) we would have to know the length of each filename in the tree, and if we haven't been told what it is, we would have to compute it for each and every filename we wished to search for C) we would have to store the length of the filename together with the entry data for each node, making our tree take up more memory than would otherwise be the case. strlen() is therefore not that effecient a hash function to use. However, isn't there some way in which we can hash the filename into something that has a larger chance of being unique? Well, we could do something like ORing each byte of the string together. That would certainly be a very good hash value, as it would change with any change in the filename, so pix001.gif and pix002.gif would hash to different values. Using suchs a hash algorithm, we would have a pretty good chance of having the hash values spread out to statistically fill the entire 0-255 range of a byte, this only making each node larger by 1 byte. For each 1024 files, that would 1 kb. That sounds a bit expensive, to OR each byte (char) of the string together, but it really is not. Consider what is needed to compare two identical filenames that has a length of 8: 8 compares, 9 checks for null and 9 iterations of some kind. Some implementations might utilise assembly optimizations that make it possible to compare 4 bytes at a time, or even 8 bytes if the platform is 64-bit, but the benefit is minimal with these short strings (filenames are usually not that long), and it requires that the string is aligned properly, which it takes a little extra overhead to ensure that it is. Well, what is needed to find the hash value of a filename of length 8? 8 ORs, 9 checks for null and 9 iterations of some kind. I don't know, but it would seem reasonable that a compare and an OR operation takes roughly the same amount of time. Imagine a balanced binary search tree with 256 entries. Suchs a tree would have a depth of 8 and be full, so all leafs would also have a depth of 8. Now imagine that we have a filename, and we need to find out wheter or not this filename is in the tree somewhere, and if it is, we need to find the exact node in which it is contained. Using the filename as our key and no hashing, we would on average have to do: (8+7+6+5+4+3+2+1) / 8 = 4,5 checks on this key. That's, on average, 4,5 calls to strcmp(). If we on the other hand were to hash the key to a hash value rather than use the key directly, we would have to do one initial hash calculation, and then, on average, 4,5 compares. Now, the question is wheter 4,5 calls to strcmp() is slower or faster than a hash calculation, 1 call to strcmp and 4,5 hash value compares. The answer I come to is that hashing is faster. So in a 256 node tree of our kind, hashing is faster on average. How about a best-case scenario, where the filename we are looking for is contained by the root node? In this case, hashing is slower by 1 compare of a hash value and a call to strcmp(). In any case where the answer is not in the root nor its children, hashing is faster, so even in a tree with just 9 nodes, hashing will still be faster on average (in a tree where all the filenames look somewhat the same), and it gets faster and faster compared to always comparing the full filenames the larger the tree grows. Notice that I have not taken into account situations in which two filenames have the same hash value, but are not identical. The reason for this is that it statiscally only happends once in every 256 compares, so the effect of this should be quite minimal. Other hash algorithms could be to only OR the first 4 or so bytes (chars) of the string, or the 4 last (without the extension) or perhaps to use a multi-byte hash value, perhaps storing two seperate hash value in each byte. The upper could be the combined (ORed) value of the last 4 bytes of the string (without extension) and the lower byte could be the length, or perhaps the combined value of the first 4 bytes or something like that (I think that would be ineffecient, though). There are many possibilities. Also, I was thinking about the fact that there usually aren't that many directories. Perhaps it would be more effecient to store directories in another kind of structure than a tree?

- Prev by Date:
**oops!** - Next by Date:
**Yep, already got an appointment (on ppf_DirTree)** - Prev by thread:
**oops!** - Next by thread:
**Re: I need my head examined (on ppf_DirTree)** - Index(es):