[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Strangest thing happened; it seems I'm perfectly normal. That doctor must be a fraud! I want my money back!



This is a reply to 3 Emails.

>>The current implementation of ppf_DirTree can handle a preconstructed tree
>>by passing it the root node of this tree. ppf_DirTree becomes read-only in
>>this case, though (why?). This can easily be implemented with this
>>
>Because if the preconstucted tree doesn't satisfy the "search tree" or
>"balanced" requirements changing it could get very messy.
>
If its unbalanced or not even a search tree, and that is a problem (which it
is), why not just MAKE it a balanced binary search tree (and if my idea is
used, also complete)? How would you end up with suchs a tree, anyway?

>Reading would
>simply return wrong values sometimes, but writing could seriously break
>things.
>
If you can't trust the information you get from suchs a tree, isn't it
rather useless?

>I'll have a second look at this when I'll implement the overloading thing
>however...
>
We want overloaded mounting to be as fast as, or nearly as fast as,
non-overloaded mounting. If we were to merge the trees of all the paks we
mount, I really don't think this would be the case. We would also have a
very large problem un-merging a pak. Of course, we could store in each and
every node what pak it belonged to and what, if any, files it were hiding
(ie, files in the same location having the same name). This would in effect
be adding a linked list and a pointer to each and every node. We could also
add this data to the entry data; ppf_Directory and ppf_File. The prospect of
this somehow doesn't sound a whole lot better to me. A last possibility
would be to start over and remerge all paks that had not been unmounted yet,
but that's even worse.

If this is to be done this way, then mounting/unmounting would have to be
done in mount/unmount sessions where the client program informs PenguinFile
that it would like to do some mounting/unmounting. If not done in sessions,
3 mounts (beyound in the first mount) would result in 3 seperate sorts
instead of just 1. The same goes for unmounting.

I think keeping the trees for seperate paks seperate is a more elegant way
to handle this issue and avoid the large-scale sorting and file-tracking,
not to talk about the mess, involved in merging and unmerging the pak file
structure at every mount or unmount (or mount/unmount session). Implementing
things this way would be very easy: When looking for a file (let's say we've
already found the appropiate directory), that file would first be searched
for in the tree which belong to the pak with the higest priority, if its
there, we found our file and we can stop searching. If its not there, we go
on to the tree with the next higest priority and search there, and if its
there, we found our file and can stop searching. If it isn't there, we go on
to the tree with the third higest priority and so on untill we either find
the file or run out of trees.

In this implementation, mounting is as simple as adding the file/directory
trees for that pak to the rest of the file/directory trees and give it a
priority. Files that are supposed to be hiddeno (overloaded) by any of those
trees will be so automaticly. Unmounting is as simple as removing the trees
of the pak being unmounted. Any files or directories previously hidden
(overloaded) by the pak will automaticly pop into view again. Notice that no
sorting what-so-ever is needed.

I find this implementation effecient, elegant, easy to code and OOP'y
(things take care of themselves and does so with minimum effort). What more
can you ask for? We also get rid of that annoying mount/unmont session
thingy.

Notice that hashing would really speed up things in an implementation of
this kind, as we will be doing searches in several trees instead of just 1.
With hashing, the performance loss should be minimal, though, and in any
case it would be spread out, in contrast to the performance loss related to
sorting, which would be all focused at the same moments (at every mount or
unmount, or mount/unmount session).

>>is implemented, though; some implementations wouldn't need a sort). Just
>>mounting a pak file that didn't overlap with any other pak files would not
>>cause a sort as it would already be sorted before it was written to the
pak
>>file (right?).
>>
>For format 0, yes. format 1 stores all dir entries in an unsorted linked
>list, so at mounting time there has to be done some sorting. But format 1
>paks won't be very big anyway, so this doesn't matter.
>
I thought format 1 paks were supposed to be used by developers while
developing the game so they could easily alter their paks without having to
recompile them all the time. You mean format 1 paks are also supposed to
find their way into to the release versions?

Why would format 1 paks be small? I mean, what's stopping them from being
large? I don't remember any size restrictions being mentioned in the
documentation...

If format 1 dir entries are all stored in an unsorted linked list, why would
we need any sorting? An unsorted linked list is by definition not sorted...


>>>strlen() is therefore not that effecient a hash function to use. However,
>>>isn't there some way in which we can hash the filename into something
that
>>>has a larger chance of being unique? Well, we could do something like
ORing
>>>each byte of the string together. That would certainly be a very good
hash
>>>value, as it would change with any change in the filename, so pix001.gif
>>and
>>>pix002.gif would hash to different values. Using suchs a hash algorithm,
we
>>>would have a pretty good chance of having the hash values spread out to
>>>statistically fill the entire 0-255 range of a byte, this only making
each
>>>node larger by 1 byte. For each 1024 files, that would 1 kb.
>>>
>>ORing all the bytes together would of course be completely pointless, as
it
>>would just give you a value of 0xFF or something very close. What I mean
>>was, of course, XOR - eXclusive OR.
>>
>Well, but then you have to do the same processing on the string you compare
>with (the one you're searching) -
>
Well, this would of course be done before-hand, when the pak file was being
created. If it wasn't, this kind of hashing would be rather senseless.

> and with that you lose much of the speed
>benefit (xoring all bytes in a string should be slower than a worst-case
>comparison - and you don't have many comparisons).
>That could give some minor benefit for very large directories, but else ...
>
You would have to hash the filename of the file you were searching for prior
to the search, but as I said above, the filenames you would be searching
against would already have been hashed.

>>That sounds a bit expensive, to OR each byte (char) of the string
together,
>>but it really is not. Consider what is needed to compare two identical
>>filenames that has a length of 8: 8 compares, 9 checks for null and 9
>>iterations of some kind. Some implementations might utilise assembly
>>
>Or less. It only has to compare the strings until they differ.
>
Well, in this case I was talking about identical string, but in any other
case, you would be correct.

>>Now, the question is wheter 4,5 calls to strcmp() is slower or faster than
a
>>hash calculation, 1 call to strcmp and 4,5 hash value compares. The answer
I
>>come to is that hashing is faster. So in a 256 node tree of our kind,
>>hashing is faster on average.
>>
>I don't think the speed gain would be big enough to make up for the
>additional effort (an more importantly, complexity). But you could simply
>implement it and do some benchmarking.
>
I'll do that. There wouldn't be much complexity in this, though. All that's
needed is 1 function to do the hashes, and then just compare away. Not much
to do or understand there.

>>Also, I was thinking about the fact that there usually aren't that many
>>directories. Perhaps it would be more effecient to store directories in
>>another kind of structure than a tree?
>>
>For example?
>Arrays and lists are clearly slower.
>
Well, I haven't actually come up with anything yet on this... I can say,
though, that hashing for directories probably wouldn't be a good idea.

>Hashes have very unpredictable performance, as it depends on how similar
>the names are.
>
Depends on what you mean. This hash's performance would be very predictable,
as it changes with ANY change done, nomatter how small (pix001.gif and
pix002.gif will hash differently). Therefore, this hash will use the full
0-255 range of a byte (it would also fill the entire range of a short or
int, but in this case I think a byte is sufficient).

If you mean performance compared to full-string matching, then you are
correct that in situations where the filenames are very different
(donkey.txt, monkey.txt, dog.txt, fish.txt), the performance of the hashing
would not be better (except for the case where the hashes match, in which
case a full-string match must be made to be sure of equality), but the
performance of full-string matching would.

So, the performance of this hash is actually much more predictable than
full-string matching.

>And other than that... ?
>
You haven't touched these issues which I raise in my Email(s):

1. Implementing a complete binary tree instead of the current incomplete
one.
2. Storing the nodes in arrays rather than dynamicly allocating each and
every one (becomes feasible if the tree is complete).
3. Implementing the tree as a template.
4. Using seperate trees for files and dirs (become feasible if the tree is
templated).
5. Mounting in sessions (bad idea)