[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Funny situation (was: Re: Serialization et al)

Writing from my new Netscape E-mail reader. First impressions is that I
like this *much* better than Microsoft Outlook. Thanks goes to the guy
who told me to switch (sorry, can't remember the name).

> >> Well, no. A core assumption behind that serialization scheme was "the data
> >> structures should be able to serialize themselves" - and that's a good
> >> thing.
> >>
> >Well, yes, when it makes sense. It makes sense for HashTable to know how to
> >serialize itself in a format that is ideally suited to itself, without
> >caring too much about anything else.
> >
> >It doens't make so much sense for it to know how to serialize itself so that
> >it meets the requirements of a multitude of different formats. That really
> >has nothing to do with any of its main responsibilities.
> Hmmm, I hadn't considered that. You're right.
This is a weird situation to say the least... My arguments have
convinced you that my original idea was best, and your arguments have
convinced me that my original idea was best. I can't say that has ever
happened to me before... :)

Anyway, perhaps I should give an explanation why I thought that my own
arguments were not so strong as they are portrayed by me here, atleast
not when the solution you previously thougth to be best (ie, specialize
with deriviation) is considered.

Above, I state that HashTable should only know how to serialize itself
in a format that makes sense relative to itself. To me, that is pretty

However, your idea is/was to make derivatives of HashTable. Thus,
HashTable still only knows how to serialize itself in a way that makes
sense relative to itself, but its derivative class ZipHashTable will
also be able to serialize Zip files. The reason this is IMHO still
appripiate is that the Zip format makes sense relative to ZipHashTable.

ZipHashTable might be implemented in a radically different way than
HashTable (that way, it would probably be better to have a pure virtual
HashTable base class, and make PakHashTable, ZipHashTable, TarHashTable
etc.), if a different internal representation of the data migth make it
more effecient when dealing with Zip files.

Taking all these things into consideration, I have come up with a third
solution, that is a hybrid of both proposed solutions. My explanation is
a bit longish, so I've placed it in the end of this E-mail.

For now, though, I woudln't recommend doing the changes nessecary to
implement this new solution. Let's get something working first.

> >I do, however, think that what should count the most is how easy the format
> >is to handle programmatically, not manually. It is, afterall, a binary
> >format.
> Well, yes. Actually it should be a compromise of both ;)
> A format that's very simple to write is fine, but quite problematic if you
> can't tell how it actually looks like. Imagine some image manipulation
> software say: "Well, the image files are written by this code, and that one
> can read them again. At least for the cases we tested. You want to write a
> viewer for them? Uh, you could reverse-engineer it..." ;)
He he :)

What I meant was that IMHO its most important that it is both easy to
program and explain how to program the serialization code. I believe
that is the case for the current format.

Now, of course, you are correct that it shouldn't be so insanely
difficult to verify the product of the format (ie, the actual pak files)
manually that it basically becomes impossible.

> Wait, <searching> Ok, here is an example of the LISP code I was thinking of:
> (defun fak1 (n)
>    (fak_iter 1 1 n)
> )
> (defun fak_iter (product counter n)
>    (if (> counter n) product
>        (fak_iter (* counter product) (1+ counter) n)
>    )
> )
> It calculates the faculty of n
> Our Prof presented this as example of an "iterative" process (meaning that
> a compiler can optimize it to an iterative process) ;)
> The recursive version is:
> (defun fak (n)
>    (if (= n 1) 1
>        (* n (fak (1- n)))
>    )
> )
> Just as little anecdote...
I've just desided I don't like LISP :)

C++ is so superior to that.

> >It is my understanding that when a function calls itself recursively, it
> >gets its code loaded to the instruction-stack (what's the proper name?)
> >twice. Have I misunderstod this? Doesn't doing it this way save alot of
> >pushing/popping on the stack?
> Calling a function pushes the instruction pointer on the stack (usually
> 32bit) and some of the processor registers. Then space on the stack is
> allocated for the function's local variables (by simply adjusting the stack
> pointer). No code is pushed.
> That amounts to about, say, 6-10 pushes on an X86 plus an add (or sub)
> operation on the stack pointer.
Ahh, thank you very much for explaining this out for me. I agree, the
code I made would have been much better had it been done with recursion.
We should change it sometime we haven't got anything better to do (if we
choose to implement the solution I detail in the end of this E-mail,
doing it along with that would make sense).

> >Anyway, I think this works quite well (even if those 10 lines might look a
> >bit arcane the first few seconds), so instead of polishing off code that
> >already works, I'd recommend getting on with more important matters.
> Agreed. It was only a side note anyway ;)
Well, thanks for pointing it out though. You've improved my future code

That other solution I mentioned above:

We have a pure virtual baseclass. Let's call it
FileSystemEntityContainer. The name might need a bit of shortening, but
otherwise, it should be ok.

FileSystemEntityContainer implement pure virtual functions for entry
look-up, entry adding, entry-removal and serialization to and from
storage. It will also implement some accessor methods to expose basic
properties of the container, like the amount of contained entries.

FileSystemEntityContainer is not meant to be a collection, but a
repository (someone has been reading a bit too much theory lately ;),
which means it is meant to actually store and manage its entries, not
just hold a pointer to them.

FileSystemEntityContainer is a template (this *allows* it to be
implemented as a collection). It has template parameters that specify
what kind of entries it will be storing aswell as what kind of key it
will accept when searching for suchs an entry. That's it. No MISC.

FileSystemEntityContainer requires that classes used as the entry-type
parameter must declare these methods:

key-type GetKey() const
entry-type& operator=(const entry-type&)

If serialization is also intended, these additional methods must also be

bool SerializeTo(FILE*) const
bool SerializeFrom(FILE*)

FileSystemEntityContainer requires that classes used as the key-type
parameter must declare these methods:

bool operator==(const key-type&) const
bool operator>(const key-type&) const

Directory will store its contained files and directories in
FileSystemEntityContainers. These containers are created in the
constructors of Directory's derivatives. This eliminates the need for
virtual factory methods, at no inconvinience or cost, which we would
otherwise need two of. The Directory deconstructor doesn't take care of
calling delete on its FileSystemEntityContainers, as the constructor
wasn't the one allocating it.

I believe this to be an extremely flexible way of doing it. I can't
think of much we woudn't be able to do here. Notice that the exact
algorithm used internally in FileSystemEntityContainer derivates is
completely abstracted away. I believe this to be a good thing.

Now, there is still the issue of where to place the serialization code.
I've been thinking of this. Quite alot, actually. We have two

A) Place it in the FileSystemEntityContainer derivative.
Argument: The FileSystemEntityContainer will know best how to serialize
itself, as it is the only class that can appropiately make assumptions
about its own implementation.

Bad thing: It's best for effeciency and encapsulation to place
implementation-specific code in the only class that is allowed to know
alot about implementation details: The class itself. In this case, that
is the FileSystemEntityContainer derivative.

B) Place it in the Directory derivative.
Argument: The Directory derivative *represents* the format to the
"world". Format-specific functionality should therefore be encapsulated
within the Directory derivative, as that is the appropiate place to put
format-specific behavior.

Bad thing: It's best for code-reuse and encapsulation to place
format-specific code in a Directory derivative, as it should know about

Both A) and B) are supported by good arguments, and are good solutions.
Actually, they are both so good that I don't think its appropiate to say
that one will clearly be the better solution in all cases. I think the
best way to do this would be to say that, well, the answer to the
question "What is the most appropiate solution?" is: "It depends."

It depends on the specific situation. Placing the serialization code in
the Directory derivative will allow us to reuse the code in the
corresponding FileSystemEntityContainer derivative if other Directory
derivatives need a FileSystemEntityContainer derivative that utilizes
the same algorithm (like a hash table algorithm).

Placing the serialization code in the FileSystemEntityContainer
derivative will allow us to depend on intimate implementation details of
the specific FileSystemEntityContainer derivative we are using, thus
possibly making stuff work considerably faster.

In each situation, the speed gain that can be gained from placing the
code in the FileSystemEntityContainer derivative must be weighed against
the code reuse aspect of placing code in the Directory derivative.
Sometimes, the speed gain will be non-existant, and other times, we
won't want to do reuse the code in the FileSystemEntityContainer
derivative anyway.

Phew! :)

Anyway, I believe the above solution incorporates the best qualities of
both the two previously proposed solutions, while steering clear of
their limitations.