[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SEUL: sdoc draft spec

[This is sent to seul-research because some people there are
interested in sdoc. If you want to hear more about this, sub to
"echo subscribe seul-pub-www|mail majordomo@seul.org"]

The following is a mixture of overview documentation, technical
documentation, and other mutterings. Whatever it is, it's certainly
not the final version. :)


sdoc is a script that parses and manipulates html-style markup text. It
is designed to be fully xml-compliant, and it might even become fully
sgml-compliant one day. (That's much more difficult to verify, given the
length of the sgml specification.)

In brief, sdoc associates with each tag a separate perl script, called a
'handler'. sdoc parses the tag for its name, its parameters, and if
applicable the 'inside text' (if it's a paired tag), and then passes
this information to the handler. The handler does whatever it wants
to do, and then passes back new text. sdoc replaces the old tag with the
new text that the handler supplies. sdoc continues making passes over the
document until no handlers have made any changes. A single handler can be
responsible for multiple tags; if no handler is specified for a tag, it is
assigned a 'default' handler, which simply passes the text through

As sdoc parses the document, it recurses as far as possible before
calling handlers. This means that the innermost tags will be handled
first, and thus the outer tags will potentially have different 'inside
text' than the document originally started with.


Gritty details:
There are a number of variables that are available within the scope of
a handler:

# $tagname: This is the name of the tag itself
# $tagpaired: 0 if unpaired tag
# $params: the params for this tag. Use Parse_Params to parse
# $text: This is the text for the tag ('' if unpaired)
# $Pass_Number: Which pass through the document this is

Once the handler has completed, it should return() an 'error value':
0 means that the handler is finished, and the tag (and inside text)
  should be replaced by $text.
1 indicates that the handler is not yet finished, and the tag (and
  inside text) should be replaced by $text. This is different
  primarily in that it will prevent sdoc from exiting, since the
  handler wants to do something more later.
2 means that the handler is finished, but more parsing is required.
  It will read the following variables from the scope of the handler:
  $prelude, $text, $postlude. It will then replace the tag (and
  inside text) from the document with
  "$prelude . &Parse($text) . $postlude".
  That is, sdoc will do the replacement, and then recurse down onto
  the middle portion of the text that the handler returned. After that,
  it will resume parsing normally. The $prelude and $postlude will be taken
  as literal strings (though of course, later handlers and later passes
  will be free to modify them).
  [is this clear? i fear it is very confusing...]
3 is identical to 2 in the same way that 1 is identical to 0.


Changes from the old sdoc to the new sdoc:

SDOC is now a package of its own. This means that the variable scoping is
much much cleaner. This also means that it's more difficult for handlers
to access arbitrary sdoc internal variables. This means we're going to have
to sit down and decide which variables the handlers ought to be able to get
at, and document them. This is a good thing.

I implemented handlerstacks. That is, each tag is no longer simply
associated with a perl script: now it is associated with an array (stack) of
perl scripts. When a handler is requested, the one at the "top" of the
stack is used. Any handler can push a new script onto a handler stack (add),
or pop the current one off the stock (remove). When no handlers are on a
stack, then that handlerstack behaves like the 'default' handler as
described above.

doctypes are now no longer hard-coded in sdoc. In the previous versions of
sdoc, there was a doctype associated with each document. As soon as you
reached the <meta name="doctype" contents="webpage"> tag, then sdoc would
record that your document was of type 'webpage', and would load the
appropriate handlers (in this case, it would load all the .pl files from
$LIB/doc-types/webpage/ and associate them with tags based on filename). Now
there are a couple of initial handlers that are set during sdoc
initialization, such as "!doctype". Now, when <!doctype foo> is found, the
handler for tag !doctype is called, and it does whatever it wants to do. (In
this case, it would probably load a set of new handlers for that doctype,
and perhaps assign a new default handler.) This means that the !doctype
handler now behaves just like every other handler. There are no special



One day I want to make sdoc an apache plugin module, so it could
dynamically generate webpages. This would allow it to function like eperl.
This is low priority.


Arguments to sdoc:

This is what usage() current spits out:

"Usage: $0 [-hq] [options] [-o outfile|-O outfile] infile\n" .
"Usage: $0 [-hq] [options] [-o outfile|-O outfile] < infile\n\n" .
"-h, --help\tHelp (this text)\n" .
"-v, --verbose\tVerbose level. 9 is spammy, 0 is quiet (default), 6 is standard\n" .
"-q, --quiet\tQuiet mode (suppress warnings)\n" .
"-o, --output\tRedirect output to outfile, fail if outfile already exists\n" .
"-O, --Output\tRedirect output to outfile, replace if outfile already exists\n" .
"-i, --initial\tRead in initial script from scriptfile\n" .
"-H, --Handler\tAdd handlerset parameter\n" .
"-d, --doctype\tSet doctype variable\n";

-i (the 'initial script') is the script that loads such things as the !doctype
handler, as described above. We should look through it and decide how we can
minimize the number of tag handlers we start out with. This leads to greater

-d is a way of defining the doctype without including a <!doctype> tag in
your document. This is good.

-H is something omega wanted. It appears to collect each argument to -H in
@HandlerSet, which presumably the handlers use if they want to. Omega?

Also, I implemented a -Dfoo=bar parameter on the commandline, which collects
its arguments (as associative pairs) in %defines. Omega wanted this for
handlers or doctypes or something as well. Omega?


What to do:

We need to go over this spec, and make sure we agree with what I just said
sdoc is supposed to do. We need to flesh out the areas I missed entirely.
Once we're agreed that this is a good spec, we should make sure that our
implementation of it matches the spec. In particular, we're going to have
to write an initial_script that loads the appropriate initial handlers
(I think maybe !doctype and ?xml and !-- might be good ones to start with),
we're going to have to write new handlers (for the above), and we're going
to have to look at the current handlers and adapt them to the new system if
necessary. They live in /home/seul/lib/web/sdoc/ on cran. The old sdoc
binary can be seen from http://web.mit.edu/arma/Public/sdoc, and an in-flux
version of the new one is at http://web.mit.edu/arma/Public/sdoc.tar.gz.

I want the new sdoc to be compatible with the old sdoc: we should be able
to parse old documents the same way it used to.

In particular, I hope to use sdoc for the following projects initially:

* the seul-research survey: http://www.seul.org/research/survey.html
  I want to write some simple tags like <question name="SMP use"> and have
  it expand that into something that is a good style for a survey. This
  will make it trivial to write the survey, trivial to modify it, and it
  will enforce a consistent format.

* the seul webpages themselves. We currently use sdoc for this. We just
  need to make sure that the new sdoc can be used for this as well. Also,
  omega had grand plans of more complex handlers, which would be made
  easier by handlerstacks and such. More power to him.

* the linuxunited webpages, at http://linuxunited.org/. They want to use
  sdoc, and they're stalling page development until the new sdoc is ready.

* The lu-news project, at http://linuxunited.org/projects/news/
  I want to write a "universal news submission form" in a similar way as I
  want to write the survey above: a couple of simple tags which are used
  to turn a "short-hand description of the form" into a good-looking
  consistent html form.

* Twoducks wants to use sdoc to put his task-helps
  (http://www.seul.org/dev/help/task-help/) into a more convenient form. In
  addition, the User Education project (http://usered.freeservers.com) might
  well be able to make use of sdoc. Indeed, with some simple handlers, we
  can turn sdoc into a universal documentation language -- it can convert
  html-style markup into just about anything you want, and it can handle a
  variety of different xml configurations without much fuss.

As you can tell, several projects are getting backed up because sdoc isn't
ready yet. We should fix this.

People who are interested in any of these applications should start thinking
now about what tags they want supported. That is, in seul-research, we
should come up with each "type" of question we'll have, and maybe how
they should look in terms of final html. If you give me adequate specs,
the actual implementation of a handler will be trivial.



arma@seul.org. I don't have much free time, but I know perl really well,
and I have a vague idea of the sdoc specification, and a very good idea of
what we need it to do.
omega@seul.org. He has even less free time, but he has a better idea of the
sdoc specification we were envisioning.
Mark Dunnett, mdunnett@seul.org. He knows perl and is learning xml, and
has volunteered to help make this work. He's the one with the free time
right now, I hope. :)

Nils Lohner, lohner@linuxunited.org, treasurer of debian, wants to use it
for linuxunited
Simon Waldman, swaldman@seul.org, seul webmaster, is using it for seul
Pete St. Onge, pete@seul.org, seul-research webmaster, wants to use it for
the survey
Ken Duck, twoducks@seul.org, seul-dev-help webmaster, wants to use it for
Camilla Fox, cfox@mit.edu, is following the sgml lists and might be able
to answer some questions and tell us where we're being stupid


Comments on any of this are very much appreciated. The more we talk
about this, the more will get accomplished. Argue with me.
(Arguments of the form "sdoc is moot -- here's an xml parser that
already does all this" are very welcome too. :)