[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Call for current news site data formats



Roger Dingledine wrote:
> 
> In message <35CD2331.95D3659A@appliedtheory.com>, blizzard@appliedtheory.com writes:
> >If everyone describes their data model I can collect the common elements
> >and make sure that the DTD suits everyone's common needs.
> 
> If everyone could list out the fields that they care about for news, then we
> can get a good first draft of the dtd set up and created. That will be
> awfully useful.

Here's mine...not much to it.  :)

mysql> describe linux_articles;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| article_id   | int(11)      |      | PRI | 0       | auto_increment |
| url          | varchar(128) |      |     |         |                |
| title        | varchar(128) |      |     |         |                |
| source       | varchar(128) |      |     |         |                |
| date         | varchar(20)  |      |     |         |                |
| submit       | varchar(128) | YES  |     | NULL    |                |
| article_desc | blob         | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)

mysql> 


> 
> Chris: at some point we're going to need some conversion utils for converting
> between your xml format and what each of the news sites actually wants to speak.
> are there conversion utils around already that are flexible enough to do this
> easily, or are we going to be writing our own? in particular, can we write a
> universal one, or will it end up being one per site?
> 

Well, every site has different data requirements.  I'm going to guess
that most of them are using mysql as a database backend, apache with
perl as a module.  Personally, my pages are generated using perl from a
mysql backend but are generated static html pages, no includes or
anything.

It's very easy to generate an XML page from a database because the data
is very structured.  The only thing that individual sites have to worry
about at that point is how to get data in and out of their storage
formats.  Tools are available for XML to do this.  Take a look at this
page along the left hand side column:

http://sunsite.unc.edu/xml/

To validate the dtd and xml pages that I've been putting up I've been
using the XML for Java from IBM and before that MSXML from Microsoft at
work.  I don't have windows at home. :)  Both of these are Java based
parsers and seem pretty good.  Actually, the one from IBM looks very
very nice.

There seems to be a lacking of a validating XML parser for perl at this
point although Larry Wall seems very committed to make perl the language
for XML.  You can see that he has some preliminary code available but I
haven't looked at it.  The problem is that XML requires that parsers use
Unicode and perl doesn't grok Unicode.  This means that part of the perl
interpreter has to be modified.

There are also a number of decent XML parsers in C although very few
seem to be validating XML parsers as well as checking for
well-formedness.  Hopefully the XML parser that Daniel Veillard is
working on will do validation pretty soon.

In any case we don't really need a full fledged XML parser in the short
term to transfer the information around.  We just need one that will
check to make sure it has all of the elements for our needs in it...

> Another field that might be nice is 'keywords'. That seems like it would allow
> stronger search mechanisms...

Added.  That's a good idea I hadn't thought of.  :)

--Chris

-- 

------------
Christopher Blizzard
http://odin.appliedtheory.com/
------------