[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [kidsgames] word familiarity



jeff@smluc.org wrote:

> -->> word: 128 characters (are there any words longer than this? should it be
> -->> shorter or longer?)
> -->
> -->The longest word (not place-name or proper noun) in English is
> -->Antidisestablishmentarianism - a mere 28 letters.  If you allow
> -->proper nouns - but exclude place names, you need to allow 39:
> -->Pneumonoultramicroscopicvolcanoconiosis. There is a town in Wales
> -->called Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch,
> -->but even that is only 57 characters...so I think you're OK with 128.
> -->
> 
> I just gotta know.  HOW the H*** do you KNOW that?

Where I come from they actually educate people in schools!
(I admit I had to check the spelling of Pneumonoultramicroscopicvolcanoconiosis,
and a quick Google web search turned up the URL:

 http://llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch.co.uk

...which is only the longest functioning domain name in the world.)
I've actually visited Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch.
It's quite a pretty little Welsh village.  But if you plan to drive there,
take a map - it's hard to ask for directions!

> -->We were attempting a similar thing to what I think you propose
> -->- for each letter, a couple of lines of text, pointers to synonyms
> -->and antonyms, textual and audio pronounciation guide, pictures
> -->for words like OAF, OAK, OAKAPPLE, ...etc.
> -->
> 
> Do you still have ANY of that data, and would it be POSSIBLE to get
> permission to USE it in ours?  It would be incredible if we could slurp a
> large chunk directly into our store.

I don't work there any more - but I know someone who does.
Ask Karl Wood at Philips Research Labs <karl@shevek.f9.co.uk>,
tell him I sent you.  I'd be suprised if someone didn't keep
a copy of the CDROM after the project ended.  IIRC, they pressed
about 50 of them with the letter 'O'.

However, the file format wouldn't have been anything at all standard,
so you would probably need to dump the raw CD information to your
hard drive and then decode it the hard way.  The image, sound and
text formats would all have been very customised because back when
we did this, hardly anyone kept pictures and sounds on computers -
so there were no standard file formats.  Remember, this work was
done before the IBM PC existed!
 
> This is very good information and I for one appreciate you sharing this.

No problem.

> Do you have ANY suggestions for dividing the task in such away as to avoid
> this type of burn-out?

Well, obviously getting a lot of people involved is the only answer...but
the more people involved, the harder it'll be to maintain a consistant
style across all that data...especially for things like audio samples,
where you'd like to use a single person to do all the words so that
you can play a sentence by replaying the audio for each word in turn.
That would be pretty comical if a 26 contributors had each recorded
a different letter of the alphabet!!

However, supposing we could get around those 'style' issues, my "Pocket
Oxford English Dictionary" has a thousand pages with perhaps 20 words
to a page.

If someone asked me to contribute one page, it would take me an evening
to do it (remember I'd have to paint a couple of pictures - or at least
search and download a couple of photos or images from the web and check
out the copyright issues for them).

That's probably something you could ask a LOT of people to do - one
evening isn't much time.

So, do you think you could find 1,000 contributors?  If you can only
find 100, then you are asking them to commit two weeks of their spare
time.  If you find just 10 people then we are talking six months each.

I think this is still an impossibly large task - even for the resources
of the Internet community.

>  The only way I can see to do that is
> to encourage people to enter data during the course of creating something
> specific.

Hmmm - perhaps.  I think you'd have to put a couple of hundred words
into it to make it worthwhile for anyone to fight to use your database
(it would be MUCH easier to do it yourself for less words than that).

Once there are more than a couple of hundred words, it's unlikely that
any new project would add more than a dozen new words (why would I
bother to add pictures, sounds and text for "Elm", "Cedar"
if the database already has "Oak", and "Pine"?)

It's unlikely there will ever be more than a dozen packages using
this data - and that only gets you a dozen contributors...if there
are only a dozen of them then we are at the six-months-each level
of effort.

I hate to be pessimistic about this - but I think a dose of realism
is needed in all the 'gung-ho' enthusiasm that this is creating!

>  Maybe especially when working with their children for specific
> vocabulary, the parent could entry the data for the days lessons and that
> would be placed in the global repository (assuming they accept that
> choice) and then the next parent that needs to do a lesson with that word
> will not have to do anything.  Am I making any sense?  Basically I want to
> make that part of the project massively parallel to avoid the problems you
> speak of.

Well, again, I think that you are in the chicken and egg deal again:

* When there are too few words for the package to be useful, hardly
  anyone will download it.

* When there are just enough words for it to be useful (but nowhere
  near enough words to make this the all-encompassing database you
  would like it to be) - people will have no incentive to add lots of
  new words - and it'll stop growing.

And there is still that issue of consistancy - especially for audio
samples.  Once the (single) reader of the original set of words
ceases to be available to record new words, you are unable to add
new words without destroying the valuable ability to replay sets
of word samples.  Obviously, you could have a different reader for
each language - but that really just multiplies the problem. If
your French speaker dies or something - no more French words can
be done.

The (kindof) answer to that is to have one person read the *entire*
dictionary (just the root words at least) into a BIG sound file -
and have people who want to contribute to the database cut out the
word they want from that big file (or set of 26 big files).  I
wonder how long it would take to read the dictionary like that?
20,000 words - two seconds each?  Eleven hours.  I guess that's
do-able if you have enough disk space.  One person could do it in
a couple of weeks at one or two hours a night.

Oh - but wait.  20,000 words is only the start of it.  My dictionary
has "JUMP" - but not "JUMPED", "JUMPING" or "JUMPS".

<sigh>

> -->So, even downscaling to words to a childs vocabulary, this is a
> -->HUGE undertaking.
> 
> And the benefits, I hope, more than make up for it.  They say open source
> ("freed" software) allows the developer's to "do it right".  So the
> question is -- "Is building this database the RIGHT way to do it?"  If so
> then I think we attempt it...  Future generations can fix it on the fly...
> It is theirs TOO.

Yes - this is an incredibly good idea - it would be a VERY useful
resource - and doing it RIGHT is good.  But we keep coming back to
the amount of effort required.
 

-- 
Steve Baker                  http://web2.airmail.net/sjbaker1
sjbaker1@airmail.net (home)  http://www.woodsoup.org/~sbaker
sjbaker@hti.com      (work)

-
kidgames@smluc.org  -- To get off this list send "unsubscribe" in the
body of a message to majordomo@smluc.org