[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [kidsgames] word familiarity



Hello Steve,

On Thu, 17 Feb 2000, Steve Baker wrote:

-->Date: Thu, 17 Feb 2000 01:08:45 -0600
-->From: Steve Baker <sjbaker1@airmail.net>
-->Reply-To: kidsgames@smluc.org
-->To: kidsgames@smluc.org
-->Subject: Re: [kidsgames] word familiarity
-->
-->jeff@smluc.org wrote:
-->
-->> -->> word: 128 characters (are there any words longer than this? should it be
-->> -->> shorter or longer?)
-->> -->
-->> -->The longest word (not place-name or proper noun) in English is
-->> -->Antidisestablishmentarianism - a mere 28 letters.  If you allow
-->> -->proper nouns - but exclude place names, you need to allow 39:
-->> -->Pneumonoultramicroscopicvolcanoconiosis. There is a town in Wales
-->> -->called Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch,
-->> -->but even that is only 57 characters...so I think you're OK with 128.
-->> -->
-->> 
-->> I just gotta know.  HOW the H*** do you KNOW that?
-->
-->Where I come from they actually educate people in schools!

:)

-->(I admit I had to check the spelling of Pneumonoultramicroscopicvolcanoconiosis,
-->and a quick Google web search turned up the URL:
-->

:)

--> http://llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch.co.uk
-->
-->...which is only the longest functioning domain name in the world.)
-->I've actually visited Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch.
-->It's quite a pretty little Welsh village.  But if you plan to drive there,
-->take a map - it's hard to ask for directions!
-->

Maybe that 3d data modeling you talk about in another message could be
used to make virtual visitors guide?

-->> -->We were attempting a similar thing to what I think you propose
-->> -->- for each letter, a couple of lines of text, pointers to synonyms
-->> -->and antonyms, textual and audio pronounciation guide, pictures
-->> -->for words like OAF, OAK, OAKAPPLE, ...etc.
-->> -->
-->> 
-->> Do you still have ANY of that data, and would it be POSSIBLE to get
-->> permission to USE it in ours?  It would be incredible if we could slurp a
-->> large chunk directly into our store.
-->
-->I don't work there any more - but I know someone who does.

ok.

-->Ask Karl Wood at Philips Research Labs <karl@shevek.f9.co.uk>,
-->tell him I sent you.

Email sent requesting it, Thank you.

-->  I'd be suprised if someone didn't keep
-->a copy of the CDROM after the project ended.  IIRC, they pressed
-->about 50 of them with the letter 'O'.
-->

"didn't" or "did"?  50 seems like a small number given the number of years
that have passed...?

-->However, the file format wouldn't have been anything at all standard,
-->so you would probably need to dump the raw CD information to your
-->hard drive and then decode it the hard way.

It may be instructive just to see the database layout and such.


-->  The image, sound and
-->text formats would all have been very customised because back when
-->we did this, hardly anyone kept pictures and sounds on computers -
-->so there were no standard file formats.  Remember, this work was
-->done before the IBM PC existed!
--> 

Most modern work was done BEFORE the IBM PC existed.  It's only slightly
older than I am, I think.  When did IBM bear the PC?  Hmmm.  '67 or is
that way too soon... I get the timelines confused....I mean motorla's
68000 chip was named that because it came out in 1968 right?  Or is that
urban legend....

-->> This is very good information and I for one appreciate you sharing this.
-->
-->No problem.
-->
-->> Do you have ANY suggestions for dividing the task in such away as to avoid
-->> this type of burn-out?
-->
-->Well, obviously getting a lot of people involved is the only answer...but
-->the more people involved, the harder it'll be to maintain a consistant
-->style across all that data...especially for things like audio samples,
-->where you'd like to use a single person to do all the words so that
-->you can play a sentence by replaying the audio for each word in turn.
-->That would be pretty comical if a 26 contributors had each recorded
-->a different letter of the alphabet!!
-->

Yes it might be funny, but it would still I think be useable.  There is a
case to be made for having some kind of field in the database indicating
"reader" so that a query for audio words can be "sync'ed" to that reader
as you suggest.

-->However, supposing we could get around those 'style' issues, my "Pocket
-->Oxford English Dictionary" has a thousand pages with perhaps 20 words
-->to a page.
-->
-->If someone asked me to contribute one page,

Will you contribute one page, please?

:)

--> it would take me an evening
-->to do it (remember I'd have to paint a couple of pictures - or at least
-->search and download a couple of photos or images from the web and check
-->out the copyright issues for them).
-->

perhaps the CIA factbook and gutenberg can help...

-->That's probably something you could ask a LOT of people to do - one
-->evening isn't much time.
-->

I sure we WILL ask many people to do just that once we figure out how to
use a webpage to do it.

-->So, do you think you could find 1,000 contributors?

um, yes I do.

-->  If you can only
-->find 100,

Then it will take longer, but it will eventually work....

--> then you are asking them to commit two weeks of their spare
-->time.  If you find just 10 people then we are talking six months each.
-->

and 10 VERY worn out people,  This is not the way to do it.  Paralization
across thousands of people is the key.

-->I think this is still an impossibly large task - even for the resources
-->of the Internet community.
-->

Perhaps so, but the society has made it this far by implanting the data
into verbal memory and written libraries, I think this will work too.  I
may be proven wrong, it wouldn't be the first time.

-->>  The only way I can see to do that is
-->> to encourage people to enter data during the course of creating something
-->> specific.
-->
-->Hmmm - perhaps.  I think you'd have to put a couple of hundred words
-->into it to make it worthwhile for anyone to fight to use your database

I hope that using the database will not be a fight, it is hoped that it
would be a joy to use.

-->(it would be MUCH easier to do it yourself for less words than that).
-->

Perhaps, how would we entice those who are used to using their own
database to use ours?  What would entice YOU to use this database for one
of your apps?

-->Once there are more than a couple of hundred words, it's unlikely that
-->any new project would add more than a dozen new words (why would I
-->bother to add pictures, sounds and text for "Elm", "Cedar"
-->if the database already has "Oak", and "Pine"?)
-->

Because your project needs Elm and Cedar, because high school students
could be encouraged to do it by their teacher's, I seem to recall doing
field studies to identify a great deal of flora and fauna, seems that 1
good student could further this database an incredible amount.

-->It's	unlikely there will ever be more than a dozen packages using
-->this data - and that only gets you a dozen contributors...if there
-->are only a dozen of them then we are at the six-months-each level
-->of effort.
-->

I certainly hope that you are COMPLETELY wrong about that.

-->I hate to be pessimistic about this - but I think a dose of realism
-->is needed in all the 'gung-ho' enthusiasm that this is creating!
-->

What!?  You don't like 'gung-ho'? :)  Perhaps you could tell the captian
that your give her all she's got and the dylithium crystals aren't gonna
hold.....

-->>  Maybe especially when working with their children for specific
-->> vocabulary, the parent could entry the data for the days lessons and that
-->> would be placed in the global repository (assuming they accept that
-->> choice) and then the next parent that needs to do a lesson with that word
-->> will not have to do anything.  Am I making any sense?  Basically I want to
-->> make that part of the project massively parallel to avoid the problems you
-->> speak of.
-->
-->Well, again, I think that you are in the chicken and egg deal again:
-->

Well scramble that freakin' egg already...;)

-->* When there are too few words for the package to be useful, hardly
-->  anyone will download it.
-->

I'm hoping that the database will be an online resource and directly used
from the web for most applications (probably through CORBA).  For those
when it is to be a local app. you are probably correct.

-->* When there are just enough words for it to be useful (but nowhere
-->  near enough words to make this the all-encompassing database you
-->  would like it to be) - people will have no incentive to add lots of
-->  new words - and it'll stop growing.
-->
-->And there is still that issue of consistancy - especially for audio
-->samples.  Once the (single) reader of the original set of words
-->ceases to be available to record new words, you are unable to add
-->new words without destroying the valuable ability to replay sets
-->of word samples.

This is "solved" by placing the "reader" data in with the sample so that
it can be sync'ed or not depending on the needs of the individual
application.  Granted it's not going to be perfect.  I dare say we don't
yet even know what perfect is.  Does this mean we should NOT gather what
we can now?

--> Obviously, you could have a different reader for 
-->each language - but that really just multiplies the problem. If
-->your French speaker dies or something - no more French words can
-->be done.
-->

I don't agree that "no more" can be done.

-->The (kindof) answer to that is to have one person read the *entire*
-->dictionary (just the root words at least) into a BIG sound file -

I read the "First 1000 words" into a sound file, but I've done nothing
with it yet, it seems to have stopped recording in the middle somewhere,
and I don't know what the copyright is on it, I mean the words are all
common words right?  So.... hmmmm.

-->and have people who want to contribute to the database cut out the
-->word they want from that big file (or set of 26 big files).  I
-->wonder how long it would take to read the dictionary like that?
-->20,000 words - two seconds each?  Eleven hours.  I guess that's
-->do-able if you have enough disk space.  One person could do it in
-->a couple of weeks at one or two hours a night.
-->

Yep it's bigger than one person.... although it seems what the speech
synth. people have been doing it recording public speech of a bunch of
time and then analysizing it to produce a voice based on phoneme's which
is only dependent on the phoneme set being "sync'ed".

-->Oh - but wait.  20,000 words is only the start of it.  My dictionary
-->has "JUMP" - but not "JUMPED", "JUMPING" or "JUMPS".
-->

So whats wrong with your dicitionary :)

--><sigh>
-->
-->> -->So, even downscaling to words to a childs vocabulary, this is a
-->> -->HUGE undertaking.
-->> 
-->> And the benefits, I hope, more than make up for it.  They say open source
-->> ("freed" software) allows the developer's to "do it right".  So the
-->> question is -- "Is building this database the RIGHT way to do it?"  If so
-->> then I think we attempt it...  Future generations can fix it on the fly...
-->> It is theirs TOO.
-->
-->Yes - this is an incredibly good idea - it would be a VERY useful
-->resource - and doing it RIGHT is good.  But we keep coming back to
-->the amount of effort required.
--> 
-->

The journey of 1000 miles is started with but a step.

-- 
Jeff Waddell
jeff@smluc.org

Kids Games Project Coordinator
main website at http://smluc.org/SIA/kidsgames/



-
kidgames@smluc.org  -- To get off this list send "unsubscribe" in the
body of a message to majordomo@smluc.org