[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[kidsgames] Re: distributed data collection server



Hi Paul,

On Wed, 23 Feb 2000, Paul Kienzle wrote:

-->Date: Wed, 23 Feb 2000 13:23:20 +0000 (GMT)
-->From: Paul Kienzle <pkienzle@kienzle.powernet.co.uk>
-->Reply-To: kidsgames@smluc.org
-->To: kidsgames@smluc.org
-->
-->
-->I've taken a first crack at a distributed data collection server.

Great!!!

-->It's all pseudocode as of yet.

It's a start.

-->  Anyone care to take it further while I
-->bury myself in my "real" project for a while?

what's so "unreal" about this one?

-->  I need to select a
-->programming language and a database.

php and postgresql

-->  It may be wise to find a willing
-->host site first, to see what tools they will accept/provide.  Would
-->this project be easier in Zope?

zope REQUIRES frames...and I Don't LIKE frames....

-->  I hear python is easy enough to
-->learn.
-->

I keep hearing that, yet when I look at the stuff, it seems about as
cryptic as perl...

-->My goal is to generate a free French dictionary to complement our free
-->English one.

Awesome.

-->  I grabbed half the Project Gutenburg French collection,
-->and pretty quickly generated an 18600 word database, all with sentence
-->contexts.

WOW!

-->  Granted, conjugations, Gutenburg license terms, archaism,
-->proper names and other junk will cut that down as low as 12000 head
-->words, but that's still an awfully good start on a dictionary.  Surely
-->more modern texts are available from the French or Canadian
-->governments to push this up to the 50000 range of a good college
-->dictionary. [The 200,000 they list on the back includes all those
-->variations like plural, past tense, different part of speech, etc.]
-->

Any French speakers out there with some URL's to more Text?

-->-------------------------------
-->Paul Kienzle
-->pkienzle@kienzle.powernet.co.uk
-->-------------------------------
-->
-->Distributed data collection using a mail service
-->
-->Assumptions
-->-----------
-->
-->(1) Data elements are independent.  This means that your data
-->    collectors do not have to interact with each other.  You can get
-->    around this by generating dependent data after independent data,
-->    or by generating interdependent data elements together.
-->
-->(2) Expertise is widely held.  This means that you don't have
-->    to match data elements to individuals.  Some users can subscribe
-->    as experts, which means that they are willing to spend more time
-->    figuring out how to enter or verify a data element, or will
-->    delegate to someone who can.  If they fail, then the data element
-->    is put on an open challenge list for all takers.  If expertise is
-->    not widely held, then users and data will have to be tagged and
-->    matched for specialities.  E.g., legal, medical, scientific.
-->
-->(3) Users are mostly trustworthy, but errorprone.  That is, 
-->    most won't enter bogus data, but they will sometimes make
-->    mistakes. You do need a verification process.  Depending on how
-->    reliable you want the data, you can use random sampling to insure
-->    that things are mostly correct, or you can verify every
-->    modification.  Keep the identities of the user who initially
-->    entered the data, he who last modified the data, and he who last
-->    verified the data, so that if errors are found, other entries
-->    touched by the same individual can be verified.  From this, a
-->    measure of reliability can be determined for each user.  This also
-->    provides a degree of accountability, which should make the entire
-->    process less error-prone.  Making these identies available in the
-->    released database improves accountability even more, though an
-->    opt-out should be available.  Check out the entries of those who
-->    opt out.  Special mention should be made of those with the most
-->    new entries, the most fixes and the most checks.
-->
-->(4) There is an editor who "owns" the project.  He has the right to add 
-->    and remove users, and to process entries on a per-user basis to
-->    check their reliability.  Anomolous e-mails (and spam) will be
-->    sent to his accounts.  If users detect bogus entries, they have
-->    a way of flagging them to be sent to an editor for further checks.
-->    If traffic gets too high, cycle it across multiple editors.
-->
-->(5) Users control their own level of activity.  In the simplist
-->    form, simply not responding for a while will delay the next
-->    request.  If they delay too long, however, the system will not
-->    know if the request was lost.  So after a period of delay, the
-->    system will resend the request, and after another period of delay,
-->    the system will flag the user as inactive and tell them to
-->    resubscribe to resume activity, and it will send the request on to
-->    another user.  Invalid e-mail addresses will be treated similarly.
-->    A more complicated system would allow the user to set the number
-->    of elements to process at once (useful for those who have
-->    occasional large blocks of time, and intermittent connections) and
-->    the delay between reminders.
-->
-->

This is very similar it seems to the way gnu-translators works, that
effort intends to translate the entire gnu.org website (one page at a
time) utilizing many translators instead of relying on one.

-->
-->Management database
-->-------------------
-->
-->(1) List of users who are doing the collecting.  This should include
-->    the following fields:
-->
-->    user-id
-->	table key
-->    user-status
-->	ACTIVE: currently receiving commands 
-->	EXPERT: receiving HARD commands (see below) if any
-->	INACTIVE: not responding to commands 
-->	REMOVED: asked to be removed from the list
-->	KILLED: forcibly removed from the list
-->    name
-->	optional
-->    email
-->	may change during the collection process, so can't be used as
-->	the key
-->    command
-->	data element they are entering/verifying
-->    command date
-->	date the command was sent. used to compute number of days
-->	since command was sent, and to either resend the command or
-->	mark the user as inactive.
-->    entered 
-->	number of elements entered/modified/verified
-->    checked
-->	number of elements checked by others
-->    reliability  
-->	number of elements accepted by others, weighted by the
-->	reliability of those doing the accepting, minus the number of
-->	elements modified by others, weighted by the reliability of
-->	those doing the modified and by the degree of modification.
-->	Lower limit of zero, upper limit of 1.  Divide by the number
-->	checked by others for the normalized score used in the
-->	calculation. Editors are by definition 100% reliable.  If no
-->	elements are checked, a default normalized reliability is
-->	used, as determined by the editors by random sampling across
-->	all users in the population.
-->    probation
-->	flag. If set, all entries modified by the user are sent to the
-->	editor

	authentication-type: type of authentication verifying identity of
this user.
	authenitication-key: pgp/gpg key for authentication purposes.

-->
-->(2) List of data elements to collect/already collected. This should
-->    include the following fields:
-->
-->    data-id
-->	there is a 1-1 mapping between data elements and elements
-->	collected, obviously.  For distribution, however, the
-->	management fields may be removed from the database.  Also,
-->	these fields are independent of the actual data being
-->	collected.
-->    priority
-->	frequency of request; higher is sooner
-->    data-status
-->	READY: data element needs to be entered or verified
-->	HARD: data element needs to be entered or verified by
-->		an expert since the last user could not enter
-->		or verify it as requested
-->	UNKNOWN: data element is put on the challenge page for
-->		anyone to claim since an expert could not enter
-->		or verify it as requested
-->	ACTIVE: someone is entering/verifying the data element
-->	REJECTED: data element is not required in database
-->	DONE: data element has been verified
-->    request-id
-->	user who requested the data element
-->    enter-id
-->	user who entered the data element.  empty if the data has not
-->	yet been entered
-->    modify-id
-->	user who last modified the data element. empty if the data has
-->	never been modified (either because it was entered correctly,
-->	or because it has never been verified)
-->    verify-id
-->	user who "signed-off" on the data element. empty if the data
-->	has not been verified since it was defined or modified.  For
-->	extra reliability, more than one user should sign off on the
-->	data, and this will be an array of verifiers.
-->    supporting data
-->	whatever can be provided to make data entry/verification
-->	easier.  In the case of a dictionary, this would include
-->	sentences which contain the target word and definitions from
-->	other (possibly outdated) sources.
-->
-->
-->Server programs
-->---------------
-->
-->(1) process-message ## called for each message received by the list server
-->
-->	determine user-id ## look how majordomo does it; may need password
-->	if no user-id, and not subscribe command, and not data-challenge
-->	   forward message to an editor: invalid user
-->	if user-status == KILLED
-->	   forward message to an editor: message from killed user
-->	   return
-->	if user-status == REALLY-KILLED
-->	   send user the go-away message
-->	   return
-->
-->	determine nature of request by looking for initial keyword
-->
-->	## list commands
-->	if subscribe,
-->	   if no user-id,
-->	      add user to the table
-->	      send user the welcome message
-->	      set user-status to ACTIVE or EXPERT
-->	      send user the next command
-->	   else
-->	      send user the go-away message
-->	else if unsubscribe
-->	   set user-status to REMOVED
-->	   if command is not empty
-->		set data-status of command to READY
-->		set command to empty
-->	else if address-update
-->	   set email to new address
-->	   if user-status == INACTIVE or REMOVED
-->	      set user-status to ACTIVE
-->	   if command is not empty
-->	      send user the command
-->	   else
-->	      send user the next command
-->	   send user the last command if they have one, of the next command
-->
-->        ## data updates
-->	else if data-rejected
-->	   increment user.activity
-->	   if define-id is empty
-->	      ## garbage requested
-->	      set data-status to REJECTED
-->	   else
-->	      ## garbage data entered
-->	      forward message to editor: rejected existing entry
-->	      if it is really bad, he should check all other entries of the
-->	      same user, resetting those that are bogus.  This user should
-->	      be warned, and have the probation flag set, or have his
-->	      status set to killed
-->	   send the user the next command
-->	else if data-unknown
-->	   if data-status is HARD
-->		set data-status to UNKNOWN
-->		add element to challenge page
-->	   if data-status is ACTIVE
-->		set data-status to UNKNOWN
-->	   send the user the next command
-->	else if data-accepted
-->	   if data-status is not ACTIVE
-->		forward message to editor: old entry reactivated
-->		return
-->	   increment user.activity
-->	   if enter-id is empty
-->	      add element to the database
-->	      set enter-id to user
-->	   else
-->	      if modify-id is empty
-->	          lookup modifier in enter-id
-->	      else
-->		  lookup modifier in modify-id
-->	      compare element to the database
-->	      if user.checked > 0
-->	          reliability = user.reliability/user.checked
-->		  if reliability < 0, reliability=0
-->	      else reliability = default
-->	      if identical
-->		  set verify-id to user
-->	          add reliability to modifier.reliability
-->	      else
-->		  update element in the database
-->	          set modify-id to user
-->		  subtract reliability from modifier.reliability
-->	      increment modifier.checked
-->	   send the user the next command
-->	else if data-new  ## add new words to define
-->	   for each entry/support pair
-->	      if entry exists, increment priority and update support
-->	      else make new data entries with request.id set to user
-->	else if data-challenge ## definition sent from challenge page
-->	   if no user-id, create new user with user-status INACTIVE
-->	   do everything after the datastatus check in data-accepted
-->
-->	## editor commands
-->	## for each command, need to confirm that it is an editor
-->	## sending the command.  Commands include set user-status,
-->	## set data-status and who knows what else.  Many of them
-->	## are better done direct from the account, though CGI is
-->	## a better bet.  Ick.  Learn zope then?
-->
-->	## unknown
-->	else
-->	   forward message to editor: can't parse
-->
-->	subroutine next-command ## send the user the next command
-->	    if user-status == EXPERT
-->		get first data with status HARD in reverse priority order
-->	    else
-->		get first data with status READY in reverse priority order
-->	    set data-status to ACTIVE
-->	    set command to data-id
-->	    set command-date to today
-->	    send data and supporting data to user
-->
-->
-->(2) daily batch
-->	for each command older than 2*max days
-->	    flag user as inactive
-->	    set data-status as READY
-->	for each command older than max days
-->	    resend command
-->	send editor update message
-->	    #elements, #ACTIVE, #READY, #HARD, #CHALLENGE, #REJECTED, #DONE
-->	send editor random sample of elements which are waiting to be
-->	    verified, weighted to those with the lowest activitity to
-->	    checked ratio, and those with the lowest reliability.
-->
-->(3) display-cgi # public viewing of database
-->	verify that key is valid
-->	if key is in database, generate display form, including search field
-->	if key is not in database, generate search form
-->
-->(4) entry-cgi # public update of database
-->	verify that key is valid
-->	if key is in database, generate edit form filled with data
-->	if key is not in database, generate empty edit form
-->	clicking the send button sends the user-filled form to the
-->	    database server, along with e-mail address of the user
-->
-->(5) challenge-cgi # public selection of challenge word
-->	generate list of words with status CHALLENGE sorted in reverse
-->	order of priority.  Each word linked to entry form with that word
-->	as the key
-->
-->(6) verify-cgi # editor quality check of entries on a per-user basis
-->	verify that access is by an editor
-->	if user is not in database,  generate form with user key only
-->	else generate list of words entered/modified by that user
-->
-->
-->Specific to dictionary building
-->-------------------------------
-->
-->(1) supporting data tables
-->    document table:
-->	KEY:document-id
-->	source   where the document came from
-->	date     when the document was written
-->	author
-->	title
-->    sentence table:
-->	KEY:sentence-id 
-->	text     text of the sentence
-->	document-id source
-->    concordance:
-->	KEY:word
-->	KEY:sentence-id
-->
-->(2) dictionary entry tables
-->    ## don't know what to put in the dictionary yet
-->
-->(3) process-message program:
-->	...
-->	else if data-document
-->	   add new document entry
-->	   for each sentence
-->	      add new sentence
-->	      for each word in sentence
-->	         unmorph word [maybe??]
-->	         if new, add new word
-->		 increment frequency
-->	         if frequency is small and word is not junk, 
-->		    add to concordance
-->        ...
-->
-->	subroutine next-command
-->	   ...
-->	   retrieve all word in concordance table, with text substituted
-->	      for sentence-id, and author-title substituted for document-id
-->	   append to message before sending

Anybody know if majordomo or mailman are scriptable?

Nice start Paul!

-- 
Jeff Waddell
jeff@smluc.org

Kids Games Project Coordinator
main website at http://smluc.org/SIA/kidsgames/


-
kidsgames@smluc.org  -- To get off this list send "unsubscribe kidsgames"
in the body of a message to majordomo@smluc.org