[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

No Subject




I've taken a first crack at a distributed data collection server.
It's all pseudocode as of yet.  Anyone care to take it further while I
bury myself in my "real" project for a while?  I need to select a
programming language and a database.  It may be wise to find a willing
host site first, to see what tools they will accept/provide.  Would
this project be easier in Zope?  I hear python is easy enough to
learn.

My goal is to generate a free French dictionary to complement our free
English one.  I grabbed half the Project Gutenburg French collection,
and pretty quickly generated an 18600 word database, all with sentence
contexts.  Granted, conjugations, Gutenburg license terms, archaism,
proper names and other junk will cut that down as low as 12000 head
words, but that's still an awfully good start on a dictionary.  Surely
more modern texts are available from the French or Canadian
governments to push this up to the 50000 range of a good college
dictionary. [The 200,000 they list on the back includes all those
variations like plural, past tense, different part of speech, etc.]

-------------------------------
Paul Kienzle
pkienzle@kienzle.powernet.co.uk
-------------------------------

Distributed data collection using a mail service

Assumptions
-----------

(1) Data elements are independent.  This means that your data
    collectors do not have to interact with each other.  You can get
    around this by generating dependent data after independent data,
    or by generating interdependent data elements together.

(2) Expertise is widely held.  This means that you don't have
    to match data elements to individuals.  Some users can subscribe
    as experts, which means that they are willing to spend more time
    figuring out how to enter or verify a data element, or will
    delegate to someone who can.  If they fail, then the data element
    is put on an open challenge list for all takers.  If expertise is
    not widely held, then users and data will have to be tagged and
    matched for specialities.  E.g., legal, medical, scientific.

(3) Users are mostly trustworthy, but errorprone.  That is, 
    most won't enter bogus data, but they will sometimes make
    mistakes. You do need a verification process.  Depending on how
    reliable you want the data, you can use random sampling to insure
    that things are mostly correct, or you can verify every
    modification.  Keep the identities of the user who initially
    entered the data, he who last modified the data, and he who last
    verified the data, so that if errors are found, other entries
    touched by the same individual can be verified.  From this, a
    measure of reliability can be determined for each user.  This also
    provides a degree of accountability, which should make the entire
    process less error-prone.  Making these identies available in the
    released database improves accountability even more, though an
    opt-out should be available.  Check out the entries of those who
    opt out.  Special mention should be made of those with the most
    new entries, the most fixes and the most checks.

(4) There is an editor who "owns" the project.  He has the right to add 
    and remove users, and to process entries on a per-user basis to
    check their reliability.  Anomolous e-mails (and spam) will be
    sent to his accounts.  If users detect bogus entries, they have
    a way of flagging them to be sent to an editor for further checks.
    If traffic gets too high, cycle it across multiple editors.

(5) Users control their own level of activity.  In the simplist
    form, simply not responding for a while will delay the next
    request.  If they delay too long, however, the system will not
    know if the request was lost.  So after a period of delay, the
    system will resend the request, and after another period of delay,
    the system will flag the user as inactive and tell them to
    resubscribe to resume activity, and it will send the request on to
    another user.  Invalid e-mail addresses will be treated similarly.
    A more complicated system would allow the user to set the number
    of elements to process at once (useful for those who have
    occasional large blocks of time, and intermittent connections) and
    the delay between reminders.



Management database
-------------------

(1) List of users who are doing the collecting.  This should include
    the following fields:

    user-id
	table key
    user-status
	ACTIVE: currently receiving commands 
	EXPERT: receiving HARD commands (see below) if any
	INACTIVE: not responding to commands 
	REMOVED: asked to be removed from the list
	KILLED: forcibly removed from the list
    name
	optional
    email
	may change during the collection process, so can't be used as
	the key
    command
	data element they are entering/verifying
    command date
	date the command was sent. used to compute number of days
	since command was sent, and to either resend the command or
	mark the user as inactive.
    entered 
	number of elements entered/modified/verified
    checked
	number of elements checked by others
    reliability  
	number of elements accepted by others, weighted by the
	reliability of those doing the accepting, minus the number of
	elements modified by others, weighted by the reliability of
	those doing the modified and by the degree of modification.
	Lower limit of zero, upper limit of 1.  Divide by the number
	checked by others for the normalized score used in the
	calculation. Editors are by definition 100% reliable.  If no
	elements are checked, a default normalized reliability is
	used, as determined by the editors by random sampling across
	all users in the population.
    probation
	flag. If set, all entries modified by the user are sent to the
	editor

(2) List of data elements to collect/already collected. This should
    include the following fields:

    data-id
	there is a 1-1 mapping between data elements and elements
	collected, obviously.  For distribution, however, the
	management fields may be removed from the database.  Also,
	these fields are independent of the actual data being
	collected.
    priority
	frequency of request; higher is sooner
    data-status
	READY: data element needs to be entered or verified
	HARD: data element needs to be entered or verified by
		an expert since the last user could not enter
		or verify it as requested
	UNKNOWN: data element is put on the challenge page for
		anyone to claim since an expert could not enter
		or verify it as requested
	ACTIVE: someone is entering/verifying the data element
	REJECTED: data element is not required in database
	DONE: data element has been verified
    request-id
	user who requested the data element
    enter-id
	user who entered the data element.  empty if the data has not
	yet been entered
    modify-id
	user who last modified the data element. empty if the data has
	never been modified (either because it was entered correctly,
	or because it has never been verified)
    verify-id
	user who "signed-off" on the data element. empty if the data
	has not been verified since it was defined or modified.  For
	extra reliability, more than one user should sign off on the
	data, and this will be an array of verifiers.
    supporting data
	whatever can be provided to make data entry/verification
	easier.  In the case of a dictionary, this would include
	sentences which contain the target word and definitions from
	other (possibly outdated) sources.


Server programs
---------------

(1) process-message ## called for each message received by the list server

	determine user-id ## look how majordomo does it; may need password
	if no user-id, and not subscribe command, and not data-challenge
	   forward message to an editor: invalid user
	if user-status == KILLED
	   forward message to an editor: message from killed user
	   return
	if user-status == REALLY-KILLED
	   send user the go-away message
	   return

	determine nature of request by looking for initial keyword

	## list commands
	if subscribe,
	   if no user-id,
	      add user to the table
	      send user the welcome message
	      set user-status to ACTIVE or EXPERT
	      send user the next command
	   else
	      send user the go-away message
	else if unsubscribe
	   set user-status to REMOVED
	   if command is not empty
		set data-status of command to READY
		set command to empty
	else if address-update
	   set email to new address
	   if user-status == INACTIVE or REMOVED
	      set user-status to ACTIVE
	   if command is not empty
	      send user the command
	   else
	      send user the next command
	   send user the last command if they have one, of the next command

        ## data updates
	else if data-rejected
	   increment user.activity
	   if define-id is empty
	      ## garbage requested
	      set data-status to REJECTED
	   else
	      ## garbage data entered
	      forward message to editor: rejected existing entry
	      if it is really bad, he should check all other entries of the
	      same user, resetting those that are bogus.  This user should
	      be warned, and have the probation flag set, or have his
	      status set to killed
	   send the user the next command
	else if data-unknown
	   if data-status is HARD
		set data-status to UNKNOWN
		add element to challenge page
	   if data-status is ACTIVE
		set data-status to UNKNOWN
	   send the user the next command
	else if data-accepted
	   if data-status is not ACTIVE
		forward message to editor: old entry reactivated
		return
	   increment user.activity
	   if enter-id is empty
	      add element to the database
	      set enter-id to user
	   else
	      if modify-id is empty
	          lookup modifier in enter-id
	      else
		  lookup modifier in modify-id
	      compare element to the database
	      if user.checked > 0
	          reliability = user.reliability/user.checked
		  if reliability < 0, reliability=0
	      else reliability = default
	      if identical
		  set verify-id to user
	          add reliability to modifier.reliability
	      else
		  update element in the database
	          set modify-id to user
		  subtract reliability from modifier.reliability
	      increment modifier.checked
	   send the user the next command
	else if data-new  ## add new words to define
	   for each entry/support pair
	      if entry exists, increment priority and update support
	      else make new data entries with request.id set to user
	else if data-challenge ## definition sent from challenge page
	   if no user-id, create new user with user-status INACTIVE
	   do everything after the datastatus check in data-accepted

	## editor commands
	## for each command, need to confirm that it is an editor
	## sending the command.  Commands include set user-status,
	## set data-status and who knows what else.  Many of them
	## are better done direct from the account, though CGI is
	## a better bet.  Ick.  Learn zope then?

	## unknown
	else
	   forward message to editor: can't parse

	subroutine next-command ## send the user the next command
	    if user-status == EXPERT
		get first data with status HARD in reverse priority order
	    else
		get first data with status READY in reverse priority order
	    set data-status to ACTIVE
	    set command to data-id
	    set command-date to today
	    send data and supporting data to user


(2) daily batch
	for each command older than 2*max days
	    flag user as inactive
	    set data-status as READY
	for each command older than max days
	    resend command
	send editor update message
	    #elements, #ACTIVE, #READY, #HARD, #CHALLENGE, #REJECTED, #DONE
	send editor random sample of elements which are waiting to be
	    verified, weighted to those with the lowest activitity to
	    checked ratio, and those with the lowest reliability.

(3) display-cgi # public viewing of database
	verify that key is valid
	if key is in database, generate display form, including search field
	if key is not in database, generate search form

(4) entry-cgi # public update of database
	verify that key is valid
	if key is in database, generate edit form filled with data
	if key is not in database, generate empty edit form
	clicking the send button sends the user-filled form to the
	    database server, along with e-mail address of the user

(5) challenge-cgi # public selection of challenge word
	generate list of words with status CHALLENGE sorted in reverse
	order of priority.  Each word linked to entry form with that word
	as the key

(6) verify-cgi # editor quality check of entries on a per-user basis
	verify that access is by an editor
	if user is not in database,  generate form with user key only
	else generate list of words entered/modified by that user


Specific to dictionary building
-------------------------------

(1) supporting data tables
    document table:
	KEY:document-id
	source   where the document came from
	date     when the document was written
	author
	title
    sentence table:
	KEY:sentence-id 
	text     text of the sentence
	document-id source
    concordance:
	KEY:word
	KEY:sentence-id

(2) dictionary entry tables
    ## don't know what to put in the dictionary yet

(3) process-message program:
	...
	else if data-document
	   add new document entry
	   for each sentence
	      add new sentence
	      for each word in sentence
	         unmorph word [maybe??]
	         if new, add new word
		 increment frequency
	         if frequency is small and word is not junk, 
		    add to concordance
        ...

	subroutine next-command
	   ...
	   retrieve all word in concordance table, with text substituted
	      for sentence-id, and author-title substituted for document-id
	   append to message before sending
-
kidsgames@smluc.org  -- To get off this list send "unsubscribe kidsgames"
in the body of a message to majordomo@smluc.org