[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[kidsgames] Re: distributed data collection server
Hi Paul,
On Wed, 23 Feb 2000, Paul Kienzle wrote:
-->Date: Wed, 23 Feb 2000 13:23:20 +0000 (GMT)
-->From: Paul Kienzle <pkienzle@kienzle.powernet.co.uk>
-->Reply-To: kidsgames@smluc.org
-->To: kidsgames@smluc.org
-->
-->
-->I've taken a first crack at a distributed data collection server.
Great!!!
-->It's all pseudocode as of yet.
It's a start.
--> Anyone care to take it further while I
-->bury myself in my "real" project for a while?
what's so "unreal" about this one?
--> I need to select a
-->programming language and a database.
php and postgresql
--> It may be wise to find a willing
-->host site first, to see what tools they will accept/provide. Would
-->this project be easier in Zope?
zope REQUIRES frames...and I Don't LIKE frames....
--> I hear python is easy enough to
-->learn.
-->
I keep hearing that, yet when I look at the stuff, it seems about as
cryptic as perl...
-->My goal is to generate a free French dictionary to complement our free
-->English one.
Awesome.
--> I grabbed half the Project Gutenburg French collection,
-->and pretty quickly generated an 18600 word database, all with sentence
-->contexts.
WOW!
--> Granted, conjugations, Gutenburg license terms, archaism,
-->proper names and other junk will cut that down as low as 12000 head
-->words, but that's still an awfully good start on a dictionary. Surely
-->more modern texts are available from the French or Canadian
-->governments to push this up to the 50000 range of a good college
-->dictionary. [The 200,000 they list on the back includes all those
-->variations like plural, past tense, different part of speech, etc.]
-->
Any French speakers out there with some URL's to more Text?
-->-------------------------------
-->Paul Kienzle
-->pkienzle@kienzle.powernet.co.uk
-->-------------------------------
-->
-->Distributed data collection using a mail service
-->
-->Assumptions
-->-----------
-->
-->(1) Data elements are independent. This means that your data
--> collectors do not have to interact with each other. You can get
--> around this by generating dependent data after independent data,
--> or by generating interdependent data elements together.
-->
-->(2) Expertise is widely held. This means that you don't have
--> to match data elements to individuals. Some users can subscribe
--> as experts, which means that they are willing to spend more time
--> figuring out how to enter or verify a data element, or will
--> delegate to someone who can. If they fail, then the data element
--> is put on an open challenge list for all takers. If expertise is
--> not widely held, then users and data will have to be tagged and
--> matched for specialities. E.g., legal, medical, scientific.
-->
-->(3) Users are mostly trustworthy, but errorprone. That is,
--> most won't enter bogus data, but they will sometimes make
--> mistakes. You do need a verification process. Depending on how
--> reliable you want the data, you can use random sampling to insure
--> that things are mostly correct, or you can verify every
--> modification. Keep the identities of the user who initially
--> entered the data, he who last modified the data, and he who last
--> verified the data, so that if errors are found, other entries
--> touched by the same individual can be verified. From this, a
--> measure of reliability can be determined for each user. This also
--> provides a degree of accountability, which should make the entire
--> process less error-prone. Making these identies available in the
--> released database improves accountability even more, though an
--> opt-out should be available. Check out the entries of those who
--> opt out. Special mention should be made of those with the most
--> new entries, the most fixes and the most checks.
-->
-->(4) There is an editor who "owns" the project. He has the right to add
--> and remove users, and to process entries on a per-user basis to
--> check their reliability. Anomolous e-mails (and spam) will be
--> sent to his accounts. If users detect bogus entries, they have
--> a way of flagging them to be sent to an editor for further checks.
--> If traffic gets too high, cycle it across multiple editors.
-->
-->(5) Users control their own level of activity. In the simplist
--> form, simply not responding for a while will delay the next
--> request. If they delay too long, however, the system will not
--> know if the request was lost. So after a period of delay, the
--> system will resend the request, and after another period of delay,
--> the system will flag the user as inactive and tell them to
--> resubscribe to resume activity, and it will send the request on to
--> another user. Invalid e-mail addresses will be treated similarly.
--> A more complicated system would allow the user to set the number
--> of elements to process at once (useful for those who have
--> occasional large blocks of time, and intermittent connections) and
--> the delay between reminders.
-->
-->
This is very similar it seems to the way gnu-translators works, that
effort intends to translate the entire gnu.org website (one page at a
time) utilizing many translators instead of relying on one.
-->
-->Management database
-->-------------------
-->
-->(1) List of users who are doing the collecting. This should include
--> the following fields:
-->
--> user-id
--> table key
--> user-status
--> ACTIVE: currently receiving commands
--> EXPERT: receiving HARD commands (see below) if any
--> INACTIVE: not responding to commands
--> REMOVED: asked to be removed from the list
--> KILLED: forcibly removed from the list
--> name
--> optional
--> email
--> may change during the collection process, so can't be used as
--> the key
--> command
--> data element they are entering/verifying
--> command date
--> date the command was sent. used to compute number of days
--> since command was sent, and to either resend the command or
--> mark the user as inactive.
--> entered
--> number of elements entered/modified/verified
--> checked
--> number of elements checked by others
--> reliability
--> number of elements accepted by others, weighted by the
--> reliability of those doing the accepting, minus the number of
--> elements modified by others, weighted by the reliability of
--> those doing the modified and by the degree of modification.
--> Lower limit of zero, upper limit of 1. Divide by the number
--> checked by others for the normalized score used in the
--> calculation. Editors are by definition 100% reliable. If no
--> elements are checked, a default normalized reliability is
--> used, as determined by the editors by random sampling across
--> all users in the population.
--> probation
--> flag. If set, all entries modified by the user are sent to the
--> editor
authentication-type: type of authentication verifying identity of
this user.
authenitication-key: pgp/gpg key for authentication purposes.
-->
-->(2) List of data elements to collect/already collected. This should
--> include the following fields:
-->
--> data-id
--> there is a 1-1 mapping between data elements and elements
--> collected, obviously. For distribution, however, the
--> management fields may be removed from the database. Also,
--> these fields are independent of the actual data being
--> collected.
--> priority
--> frequency of request; higher is sooner
--> data-status
--> READY: data element needs to be entered or verified
--> HARD: data element needs to be entered or verified by
--> an expert since the last user could not enter
--> or verify it as requested
--> UNKNOWN: data element is put on the challenge page for
--> anyone to claim since an expert could not enter
--> or verify it as requested
--> ACTIVE: someone is entering/verifying the data element
--> REJECTED: data element is not required in database
--> DONE: data element has been verified
--> request-id
--> user who requested the data element
--> enter-id
--> user who entered the data element. empty if the data has not
--> yet been entered
--> modify-id
--> user who last modified the data element. empty if the data has
--> never been modified (either because it was entered correctly,
--> or because it has never been verified)
--> verify-id
--> user who "signed-off" on the data element. empty if the data
--> has not been verified since it was defined or modified. For
--> extra reliability, more than one user should sign off on the
--> data, and this will be an array of verifiers.
--> supporting data
--> whatever can be provided to make data entry/verification
--> easier. In the case of a dictionary, this would include
--> sentences which contain the target word and definitions from
--> other (possibly outdated) sources.
-->
-->
-->Server programs
-->---------------
-->
-->(1) process-message ## called for each message received by the list server
-->
--> determine user-id ## look how majordomo does it; may need password
--> if no user-id, and not subscribe command, and not data-challenge
--> forward message to an editor: invalid user
--> if user-status == KILLED
--> forward message to an editor: message from killed user
--> return
--> if user-status == REALLY-KILLED
--> send user the go-away message
--> return
-->
--> determine nature of request by looking for initial keyword
-->
--> ## list commands
--> if subscribe,
--> if no user-id,
--> add user to the table
--> send user the welcome message
--> set user-status to ACTIVE or EXPERT
--> send user the next command
--> else
--> send user the go-away message
--> else if unsubscribe
--> set user-status to REMOVED
--> if command is not empty
--> set data-status of command to READY
--> set command to empty
--> else if address-update
--> set email to new address
--> if user-status == INACTIVE or REMOVED
--> set user-status to ACTIVE
--> if command is not empty
--> send user the command
--> else
--> send user the next command
--> send user the last command if they have one, of the next command
-->
--> ## data updates
--> else if data-rejected
--> increment user.activity
--> if define-id is empty
--> ## garbage requested
--> set data-status to REJECTED
--> else
--> ## garbage data entered
--> forward message to editor: rejected existing entry
--> if it is really bad, he should check all other entries of the
--> same user, resetting those that are bogus. This user should
--> be warned, and have the probation flag set, or have his
--> status set to killed
--> send the user the next command
--> else if data-unknown
--> if data-status is HARD
--> set data-status to UNKNOWN
--> add element to challenge page
--> if data-status is ACTIVE
--> set data-status to UNKNOWN
--> send the user the next command
--> else if data-accepted
--> if data-status is not ACTIVE
--> forward message to editor: old entry reactivated
--> return
--> increment user.activity
--> if enter-id is empty
--> add element to the database
--> set enter-id to user
--> else
--> if modify-id is empty
--> lookup modifier in enter-id
--> else
--> lookup modifier in modify-id
--> compare element to the database
--> if user.checked > 0
--> reliability = user.reliability/user.checked
--> if reliability < 0, reliability=0
--> else reliability = default
--> if identical
--> set verify-id to user
--> add reliability to modifier.reliability
--> else
--> update element in the database
--> set modify-id to user
--> subtract reliability from modifier.reliability
--> increment modifier.checked
--> send the user the next command
--> else if data-new ## add new words to define
--> for each entry/support pair
--> if entry exists, increment priority and update support
--> else make new data entries with request.id set to user
--> else if data-challenge ## definition sent from challenge page
--> if no user-id, create new user with user-status INACTIVE
--> do everything after the datastatus check in data-accepted
-->
--> ## editor commands
--> ## for each command, need to confirm that it is an editor
--> ## sending the command. Commands include set user-status,
--> ## set data-status and who knows what else. Many of them
--> ## are better done direct from the account, though CGI is
--> ## a better bet. Ick. Learn zope then?
-->
--> ## unknown
--> else
--> forward message to editor: can't parse
-->
--> subroutine next-command ## send the user the next command
--> if user-status == EXPERT
--> get first data with status HARD in reverse priority order
--> else
--> get first data with status READY in reverse priority order
--> set data-status to ACTIVE
--> set command to data-id
--> set command-date to today
--> send data and supporting data to user
-->
-->
-->(2) daily batch
--> for each command older than 2*max days
--> flag user as inactive
--> set data-status as READY
--> for each command older than max days
--> resend command
--> send editor update message
--> #elements, #ACTIVE, #READY, #HARD, #CHALLENGE, #REJECTED, #DONE
--> send editor random sample of elements which are waiting to be
--> verified, weighted to those with the lowest activitity to
--> checked ratio, and those with the lowest reliability.
-->
-->(3) display-cgi # public viewing of database
--> verify that key is valid
--> if key is in database, generate display form, including search field
--> if key is not in database, generate search form
-->
-->(4) entry-cgi # public update of database
--> verify that key is valid
--> if key is in database, generate edit form filled with data
--> if key is not in database, generate empty edit form
--> clicking the send button sends the user-filled form to the
--> database server, along with e-mail address of the user
-->
-->(5) challenge-cgi # public selection of challenge word
--> generate list of words with status CHALLENGE sorted in reverse
--> order of priority. Each word linked to entry form with that word
--> as the key
-->
-->(6) verify-cgi # editor quality check of entries on a per-user basis
--> verify that access is by an editor
--> if user is not in database, generate form with user key only
--> else generate list of words entered/modified by that user
-->
-->
-->Specific to dictionary building
-->-------------------------------
-->
-->(1) supporting data tables
--> document table:
--> KEY:document-id
--> source where the document came from
--> date when the document was written
--> author
--> title
--> sentence table:
--> KEY:sentence-id
--> text text of the sentence
--> document-id source
--> concordance:
--> KEY:word
--> KEY:sentence-id
-->
-->(2) dictionary entry tables
--> ## don't know what to put in the dictionary yet
-->
-->(3) process-message program:
--> ...
--> else if data-document
--> add new document entry
--> for each sentence
--> add new sentence
--> for each word in sentence
--> unmorph word [maybe??]
--> if new, add new word
--> increment frequency
--> if frequency is small and word is not junk,
--> add to concordance
--> ...
-->
--> subroutine next-command
--> ...
--> retrieve all word in concordance table, with text substituted
--> for sentence-id, and author-title substituted for document-id
--> append to message before sending
Anybody know if majordomo or mailman are scriptable?
Nice start Paul!
--
Jeff Waddell
jeff@smluc.org
Kids Games Project Coordinator
main website at http://smluc.org/SIA/kidsgames/
-
kidsgames@smluc.org -- To get off this list send "unsubscribe kidsgames"
in the body of a message to majordomo@smluc.org
- References:
- No Subject
- From: pkienzle@kienzle.powernet.co.uk (Paul Kienzle)