[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No Subject
I've taken a first crack at a distributed data collection server.
It's all pseudocode as of yet. Anyone care to take it further while I
bury myself in my "real" project for a while? I need to select a
programming language and a database. It may be wise to find a willing
host site first, to see what tools they will accept/provide. Would
this project be easier in Zope? I hear python is easy enough to
learn.
My goal is to generate a free French dictionary to complement our free
English one. I grabbed half the Project Gutenburg French collection,
and pretty quickly generated an 18600 word database, all with sentence
contexts. Granted, conjugations, Gutenburg license terms, archaism,
proper names and other junk will cut that down as low as 12000 head
words, but that's still an awfully good start on a dictionary. Surely
more modern texts are available from the French or Canadian
governments to push this up to the 50000 range of a good college
dictionary. [The 200,000 they list on the back includes all those
variations like plural, past tense, different part of speech, etc.]
-------------------------------
Paul Kienzle
pkienzle@kienzle.powernet.co.uk
-------------------------------
Distributed data collection using a mail service
Assumptions
-----------
(1) Data elements are independent. This means that your data
collectors do not have to interact with each other. You can get
around this by generating dependent data after independent data,
or by generating interdependent data elements together.
(2) Expertise is widely held. This means that you don't have
to match data elements to individuals. Some users can subscribe
as experts, which means that they are willing to spend more time
figuring out how to enter or verify a data element, or will
delegate to someone who can. If they fail, then the data element
is put on an open challenge list for all takers. If expertise is
not widely held, then users and data will have to be tagged and
matched for specialities. E.g., legal, medical, scientific.
(3) Users are mostly trustworthy, but errorprone. That is,
most won't enter bogus data, but they will sometimes make
mistakes. You do need a verification process. Depending on how
reliable you want the data, you can use random sampling to insure
that things are mostly correct, or you can verify every
modification. Keep the identities of the user who initially
entered the data, he who last modified the data, and he who last
verified the data, so that if errors are found, other entries
touched by the same individual can be verified. From this, a
measure of reliability can be determined for each user. This also
provides a degree of accountability, which should make the entire
process less error-prone. Making these identies available in the
released database improves accountability even more, though an
opt-out should be available. Check out the entries of those who
opt out. Special mention should be made of those with the most
new entries, the most fixes and the most checks.
(4) There is an editor who "owns" the project. He has the right to add
and remove users, and to process entries on a per-user basis to
check their reliability. Anomolous e-mails (and spam) will be
sent to his accounts. If users detect bogus entries, they have
a way of flagging them to be sent to an editor for further checks.
If traffic gets too high, cycle it across multiple editors.
(5) Users control their own level of activity. In the simplist
form, simply not responding for a while will delay the next
request. If they delay too long, however, the system will not
know if the request was lost. So after a period of delay, the
system will resend the request, and after another period of delay,
the system will flag the user as inactive and tell them to
resubscribe to resume activity, and it will send the request on to
another user. Invalid e-mail addresses will be treated similarly.
A more complicated system would allow the user to set the number
of elements to process at once (useful for those who have
occasional large blocks of time, and intermittent connections) and
the delay between reminders.
Management database
-------------------
(1) List of users who are doing the collecting. This should include
the following fields:
user-id
table key
user-status
ACTIVE: currently receiving commands
EXPERT: receiving HARD commands (see below) if any
INACTIVE: not responding to commands
REMOVED: asked to be removed from the list
KILLED: forcibly removed from the list
name
optional
email
may change during the collection process, so can't be used as
the key
command
data element they are entering/verifying
command date
date the command was sent. used to compute number of days
since command was sent, and to either resend the command or
mark the user as inactive.
entered
number of elements entered/modified/verified
checked
number of elements checked by others
reliability
number of elements accepted by others, weighted by the
reliability of those doing the accepting, minus the number of
elements modified by others, weighted by the reliability of
those doing the modified and by the degree of modification.
Lower limit of zero, upper limit of 1. Divide by the number
checked by others for the normalized score used in the
calculation. Editors are by definition 100% reliable. If no
elements are checked, a default normalized reliability is
used, as determined by the editors by random sampling across
all users in the population.
probation
flag. If set, all entries modified by the user are sent to the
editor
(2) List of data elements to collect/already collected. This should
include the following fields:
data-id
there is a 1-1 mapping between data elements and elements
collected, obviously. For distribution, however, the
management fields may be removed from the database. Also,
these fields are independent of the actual data being
collected.
priority
frequency of request; higher is sooner
data-status
READY: data element needs to be entered or verified
HARD: data element needs to be entered or verified by
an expert since the last user could not enter
or verify it as requested
UNKNOWN: data element is put on the challenge page for
anyone to claim since an expert could not enter
or verify it as requested
ACTIVE: someone is entering/verifying the data element
REJECTED: data element is not required in database
DONE: data element has been verified
request-id
user who requested the data element
enter-id
user who entered the data element. empty if the data has not
yet been entered
modify-id
user who last modified the data element. empty if the data has
never been modified (either because it was entered correctly,
or because it has never been verified)
verify-id
user who "signed-off" on the data element. empty if the data
has not been verified since it was defined or modified. For
extra reliability, more than one user should sign off on the
data, and this will be an array of verifiers.
supporting data
whatever can be provided to make data entry/verification
easier. In the case of a dictionary, this would include
sentences which contain the target word and definitions from
other (possibly outdated) sources.
Server programs
---------------
(1) process-message ## called for each message received by the list server
determine user-id ## look how majordomo does it; may need password
if no user-id, and not subscribe command, and not data-challenge
forward message to an editor: invalid user
if user-status == KILLED
forward message to an editor: message from killed user
return
if user-status == REALLY-KILLED
send user the go-away message
return
determine nature of request by looking for initial keyword
## list commands
if subscribe,
if no user-id,
add user to the table
send user the welcome message
set user-status to ACTIVE or EXPERT
send user the next command
else
send user the go-away message
else if unsubscribe
set user-status to REMOVED
if command is not empty
set data-status of command to READY
set command to empty
else if address-update
set email to new address
if user-status == INACTIVE or REMOVED
set user-status to ACTIVE
if command is not empty
send user the command
else
send user the next command
send user the last command if they have one, of the next command
## data updates
else if data-rejected
increment user.activity
if define-id is empty
## garbage requested
set data-status to REJECTED
else
## garbage data entered
forward message to editor: rejected existing entry
if it is really bad, he should check all other entries of the
same user, resetting those that are bogus. This user should
be warned, and have the probation flag set, or have his
status set to killed
send the user the next command
else if data-unknown
if data-status is HARD
set data-status to UNKNOWN
add element to challenge page
if data-status is ACTIVE
set data-status to UNKNOWN
send the user the next command
else if data-accepted
if data-status is not ACTIVE
forward message to editor: old entry reactivated
return
increment user.activity
if enter-id is empty
add element to the database
set enter-id to user
else
if modify-id is empty
lookup modifier in enter-id
else
lookup modifier in modify-id
compare element to the database
if user.checked > 0
reliability = user.reliability/user.checked
if reliability < 0, reliability=0
else reliability = default
if identical
set verify-id to user
add reliability to modifier.reliability
else
update element in the database
set modify-id to user
subtract reliability from modifier.reliability
increment modifier.checked
send the user the next command
else if data-new ## add new words to define
for each entry/support pair
if entry exists, increment priority and update support
else make new data entries with request.id set to user
else if data-challenge ## definition sent from challenge page
if no user-id, create new user with user-status INACTIVE
do everything after the datastatus check in data-accepted
## editor commands
## for each command, need to confirm that it is an editor
## sending the command. Commands include set user-status,
## set data-status and who knows what else. Many of them
## are better done direct from the account, though CGI is
## a better bet. Ick. Learn zope then?
## unknown
else
forward message to editor: can't parse
subroutine next-command ## send the user the next command
if user-status == EXPERT
get first data with status HARD in reverse priority order
else
get first data with status READY in reverse priority order
set data-status to ACTIVE
set command to data-id
set command-date to today
send data and supporting data to user
(2) daily batch
for each command older than 2*max days
flag user as inactive
set data-status as READY
for each command older than max days
resend command
send editor update message
#elements, #ACTIVE, #READY, #HARD, #CHALLENGE, #REJECTED, #DONE
send editor random sample of elements which are waiting to be
verified, weighted to those with the lowest activitity to
checked ratio, and those with the lowest reliability.
(3) display-cgi # public viewing of database
verify that key is valid
if key is in database, generate display form, including search field
if key is not in database, generate search form
(4) entry-cgi # public update of database
verify that key is valid
if key is in database, generate edit form filled with data
if key is not in database, generate empty edit form
clicking the send button sends the user-filled form to the
database server, along with e-mail address of the user
(5) challenge-cgi # public selection of challenge word
generate list of words with status CHALLENGE sorted in reverse
order of priority. Each word linked to entry form with that word
as the key
(6) verify-cgi # editor quality check of entries on a per-user basis
verify that access is by an editor
if user is not in database, generate form with user key only
else generate list of words entered/modified by that user
Specific to dictionary building
-------------------------------
(1) supporting data tables
document table:
KEY:document-id
source where the document came from
date when the document was written
author
title
sentence table:
KEY:sentence-id
text text of the sentence
document-id source
concordance:
KEY:word
KEY:sentence-id
(2) dictionary entry tables
## don't know what to put in the dictionary yet
(3) process-message program:
...
else if data-document
add new document entry
for each sentence
add new sentence
for each word in sentence
unmorph word [maybe??]
if new, add new word
increment frequency
if frequency is small and word is not junk,
add to concordance
...
subroutine next-command
...
retrieve all word in concordance table, with text substituted
for sentence-id, and author-title substituted for document-id
append to message before sending
-
kidsgames@smluc.org -- To get off this list send "unsubscribe kidsgames"
in the body of a message to majordomo@smluc.org