[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [school-discuss] most frequently used words



At 15:09 2002.01.28 -0800, Jeremy C. Reed wrote:
>I am looking for some easy ways to figure out the most commonly used
>words (in English).
>
>But, I would like to categorize them by nouns, verbs, article, pronouns,
>conjunctions, etc.
>
>Does anyone know of any dictionary software that can be used on an Unix
>command-line that can help?

At http://www.georgetown.edu/cball/ling361/index.html is a syllabus for Kathryn B. Taylor's "LING-361 INTRODUCTION TO COMPUTATIONAL LINGUISTICS". It seems the whole course, including assignments is on-line.

In Part 4, "E-Text and Simple Text Processing", Catherine N. Ball walks students through a process for calculating word frequencies and tagging words by parts-of-speech, first using commom unix tools (also available in Linux) then building perl scripts to automate the calculations. Source code is included. 

I only tried a bit of it on Red Hat 7.2 and found a few incompatibilities in file locations, but those could be easily overcome. All the apps required were available. 

There is also an on-line version at http://www.georgetown.edu/cball/webtools/web_freqs.html. Below is the output of that url for this reply, up to here -- just for the fun of it. ;)

Cheers,
ADd

Text name: SchoolForge Reply
Date/time: 1/29/2002 16:8
Word count: 178
Unique words: 134
Sort order: descending 5IS
5   THE
5   TO
4   A
4   FOR
4   IN
3   AND
3   AT
3   I
3   OF
3   THAT
2   ALSO
2   AN
2   AVAILABLE
2   BE
2   BUT
2   BY
2   CAN
2   IT
2   ON
2   ON-LINE
2   UNIX
2   USED
2   WORDS
1   0800
1   15:09
1   2002.01.28
1   4
1   7.2
1   ALL
1   AM
1   ANY
1   ANYONE
1   APPS
1   ARTICLE
1   ASSIGNMENTS
1   AUTOMATE
1   B
1   BALL
1   BELOW
1   BIT
1   BUILDING
1   C
1   CALCULATING
1   CALCULATIONS
1   CATEGORIZE
1   CATHERINE
1   CODE
1   COMMAND-LINE
1   COMMOM
1   COMMONLY
1   COMPUTATIONAL
1   CONJUNCTIONS
1   COULD
1   COURSE
1   DICTIONARY
1   DOES
1   E-TEXT
1   EASILY
1   EASY
1   ENGLISH
1   ETC
1   FEW
1   FIGURE
1   FILE
1   FIRST
1   FOUND
1   FREQUENCIES
1   HAT
1   HELP
1   HERE
1   HTTP://WWW.GEORGETOWN.EDU/CBALL/LING361/INDEX.HTML
1   HTTP://WWW.GEORGETOWN.EDU/CBALL/WEBTOOLS/WEB_FREQS.HTML
1   INCLUDED
1   INCLUDING
1   INCOMPATIBILITIES
1   INTRODUCTION
1   JEREMY
1   KATHRYN
1   KNOW
1   LIKE
1   LING-361
1   LINGUISTICS
1   LINUX
1   LOCATIONS
1   LOOKING
1   MOST
1   N
1   NOUNS
1   ONLY
1   OUT
1   OUTPUT
1   OVERCOME
1   PART
1   PARTS-OF-SPEECH
1   PERL
1   PROCESS
1   PROCESSING
1   PRONOUNS
1   RED
1   REED
1   REPLY
1   REQUIRED
1   SCRIPTS
1   SEEMS
1   SIMPLE
1   SOFTWARE
1   SOME
1   SOURCE
1   STUDENTS
1   SYLLABUS
1   TAGGING
1   TAYLOR'S
1   TEXT
1   THEM
1   THEN
1   THERE
1   THIS
1   THOSE
1   THROUGH
1   TOOLS
1   TRIED
1   UP
1   URL
1   USING
1   VERBS
1   VERSION
1   WALKS
1   WAYS
1   WERE
1   WHOLE
1   WORD
1   WOULD
1   WROTE 
Processing time: 0.04 CPU seconds.

At 15:09 2002.01.28 -0800, Jeremy C. Reed wrote:
>I am looking for some easy ways to figure out the most commonly used
>words (in English).
>
>But, I would like to categorize them by nouns, verbs, article, pronouns,
>conjunctions, etc.
>
>Does anyone know of any dictionary software that can be used on an Unix
>command-line that can help?
>
>Such as some tool like:
>
>  $ the-dictionary -t frog
>  noun
>  $ the-dictionary -t ahdsjkhgfe
>  [not in dictionary]
>  $
>
>(I already can build a list of frequently used words from miscellanous
>emails, and HTML and txt docs on my system.)
>
>My plan is to build categorized lists of top words for reading practice.
>
>   Jeremy C. Reed
>   http://www.reedmedia.net/
>
>p.s. for example, frequently used words (not categorized):
>
>7.6% the
>3.0% to
>2.6% a
>2.5% of
>2.3% and
>2.0% is
>1.7% in
>1.5% for
>1.0% this
>1.0% that
>1.0% be
>0.8% with
>0.8% if
>0.7% or
>0.7% it
>0.7% are
>0.6% you
>0.6% on
>0.6% not
>0.6% by
>0.6% as
>0.5% from
>0.5% an
>0.4% will
>0.4% which