[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [school-discuss] most frequently used words
At 15:09 2002.01.28 -0800, Jeremy C. Reed wrote:
>I am looking for some easy ways to figure out the most commonly used
>words (in English).
>
>But, I would like to categorize them by nouns, verbs, article, pronouns,
>conjunctions, etc.
>
>Does anyone know of any dictionary software that can be used on an Unix
>command-line that can help?
At http://www.georgetown.edu/cball/ling361/index.html is a syllabus for Kathryn B. Taylor's "LING-361 INTRODUCTION TO COMPUTATIONAL LINGUISTICS". It seems the whole course, including assignments is on-line.
In Part 4, "E-Text and Simple Text Processing", Catherine N. Ball walks students through a process for calculating word frequencies and tagging words by parts-of-speech, first using commom unix tools (also available in Linux) then building perl scripts to automate the calculations. Source code is included.
I only tried a bit of it on Red Hat 7.2 and found a few incompatibilities in file locations, but those could be easily overcome. All the apps required were available.
There is also an on-line version at http://www.georgetown.edu/cball/webtools/web_freqs.html. Below is the output of that url for this reply, up to here -- just for the fun of it. ;)
Cheers,
ADd
Text name: SchoolForge Reply
Date/time: 1/29/2002 16:8
Word count: 178
Unique words: 134
Sort order: descending 5IS
5 THE
5 TO
4 A
4 FOR
4 IN
3 AND
3 AT
3 I
3 OF
3 THAT
2 ALSO
2 AN
2 AVAILABLE
2 BE
2 BUT
2 BY
2 CAN
2 IT
2 ON
2 ON-LINE
2 UNIX
2 USED
2 WORDS
1 0800
1 15:09
1 2002.01.28
1 4
1 7.2
1 ALL
1 AM
1 ANY
1 ANYONE
1 APPS
1 ARTICLE
1 ASSIGNMENTS
1 AUTOMATE
1 B
1 BALL
1 BELOW
1 BIT
1 BUILDING
1 C
1 CALCULATING
1 CALCULATIONS
1 CATEGORIZE
1 CATHERINE
1 CODE
1 COMMAND-LINE
1 COMMOM
1 COMMONLY
1 COMPUTATIONAL
1 CONJUNCTIONS
1 COULD
1 COURSE
1 DICTIONARY
1 DOES
1 E-TEXT
1 EASILY
1 EASY
1 ENGLISH
1 ETC
1 FEW
1 FIGURE
1 FILE
1 FIRST
1 FOUND
1 FREQUENCIES
1 HAT
1 HELP
1 HERE
1 HTTP://WWW.GEORGETOWN.EDU/CBALL/LING361/INDEX.HTML
1 HTTP://WWW.GEORGETOWN.EDU/CBALL/WEBTOOLS/WEB_FREQS.HTML
1 INCLUDED
1 INCLUDING
1 INCOMPATIBILITIES
1 INTRODUCTION
1 JEREMY
1 KATHRYN
1 KNOW
1 LIKE
1 LING-361
1 LINGUISTICS
1 LINUX
1 LOCATIONS
1 LOOKING
1 MOST
1 N
1 NOUNS
1 ONLY
1 OUT
1 OUTPUT
1 OVERCOME
1 PART
1 PARTS-OF-SPEECH
1 PERL
1 PROCESS
1 PROCESSING
1 PRONOUNS
1 RED
1 REED
1 REPLY
1 REQUIRED
1 SCRIPTS
1 SEEMS
1 SIMPLE
1 SOFTWARE
1 SOME
1 SOURCE
1 STUDENTS
1 SYLLABUS
1 TAGGING
1 TAYLOR'S
1 TEXT
1 THEM
1 THEN
1 THERE
1 THIS
1 THOSE
1 THROUGH
1 TOOLS
1 TRIED
1 UP
1 URL
1 USING
1 VERBS
1 VERSION
1 WALKS
1 WAYS
1 WERE
1 WHOLE
1 WORD
1 WOULD
1 WROTE
Processing time: 0.04 CPU seconds.
At 15:09 2002.01.28 -0800, Jeremy C. Reed wrote:
>I am looking for some easy ways to figure out the most commonly used
>words (in English).
>
>But, I would like to categorize them by nouns, verbs, article, pronouns,
>conjunctions, etc.
>
>Does anyone know of any dictionary software that can be used on an Unix
>command-line that can help?
>
>Such as some tool like:
>
> $ the-dictionary -t frog
> noun
> $ the-dictionary -t ahdsjkhgfe
> [not in dictionary]
> $
>
>(I already can build a list of frequently used words from miscellanous
>emails, and HTML and txt docs on my system.)
>
>My plan is to build categorized lists of top words for reading practice.
>
> Jeremy C. Reed
> http://www.reedmedia.net/
>
>p.s. for example, frequently used words (not categorized):
>
>7.6% the
>3.0% to
>2.6% a
>2.5% of
>2.3% and
>2.0% is
>1.7% in
>1.5% for
>1.0% this
>1.0% that
>1.0% be
>0.8% with
>0.8% if
>0.7% or
>0.7% it
>0.7% are
>0.6% you
>0.6% on
>0.6% not
>0.6% by
>0.6% as
>0.5% from
>0.5% an
>0.4% will
>0.4% which