[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [seul-edu] [Fwd: "If you know of such a lexical analysis program, ..."]



Hi,

I'm sure this could be done in perl, but isn't it just a one-liner in
the shell?

(echo "awk '";sed 's;^\(.*\)$;/\1/;' < wordlist;echo "' <textfile")|sh

where the file wordlist contains the words you seek, one per line and
the file textfile is your text you want to search.  The above command
is a bit like generalised grep (I could have used sed or egrep in
place of awk with minor changes to the line - would be faster but even
more obscure :-)

This prints out the full line containing the word - you could add the
string {print NR} to each awk line, to print the line number of the
occurrence instead.  Each line is reprinted if several of the words
appear in it - use unique to get rid of it or use egrep or sed in
place of awk - or combine all the awk commands on one line rather than 
one on each line.

Of course if you want to be fancy then you use the above sed command
to generate input to lex and then you are really in business :-)

Or did you want something totally different?

Gunnar

PS Use sh or bash - not csh or tcsh.

> "Brown, Rodney" wrote:
> 
> > S. Barret Dolph posed an interesting problem:
> >
> >       I would like to be able to check some books for root words.
> >
> >       Example.... look for occurances of ped, cor, cit, in
> > swift.txt
> >
> >       Would it be possible with Perl, or anything else, to find
> > the occurances of a list
> > of words, show where those words are, and save this task to a
> > file?
> >
> > While not a direct solution, I believe the stemmer in  the GPL
> > program mg
> > <http://www.cs.mu.oz.au/mg/> as described in
> > "The second edition of Managing Gigabytes: Compressing and
> > Indexing Documents and Images
> > by Ian H. Witten, Alistair Moffat, and Timothy C. Bell, is now
> > available (May 1999),
> > published by Morgan Kaufmann Publishing, San Francisco, ISBN
> > 1-55860-570-3."
> > may be a basis for what you want to do. The indexing works on the
> > stemmed words so could
> > go part of the way. I have a copy of mg-1.3f from somewhere
> > (possibly New Zealand) too
> > so the mg-1.2.1 source linked off the page may not be the latest
> > available.
> >
> > While I haven't gone looking, I though tools for generating
> > concordances etc had been
> > around for ages... You may get more help from the Information
> > Retrieval community
> > (ACM SIGIR for example).
> 
> --
> Doug Loss                 God is a comedian playing
> Data Network Coordinator  to an audience too afraid
> Bloomsburg University     to laugh.
> dloss@bloomu.edu                Voltaire
> 

-- 
-------------------------------------------------------------------------------
URL     http://www.hafro.is/~gunnar          http://www.hi.is/~gunnar
E-mail  gunnar@hafro.is                      gunnar@hi.is
        Marine Research Institute            Univ. of Iceland
        P.O. Box 1390                        Science Institute
        121 Reykjavik, Iceland               Dunhaga 7, 101 Reykjavik, Iceland
Phone   +354-552-0240                        +354-525-5915
FAX     +354-562-3790
Motto: Don't call, don't knock, use e-mail