[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [seul-edu] [Fwd: Child Internet protection act CIPA]



on Tue, Nov 13, 2001 at 07:07:31PM +1030, Andrew Reid (andrew.reid@plug.cx) wrote:
> On Mon, Nov 12, 2001 at 10:49:57PM -0800, Karsten M. Self wrote:
> 
> > My fix would be a squid proxy, with logs checked reasonably frequently.

<...>

> > I like cutting through the logs with a script that returns just the host
> > for a given site -- fewer pages you have to look at.  Most sex sites
> > aren't buried deep in the URL anyway -- the first page should give you
> > a pretty good idea of what's there, though, naturally, if there's a lot
> > of traffic at some deep level, this may bear investigating.
> 
> That's a good idea. Can you send us a copy of the script to look at?
> Something that runs as a CRON job every night (emailing the results to 
> the librarian or something) that lists users that have been doing
> suspicious things would be good. That could then by tied into
> something that lists all the domains that users have been accessing
> when further investigation is required.

I just use a one-off.  I don't filter by dates, and don't know what
Squid's data format is (seconds from some offset, possibly epoch...yeah,
looks like).

Anyway:

    $ sed -ne '/^.*GET http:\/\//s//p' -e '/^.*POST http:\/\/s//p' |
	sed -e '/\/.*/s///' | 
	sort |
	uniq -c |
	sort | 
	cat -n

...will output an ordered list of domains.  You'd want to add an
appropriate expression as a prefilter to match the dates of interest to
do time-mediated searches.  This will capture, e.g.:  sites which are
predominantly inappropriate, but might fail to capture inappropriate
content being served from a site with a diverse class of content.

It takes a while to cut through logs -- I'm timing performance on a
P-200 and PII-233 right now.  It's going to want at least several
minutes.  2:35.435s on the PII-233, for 178,800 lines of log.

My own top ten sites are:

     1    81282 www.fuckedcompany.com
     2     9187 z.iwethey.org
     3     7810 http.us.debian.org
     4     4868 non-us.debian.org
     5     4051 www.sfgate.com
     6     3242 www.theregister.co.uk
     7     2944 graphics.nytimes.com
     8     2847 www.economist.com
     9     2097 www.zdnetasia.com
    10     1880 us.news1.yimg.com

There are 1876 distinct sites listed, this covers a month or so.  Squid
*doesn't* cache SSL references.

Annotating:

  - FuckedCompany uses a meta-refresh tag.  This greatly skews results
    (an annoys me).  Note that it constitutes nearly half of the results
    for this period.  This is the result of a browser window standing
    open, but unviewed, in my browser, for hours of a day.  Little or no
    reflection of reality.  Pud's scoring the banner hits though (well,
    except for Junkbuster....).

  - IWETHEY is a forum site.

  - I've *no* idea what http.us.debian.org and non-us.debian.org are ;-)

  - The remaining sites are news oriented.

Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>       http://kmself.home.netcom.com/
 What part of "Gestalt" don't you understand?             Home of the brave
  http://gestalt-system.sourceforge.net/                   Land of the free
   Free Dmitry! Boycott Adobe! Repeal the DMCA! http://www.freesklyarov.org
Geek for Hire                     http://kmself.home.netcom.com/resume.html

PGP signature