[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [seul-edu] [Fwd: Child Internet protection act CIPA]
on Tue, Nov 13, 2001 at 07:07:31PM +1030, Andrew Reid (andrew.reid@plug.cx) wrote:
> On Mon, Nov 12, 2001 at 10:49:57PM -0800, Karsten M. Self wrote:
>
> > My fix would be a squid proxy, with logs checked reasonably frequently.
<...>
> > I like cutting through the logs with a script that returns just the host
> > for a given site -- fewer pages you have to look at. Most sex sites
> > aren't buried deep in the URL anyway -- the first page should give you
> > a pretty good idea of what's there, though, naturally, if there's a lot
> > of traffic at some deep level, this may bear investigating.
>
> That's a good idea. Can you send us a copy of the script to look at?
> Something that runs as a CRON job every night (emailing the results to
> the librarian or something) that lists users that have been doing
> suspicious things would be good. That could then by tied into
> something that lists all the domains that users have been accessing
> when further investigation is required.
I just use a one-off. I don't filter by dates, and don't know what
Squid's data format is (seconds from some offset, possibly epoch...yeah,
looks like).
Anyway:
$ sed -ne '/^.*GET http:\/\//s//p' -e '/^.*POST http:\/\/s//p' |
sed -e '/\/.*/s///' |
sort |
uniq -c |
sort |
cat -n
...will output an ordered list of domains. You'd want to add an
appropriate expression as a prefilter to match the dates of interest to
do time-mediated searches. This will capture, e.g.: sites which are
predominantly inappropriate, but might fail to capture inappropriate
content being served from a site with a diverse class of content.
It takes a while to cut through logs -- I'm timing performance on a
P-200 and PII-233 right now. It's going to want at least several
minutes. 2:35.435s on the PII-233, for 178,800 lines of log.
My own top ten sites are:
1 81282 www.fuckedcompany.com
2 9187 z.iwethey.org
3 7810 http.us.debian.org
4 4868 non-us.debian.org
5 4051 www.sfgate.com
6 3242 www.theregister.co.uk
7 2944 graphics.nytimes.com
8 2847 www.economist.com
9 2097 www.zdnetasia.com
10 1880 us.news1.yimg.com
There are 1876 distinct sites listed, this covers a month or so. Squid
*doesn't* cache SSL references.
Annotating:
- FuckedCompany uses a meta-refresh tag. This greatly skews results
(an annoys me). Note that it constitutes nearly half of the results
for this period. This is the result of a browser window standing
open, but unviewed, in my browser, for hours of a day. Little or no
reflection of reality. Pud's scoring the banner hits though (well,
except for Junkbuster....).
- IWETHEY is a forum site.
- I've *no* idea what http.us.debian.org and non-us.debian.org are ;-)
- The remaining sites are news oriented.
Peace.
--
Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/
What part of "Gestalt" don't you understand? Home of the brave
http://gestalt-system.sourceforge.net/ Land of the free
Free Dmitry! Boycott Adobe! Repeal the DMCA! http://www.freesklyarov.org
Geek for Hire http://kmself.home.netcom.com/resume.html
PGP signature