[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Sanitizing and publishing our web server logs



On 8/25/11 10:08 AM, Karsten Loesing wrote:
> we have been discussing sanitizing and publishing our web server logs
> for quite a while now.  The idea is to remove all potentially sensitive
> parts from the logs, publish them in monthly tarballs on the metrics
> website, and analyze them for top visited pages, top downloaded
> packages, etc.  See the tickets #1641 and #2489 for details.
> 
> Here's a suggested sanitizing procedure for our web logs, which are in
> Apache's combined log format:
> 
>  - Ignore everything except GET requests.
>  - Ignore all requests that resulted in a 404 status code.
>  - Rewrite log lines so that they only contain the following fields:
>    - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests
> (as logged by our Apache configuration),
>    - the request date (with the time part set to 00:00:00),
>    - the requested URL (cut off at the first encountered "?"),
>    - the HTTP version,
>    - the server's HTTP status code, and
>    - the size of the returned object.
>  - Write all lines from a given virtual host and day to a single output
> file.
>  - Sort the output file alphanumerically to conceal the original order
> of requests.

Pushing this forward.  Here are the sanitized web logs that we'd like to
publish on a daily basis for all our web servers and virtual domains for
all of 2010 (155M):

http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-01.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-02.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-03.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-04.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-05.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-06.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-07.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-08.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-09.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-10.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-11.tar
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-12.tar

The webalizer output for www.torproject.org can be viewed here:

http://freehaven.net/~karsten/volatile/www.torproject.org-webalizer/

So.  Is it safe to publish these logs on a daily basis?  The same
questions from my original mail apply here:

> Is there still anything sensitive in that log file that we should
> remove?  For example:
>  - Do the logs reveal how many pages were cached already on the
> requestor's site (e.g. as repeat accesses)?  Note that log files are
> sorted before being published.
>  - Are there other concerns about making these sanitized log files
> publicly available?
> 
> Are the decisions to remove parts from the logs reasonable?  In particular:
>  - Do we have to take out all requests with 404 status codes?  Some of
> these requests for non-existing URLs contain typos which may not be safe
> to make public.  Should we instead put in some placeholder for the URL
> part and keep the 404 lines to know how many 404's we have per day?
>  - Is there any good reason to keep the portion of a URL after a "?"?
>  - Is it possible to leave some part of Referers in the logs that helps
> us figure out where our traffic originates and what search terms people
> use to find us?
>  - Can we resolve client IP addresses to country codes and include those
> in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS?  How
> would we handle countries with only a few users per day, e.g., should
> there be a threshold below which we consider requests to come from "a
> country with less than XY users?"

The next steps will be to make these sanitized logs available on a daily
basis and to publish the sanitized archives from 2008, 2009, and 2011.

I'm going to wait another week (probably longer) for feedback before
taking these next steps.

Best,
Karsten
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev