[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors
#23243: write a spec for web-server-access log descriptors
-------------------------------------+------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Metrics website | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------+------------------------------
Comment (by karsten):
Okay, I tried to specify that, but please
[https://trac.torproject.org/projects/tor/attachment/ticket/23243
/webstats-spec.3.txt review carefully]. The part that made this a bit more
complex was that there are actually 2 places where we need to look at
dates/times: 1) when deciding about discarding lines that are too old or
too new and 2) when deciding when to publish a sanitized file and never
ever touch it again. Maybe I overcomplicated this, so if you see a way to
simplify what I wrote, please say so!
Here's the diff, if that helps reviewing:
{{{
diff --git a/webstats-spec.txt b/webstats-spec.txt
index 7e46449..48c0287 100644
--- a/webstats-spec.txt
+++ b/webstats-spec.txt
@@ -3,7 +3,6 @@ Tor webserver logs
Next steps:
- Replace webserver with web server which seems to be Less Bad English
(karsten).
- - Find out what exact delay we'll need for publishing sanitized logs
(iwakeh?)
- Turn this document into XML (karsten)
- Code the decisions (iwakeh)
- Try out the code on actual logs (iwakeh; karsten can make more logs
available)
@@ -30,6 +29,8 @@ LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t
\"%r\" %>s %b \"%{Referer}i\"
The main difference to Apache's Common Log Format is that request IP
addresses are removed and the field is instead used to encode whether the
request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the
site's onion service (0.0.0.2).
+Tor's webservers are configured to use UTC as timezone, which is also
highly recommended when rewriting request times to "00:00:00" in order for
the subsequent sanitizing steps to work correctly. Alternatively, if the
system timezone is not set to UTC, webservers should keep request times
unchanged and let them be handled by the subsequent sanitizing steps.
+
Tor's webservers are configured to rotate logs at least once per day,
which does not necessarily happen at 00:00:00 UTC. As a result, log files
may contain requests from up to two UTC days and several log files may
contain requests that have been started on the same UTC day.
All access log files written by Tor's webservers follow the naming
convention <hostname>.torproject.org-access.log-YYYYMMDD.
@@ -48,6 +49,8 @@ Log files are expected to contain exactly 1 request per
line. We process these f
- Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
%b") or a compatible format like one of Tor's privacy formats. It is
acceptable if lines start with a format that is compatible to the Common
Log Format and continue with additional fields. Those additional fields
will later be discarded, but the line will not be discarded because of
them.
- The request IP address starts with "0.0.0.", followed by any number
between 0 and 255.
+ - The time the request was received does not lie in the future.
+ - The date the request was received, after converting the request time
to UTC, does not lie more than 1 day in the past. (Bulk imports of
archived logs are exempt from this requirement.)
- The request protocol is HTTP.
- The request method is either GET or HEAD.
- The final status of the request is neither 400 ("Bad Request") nor 404
("Not Found").
@@ -80,9 +83,7 @@ Sanitized log files may additionally be sorted into
directories by virtual host
<virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-
YYYYMMDD[.xz]
-Due to the fact that the date when a log file was rotated and the start
date of contained requests may not always overlap, we need to delay
publishing sanitized log files until all log files containing requests
from that date are guaranteed to be processed. After this delay, the
sanitized log files are published and not further modified.
-
-XXX What's the delay? End of UTC day + 24 hours? Check current script!
+Due to the fact that the date when a log file was rotated and the start
date of contained requests may not always overlap, we need to delay
publishing sanitized log files until the start date of requests in UTC
plus 2 days. After this delay, all log files containing requests from that
date are assumed to be processed. Sanitized log files are published and
not further modified in the future. (Again, bulk imports of archived logs
are exempt from this.)
As last and certainly not least important sanitizing step, all rewritten
log lines are sorted alphabetically, so that request order cannot be
inferred from sanitized log files.
}}}
If you think it's good, I'll continue with the remaining next steps.
Thanks!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs