[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: defect | Status:
| assigned
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Blocker | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Description changed by anarcat:
Old description:
> The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
> seeing intermittent networking issues. It's proving very difficult to get
> reliable metrics out of it, in any case. From its perspective, random
> hosts blink in and out of existence unreliably, with almost *all* hosts
> (63 of the ~80 monitored) are affected over a period of a week. This
> leads me to believe the problem is not with *all* hosts, but with the
> monitoring server itself. The attached screenshot (tpo-overview.png)
> shows the randomness of the problem, as seen from hetzner-
> nbg1-01.torproject.org during the last 7 days.
>
> [[Image(tpo-overview.png, 700)]]
>
> We have another monitoring server hosted in the Hetzner cloud (hetzner-
> hel1-01.torproject.org) which doesn't seem to suffer from the same
> problems. From its perspective, most hosts are healthy over the same
> period, with an average availability of 99.876% over all hosts, which
> includes at least one outlier at 88%. The other (nagios) monitoring
> server sees the new monitoring server with only a 99.728% availbility,
> with a total 30 minutes downtime over the last 7 days. Note that those
> statistics have a large margin of error as the Nagios checks are much
> less frequent than the Prometheus ones, with a granularity ranging in
> tens of minutes instead of seconds.
>
> The alert history graph (second attachment, histogram.cgi-nbg1-01.png)
> shows more clearly the problem, especially when compared to a similar
> host in the vincinity (hetzner-nbg01-02, third attachement, histogram
> .cgi-nbg1-02.png).
>
> [[Image(histogram.cgi-nbg1-01.png, 700)]]
> [[Image(histogram.cgi-nbg1-02.png, 700)]]
>
> I would therefore conclude there is a severe and intermittent routing
> issue with this server.
New description:
The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
seeing intermittent networking issues. It's proving very difficult to get
reliable metrics out of it, in any case. From its perspective, random
hosts blink in and out of existence unreliably, with almost *all* hosts
(63 of the ~80 monitored) are affected over a period of a week. This leads
me to believe the problem is not with *all* hosts, but with the monitoring
server itself. The attached screenshot (tpo-overview.png) shows the
randomness of the problem, as seen from hetzner-nbg1-01.torproject.org
during the last 7 days.
[[Image(tpo-overview.png, 700)]]
We have another monitoring server hosted in the Hetzner cloud (hetzner-
hel1-01.torproject.org) which doesn't seem to suffer from the same
problems. From its perspective, most hosts are healthy over the same
period, with an average availability of 99.876% over all hosts, which
includes at least one outlier at 88%. The other (nagios) monitoring server
sees the new monitoring server with only a 99.728% availbility, with a
total 30 minutes downtime over the last 7 days. Note that those statistics
have a large margin of error as the Nagios checks are much less frequent
than the Prometheus ones, with a granularity ranging in tens of minutes
instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png)
shows more clearly the problem, especially when compared to a similar host
in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-
nbg1-02.png).
[[Image(histogram.cgi-nbg1-01.png, 700)]]
[[Image(histogram.cgi-nbg1-02.png, 700)]]
I would therefore conclude there is a severe and intermittent routing
issue with this server.
I filed this as an issue in the Hetzner "cloud" web interface and am
waiting for feedback.
--
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs