[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01



#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Description changed by anarcat:

Old description:

> The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
> seeing intermittent networking issues. It's proving very difficult to get
> reliable metrics out of it, in any case. From its perspective, random
> hosts blink in and out of existence unreliably, with almost *all* hosts
> (63 of the ~80 monitored) are affected over a period of a week. This
> leads me to believe the problem is not with *all* hosts, but with the
> monitoring server itself. The attached screenshot (tpo-overview.png)
> shows the randomness of the problem, as seen from hetzner-
> nbg1-01.torproject.org during the last 7 days.
>
> [[Image(tpo-overview.png​, 700)]]
>
> We have another monitoring server hosted in the Hetzner cloud (hetzner-
> hel1-01.torproject.org) which doesn't seem to suffer from the same
> problems. From its perspective, most hosts are healthy over the same
> period, with an average availability of 99.876% over all hosts, which
> includes at least one outlier at 88%. The other (nagios) monitoring
> server sees the new monitoring server with only a 99.728% availbility,
> with a total 30 minutes downtime over the last 7 days. Note that those
> statistics have a large margin of error as the Nagios checks are much
> less frequent than the Prometheus ones, with a granularity ranging in
> tens of minutes instead of seconds.
>
> The alert history graph (second attachment,  histogram.cgi-nbg1-01.png)
> shows more clearly the problem, especially when compared to a similar
> host in the vincinity (hetzner-nbg01-02, third attachement, histogram
> .cgi-nbg1-02.png).
>
> [[Image(histogram.cgi-nbg1-01.png​, 700)]]
> [[Image(histogram.cgi-nbg1-02.png​, 700)]]
>
> I would therefore conclude there is a severe and intermittent routing
> issue with this server.

New description:

 The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
 seeing intermittent networking issues. It's proving very difficult to get
 reliable metrics out of it, in any case. From its perspective, random
 hosts blink in and out of existence unreliably, with almost *all* hosts
 (63 of the ~80 monitored) are affected over a period of a week. This leads
 me to believe the problem is not with *all* hosts, but with the monitoring
 server itself. The attached screenshot (tpo-overview.png) shows the
 randomness of the problem, as seen from hetzner-nbg1-01.torproject.org
 during the last 7 days.

 [[Image(tpo-overview.png​, 700)]]

 We have another monitoring server hosted in the Hetzner cloud (hetzner-
 hel1-01.torproject.org) which doesn't seem to suffer from the same
 problems. From its perspective, most hosts are healthy over the same
 period, with an average availability of 99.876% over all hosts, which
 includes at least one outlier at 88%. The other (nagios) monitoring server
 sees the new monitoring server with only a 99.728% availbility, with a
 total 30 minutes downtime over the last 7 days. Note that those statistics
 have a large margin of error as the Nagios checks are much less frequent
 than the Prometheus ones, with a granularity ranging in tens of minutes
 instead of seconds.

 The alert history graph (second attachment,  histogram.cgi-nbg1-01.png)
 shows more clearly the problem, especially when compared to a similar host
 in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-
 nbg1-02.png).

 [[Image(histogram.cgi-nbg1-01.png​, 700)]]
 [[Image(histogram.cgi-nbg1-02.png​, 700)]]

 I would therefore conclude there is a severe and intermittent routing
 issue with this server.

 I filed this as an issue in the Hetzner "cloud" web interface and am
 waiting for feedback.

--

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs