[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01

To: undisclosed-recipients: ;
Subject: Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
From: "Tor Bug Tracker & Wiki" <blackhole@xxxxxxxxxxxxxx>
Date: Tue, 01 Oct 2019 20:56:50 -0000
Auto-submitted: auto-generated
Delivered-to: archiver@xxxxxxxx
Delivery-date: Tue, 01 Oct 2019 16:56:59 -0400
In-reply-to: <047.09debacfbe5cd669c9e912e2d70f3d7e@torproject.org>
List-archive: <http://lists.torproject.org/pipermail/tor-bugs/>
List-help: <mailto:tor-bugs-request@lists.torproject.org?subject=help>
List-id: "auto: Tor bug tracker status mails" <tor-bugs.lists.torproject.org>
List-post: <mailto:tor-bugs@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=unsubscribe>
References: <047.09debacfbe5cd669c9e912e2d70f3d7e@torproject.org>
Reply-to: no-reply@xxxxxxxxxxxxxx, tor-assistants@xxxxxxxxxxxxxx
Sender: "tor-bugs" <tor-bugs-bounces@xxxxxxxxxxxxxxxxxxxx>

#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Description changed by anarcat:

Old description:

> The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
> seeing intermittent networking issues. It's proving very difficult to get
> reliable metrics out of it, in any case. From its perspective, random
> hosts blink in and out of existence unreliably, with almost *all* hosts
> (63 of the ~80 monitored) are affected over a period of a week. This
> leads me to believe the problem is not with *all* hosts, but with the
> monitoring server itself. The attached screenshot (tpo-overview.png)
> shows the randomness of the problem, as seen from hetzner-
> nbg1-01.torproject.org during the last 7 days.
>
> [[Image(tpo-overview.png, 700)]]
>
> We have another monitoring server hosted in the Hetzner cloud (hetzner-
> hel1-01.torproject.org) which doesn't seem to suffer from the same
> problems. From its perspective, most hosts are healthy over the same
> period, with an average availability of 99.876% over all hosts, which
> includes at least one outlier at 88%. The other (nagios) monitoring
> server sees the new monitoring server with only a 99.728% availbility,
> with a total 30 minutes downtime over the last 7 days. Note that those
> statistics have a large margin of error as the Nagios checks are much
> less frequent than the Prometheus ones, with a granularity ranging in
> tens of minutes instead of seconds.
>
> The alert history graph (second attachment,  histogram.cgi-nbg1-01.png)
> shows more clearly the problem, especially when compared to a similar
> host in the vincinity (hetzner-nbg01-02, third attachement, histogram
> .cgi-nbg1-02.png).
>
> [[Image(histogram.cgi-nbg1-01.png, 700)]]
> [[Image(histogram.cgi-nbg1-02.png, 700)]]
>
> I would therefore conclude there is a severe and intermittent routing
> issue with this server.

New description:

 The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is
 seeing intermittent networking issues. It's proving very difficult to get
 reliable metrics out of it, in any case. From its perspective, random
 hosts blink in and out of existence unreliably, with almost *all* hosts
 (63 of the ~80 monitored) are affected over a period of a week. This leads
 me to believe the problem is not with *all* hosts, but with the monitoring
 server itself. The attached screenshot (tpo-overview.png) shows the
 randomness of the problem, as seen from hetzner-nbg1-01.torproject.org
 during the last 7 days.

 [[Image(tpo-overview.png, 700)]]

 We have another monitoring server hosted in the Hetzner cloud (hetzner-
 hel1-01.torproject.org) which doesn't seem to suffer from the same
 problems. From its perspective, most hosts are healthy over the same
 period, with an average availability of 99.876% over all hosts, which
 includes at least one outlier at 88%. The other (nagios) monitoring server
 sees the new monitoring server with only a 99.728% availbility, with a
 total 30 minutes downtime over the last 7 days. Note that those statistics
 have a large margin of error as the Nagios checks are much less frequent
 than the Prometheus ones, with a granularity ranging in tens of minutes
 instead of seconds.

 The alert history graph (second attachment,  histogram.cgi-nbg1-01.png)
 shows more clearly the problem, especially when compared to a similar host
 in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-
 nbg1-02.png).

 [[Image(histogram.cgi-nbg1-01.png, 700)]]
 [[Image(histogram.cgi-nbg1-02.png, 700)]]

 I would therefore conclude there is a severe and intermittent routing
 issue with this server.

 I filed this as an issue in the Hetzner "cloud" web interface and am
 waiting for feedback.

--

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online

_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

References:
- [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
  - From: Tor Bug Tracker & Wiki

Prev by Author: Re: [tor-bugs] #32076 [Applications/Tor Browser]: Update goptlib to v1.1.0
Next by Author: Re: [tor-bugs] #32061 [Applications/Tor Browser]: Bump snowflake to b4f4b29a03
Previous by thread: Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
Next by thread: Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
Index(es):
- Author
- Thread