[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01
#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: defect | Status:
| assigned
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Blocker | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Comment (by anarcat):
i saw a case of a downtime, live. both server couldn't ping each other for
a few minutes. this happened in the logs, on nbg1:
{{{
Oct 2 19:03:20 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 09[IKE]
establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org{6}
Oct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 08[IKE]
95.216.141.241 is initiating an IKE_SA
Oct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE]
IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[54]
established between
195.201.139.202[195.201.139.202]...95.216.141.241[95.216.141.241]
Oct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE]
CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org{164} established with SPIs c0bc9534_i cf59efd5_o
and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 95.216.141.241/32
2a01:4f9:c010:5f1::1/128
}}}
... ie. a new session, negociated in 6 seconds. that's pretty slow, but
tolerable I guess. the problem is that, from nagios' point of view, this
was a much longer downtime:
{{{
Oct 2 18:57:31 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 10[IKE]
establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org{6}
Oct 2 19:00:16 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 16[IKE]
initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org[44] to 195.201.139
.202
}}}
ie. the session started 6 *minutes* earlier, and took 3 *more* minutes to
get to the initiating stage. then nagios noticed the node was down,
naturally:
{{{
Oct 2 19:01:11 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
ALERT: hetzner-nbg1-01;process - apache2 -
worker;CRITICAL;SOFT;1;CHECK_NRPE STATE CRITICAL: Sock
et timeout after 50 seconds.
Oct 2 19:01:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
}}}
then, another three minutes later, ipsec figured it out and fixed the
outage:
{{{
Oct 2 19:03:02 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 12[IKE]
initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org[45] to 195.201.139.202
Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 14[IKE]
establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org{6}
Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE]
IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45]
established between
95.216.141.241[95.216.141.241]...195.201.139.202[195.201.139.202]
Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE]
CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
nbg1-01.torproject.org{139} established with SPIs cf59efd5_i c0bc9534_o
and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 195.201.139.202/32
2a01:4f8:c2c:1e17::1/128
Oct 2 19:03:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
hetzner-nbg1-01;UP;SOFT;4;PING OK - Packet loss = 0%, RTA = 26.98 ms
}}}
A similar problem occured when macrum rebooted. From nbg's point of view,
it went down, for sure, but then it took forever to rekey, as it was stuck
in this state:
{{{
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: ESTABLISHED 47
minutes ago,
195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKEv2 SPIs:
c1869d7083d456cd_i adcdb88009c5736e_r*, pre-shared key reauthentication in
117 minutes
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKE proposal:
AES_CBC_128/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_3072
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks queued:
CHILD_REKEY
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks active:
CHILD_REKEY
hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}: REKEYING,
TUNNEL, reqid 2, expires in 12 minutes
hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}:
195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32
138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128
2a01:4f8:172:39ca:0:dad3::/96
}}}
At that point, the tunnel had already been down for a while - macrum was
rebooted at 19:04 - but it took a full 20 minutes for ipsec to recover:
{{{
Oct 2 19:18:45 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 16[IKE]
establishing CHILD_SA hetzner-nbg1-01.torproject.org-
macrum.torproject.org{2}
Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 14[IKE]
initiating IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55]
to 138.201.192.11
Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 12[IKE]
establishing CHILD_SA hetzner-nbg1-01.torproject.org-
macrum.torproject.org{2}
Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE]
IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55]
established between
195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE]
CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{166}
established with SPIs cfe1be1c_i c2ce0e71_o and TS 195.201.139.202/32
2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32 138.201.212.224/28
172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
}}}
in comparison, nagios barely flinched when the server went down:
{{{
Oct 2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE]
deleting IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[31]
between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
Oct 2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 05[IKE]
initiating IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46]
to 138.201.192.11
Oct 2 19:04:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
macrum;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct 2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
ALERT: macrum;unwanted process - postgrey;CRITICAL;HARD;1;CHECK_NRPE STATE
CRITICAL: Socket timeout after 50 seconds.
Oct 2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
ALERT: macrum;SSL cert - host;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL:
Socket timeout after 50 seconds.
Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE]
establishing CHILD_SA hetzner-hel1-01.torproject.org-
macrum.torproject.org{2}
Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE]
IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46]
established between
95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE]
CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{140}
established with SPIs c93eb90f_i c8942f69_o and TS 95.216.141.241/32
2a01:4f9:c010:5f1::1/128 === 138.201.192.11/32 138.201.212.224/28
172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
Oct 2 19:05:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
macrum;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 25.46 ms
}}}
19:04:00 is (basically) when macrum went down (syslog stopped writing at
19:03:59), and it brought back up its strongswan service at 19:05:26, so
it took less than 5 seconds for nagios to reconnect.
so there's definitely something wrong with strongswan on that prometheus
server. i've closed the ticket with hetzner as it now seems obvious they
are not the cause of this problem.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs