[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #32801 [Internal Services/Tor Sysadmin Team]: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc
- To: undisclosed-recipients: ;
- Subject: Re: [tor-bugs] #32801 [Internal Services/Tor Sysadmin Team]: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc
- From: "Tor Bug Tracker & Wiki" <blackhole@xxxxxxxxxxxxxx>
- Date: Wed, 18 Dec 2019 22:00:41 -0000
- Auto-submitted: auto-generated
- Delivered-to: archiver@xxxxxxxx
- Delivery-date: Wed, 18 Dec 2019 18:56:05 -0500
- In-reply-to: <047.38730e9b52b37def070cc25afe6a8d55@torproject.org>
- List-archive: <http://lists.torproject.org/pipermail/tor-bugs/>
- List-help: <mailto:tor-bugs-request@lists.torproject.org?subject=help>
- List-id: "auto: Tor bug tracker status mails" <tor-bugs.lists.torproject.org>
- List-post: <mailto:tor-bugs@lists.torproject.org>
- List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=subscribe>
- List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=unsubscribe>
- References: <047.38730e9b52b37def070cc25afe6a8d55@torproject.org>
- Reply-to: no-reply@xxxxxxxxxxxxxx, tor-assistants@xxxxxxxxxxxxxx
- Sender: "tor-bugs" <tor-bugs-bounces@xxxxxxxxxxxxxxxxxxxx>
#32801: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap),
pauli (puppet), rouyi (jenkins), etc
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: hiro
Type: defect | Status:
| assigned
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Old description:
> During a security reboot today, kvm4.torproject.org did not return. All
> virtual machines on this host are down and unavailable.
>
> According to the Nextcloud spreadsheet (since LDAP is down), that
> includes:
>
> || host || service || impact || mitigation ||
> || alberti || LDAP, db.tpo || critical, no passwd
> change || read-only copies everywhere ||
> || build-x86-09 || buildbox || redundant || N/A ||
> || eugeni || incoming mail, lists || critical, total outage ||
> peek at `tor-puppet/modules/postfix/files/virtual` and email people
> directly ||
> || meronense || metrics? || unclear || ? ||
> || neriniflorum || DNS || redundant, higher TTFB?
> || possible to remove from rotation ||
> || oo-hetzner-03 || onionoo || redundant? unclear? || ?
> ||
> || pauli || puppet || major, no config
> management || use `cumin`, local git copies ||
> || rouyi || jenkins || critical, total outage ||
> ? ||
> || web-hetzner-01 || web mirror || redundant, no effect? ||
> removed from rotation automatically ||
> || weissi || build box || no windows builds || N/A
> ||
> || woronowii || build box || no windows builds || N/A
> ||
>
> I'll note that it seems both windows build boxes are on the same machine
> so even if jenkins *would* be able to dispatch builds, we wouldn't be
> able to do those...
>
> A ticket was filed with Hetzner to try and rescue the server.
>
> Our disaster recover plan so far is to wait for that rescue to succeed,
> which might take up to 24h but hopefully less.
>
> If that fails, I would suggest the following plan:
>
> 1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
> (we need those three to build new machines)
> 2. build a new ganeti cluster (because we can't recover all of this on
> gnt-fsn)
> 3. restore remaining machines on the new cluster
> 4. decommission kvm4 officially
>
> This could take a few days of work. :(
New description:
During a security reboot today, kvm4.torproject.org did not return. All
virtual machines on this host are down and unavailable.
According to the Nextcloud spreadsheet (since LDAP is down), that
includes:
|| host || service || impact || mitigation ||
|| alberti || LDAP, db.tpo || critical, no passwd change
|| read-only copies everywhere ||
|| build-x86-09 || buildbox || redundant || N/A ||
|| eugeni || incoming mail, lists || critical, total outage ||
peek at `tor-puppet/modules/postfix/files/virtual` and email people
directly ||
|| meronense || metrics.tpo || critical, total outage ||
? ||
|| neriniflorum || DNS || redundant, higher TTFB? ||
possible to remove from rotation ||
|| oo-hetzner-03 || onionoo || redundant || ? ||
|| pauli || puppet || major, no config
management || use `cumin`, local git copies ||
|| rouyi || jenkins || critical, total outage ||
? ||
|| web-hetzner-01 || web mirror || redundant, no effect? ||
removed from rotation automatically ||
|| weissi || build box || no windows builds || N/A
||
|| woronowii || build box || no windows builds || N/A
||
I'll note that it seems both windows build boxes are on the same machine
so even if jenkins *would* be able to dispatch builds, we wouldn't be able
to do those...
A ticket was filed with Hetzner to try and rescue the server.
Our disaster recover plan so far is to wait for that rescue to succeed,
which might take up to 24h but hopefully less.
If that fails, I would suggest the following plan:
1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
(we need those three to build new machines)
2. build a new ganeti cluster (because we can't recover all of this on
gnt-fsn)
3. restore remaining machines on the new cluster
4. decommission kvm4 officially
This could take a few days of work. :(
--
Comment (by anarcat):
metrics is metrics.tpo, and so critical (or at least not redundant).
onion-oo is redundant.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32801#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs
- Prev by Author:
Re: [tor-bugs] #32797 [Applications/Tor Browser]: Torbrowser and tordaemon cant connect to the internet
- Next by Author:
Re: [tor-bugs] #32300 [Circumvention/Snowflake]: Improve snowflake server test coverage
- Previous by thread:
Re: [tor-bugs] #32801 [Internal Services/Tor Sysadmin Team]: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc
- Next by thread:
Re: [tor-bugs] #32801 [- Select a component]: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc
- Index(es):