[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4
#32802: decomission kvm4
-------------------------------------------------+---------------------
Reporter: anarcat | Owner: tpa
Type: project | Status: new
Priority: High | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Major | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+---------------------
Comment (by anarcat):
here's the disaster recovery plan i made up on the fly in #32801, which is
relevant to the discussion here:
> According to the Nextcloud spreadsheet (since LDAP is down), [machines
running on kvm4] includes:
>
> || host || service || impact || mitigation ||
> || alberti || LDAP, db.tpo || critical, no passwd
change || read-only copies everywhere ||
> || build-x86-09 || buildbox || redundant || N/A ||
> || eugeni || incoming mail, lists || critical, total outage
|| peek at `tor-puppet/modules/postfix/files/virtual` and email people
directly ||
> || meronense || metrics.tpo || critical, total outage
|| ? ||
> || neriniflorum || DNS || redundant, higher TTFB?
|| possible to remove from rotation ||
> || oo-hetzner-03 || onionoo || redundant || ? ||
> || pauli || puppet || major, no config
management || use `cumin`, local git copies ||
> || rouyi || jenkins || critical, total outage
|| ? ||
> || web-hetzner-01 || web mirror || redundant, no effect? ||
removed from rotation automatically ||
> || weissi || build box || no windows builds || N/A
||
> || woronowii || build box || no windows builds || N/A
||
>
> I'll note that it seems both windows build boxes are on the same machine
so even if jenkins *would* be able to dispatch builds, we wouldn't be able
to do those...
>
> Our disaster recover plan so far is to wait for that rescue to succeed,
which might take up to 24h but hopefully less.
>
> If that fails, I would suggest the following plan:
>
> 1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
(we need those three to build new machines)
> 2. build a new ganeti cluster (because we can't recover all of this on
gnt-fsn)
> 3. restore remaining machines on the new cluster
> 4. decommission kvm4 officially
>
> This could take a few days of work. :(
Out of that, I would outline the following plan:
1. in the short term: migrate eugeni, pauli and alberti to a HA cluster,
probably gnt-fsn (yes, that means it will be over-allocated even more)
2. in parallel or after (january): add a node or two to the ganeti
cluster
3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new
cluster
This would leave the following boxes on kvm4, with the following
rationale:
* build-x86-09 - highly redundant, not urgent
* web-hetzner-01 - one web node already present in the gnt-fsn cluster,
moving this will not bring us more redundancy
* weissi - hard to migrate
* woronowii - hard to migrate
At that point we'd have the choice to migrate the two windows VM (ugh) and
the build box to the ganeti cluster, and we'd probably decom web-
hetzner-01 or move it to kvm5 or some other host, then decom kvm4.
How does that sound for a plan?
Tickets would need to be created for each one of those tasks.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32802#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs