[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-bugs] #33958 [Internal Services/Tor Sysadmin Team]: fsn VMs lost connectivity this morning
#33958: fsn VMs lost connectivity this morning
-----------------------------------------------------+-----------------
Reporter: weasel | Owner: tpa
Type: defect | Status: new
Priority: High | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Major | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------------------------+-----------------
This morning several of our VMs at fsn were without network.
The instances were still running, and `gnt-console` still got me a console
that I could log into, but the machines were not reachable from the
network, nor could they reach the network. tcpdumping the bridge
interface on the node did not show any network traffic for the instance.
Migrating them made them be online again (tried with vineale for
instance). Rebooting also helped (tried with everything else).
Looking at the running openswitch config on a node when its instances did
not have network looked like this:
{{{
root@fsn-node-04:~# ovs-vsctl show
ce[...]
Bridge "br0"
Port vlan-gntinet
tag: 4000
Interface vlan-gntinet
type: internal
Port "eth0"
Interface "eth0"
Port "br0"
Interface "br0"
type: internal
Port vlan-gntbe
tag: 4001
Interface vlan-gntbe
type: internal
ovs_version: "2.10.1"
}}}
When its working, it should look more like this:
{{{
root@fsn-node-04:~# ovs-vsctl show
ce[...]
Bridge "br0"
Port "tap3"
tag: 4000
trunks: [4000]
Interface "tap3"
Port vlan-gntinet
tag: 4000
Interface vlan-gntinet
type: internal
Port "eth0"
Interface "eth0"
Port "tap4"
tag: 4000
trunks: [4000]
Interface "tap4"
Port "br0"
Interface "br0"
type: internal
Port "tap5"
tag: 4000
trunks: [4000]
Interface "tap5"
Port "tap1"
tag: 4000
trunks: [4000]
Interface "tap1"
Port vlan-gntbe
tag: 4001
Interface vlan-gntbe
type: internal
Port "tap2"
tag: 4000
trunks: [4000]
Interface "tap2"
Port "tap0"
tag: 4000
trunks: [4000]
Interface "tap0"
ovs_version: "2.10.1"
}}}
My first guess was that migrating somehow had screwed up the network
config, but that's probably not what happened, as the issue happened again
shortly afterwards when I was running upgrades. So:
My current working theory is that the following happened:
- In the morning, once automaticallly and once manually, we ran package
upgrades.
- Today this included an openssl update. And openvswitch is linked
against openssl.
- `needrestart` restarted openvswitch.
- restarting openvswitch does not restore the dynamically added VM taps
into the bridge.
I propose we blacklist openvswitch from being restarted by needrestart.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33958>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs