[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

To: undisclosed-recipients: ;
Subject: Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
From: "Tor Bug Tracker & Wiki" <blackhole@xxxxxxxxxxxxxx>
Date: Thu, 30 Jan 2020 19:36:33 -0000
Auto-submitted: auto-generated
Delivered-to: archiver@xxxxxxxx
Delivery-date: Thu, 30 Jan 2020 14:36:44 -0500
In-reply-to: <047.e6c2669359a0e049e529e4a2b105d4e7@torproject.org>
List-archive: <http://lists.torproject.org/pipermail/tor-bugs/>
List-help: <mailto:tor-bugs-request@lists.torproject.org?subject=help>
List-id: "auto: Tor bug tracker status mails" <tor-bugs.lists.torproject.org>
List-post: <mailto:tor-bugs@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-bugs>, <mailto:tor-bugs-request@lists.torproject.org?subject=unsubscribe>
References: <047.e6c2669359a0e049e529e4a2b105d4e7@torproject.org>
Reply-to: no-reply@xxxxxxxxxxxxxx, blackhole@xxxxxxxxxxxxxx
Sender: "tor-bugs" <tor-bugs-bounces@xxxxxxxxxxxxxxxxxxxx>

#33098: fsn-node-03 disk problems
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  High                                 |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 and we got more SMART email messages about this server. now it's also sdb
 that's complaining. ]

 i also noticed that sdb has been complaining even before i opened that
 ticket with Hetzner. in fact, what triggered me to open that ticket is the
 second smartd email, which i mistakenly thought was caused by sda errors:
 the second email we got was about sdb! so changing sda wouldn't have
 solved that problem.

 i commented on hetzner's ticket with the following:

 > We're still having trouble with this server.
 >
 > After a full RAID-1 resync, I rebooted the box, but the new disk was
 > kicked out of the array, and not detected as having a RAID superblock:
 >
 > {{{
 > root@fsn-node-03:~# mdadm -E /dev/sda1
 > mdadm: No md superblock detected on /dev/sda1.
 > }}}
 >
 > When I started the array and readded the disk, it started a full resync
 > again:
 >
 > {{{
 > root@fsn-node-03:~# mdadm --run /dev/md2
 > mdadm: started array /dev/md/2
 > root@fsn-node-03:~# mdadm /dev/md2 -a /dev/sda1
 > mdadm: added /dev/sda1
 > root@fsn-node-03:~# cat /proc/mdstat
 > Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
 [raid4] [raid10]
 > md2 : active raid1 sda1[2] sdb1[1]
 >       9766302720 blocks super 1.2 [2/1] [_U]
 >       [>....................]  recovery =  0.0% (274048/9766302720)
 finish=593.9min speed=274048K/sec
 >       bitmap: 0/73 pages [0KB], 65536KB chunk
 >
 > md1 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
 >       937026560 blocks super 1.2 [2/2] [UU]
 >       bitmap: 1/7 pages [4KB], 65536KB chunk
 >
 > md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
 >       523264 blocks super 1.2 [2/2] [UU]
 >
 > unused devices: <none>
 > }}}
 >
 > Furthermore, I just noticed that have received smartd notifications
 about the *OTHER*
 > hard drive (sdb):
 >
 > > Date: Wed, 29 Jan 2020 23:46:38 +0000
 > >
 > > [...]
 > >
 > > Device: /dev/sdb [SAT], ATA error count increased from 4 to 5
 > >
 > > Device info:
 > > TOSHIBA MG06ACA10TEY, S/N:[...], WWN:[...], FW:0103, 10.0 TB
 >
 > We have also seen errors from sdb, the second drive, *before* we opened
 > this ticket. That was my mistake: I thought the errors were both from
 > the same disk, I couldn't imagine both disks were giving out errors.
 >
 > At this point, I am wondering if it might not be better to just
 > commission a completely new machine than trying to revive this one. I
 > get the strong sense something is wrong with the disk controller on that
 > one. We have two other PX62 servers with the same identical setup
 > (fsn-node-01/PX62-NVMe #[...], fsn-node-02/PX62-NVMe #[...]).
 > Both are in production and neither show the same disk problems.
 >
 > In any case, I can't use the box like this: its (software) RAID array
 > doesn't survive reboots which tells me there's something very wrong with
 > this machine.
 >
 > Could you look into this again please?

 So I think that, worst case, they just swap the machine and we reinstall.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33098#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online

_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

References:
- [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
  - From: Tor Bug Tracker & Wiki

Prev by Author: [tor-bugs] #32874 [Webpages/Blog]: Tor blog tags not showing up on mobile
Next by Author: Re: [tor-bugs] #20969 [Core Tor]: Detect relays that don't update their onion keys every 7 days.
Previous by thread: Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
Next by thread: Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
Index(es):
- Author
- Thread