[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
#33098: fsn-node-03 disk problems
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: defect | Status:
| assigned
Priority: High | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Blocker | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Comment (by anarcat):
and we got more SMART email messages about this server. now it's also sdb
that's complaining. ]
i also noticed that sdb has been complaining even before i opened that
ticket with Hetzner. in fact, what triggered me to open that ticket is the
second smartd email, which i mistakenly thought was caused by sda errors:
the second email we got was about sdb! so changing sda wouldn't have
solved that problem.
i commented on hetzner's ticket with the following:
> We're still having trouble with this server.
>
> After a full RAID-1 resync, I rebooted the box, but the new disk was
> kicked out of the array, and not detected as having a RAID superblock:
>
> {{{
> root@fsn-node-03:~# mdadm -E /dev/sda1
> mdadm: No md superblock detected on /dev/sda1.
> }}}
>
> When I started the array and readded the disk, it started a full resync
> again:
>
> {{{
> root@fsn-node-03:~# mdadm --run /dev/md2
> mdadm: started array /dev/md/2
> root@fsn-node-03:~# mdadm /dev/md2 -a /dev/sda1
> mdadm: added /dev/sda1
> root@fsn-node-03:~# cat /proc/mdstat
> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
[raid4] [raid10]
> md2 : active raid1 sda1[2] sdb1[1]
> 9766302720 blocks super 1.2 [2/1] [_U]
> [>....................] recovery = 0.0% (274048/9766302720)
finish=593.9min speed=274048K/sec
> bitmap: 0/73 pages [0KB], 65536KB chunk
>
> md1 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
> 937026560 blocks super 1.2 [2/2] [UU]
> bitmap: 1/7 pages [4KB], 65536KB chunk
>
> md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
> 523264 blocks super 1.2 [2/2] [UU]
>
> unused devices: <none>
> }}}
>
> Furthermore, I just noticed that have received smartd notifications
about the *OTHER*
> hard drive (sdb):
>
> > Date: Wed, 29 Jan 2020 23:46:38 +0000
> >
> > [...]
> >
> > Device: /dev/sdb [SAT], ATA error count increased from 4 to 5
> >
> > Device info:
> > TOSHIBA MG06ACA10TEY, S/N:[...], WWN:[...], FW:0103, 10.0 TB
>
> We have also seen errors from sdb, the second drive, *before* we opened
> this ticket. That was my mistake: I thought the errors were both from
> the same disk, I couldn't imagine both disks were giving out errors.
>
> At this point, I am wondering if it might not be better to just
> commission a completely new machine than trying to revive this one. I
> get the strong sense something is wrong with the disk controller on that
> one. We have two other PX62 servers with the same identical setup
> (fsn-node-01/PX62-NVMe #[...], fsn-node-02/PX62-NVMe #[...]).
> Both are in production and neither show the same disk problems.
>
> In any case, I can't use the box like this: its (software) RAID array
> doesn't survive reboots which tells me there's something very wrong with
> this machine.
>
> Could you look into this again please?
So I think that, worst case, they just swap the machine and we reinstall.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33098#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs