[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems
#33098: fsn-node-03 disk problems
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: defect | Status: assigned
Priority: High | Milestone:
Component: Internal Services/Tor Sysadmin | Version:
Team |
Severity: Blocker | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-------------------------------------------------+-------------------------
for some reason, the HDD disk on fsn-node-03 is having SMART errors. I
originally filed this ticket with Hetzner:
> yesterday, as we got errors from the SMART daemon on this host, looking
like this:
>
> From: root <root@xxxxxxxxxxxxxxxxxxxxxxxxxx>
> Subject: SMART error (ErrorCount) detected on host: fsn-node-03
> To: root@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Date: Tue, 28 Jan 2020 23:35:35 +0000
>
> This message was generated by the smartd daemon running on:
>
> host name: fsn-node-03
> DNS domain: torproject.org
>
> The following warning/error was logged by the smartd daemon:
>
> Device: /dev/sda [SAT], ATA error count increased from 0 to 1
>
> Device info:
> TOSHIBA MG06ACA10TEY, S/N:..., WWN:...., FW:0103, 10.0 TB
>
> For details see host's SYSLOG.
>
> You can also use the smartctl utility for further investigation.
> Another message will be sent in 24 hours if the problem persists.
>
> Another such email triggered an hour later as well.
>
> The RAID array the disk is on triggered a rebuild as well, somehow. The
follow
> messages showed up in dmesg:
>
> [Jan28 20:44] md: resync of RAID array md2
> [Jan28 22:20] ata2.00: exception Emask 0x50 SAct 0x4000 SErr 0x480900
action 0x6
> frozen
> [ +0.004419] ata2.00: irq_stat 0x08000000, interface fatal error
> [ +0.001489] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
> [ +0.000781] ata2.00: failed command: WRITE FPDMA QUEUED
> [ +0.000785] ata2.00: cmd 61/00:70:80:52:f6/05:00:ec:00:00/40 tag 14
ncq dma
> 655360 out
> res 40/00:70:80:52:f6/00:00:ec:00:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001600] ata2.00: status: { DRDY }
> [ +0.000801] ata2: hard resetting link
> [ +0.310126] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [ +0.088155] ata2.00: configured for UDMA/133
> [ +0.000031] ata2: EH complete
> [Jan28 23:27] ata1.00: exception Emask 0x50 SAct 0x1c00 SErr 0x280900
action 0x6
> frozen
> [ +0.004338] ata1.00: irq_stat 0x08000000, interface fatal error
> [ +0.001815] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
> [ +0.000772] ata1.00: failed command: READ FPDMA QUEUED
> [ +0.000738] ata1.00: cmd 60/00:50:00:3b:b1/05:00:47:01:00/40 tag 10
ncq dma
> 655360 in
> res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001512] ata1.00: status: { DRDY }
> [ +0.000793] ata1.00: failed command: READ FPDMA QUEUED
> [ +0.000727] ata1.00: cmd 60/00:58:00:40:b1/05:00:47:01:00/40 tag 11
ncq dma
> 655360 in
> res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001534] ata1.00: status: { DRDY }
> [ +0.000769] ata1.00: failed command: READ FPDMA QUEUED
> [ +0.000720] ata1.00: cmd 60/00:60:00:45:b1/01:00:47:01:00/40 tag 12
ncq dma
> 131072 in
> res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001453] ata1.00: status: { DRDY }
> [ +0.000778] ata1: hard resetting link
> [ +0.556198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [ +0.001780] ata1.00: configured for UDMA/133
> [ +0.000037] ata1: EH complete
> [Jan28 23:32] perf: interrupt took too long (2518 > 2500), lowering
> kernel.perf_event_max_sample_rate to 79250
> [Jan29 00:14] ata2.00: exception Emask 0x50 SAct 0x1c000000 SErr
0x480900 action
> 0x6 frozen
> [ +0.004173] ata2.00: irq_stat 0x08000000, interface fatal error
> [ +0.001996] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
> [ +0.000737] ata2.00: failed command: WRITE FPDMA QUEUED
> [ +0.000729] ata2.00: cmd 61/00:d0:00:62:0e/05:00:86:01:00/40 tag 26
ncq dma
> 655360 out
> res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001486] ata2.00: status: { DRDY }
> [ +0.000854] ata2.00: failed command: WRITE FPDMA QUEUED
> [ +0.000718] ata2.00: cmd 61/00:d8:00:67:0e/05:00:86:01:00/40 tag 27
ncq dma
> 655360 out
> res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001478] ata2.00: status: { DRDY }
> [ +0.000884] ata2.00: failed command: WRITE FPDMA QUEUED
> [ +0.000736] ata2.00: cmd 61/00:e0:00:6c:0e/01:00:86:01:00/40 tag 28
ncq dma
> 131072 out
> res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001453] ata2.00: status: { DRDY }
> [ +0.000760] ata2: hard resetting link
> [ +0.000011] ata1.00: exception Emask 0x50 SAct 0x10000000 SErr
0x280900 action
> 0x6 frozen
> [ +0.000764] ata1.00: irq_stat 0x08000000, interface fatal error
> [ +0.000725] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
> [ +0.000712] ata1.00: failed command: READ FPDMA QUEUED
> [ +0.000700] ata1.00: cmd 60/80:e0:00:6d:0e/04:00:86:01:00/40 tag 28
ncq dma
> 589824 in
> res 40/00:e0:00:6d:0e/00:00:86:01:00/40 Emask
0x50 (ATA bus
> error)
> [ +0.001426] ata1.0...
I lost the original message as hetzner trims replys, but it also included
the `smartctl -x` output of the drive, now lost.
40 minutes later, the drive was replaced and the machine booted again.
We had trouble with the `/dev/md2` array: for some reason it wouldn't
autostart after the intervention. I started it by hand, rebuilt the initrd
and rebooted, to no avail.
I tried to repartition the new `sda` drive they added, then added it to
the array, which started syncing.
But after a while, the error came back:
{{{
[Jan29 18:30] ata1.00: exception Emask 0x50 SAct 0x80080 SErr 0x480900
action 0x6 frozen
[ +0.000020] ata1.00: irq_stat 0x08000000, interface fatal error
[ +0.000010] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
[ +0.000012] ata1.00: failed command: READ FPDMA QUEUED
[ +0.000018] ata1.00: cmd 60/20:38:00:98:04/00:00:00:00:00/40 tag 7 ncq
dma 16384 in
res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50
(ATA bus error)
[ +0.000021] ata1.00: status: { DRDY }
[ +0.000010] ata1.00: failed command: WRITE FPDMA QUEUED
[ +0.000015] ata1.00: cmd 61/00:98:00:e2:ff/05:00:0e:01:00/40 tag 19 ncq
dma 655360 out
res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50
(ATA bus error)
[ +0.000012] ata1.00: status: { DRDY }
[ +0.000009] ata1: hard resetting link
[ +0.311884] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ +0.049673] ata1.00: configured for UDMA/133
[ +0.000023] ata1: EH complete
}}}
and smartd sent us another email about:
{{{
Device: /dev/sda [SAT], ATA error count increased from 0 to 1
}}}
i reopened the ticket with hetzner, which will do another visit to the
server shortly. they also find it strange the error came back, and suspect
something might be wrong with the SATA cables.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33098>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs