Re: No I/O errors reported after SATA link hard reset

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: No I/O errors reported after SATA link hard reset
@ 2017-08-26 20:58 sonofagun
  2017-08-27 18:42 ` Gionatan Danti
  0 siblings, 1 reply; 12+ messages in thread
From: sonofagun @ 2017-08-26 20:58 UTC (permalink / raw)
  To: linux-scsi

 Hello guys, this is a very interesting thread but I will join it tomorrow!

I have read a similar discussion for SSDs some time ago. That took place here [1]. Corruption of such devices can lead to complete data loss and not just corruption. 

Please install smartmontools and post its output here for each disk so that I can see if your disks are healthy. Also I must see their firmware version as there might be a firmware update available.

[1] https://marc.info/?t=149186660400002&r=1&w=2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-26 20:58 No I/O errors reported after SATA link hard reset sonofagun
@ 2017-08-27 18:42 ` Gionatan Danti
  0 siblings, 0 replies; 12+ messages in thread
From: Gionatan Danti @ 2017-08-27 18:42 UTC (permalink / raw)
  To: sonofagun; +Cc: linux-scsi, linux-scsi-owner

Il 26-08-2017 22:58 sonofagun@openmailbox.org ha scritto:
> Hello guys, this is a very interesting thread but I will join it 
> tomorrow!
> 
> I have read a similar discussion for SSDs some time ago. That took
> place here [1]. Corruption of such devices can lead to complete data
> loss and not just corruption.

I just read the thread at https://marc.info/?t=149186660400002&r=1&w=2, 
it was very interesting. However, it seems to me that  it ended without 
a clear solution, right?

Anyway, the opacity of the FTL (flash translation layer) surely is a 
significant cause of concern/danger. Unexpected power losses can wreak 
havock on SSDs.

> Please install smartmontools and post its output here for each disk so
> that I can see if your disks are healthy. Also I must see their
> firmware version as there might be a firmware update available.

Fortunately, the issue is solved now: I tracked back it to a faulty SATA 
power cable. However, the SMART reports of both disk is very 
interesting:


GOOD DISK (sda):
[root@nas ~]# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64] 
(local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, 
www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always 
       -       30483624
   3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always 
       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always 
       -       46
   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always 
       -       0
   7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always 
       -       55353954
   9 Power_On_Hours          0x0032   091   091   000    Old_age   Always 
       -       8535
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always 
       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always 
       -       44
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always 
       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always 
       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always 
       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always 
       -       0
190 Airflow_Temperature_Cel 0x0022   067   060   045    Old_age   Always 
       -       33 (Min/Max 30/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always 
       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always 
       -       24
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always 
       -       67
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always 
       -       33 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always 
       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always 
       -       0

Note the low (expected) Start_Stop_Count (46)


BAD DISK (sdb):
[root@nas ~]# smartctl -A /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64] 
(local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, 
www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   106   099   006    Pre-fail  Always 
       -       11030016
   3 Spin_Up_Time            0x0003   095   091   000    Pre-fail  Always 
       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always 
       -       661
   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always 
       -       0
   7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always 
       -       60912204
   9 Power_On_Hours          0x0032   091   091   000    Old_age   Always 
       -       8536
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always 
       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always 
       -       44
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always 
       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always 
       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always 
       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always 
       -       0
190 Airflow_Temperature_Cel 0x0022   067   061   045    Old_age   Always 
       -       33 (Min/Max 29/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always 
       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always 
       -       639
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always 
       -       672
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always 
       -       33 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always 
       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always 
       -       0

Note the *much* higher Start_Stop_Count (661); however, the 
Power_Cycle_Count was the same (44).

So yes, while HDDs surely are more resilient than SSDs to unexpected 
power losses, a micro-powerloss which corrupt/invalidate the disk's 
cache content without giving the host a change to notice *will* cause 
data corruption, sometime on acked syncronized writes also (I had a 
filesystem journal corruption).

However, as stated in this thread, SATA does not really has a provision 
to detect failed command due to micro-powerlosses nor to detect and 
invalid/corrupted disk cache. So it seems the better "line of defese" is 
to monitor (via SMART) the start/stop or power cycles count.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 12+ messages in thread

* No I/O errors reported after SATA link hard reset
@ 2017-08-16 22:27 Gionatan Danti
  2017-08-17  9:24 ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Gionatan Danti @ 2017-08-16 22:27 UTC (permalink / raw)
  To: linux-scsi

Hi list,
some time ago, I had a filesystem corruption on a simple, two disks 
RAID1 MD array. On the affected machine, /var/log/messages shown some 
"failed command: WRITE FPDMA QUEUED" entries, but *no* action (ie: kick 
off disk) was taken by MDRAID. I tracked down the problem to an instable 
power supply (switching power rail/connector solved the problem).

In the latest day I had some spare time and I am now able to regularly 
replicate the problem. Basically, when a short powerloss happens, the 
scsi midlayer logs some failed operations, but does *not* pass these 
errors to higher layer. In other words, no I/O error is returned to the 
calling application. This is the reason why MDRAID did not kick off the 
instable disk on the machine with corrupted filesystem.

To replicated the problem, I wrote a large random file on a small MD 
RAID1 array, pulling off the power of one disk from about 2 seconds. The 
file write operation stopped for some seconds, than recovered. Running 
an array check resulted in a high number of mismatch_cnt sectors. Dmesg 
logged the following lines:

Aug 16 16:04:02 blackhole kernel: ata6.00: exception Emask 0x50 SAct 
0x7fffffff SErr 0x90a00 action 0xe frozen
Aug 16 16:04:02 blackhole kernel: ata6.00: irq_stat 0x00400000, PHY RDY 
changed
Aug 16 16:04:02 blackhole kernel: ata6: SError: { Persist HostInt 
PHYRdyChg 10B8B }
Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:04:02 blackhole kernel: ata6.00: cmd 
61/00:00:10:82:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res 
40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
...
Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:04:02 blackhole kernel: ata6.00: cmd 
61/00:f0:10:7e:09/04:00:00:00:00/40 tag 30 ncq 524288 out#012         
res 40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
Aug 16 16:04:02 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:03 blackhole kernel: ata6: SATA link down (SStatus 0 
SControl 310)
Aug 16 16:04:04 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:14 blackhole kernel: ata6: softreset failed (device not 
ready)
Aug 16 16:04:14 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:24 blackhole kernel: ata6: softreset failed (device not 
ready)
Aug 16 16:04:24 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:35 blackhole kernel: ata6: link is slow to respond, please 
be patient (ready=0)
Aug 16 16:04:42 blackhole kernel: ata6: SATA link down (SStatus 0 
SControl 310)
Aug 16 16:04:46 blackhole kernel: ata6: hard resetting link
Aug 16 16:04:46 blackhole kernel: ata3: exception Emask 0x10 SAct 0x0 
SErr 0x40d0202 action 0xe frozen
Aug 16 16:04:46 blackhole kernel: ata3: irq_stat 0x00400000, PHY RDY 
changed
Aug 16 16:04:46 blackhole kernel: ata3: SError: { RecovComm Persist 
PHYRdyChg CommWake 10B8B DevExch }
Aug 16 16:04:46 blackhole kernel: ata3: hard resetting link
Aug 16 16:04:51 blackhole kernel: ata3: softreset failed (device not 
ready)
Aug 16 16:04:51 blackhole kernel: ata3: applying PMP SRST workaround and 
retrying
Aug 16 16:04:51 blackhole kernel: ata3: SATA link up 3.0 Gbps (SStatus 
123 SControl 300)
Aug 16 16:04:51 blackhole kernel: ata3.00: configured for UDMA/133
Aug 16 16:04:51 blackhole kernel: ata3: EH complete
Aug 16 16:04:52 blackhole kernel: ata6: softreset failed (device not 
ready)
Aug 16 16:04:52 blackhole kernel: ata6: applying PMP SRST workaround and 
retrying
Aug 16 16:04:52 blackhole kernel: ata6: SATA link up 1.5 Gbps (SStatus 
113 SControl 310)
Aug 16 16:04:52 blackhole kernel: ata6.00: configured for UDMA/133
Aug 16 16:04:52 blackhole kernel: ata6: EH complete

As you can see, while failed SATA operation were logged in dmesg (and 
/var/log/messages), no I/O errors where returned to the upper layer 
(MDRAID) or the calling application. I had to say that I *fully expect* 
some inconsistencies: after all, removing the power wipes the volatile 
disk's DRAM cache, which means data loss. However, I really expected 
some I/O errors to be thrown to the higher layers, causing visible 
reactions (ie: a disks pushed out the array). With no I/O errors 
returned, the higher layer application are effectively blind.

More concerning is the fact that these undetected errors can make their 
way even when the higher application consistently calls sync() and/or 
fsync. In other words, it seems than even acknowledged writes can fail 
in this manner (and this is consistent with the first machine corrupting 
its filesystem due to journal trashing - XFS journal surely uses sync() 
where appropriate). The mechanism seems the following:

- an higher layer application issue sync();
- a write barrier is generated;
- a first FLUSH CACHE command is sent to the disk;
- data are written to the disk's DRAM cache;
- power is lost! The volatile cache lose its content;
- power is re-established and the disk become responsive again;
- a second FLUSH CACHE command is sent to the disk;
- the disk acks each SATA command, but real data are lost.

As a side note, when the power loss or SATA cable disconnection is 
relatively long (over 10 seconds, as by eh timeout), the SATA disks 
become disconnected (and the MD layer acts accordlying):

Aug 16 16:12:20 blackhole kernel: ata6.00: exception Emask 0x50 SAct 
0x7fffffff SErr 0x490a00 action 0xe frozen
Aug 16 16:12:20 blackhole kernel: ata6.00: irq_stat 0x08000000, 
interface fatal error
Aug 16 16:12:20 blackhole kernel: ata6: SError: { Persist HostInt 
PHYRdyChg 10B8B Handshk }
Aug 16 16:12:20 blackhole kernel: ata6.00: failed command: WRITE FPDMA 
QUEUED
Aug 16 16:12:20 blackhole kernel: ata6.00: cmd 
61/00:00:38:88:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res 
40/00:d8:38:f4:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Aug 16 16:12:20 blackhole kernel: ata6.00: status: { DRDY }
...
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Sense Key : Illegal 
Request [current] [descriptor]
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Add. Sense: 
Unaligned write command
Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] CDB: Write(10) 2a 00 
00 09 88 38 00 04 00 00
Aug 16 16:12:32 blackhole kernel: blk_update_request: 23 callbacks 
suppressed
Aug 16 16:12:32 blackhole kernel: blk_update_request: I/O error, dev 
sdf, sector 624696

Now, I have few questions:
- is the above explanation plausible, or I am (horribly) missing 
something?
- why the scsi midlevel does not respond to a power loss event by 
immediately offlining the disks?
- is the scsi midlevel behavior configurable (I know I can lower eh 
timeout, but is this the right solution)?
- how to deal with this problem (other than being 100% sure power is 
never lost by any disks)?

Thank you all,
regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-16 22:27 Gionatan Danti
@ 2017-08-17  9:24 ` Bernd Schubert
  2017-08-17 12:48   ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2017-08-17  9:24 UTC (permalink / raw)
  To: Tejun Heo, linux-ide; +Cc: Gionatan Danti, linux-scsi

[This seems to be libata error handling and not scsi, so I added more CCs]

On 08/17/2017 12:27 AM, Gionatan Danti wrote:
> Hi list,
> some time ago, I had a filesystem corruption on a simple, two disks
> RAID1 MD array. On the affected machine, /var/log/messages shown some
> "failed command: WRITE FPDMA QUEUED" entries, but *no* action (ie: kick
> off disk) was taken by MDRAID. I tracked down the problem to an instable
> power supply (switching power rail/connector solved the problem).
> 
> In the latest day I had some spare time and I am now able to regularly
> replicate the problem. Basically, when a short powerloss happens, the
> scsi midlayer logs some failed operations, but does *not* pass these
> errors to higher layer. In other words, no I/O error is returned to the
> calling application. This is the reason why MDRAID did not kick off the
> instable disk on the machine with corrupted filesystem.
> 
> To replicated the problem, I wrote a large random file on a small MD
> RAID1 array, pulling off the power of one disk from about 2 seconds. The
> file write operation stopped for some seconds, than recovered. Running
> an array check resulted in a high number of mismatch_cnt sectors. Dmesg
> logged the following lines:
> 
> Aug 16 16:04:02 blackhole kernel: ata6.00: exception Emask 0x50 SAct
> 0x7fffffff SErr 0x90a00 action 0xe frozen
> Aug 16 16:04:02 blackhole kernel: ata6.00: irq_stat 0x00400000, PHY RDY
> changed
> Aug 16 16:04:02 blackhole kernel: ata6: SError: { Persist HostInt
> PHYRdyChg 10B8B }
> Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA
> QUEUED
> Aug 16 16:04:02 blackhole kernel: ata6.00: cmd
> 61/00:00:10:82:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res
> 40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
> Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
> ...
> Aug 16 16:04:02 blackhole kernel: ata6.00: failed command: WRITE FPDMA
> QUEUED
> Aug 16 16:04:02 blackhole kernel: ata6.00: cmd
> 61/00:f0:10:7e:09/04:00:00:00:00/40 tag 30 ncq 524288 out#012        
> res 40/00:d8:10:72:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
> Aug 16 16:04:02 blackhole kernel: ata6.00: status: { DRDY }
> Aug 16 16:04:02 blackhole kernel: ata6: hard resetting link
> Aug 16 16:04:03 blackhole kernel: ata6: SATA link down (SStatus 0
> SControl 310)
> Aug 16 16:04:04 blackhole kernel: ata6: hard resetting link
> Aug 16 16:04:14 blackhole kernel: ata6: softreset failed (device not ready)
> Aug 16 16:04:14 blackhole kernel: ata6: hard resetting link
> Aug 16 16:04:24 blackhole kernel: ata6: softreset failed (device not ready)
> Aug 16 16:04:24 blackhole kernel: ata6: hard resetting link
> Aug 16 16:04:35 blackhole kernel: ata6: link is slow to respond, please
> be patient (ready=0)
> Aug 16 16:04:42 blackhole kernel: ata6: SATA link down (SStatus 0
> SControl 310)
> Aug 16 16:04:46 blackhole kernel: ata6: hard resetting link
> Aug 16 16:04:46 blackhole kernel: ata3: exception Emask 0x10 SAct 0x0
> SErr 0x40d0202 action 0xe frozen
> Aug 16 16:04:46 blackhole kernel: ata3: irq_stat 0x00400000, PHY RDY
> changed
> Aug 16 16:04:46 blackhole kernel: ata3: SError: { RecovComm Persist
> PHYRdyChg CommWake 10B8B DevExch }
> Aug 16 16:04:46 blackhole kernel: ata3: hard resetting link
> Aug 16 16:04:51 blackhole kernel: ata3: softreset failed (device not ready)
> Aug 16 16:04:51 blackhole kernel: ata3: applying PMP SRST workaround and
> retrying
> Aug 16 16:04:51 blackhole kernel: ata3: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Aug 16 16:04:51 blackhole kernel: ata3.00: configured for UDMA/133
> Aug 16 16:04:51 blackhole kernel: ata3: EH complete
> Aug 16 16:04:52 blackhole kernel: ata6: softreset failed (device not ready)
> Aug 16 16:04:52 blackhole kernel: ata6: applying PMP SRST workaround and
> retrying
> Aug 16 16:04:52 blackhole kernel: ata6: SATA link up 1.5 Gbps (SStatus
> 113 SControl 310)
> Aug 16 16:04:52 blackhole kernel: ata6.00: configured for UDMA/133
> Aug 16 16:04:52 blackhole kernel: ata6: EH complete
> 
> As you can see, while failed SATA operation were logged in dmesg (and
> /var/log/messages), no I/O errors where returned to the upper layer
> (MDRAID) or the calling application. I had to say that I *fully expect*
> some inconsistencies: after all, removing the power wipes the volatile
> disk's DRAM cache, which means data loss. However, I really expected
> some I/O errors to be thrown to the higher layers, causing visible
> reactions (ie: a disks pushed out the array). With no I/O errors
> returned, the higher layer application are effectively blind.
> 
> More concerning is the fact that these undetected errors can make their
> way even when the higher application consistently calls sync() and/or
> fsync. In other words, it seems than even acknowledged writes can fail
> in this manner (and this is consistent with the first machine corrupting
> its filesystem due to journal trashing - XFS journal surely uses sync()
> where appropriate). The mechanism seems the following:
> 
> - an higher layer application issue sync();
> - a write barrier is generated;
> - a first FLUSH CACHE command is sent to the disk;
> - data are written to the disk's DRAM cache;
> - power is lost! The volatile cache lose its content;
> - power is re-established and the disk become responsive again;
> - a second FLUSH CACHE command is sent to the disk;
> - the disk acks each SATA command, but real data are lost.
> 
> As a side note, when the power loss or SATA cable disconnection is
> relatively long (over 10 seconds, as by eh timeout), the SATA disks
> become disconnected (and the MD layer acts accordlying):
> 
> Aug 16 16:12:20 blackhole kernel: ata6.00: exception Emask 0x50 SAct
> 0x7fffffff SErr 0x490a00 action 0xe frozen
> Aug 16 16:12:20 blackhole kernel: ata6.00: irq_stat 0x08000000,
> interface fatal error
> Aug 16 16:12:20 blackhole kernel: ata6: SError: { Persist HostInt
> PHYRdyChg 10B8B Handshk }
> Aug 16 16:12:20 blackhole kernel: ata6.00: failed command: WRITE FPDMA
> QUEUED
> Aug 16 16:12:20 blackhole kernel: ata6.00: cmd
> 61/00:00:38:88:09/04:00:00:00:00/40 tag 0 ncq 524288 out#012         res
> 40/00:d8:38:f4:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
> Aug 16 16:12:20 blackhole kernel: ata6.00: status: { DRDY }
> ...
> Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Sense Key : Illegal
> Request [current] [descriptor]
> Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] Add. Sense:
> Unaligned write command
> Aug 16 16:12:32 blackhole kernel: sd 5:0:0:0: [sdf] CDB: Write(10) 2a 00
> 00 09 88 38 00 04 00 00
> Aug 16 16:12:32 blackhole kernel: blk_update_request: 23 callbacks
> suppressed
> Aug 16 16:12:32 blackhole kernel: blk_update_request: I/O error, dev
> sdf, sector 624696
> 
> Now, I have few questions:
> - is the above explanation plausible, or I am (horribly) missing something?
> - why the scsi midlevel does not respond to a power loss event by
> immediately offlining the disks?
> - is the scsi midlevel behavior configurable (I know I can lower eh
> timeout, but is this the right solution)?
> - how to deal with this problem (other than being 100% sure power is
> never lost by any disks)?


I added the ata mailing list and Tejun.
I already wanted to report the same issue, as a flaky cable caused
libata error handling on one of my systems at home. ATA EH succeeded for
several weeks until several file systems on that system reported
corruption (btrfs and ext4). Failed commands I can see from syslog are
"READ FPDMA QUEUED" and "FLUSH CACHE EXT", but I'm not sure if it is
complete, as the log file is on btrfs and it reports checksum mismatch
for that file. Kernel version is 4.4.0-81-ubuntu, I have not checked yet
if they applied any libata patches.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17  9:24 ` Bernd Schubert
@ 2017-08-17 12:48   ` Tejun Heo
  2017-08-17 13:18     ` Bernd Schubert
  2017-08-17 14:15     ` Gionatan Danti
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2017-08-17 12:48 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-ide, Gionatan Danti, linux-scsi

Hello,

On Thu, Aug 17, 2017 at 11:24:22AM +0200, Bernd Schubert wrote:
> > More concerning is the fact that these undetected errors can make their
> > way even when the higher application consistently calls sync() and/or
> > fsync. In other words, it seems than even acknowledged writes can fail
> > in this manner (and this is consistent with the first machine corrupting
> > its filesystem due to journal trashing - XFS journal surely uses sync()
> > where appropriate). The mechanism seems the following:
> > 
> > - an higher layer application issue sync();
> > - a write barrier is generated;
> > - a first FLUSH CACHE command is sent to the disk;
> > - data are written to the disk's DRAM cache;
> > - power is lost! The volatile cache lose its content;
> > - power is re-established and the disk become responsive again;
> > - a second FLUSH CACHE command is sent to the disk;
> > - the disk acks each SATA command, but real data are lost.

Recovered errors aren't reported as IO errors and at least from link
state proper there's no way for the driver to tell apart link
glitches and buffer-erasing power issues.

> > Now, I have few questions:
> > - is the above explanation plausible, or I am (horribly) missing something?

For the most part, yes.  To be more accurate, the failure is coming
from libata not being able to tell apart link glitches from the device
getting reset due to power issues.

> > - why the scsi midlevel does not respond to a power loss event by
> > immediately offlining the disks?

Because we don't wanna be ditching disks on temporary link glitches,
which do happen once in a while.

> > - is the scsi midlevel behavior configurable (I know I can lower eh
> > timeout, but is this the right solution)?
> > - how to deal with this problem (other than being 100% sure power is
> > never lost by any disks)?

So, the right way to deal with the problem probably is making use of
the SMART counter which indicates power loss events and verify that
the counter hasn't increased over link issues.  If it changed, the
device should be detached and re-probed, which will make it come back
as a different block device.  Unfortunately, I haven't had the chance
to actually implement that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 12:48   ` Tejun Heo
@ 2017-08-17 13:18     ` Bernd Schubert
  2017-08-17 13:25       ` Tejun Heo
  2017-08-17 14:23       ` Gionatan Danti
  2017-08-17 14:15     ` Gionatan Danti
  1 sibling, 2 replies; 12+ messages in thread
From: Bernd Schubert @ 2017-08-17 13:18 UTC (permalink / raw)
  To: Tejun Heo, Bernd Schubert; +Cc: linux-ide, Gionatan Danti, linux-scsi

Hello Tejun,

On 08/17/2017 02:48 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 17, 2017 at 11:24:22AM +0200, Bernd Schubert wrote:
>>> More concerning is the fact that these undetected errors can make their
>>> way even when the higher application consistently calls sync() and/or
>>> fsync. In other words, it seems than even acknowledged writes can fail
>>> in this manner (and this is consistent with the first machine corrupting
>>> its filesystem due to journal trashing - XFS journal surely uses sync()
>>> where appropriate). The mechanism seems the following:
>>>
>>> - an higher layer application issue sync();
>>> - a write barrier is generated;
>>> - a first FLUSH CACHE command is sent to the disk;
>>> - data are written to the disk's DRAM cache;
>>> - power is lost! The volatile cache lose its content;
>>> - power is re-established and the disk become responsive again;
>>> - a second FLUSH CACHE command is sent to the disk;
>>> - the disk acks each SATA command, but real data are lost.
> 
> Recovered errors aren't reported as IO errors and at least from link
> state proper there's no way for the driver to tell apart link
> glitches and buffer-erasing power issues.
> 
>>> Now, I have few questions:
>>> - is the above explanation plausible, or I am (horribly) missing something?
> 
> For the most part, yes.  To be more accurate, the failure is coming
> from libata not being able to tell apart link glitches from the device
> getting reset due to power issues.

So for Gionatan the root cause was an instable power supply, but in my
case there wasn't any power loss, there were just failed sata commands.
I'm not sure if this was a port or cable issue - once I changed port and
sata cable the errors disappeared. I didn't change the power supply or
power cable. I'm now basically fighting with the data corruption that
caused - for btrfs it at least has a checksum, but I didn't have ext4
checksum enabled, so it is hard to figure out which files are corrupts -
silent data corruption is not well handled by backups either.

> 
>>> - why the scsi midlevel does not respond to a power loss event by
>>> immediately offlining the disks?
> 
> Because we don't wanna be ditching disks on temporary link glitches,
> which do happen once in a while.
> 
>>> - is the scsi midlevel behavior configurable (I know I can lower eh
>>> timeout, but is this the right solution)?
>>> - how to deal with this problem (other than being 100% sure power is
>>> never lost by any disks)?
> 
> So, the right way to deal with the problem probably is making use of
> the SMART counter which indicates power loss events and verify that
> the counter hasn't increased over link issues.  If it changed, the
> device should be detached and re-probed, which will make it come back
> as a different block device.  Unfortunately, I haven't had the chance
> to actually implement that.

Is it possible that sata eh recovery sends resets to the device, which
makes it evict its cache?

Thanks,
Bernd




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 13:18     ` Bernd Schubert
@ 2017-08-17 13:25       ` Tejun Heo
  2017-08-17 13:43         ` Bernd Schubert
  2017-08-17 14:23       ` Gionatan Danti
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-08-17 13:25 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Bernd Schubert, linux-ide, Gionatan Danti, linux-scsi

Hello,

On Thu, Aug 17, 2017 at 03:18:06PM +0200, Bernd Schubert wrote:
> So for Gionatan the root cause was an instable power supply, but in my
> case there wasn't any power loss, there were just failed sata commands.
> I'm not sure if this was a port or cable issue - once I changed port and
> sata cable the errors disappeared. I didn't change the power supply or
> power cable. I'm now basically fighting with the data corruption that
> caused - for btrfs it at least has a checksum, but I didn't have ext4
> checksum enabled, so it is hard to figure out which files are corrupts -
> silent data corruption is not well handled by backups either.

No idea there.  Retried and recovered errors shouldn't cause data
corruptions.  Flaky power can behave in unexpected ways tho.  What
happens if you hook up the drive on a different power supply but
revert to the port / cable which showed the problem?  What does your
SMART counters say across those failures?

> Is it possible that sata eh recovery sends resets to the device, which
> makes it evict its cache?

That'd be a very broken device.  It sure is theoretically possible but
I haven't seen any reports on such behaviors yet.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 13:25       ` Tejun Heo
@ 2017-08-17 13:43         ` Bernd Schubert
  0 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2017-08-17 13:43 UTC (permalink / raw)
  To: Tejun Heo, Bernd Schubert; +Cc: linux-ide, Gionatan Danti, linux-scsi



On 08/17/2017 03:25 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 17, 2017 at 03:18:06PM +0200, Bernd Schubert wrote:
>> So for Gionatan the root cause was an instable power supply, but in my
>> case there wasn't any power loss, there were just failed sata commands.
>> I'm not sure if this was a port or cable issue - once I changed port and
>> sata cable the errors disappeared. I didn't change the power supply or
>> power cable. I'm now basically fighting with the data corruption that
>> caused - for btrfs it at least has a checksum, but I didn't have ext4
>> checksum enabled, so it is hard to figure out which files are corrupts -
>> silent data corruption is not well handled by backups either.
> 
> No idea there.  Retried and recovered errors shouldn't cause data
> corruptions.  Flaky power can behave in unexpected ways tho.  What
> happens if you hook up the drive on a different power supply but
> revert to the port / cable which showed the problem?  What does your
> SMART counters say across those failures?

Hmm, well, I think I through away the cable already, and I also don't
have spare power supplies at home. It also wasn't that easy to reproduce
the errors, they came up when my wife was working on her system - not
when I was controlling it ;)

> 
>> Is it possible that sata eh recovery sends resets to the device, which
>> makes it evict its cache?
> 
> That'd be a very broken device.  It sure is theoretically possible but
> I haven't seen any reports on such behaviors yet.

I wonder if we just couldn't make the error handler to report issues for
people who are running raid. Gionatans powerloss and my unclear
corruption issue probably wouldn't have happened if the upper md layer
would have gotten an information that it should report errors instead of
recovering them. Although I admit it is a difficult decision what to
with link glitches.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 13:18     ` Bernd Schubert
  2017-08-17 13:25       ` Tejun Heo
@ 2017-08-17 14:23       ` Gionatan Danti
  1 sibling, 0 replies; 12+ messages in thread
From: Gionatan Danti @ 2017-08-17 14:23 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Tejun Heo, Bernd Schubert, linux-ide, linux-scsi

Hi Bernd,

Il 17-08-2017 15:18 Bernd Schubert ha scritto:
> 
> So for Gionatan the root cause was an instable power supply, but in my
> case there wasn't any power loss, there were just failed sata commands.

I tried many times to replicate the error by briefly 
disconnecting/reconnecting the SATA cable, but I had *no* corruption in 
this case. Sure, this was my experience, but a bad-behaving disk 
firmware can do all sort of bad things with the volatile cache, 
especially when renegotiating the host link.

I my case, I did *not* change the power supply, rather the SATA power 
cable: my theory is that, as the previous cable was shared between the 
two disks, somewhat low-voltage spiked find their ways and the second 
disk simply "rebooted". The new cable is dedicated to the SATA disk wich 
was previously failing.

What concern my is that, reading the linux-raid mailing list, many user 
have historycally reported high mismatch count in RAID1 arrays. These 
mismatches were generally discarded saying "RAID1 is prone to false 
positives" but, in my experience, these "false mismatches" are quite 
rare. What it means is that many users *are probably suffering* from my 
(and your) problem, without never realizing that...

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 12:48   ` Tejun Heo
  2017-08-17 13:18     ` Bernd Schubert
@ 2017-08-17 14:15     ` Gionatan Danti
  2017-08-17 14:46       ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Gionatan Danti @ 2017-08-17 14:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Bernd Schubert, linux-ide, linux-scsi, Tejun Heo

Hi Tejun,

Il 17-08-2017 14:48 Tejun Heo ha scritto:
> Recovered errors aren't reported as IO errors and at least from link
> state proper there's no way for the driver to tell apart link
> glitches and buffer-erasing power issues.

Ok, so *this* is the root cause of the problem: libata not identifying 
spurious link renegotiations vs brief powerloss/powerup events. Out of 
curiosity: is this a SATA-specific problem (ie: in the SATA 
specification), or even SAS disks are affected?

>> > - why the scsi midlevel does not respond to a power loss event by
>> > immediately offlining the disks?
> 
> Because we don't wanna be ditching disks on temporary link glitches,
> which do happen once in a while.

Any chances to report I/O errors to the upper layers *without* offlining 
the device? In this manner, upper layers (ie: MDRAID) can act in a more 
informate way. For example: single disk device will simple retry the 
failed operation, while MDRAID can take the "badblocks" code path to 
deal with the error.

> So, the right way to deal with the problem probably is making use of
> the SMART counter which indicates power loss events and verify that
> the counter hasn't increased over link issues.  If it changed, the
> device should be detached and re-probed, which will make it come back
> as a different block device.  Unfortunately, I haven't had the chance
> to actually implement that.

This is a very good idea, maybe I can implement it in userspace with a 
simple, fast polling scheme (for example, each 60 seconds). Such a 
polling would not prevent all corruption scenarios, but will at least 
timely inform the user.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 14:15     ` Gionatan Danti
@ 2017-08-17 14:46       ` Tejun Heo
  2017-08-17 15:01         ` Gionatan Danti
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-08-17 14:46 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Bernd Schubert, linux-ide, linux-scsi

Hello,

On Thu, Aug 17, 2017 at 04:15:35PM +0200, Gionatan Danti wrote:
> Ok, so *this* is the root cause of the problem: libata not
> identifying spurious link renegotiations vs brief powerloss/powerup
> events. Out of curiosity: is this a SATA-specific problem (ie: in
> the SATA specification), or even SAS disks are affected?

No idea about SAS.  They're identical at the link layer tho.

> >Because we don't wanna be ditching disks on temporary link glitches,
> >which do happen once in a while.
> 
> Any chances to report I/O errors to the upper layers *without*
> offlining the device? In this manner, upper layers (ie: MDRAID) can
> act in a more informate way. For example: single disk device will
> simple retry the failed operation, while MDRAID can take the
> "badblocks" code path to deal with the error.

Upper layer can request to avoid retrying on errors but it won't help
too much.  It doesn't have much to do with specific commands.  A power
event can take place without any command in flight and lose the
buffered data.  Unless upper layer is tracking all that's being
written, there isn't much it can do outside doing full scan.  This is
a condition which should be handled from the driver side.

> >So, the right way to deal with the problem probably is making use of
> >the SMART counter which indicates power loss events and verify that
> >the counter hasn't increased over link issues.  If it changed, the
> >device should be detached and re-probed, which will make it come back
> >as a different block device.  Unfortunately, I haven't had the chance
> >to actually implement that.
> 
> This is a very good idea, maybe I can implement it in userspace with
> a simple, fast polling scheme (for example, each 60 seconds). Such a
> polling would not prevent all corruption scenarios, but will at
> least timely inform the user.

Yeah, looking into getting it implemented on the kernel side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: No I/O errors reported after SATA link hard reset
  2017-08-17 14:46       ` Tejun Heo
@ 2017-08-17 15:01         ` Gionatan Danti
  0 siblings, 0 replies; 12+ messages in thread
From: Gionatan Danti @ 2017-08-17 15:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide, linux-scsi, Tejun Heo

Il 17-08-2017 16:46 Tejun Heo ha scritto:
> Upper layer can request to avoid retrying on errors but it won't help
> too much.  It doesn't have much to do with specific commands.  A power
> event can take place without any command in flight and lose the
> buffered data.  Unless upper layer is tracking all that's being
> written, there isn't much it can do outside doing full scan.  This is
> a condition which should be handled from the driver side.

True, I was not thinking about buffered (delayed) writes. However, for 
synchronized writes it should be possible: after all, for sync() writes 
the application is waiting for its completion. This means that if a 
powerloss/link renegotiation is detected between *the two FLUSH_CACHE 
commands*, and I/O error can be reported to the calling application.

What about disk supporting FUAs? Are they unaffected by this problem? If 
my understand it properly, torn writes remain a potential, but 
inevitable, problem when facing powerloss conditions.

By the way, when speaking about a "full scan" your are referring to full 
bus scanning/enumeration? Will it change devices name when 
re-discovering them?

> Yeah, looking into getting it implemented on the kernel side.

Great! Are your thinking about a polling approach or an event-driven 
one?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-08-27 18:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-26 20:58 No I/O errors reported after SATA link hard reset sonofagun
2017-08-27 18:42 ` Gionatan Danti
  -- strict thread matches above, loose matches on Subject: below --
2017-08-16 22:27 Gionatan Danti
2017-08-17  9:24 ` Bernd Schubert
2017-08-17 12:48   ` Tejun Heo
2017-08-17 13:18     ` Bernd Schubert
2017-08-17 13:25       ` Tejun Heo
2017-08-17 13:43         ` Bernd Schubert
2017-08-17 14:23       ` Gionatan Danti
2017-08-17 14:15     ` Gionatan Danti
2017-08-17 14:46       ` Tejun Heo
2017-08-17 15:01         ` Gionatan Danti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox