From: Gionatan Danti <g.danti@assyoma.it>
To: sonofagun@openmailbox.org
Cc: linux-scsi@vger.kernel.org, linux-scsi-owner@vger.kernel.org
Subject: Re: No I/O errors reported after SATA link hard reset
Date: Sun, 27 Aug 2017 20:42:52 +0200 [thread overview]
Message-ID: <ce5bad3e349648cf2d3444e42b2f641c@assyoma.it> (raw)
In-Reply-To: <20170826205848.BC21C4E0007@mta-1.openmailbox.og>
Il 26-08-2017 22:58 sonofagun@openmailbox.org ha scritto:
> Hello guys, this is a very interesting thread but I will join it
> tomorrow!
>
> I have read a similar discussion for SSDs some time ago. That took
> place here [1]. Corruption of such devices can lead to complete data
> loss and not just corruption.
I just read the thread at https://marc.info/?t=149186660400002&r=1&w=2,
it was very interesting. However, it seems to me that it ended without
a clear solution, right?
Anyway, the opacity of the FTL (flash translation layer) surely is a
significant cause of concern/danger. Unexpected power losses can wreak
havock on SSDs.
> Please install smartmontools and post its output here for each disk so
> that I can see if your disks are healthy. Also I must see their
> firmware version as there might be a firmware update available.
Fortunately, the issue is solved now: I tracked back it to a faulty SATA
power cable. However, the SMART reports of both disk is very
interesting:
GOOD DISK (sda):
[root@nas ~]# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64]
(local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always
- 30483624
3 Spin_Up_Time 0x0003 093 091 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 46
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always
- 55353954
9 Power_On_Hours 0x0032 091 091 000 Old_age Always
- 8535
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 44
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always
- 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 067 060 045 Old_age Always
- 33 (Min/Max 30/40)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 24
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 67
194 Temperature_Celsius 0x0022 033 040 000 Old_age Always
- 33 (0 14 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
Note the low (expected) Start_Stop_Count (46)
BAD DISK (sdb):
[root@nas ~]# smartctl -A /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64]
(local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 106 099 006 Pre-fail Always
- 11030016
3 Spin_Up_Time 0x0003 095 091 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 661
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always
- 60912204
9 Power_On_Hours 0x0032 091 091 000 Old_age Always
- 8536
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 44
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always
- 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 067 061 045 Old_age Always
- 33 (Min/Max 29/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 639
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 672
194 Temperature_Celsius 0x0022 033 040 000 Old_age Always
- 33 (0 14 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
Note the *much* higher Start_Stop_Count (661); however, the
Power_Cycle_Count was the same (44).
So yes, while HDDs surely are more resilient than SSDs to unexpected
power losses, a micro-powerloss which corrupt/invalidate the disk's
cache content without giving the host a change to notice *will* cause
data corruption, sometime on acked syncronized writes also (I had a
filesystem journal corruption).
However, as stated in this thread, SATA does not really has a provision
to detect failed command due to micro-powerlosses nor to detect and
invalid/corrupted disk cache. So it seems the better "line of defese" is
to monitor (via SMART) the start/stop or power cycles count.
Regards.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
next prev parent reply other threads:[~2017-08-27 18:43 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-26 20:58 No I/O errors reported after SATA link hard reset sonofagun
2017-08-27 18:42 ` Gionatan Danti [this message]
-- strict thread matches above, loose matches on Subject: below --
2017-08-16 22:27 Gionatan Danti
2017-08-17 9:24 ` Bernd Schubert
2017-08-17 12:48 ` Tejun Heo
2017-08-17 13:18 ` Bernd Schubert
2017-08-17 13:25 ` Tejun Heo
2017-08-17 13:43 ` Bernd Schubert
2017-08-17 14:23 ` Gionatan Danti
2017-08-17 14:15 ` Gionatan Danti
2017-08-17 14:46 ` Tejun Heo
2017-08-17 15:01 ` Gionatan Danti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ce5bad3e349648cf2d3444e42b2f641c@assyoma.it \
--to=g.danti@assyoma.it \
--cc=linux-scsi-owner@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=sonofagun@openmailbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.