* [PATCH 1/4] libata: make ata_tf_to_lba[48]() generic
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
@ 2008-03-27 10:14 ` Tejun Heo
2008-04-04 7:45 ` Jeff Garzik
2008-03-27 10:14 ` [PATCH 2/4] libata: implement ATA_QCFLAG_RETRY Tejun Heo
` (5 subsequent siblings)
6 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2008-03-27 10:14 UTC (permalink / raw)
To: jeff, linux-ide, alan, liml; +Cc: Tejun Heo
ata_tf_to_lba[48]() currently return LBA in tf + 1 for
ata_read_native_max_address(). Make them return LBA and make it
global so that it can be used to read LBA off TF for other purposes.
ata_read_native_max_address() now adds 1 itself.
Signed-off-by: Tejun Heo <htejun@gmail.com>
---
drivers/ata/libata-core.c | 12 ++++++------
drivers/ata/libata.h | 2 ++
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 4851988..1a8601f 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -1255,7 +1255,7 @@ static u64 ata_id_n_sectors(const u16 *id)
}
}
-static u64 ata_tf_to_lba48(struct ata_taskfile *tf)
+u64 ata_tf_to_lba48(const struct ata_taskfile *tf)
{
u64 sectors = 0;
@@ -1266,10 +1266,10 @@ static u64 ata_tf_to_lba48(struct ata_taskfile *tf)
sectors |= (tf->lbam & 0xff) << 8;
sectors |= (tf->lbal & 0xff);
- return ++sectors;
+ return sectors;
}
-static u64 ata_tf_to_lba(struct ata_taskfile *tf)
+u64 ata_tf_to_lba(const struct ata_taskfile *tf)
{
u64 sectors = 0;
@@ -1278,7 +1278,7 @@ static u64 ata_tf_to_lba(struct ata_taskfile *tf)
sectors |= (tf->lbam & 0xff) << 8;
sectors |= (tf->lbal & 0xff);
- return ++sectors;
+ return sectors;
}
/**
@@ -1323,9 +1323,9 @@ static int ata_read_native_max_address(struct ata_device *dev, u64 *max_sectors)
}
if (lba48)
- *max_sectors = ata_tf_to_lba48(&tf);
+ *max_sectors = ata_tf_to_lba48(&tf) + 1;
else
- *max_sectors = ata_tf_to_lba(&tf);
+ *max_sectors = ata_tf_to_lba(&tf) + 1;
if (dev->horkage & ATA_HORKAGE_HPA_SIZE)
(*max_sectors)--;
return 0;
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index aa884f7..20edee0 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -61,6 +61,8 @@ extern int libata_fua;
extern int libata_noacpi;
extern int libata_allow_tpm;
extern void ata_force_cbl(struct ata_port *ap);
+extern u64 ata_tf_to_lba(const struct ata_taskfile *tf);
+extern u64 ata_tf_to_lba48(const struct ata_taskfile *tf);
extern struct ata_queued_cmd *ata_qc_new_init(struct ata_device *dev);
extern int ata_build_rw_tf(struct ata_taskfile *tf, struct ata_device *dev,
u64 block, u32 n_block, unsigned int tf_flags,
--
1.5.2.4
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH 2/4] libata: implement ATA_QCFLAG_RETRY
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
2008-03-27 10:14 ` [PATCH 1/4] libata: make ata_tf_to_lba[48]() generic Tejun Heo
@ 2008-03-27 10:14 ` Tejun Heo
2008-03-27 10:14 ` [PATCH 3/4] libata: kill unused ata_flush_cache() Tejun Heo
` (4 subsequent siblings)
6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2008-03-27 10:14 UTC (permalink / raw)
To: jeff, linux-ide, alan, liml; +Cc: Tejun Heo
Currently whether a command should be retried after failure is
determined inside ata_eh_finish(). Add ATA_QCFLAG_RETRY and move the
logic into ata_eh_autopsy(). This makes things clearer and helps
extending retry determination logic.
Signed-off-by: Tejun Heo <htejun@gmail.com>
---
drivers/ata/libata-eh.c | 18 ++++++++----------
include/linux/libata.h | 1 +
2 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index a583032..5f3d294 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1804,6 +1804,11 @@ static void ata_eh_link_autopsy(struct ata_link *link)
if (qc->flags & ATA_QCFLAG_SENSE_VALID)
qc->err_mask &= ~(AC_ERR_DEV | AC_ERR_OTHER);
+ /* determine whether the command is worth retrying */
+ if (!(qc->err_mask & AC_ERR_INVALID) &&
+ ((qc->flags & ATA_QCFLAG_IO) || qc->err_mask != AC_ERR_DEV))
+ qc->flags |= ATA_QCFLAG_RETRY;
+
/* accumulate error info */
ehc->i.dev = qc->dev;
all_err_mask |= qc->err_mask;
@@ -2817,18 +2822,11 @@ void ata_eh_finish(struct ata_port *ap)
/* FIXME: Once EH migration is complete,
* generate sense data in this function,
* considering both err_mask and tf.
- *
- * There's no point in retrying invalid
- * (detected by libata) and non-IO device
- * errors (rejected by device). Finish them
- * immediately.
*/
- if ((qc->err_mask & AC_ERR_INVALID) ||
- (!(qc->flags & ATA_QCFLAG_IO) &&
- qc->err_mask == AC_ERR_DEV))
- ata_eh_qc_complete(qc);
- else
+ if (qc->flags & ATA_QCFLAG_RETRY)
ata_eh_qc_retry(qc);
+ else
+ ata_eh_qc_complete(qc);
} else {
if (qc->flags & ATA_QCFLAG_SENSE_VALID) {
ata_eh_qc_complete(qc);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index b064bfe..d06dd12 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -225,6 +225,7 @@ enum {
ATA_QCFLAG_RESULT_TF = (1 << 4), /* result TF requested */
ATA_QCFLAG_CLEAR_EXCL = (1 << 5), /* clear excl_link on completion */
ATA_QCFLAG_QUIET = (1 << 6), /* don't report device error */
+ ATA_QCFLAG_RETRY = (1 << 7), /* retry after failure */
ATA_QCFLAG_FAILED = (1 << 16), /* cmd failed and is owned by EH */
ATA_QCFLAG_SENSE_VALID = (1 << 17), /* sense data valid */
--
1.5.2.4
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH 3/4] libata: kill unused ata_flush_cache()
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
2008-03-27 10:14 ` [PATCH 1/4] libata: make ata_tf_to_lba[48]() generic Tejun Heo
2008-03-27 10:14 ` [PATCH 2/4] libata: implement ATA_QCFLAG_RETRY Tejun Heo
@ 2008-03-27 10:14 ` Tejun Heo
2008-03-27 10:14 ` [PATCH 4/4] libata: improve FLUSH error handling Tejun Heo
` (3 subsequent siblings)
6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2008-03-27 10:14 UTC (permalink / raw)
To: jeff, linux-ide, liml; +Cc: Tejun Heo, Alan Cox
ata_flush_code() hasn't been in use for quite some time now. Kill it.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
---
drivers/ata/libata-core.c | 26 --------------------------
drivers/ata/libata.h | 1 -
2 files changed, 0 insertions(+), 27 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 1a8601f..aba9a71 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6475,32 +6475,6 @@ int ata_link_offline(struct ata_link *link)
return 0;
}
-int ata_flush_cache(struct ata_device *dev)
-{
- unsigned int err_mask;
- u8 cmd;
-
- if (!ata_try_flush_cache(dev))
- return 0;
-
- if (dev->flags & ATA_DFLAG_FLUSH_EXT)
- cmd = ATA_CMD_FLUSH_EXT;
- else
- cmd = ATA_CMD_FLUSH;
-
- /* This is wrong. On a failed flush we get back the LBA of the lost
- sector and we should (assuming it wasn't aborted as unknown) issue
- a further flush command to continue the writeback until it
- does not error */
- err_mask = ata_do_simple_cmd(dev, cmd);
- if (err_mask) {
- ata_dev_printk(dev, KERN_ERR, "failed to flush cache\n");
- return -EIO;
- }
-
- return 0;
-}
-
#ifdef CONFIG_PM
static int ata_host_request_pm(struct ata_host *host, pm_message_t mesg,
unsigned int action, unsigned int ehi_flags,
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 20edee0..e48d17f 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -96,7 +96,6 @@ extern int ata_check_atapi_dma(struct ata_queued_cmd *qc);
extern void ata_dev_select(struct ata_port *ap, unsigned int device,
unsigned int wait, unsigned int can_sleep);
extern void swap_buf_le16(u16 *buf, unsigned int buf_words);
-extern int ata_flush_cache(struct ata_device *dev);
extern void ata_dev_init(struct ata_device *dev);
extern void ata_link_init(struct ata_port *ap, struct ata_link *link, int pmp);
extern int sata_link_init_spd(struct ata_link *link);
--
1.5.2.4
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH 4/4] libata: improve FLUSH error handling
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
` (2 preceding siblings ...)
2008-03-27 10:14 ` [PATCH 3/4] libata: kill unused ata_flush_cache() Tejun Heo
@ 2008-03-27 10:14 ` Tejun Heo
2008-04-04 7:46 ` Jeff Garzik
2008-03-27 10:23 ` Debug patch to induce errors on FLUSH Tejun Heo
` (2 subsequent siblings)
6 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2008-03-27 10:14 UTC (permalink / raw)
To: jeff, linux-ide, liml; +Cc: Tejun Heo, Alan Cox
When errors occur during flush, instead of writing out the rest and
reporting the errors, ATA halts the processing and report the error,
so FLUSH must be retried on failure. Also, the spec says that FLUSH
may take more than 30 secs but doesn't mention anything about the
actual limit.
This patch improves FLUSH error handling such that...
1. It's always retried with longer timeout (60s for now) after failure.
2. If the device is making progress by retrying, humor it for longer
(20 tries for now).
3. If the device is fails the command with the same failed sector,
retry fewer times (log2 of the original number of tries).
4. If retried FLUSH fails for something other than device error, don't
keep retrying. We're likely wasting time.
This patch is inspired by Alan's patch to improve ata_flush_cache().
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
---
drivers/ata/libata-eh.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/libata.h | 5 ++-
2 files changed, 134 insertions(+), 2 deletions(-)
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 5f3d294..195af46 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1784,6 +1784,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)
if (!(qc->flags & ATA_QCFLAG_FAILED) || qc->dev->link != link)
continue;
+ dev = qc->dev;
/* inherit upper level err_mask */
qc->err_mask |= ehc->i.err_mask;
@@ -1809,8 +1810,19 @@ static void ata_eh_link_autopsy(struct ata_link *link)
((qc->flags & ATA_QCFLAG_IO) || qc->err_mask != AC_ERR_DEV))
qc->flags |= ATA_QCFLAG_RETRY;
+ /* If FLUSH is valid for the device and failed, retry
+ * and don't be quiet about it. As EH is gonna retry,
+ * skip regular retry logic.
+ */
+ if (dev->class == ATA_DEV_ATA && ata_try_flush_cache(dev) &&
+ (qc->tf.command == ATA_CMD_FLUSH ||
+ qc->tf.command == ATA_CMD_FLUSH_EXT)) {
+ ehc->i.dev_action[dev->devno] |= ATA_EH_RETRY_FLUSH;
+ qc->flags &= ~(ATA_QCFLAG_QUIET | ATA_QCFLAG_RETRY);
+ }
+
/* accumulate error info */
- ehc->i.dev = qc->dev;
+ ehc->i.dev = dev;
all_err_mask |= qc->err_mask;
if (qc->flags & ATA_QCFLAG_IO)
eflags |= ATA_EFLAG_IS_IO;
@@ -2496,6 +2508,115 @@ int ata_set_mode(struct ata_link *link, struct ata_device **r_failed_dev)
return rc;
}
+static u64 ata_flush_failed_sector(u8 cmd, const struct ata_taskfile *tf)
+{
+ switch (cmd) {
+ case ATA_CMD_FLUSH:
+ return ata_tf_to_lba(tf);
+ case ATA_CMD_FLUSH_EXT:
+ return ata_tf_to_lba48(tf);
+ default:
+ return -1;
+ }
+}
+
+static int ata_eh_retry_flush(struct ata_device *dev)
+{
+ struct ata_port *ap = dev->link->ap;
+ int tries = ATA_EH_FLUSH_RETRIES;
+ u64 last_failed = -1;
+ unsigned int tag;
+ int rc = 0;
+
+ /* determine the failed qc and read last_failed from it */
+ for (tag = 0; tag < ATA_MAX_QUEUE - 1; tag++) {
+ struct ata_queued_cmd *qc = __ata_qc_from_tag(ap, tag);
+ if ((qc->err_mask & AC_ERR_DEV) && qc->dev == dev) {
+ last_failed = ata_flush_failed_sector(qc->tf.command,
+ &qc->result_tf);
+ break;
+ }
+ }
+
+ /* RETRY_FLUSH action is taken only once and repeats inside
+ * the action if necessary. Mark done on start.
+ */
+ ata_eh_about_to_do(dev->link, dev, ATA_EH_RETRY_FLUSH);
+ ata_eh_done(dev->link, dev, ATA_EH_RETRY_FLUSH);
+
+ while (tries) {
+ struct ata_taskfile tf;
+ unsigned int err_mask;
+ u64 failed;
+ u8 cmd;
+
+ /* this can take a while, keep the user entertained */
+ if (last_failed == -1)
+ ata_dev_printk(dev, KERN_INFO, "retrying FLUSH, "
+ "%d tries left\n", tries);
+ else
+ ata_dev_printk(dev, KERN_INFO, "retrying FLUSH, "
+ "last_sector=%llu, %d tries left\n",
+ (unsigned long long)last_failed, tries);
+
+ /* Build tf. Set the LBA flags so that we don't read
+ * garbage back after command completion.
+ */
+ ata_tf_init(dev, &tf);
+
+ if (dev->flags & ATA_DFLAG_FLUSH_EXT) {
+ tf.command = cmd = ATA_CMD_FLUSH_EXT;
+ tf.flags |= ATA_TFLAG_LBA48;
+ } else {
+ tf.command = cmd = ATA_CMD_FLUSH;
+ tf.flags |= ATA_TFLAG_LBA;
+ }
+
+ tf.flags |= ATA_TFLAG_DEVICE;
+ tf.protocol = ATA_PROT_NODATA;
+
+ /* execute FLUSH */
+ err_mask = ata_exec_internal(dev, &tf, NULL, DMA_NONE, NULL, 0,
+ jiffies_to_msecs(ATA_TMOUT_FLUSH_RETRY));
+
+ if (!err_mask)
+ return 0;
+
+ /* If it failed for anything other than device error,
+ * doing it again isn't likely to fix the situation.
+ */
+ if (err_mask != AC_ERR_DEV || (ap->pflags & ATA_PFLAG_FROZEN)) {
+ ata_dev_printk(dev, KERN_WARNING,
+ "unexpected FLUSH failure, err_mask=0x%x\n",
+ err_mask);
+ rc = -EIO;
+ break;
+ }
+
+ failed = ata_flush_failed_sector(cmd, &tf);
+
+ /* fast-fail if FLUSH isn't making any progress */
+ if (failed == last_failed) {
+ ata_dev_printk(dev, KERN_WARNING,
+ "FLUSH failed at the same sector, "
+ "halving tries\n");
+ tries /= 2;
+ } else {
+ ata_dev_printk(dev, KERN_WARNING,
+ "FLUSH failed at sector %llu\n",
+ (unsigned long long)failed);
+ last_failed = failed;
+ tries--;
+ }
+ }
+
+ /* anyone who sees this message is seriously screwed */
+ ata_dev_printk(dev, KERN_ERR, "ATTENTION: giving up FLUSH, "
+ "device might be losing data\n");
+
+ return rc;
+}
+
static int ata_link_nr_enabled(struct ata_link *link)
{
struct ata_device *dev;
@@ -2753,6 +2874,14 @@ int ata_eh_recover(struct ata_port *ap, ata_prereset_fn_t prereset,
ehc->i.flags &= ~ATA_EHI_SETMODE;
}
+ /* retry FLUSH if requested */
+ ata_link_for_each_dev(dev, link)
+ if (ata_eh_dev_action(dev) & ATA_EH_RETRY_FLUSH) {
+ rc = ata_eh_retry_flush(dev);
+ if (rc)
+ goto dev_fail;
+ }
+
if (ehc->i.action & ATA_EH_LPM)
ata_link_for_each_dev(dev, link)
ata_dev_enable_pm(dev, ap->pm_policy);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index d06dd12..d946840 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -242,6 +242,7 @@ enum {
ATA_TMOUT_BOOT_QUICK = 7 * HZ, /* heuristic */
ATA_TMOUT_INTERNAL = 30 * HZ,
ATA_TMOUT_INTERNAL_QUICK = 5 * HZ,
+ ATA_TMOUT_FLUSH_RETRY = 60 * HZ,
/* FIXME: GoVault needs 2s but we can't afford that without
* parallel probing. 800ms is enough for iVDR disk
@@ -297,9 +298,10 @@ enum {
ATA_EH_HARDRESET = (1 << 2),
ATA_EH_ENABLE_LINK = (1 << 3),
ATA_EH_LPM = (1 << 4), /* link power management action */
+ ATA_EH_RETRY_FLUSH = (1 << 5),
ATA_EH_RESET_MASK = ATA_EH_SOFTRESET | ATA_EH_HARDRESET,
- ATA_EH_PERDEV_MASK = ATA_EH_REVALIDATE,
+ ATA_EH_PERDEV_MASK = ATA_EH_REVALIDATE | ATA_EH_RETRY_FLUSH,
/* ata_eh_info->flags */
ATA_EHI_HOTPLUGGED = (1 << 0), /* could have been hotplugged */
@@ -324,6 +326,7 @@ enum {
ATA_EH_DEV_TRIES = 3,
ATA_EH_PMP_TRIES = 5,
ATA_EH_PMP_LINK_TRIES = 3,
+ ATA_EH_FLUSH_RETRIES = 20,
SATA_PMP_SCR_TIMEOUT = 250,
--
1.5.2.4
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH 4/4] libata: improve FLUSH error handling
2008-03-27 10:14 ` [PATCH 4/4] libata: improve FLUSH error handling Tejun Heo
@ 2008-04-04 7:46 ` Jeff Garzik
0 siblings, 0 replies; 32+ messages in thread
From: Jeff Garzik @ 2008-04-04 7:46 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide, alan, liml
Tejun Heo wrote:
> When errors occur during flush, instead of writing out the rest and
> reporting the errors, ATA halts the processing and report the error,
> so FLUSH must be retried on failure. Also, the spec says that FLUSH
> may take more than 30 secs but doesn't mention anything about the
> actual limit.
>
> This patch improves FLUSH error handling such that...
>
> 1. It's always retried with longer timeout (60s for now) after failure.
>
> 2. If the device is making progress by retrying, humor it for longer
> (20 tries for now).
>
> 3. If the device is fails the command with the same failed sector,
> retry fewer times (log2 of the original number of tries).
>
> 4. If retried FLUSH fails for something other than device error, don't
> keep retrying. We're likely wasting time.
>
> This patch is inspired by Alan's patch to improve ata_flush_cache().
>
> Signed-off-by: Tejun Heo <htejun@gmail.com>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
> ---
> drivers/ata/libata-eh.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/libata.h | 5 ++-
> 2 files changed, 134 insertions(+), 2 deletions(-)
I didn't apply this, because it seemed like it was still under
discussion a bit.
People are talking in the right direction, so I'm likely to apply
whatever the end result is, it sounds like
^ permalink raw reply [flat|nested] 32+ messages in thread
* Debug patch to induce errors on FLUSH
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
` (3 preceding siblings ...)
2008-03-27 10:14 ` [PATCH 4/4] libata: improve FLUSH error handling Tejun Heo
@ 2008-03-27 10:23 ` Tejun Heo
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
2008-03-27 17:51 ` Ric Wheeler
6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2008-03-27 10:23 UTC (permalink / raw)
To: jeff, linux-ide, alan, liml
Here's the debug patch.
NOT FOR UPSTREAM
---
drivers/ata/libata-core.c | 52 ++++++++++++++++++++++++++++++++++++++++++++--
drivers/ata/libata-eh.c | 2 -
include/linux/libata.h | 1
3 files changed, 52 insertions(+), 3 deletions(-)
Index: work/drivers/ata/libata-core.c
===================================================================
--- work.orig/drivers/ata/libata-core.c
+++ work/drivers/ata/libata-core.c
@@ -146,6 +146,15 @@ int libata_allow_tpm = 0;
module_param_named(allow_tpm, libata_allow_tpm, int, 0444);
MODULE_PARM_DESC(allow_tpm, "Permit the use of TPM commands");
+unsigned int libata_flush_dbg_do_timeout = 0;
+unsigned int libata_flush_dbg_do_deverr = 0;
+unsigned long libata_flush_dbg_fail_sector = 123456;
+unsigned int libata_flush_dbg_fail_increment = 128;
+module_param_named(flush_dbg_do_timeout, libata_flush_dbg_do_timeout, uint, 0644);
+module_param_named(flush_dbg_do_deverr, libata_flush_dbg_do_deverr, uint, 0644);
+module_param_named(flush_dbg_fail_sector, libata_flush_dbg_fail_sector, ulong, 0644);
+module_param_named(flush_dbg_fail_increment, libata_flush_dbg_fail_increment, uint, 0644);
+
MODULE_AUTHOR("Jeff Garzik");
MODULE_DESCRIPTION("Library module for ATA devices");
MODULE_LICENSE("GPL");
@@ -5375,6 +5384,43 @@ static void ata_hsm_qc_complete(struct a
struct ata_port *ap = qc->ap;
unsigned long flags;
+ if (qc->tf.command == ATA_CMD_FLUSH ||
+ qc->tf.command == ATA_CMD_FLUSH_EXT) {
+ if (libata_flush_dbg_do_timeout) {
+ libata_flush_dbg_do_timeout--;
+ ata_dev_printk(qc->dev, KERN_WARNING, "XXX: skipping completion of %02x\n",
+ qc->tf.command);
+ return;
+ }
+
+ if (libata_flush_dbg_do_deverr) {
+ struct ata_taskfile *tf = &qc->result_tf;
+ u64 sect = libata_flush_dbg_fail_sector;
+
+ libata_flush_dbg_do_deverr--;
+
+ qc->err_mask |= AC_ERR_DEV;
+ qc->flags |= ATA_QCFLAG_NO_RTF;
+
+ ata_tf_init(qc->dev, tf);
+
+ tf->command = ATA_DRDY | ATA_ERR;
+ tf->hob_lbah = sect >> 40;
+ tf->hob_lbam = sect >> 32;
+ tf->hob_lbal = sect >> 24;
+ tf->lbah = sect >> 16;
+ tf->lbam = sect >> 8;
+ tf->lbal = sect;
+
+ ata_dev_printk(qc->dev, KERN_WARNING,
+ "XXX: failing %02x with sector %llu\n",
+ qc->tf.command, (unsigned long long)sect);
+
+ libata_flush_dbg_fail_sector +=
+ libata_flush_dbg_fail_increment;
+ }
+ }
+
if (ap->ops->error_handler) {
if (in_wq) {
spin_lock_irqsave(ap->lock, flags);
@@ -5811,8 +5857,10 @@ static void fill_result_tf(struct ata_qu
{
struct ata_port *ap = qc->ap;
- qc->result_tf.flags = qc->tf.flags;
- ap->ops->tf_read(ap, &qc->result_tf);
+ if (!(qc->flags & ATA_QCFLAG_NO_RTF)) {
+ qc->result_tf.flags = qc->tf.flags;
+ ap->ops->tf_read(ap, &qc->result_tf);
+ }
}
static void ata_verify_xfer(struct ata_queued_cmd *qc)
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h
+++ work/include/linux/libata.h
@@ -226,6 +226,7 @@ enum {
ATA_QCFLAG_CLEAR_EXCL = (1 << 5), /* clear excl_link on completion */
ATA_QCFLAG_QUIET = (1 << 6), /* don't report device error */
ATA_QCFLAG_RETRY = (1 << 7), /* retry after failure */
+ ATA_QCFLAG_NO_RTF = (1 << 8),
ATA_QCFLAG_FAILED = (1 << 16), /* cmd failed and is owned by EH */
ATA_QCFLAG_SENSE_VALID = (1 << 17), /* sense data valid */
Index: work/drivers/ata/libata-eh.c
===================================================================
--- work.orig/drivers/ata/libata-eh.c
+++ work/drivers/ata/libata-eh.c
@@ -1814,7 +1814,7 @@ static void ata_eh_link_autopsy(struct a
* and don't be quiet about it. As EH is gonna retry,
* skip regular retry logic.
*/
- if (dev->class == ATA_DEV_ATA && ata_try_flush_cache(dev) &&
+ if (dev->class == ATA_DEV_ATA && /*ata_try_flush_cache(dev) &&*/
(qc->tf.command == ATA_CMD_FLUSH ||
qc->tf.command == ATA_CMD_FLUSH_EXT)) {
ehc->i.dev_action[dev->devno] |= ATA_EH_RETRY_FLUSH;
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
` (4 preceding siblings ...)
2008-03-27 10:23 ` Debug patch to induce errors on FLUSH Tejun Heo
@ 2008-03-27 14:24 ` Mark Lord
2008-03-27 14:35 ` Mark Lord
` (2 more replies)
2008-03-27 17:51 ` Ric Wheeler
6 siblings, 3 replies; 32+ messages in thread
From: Mark Lord @ 2008-03-27 14:24 UTC (permalink / raw)
To: Tejun Heo; +Cc: jeff, linux-ide, alan
Tejun Heo wrote:
>
> As the code is being smart against retrying needlessly, it won't be
> too dangerous to increase the 20 tries (taken from Alan's patch) but I
> think it's as good as any other random number. If anyone knows any
> meaningful number, please chime in. The same goes for 60 secs timeout
> too.
..
I really think that we should enforce a strict upper limit on the time
that can be spent inside the flush-cache near-infinite loop being introduced.
Some things rely on I/O completing or failing in a time deterministic manner.
Really, the entire flush + retries etc.. should never, ever, be permitted
to take more than XX seconds total. Not 60 seconds per retry, but XX seconds
total for the original command + however many retries we can fit in there.
As for the value of XX, well.. make it a sysfs attribute, with a default
of something "sensible". The time bounds is really dependent upon how
quickly the drive can empty its onboard cache, or how large a cache it has.
Figure the biggest drives will have no more than, say 64MB of cache for
many years (biggest SATA drive now uses 16MB). Assuming near-worst case
I/O size of 4KB, that's 16384 I/O operations, if none were adjacent on disk.
What's the average access time these days? Say.. 20ms worst case for any
drive with a cache that huge? That's unrealistically slow for data that's
already in the drive cache, but .. 16384 * .020 seconds = 328 seconds.
Absolute theoretical worst case for a drive with a buffer 4X the largest
current size: 328 seconds. Not taking into account having bad-sector
retries for each of those I/O blocks, but *nobody* is going to wait
that long anyway. They'll have long since pulled the power cord or
reached for the BIG RED BUTTON.
On a 16MB cache drive, that number would be 328 / 4 = 82 seconds.
That's what I'd put for the limit.
But we could be slighly nonsensical and agree upon 120 seconds. :)
Cheers
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
@ 2008-03-27 14:35 ` Mark Lord
2008-03-27 15:31 ` Alan Cox
` (2 more replies)
2008-03-27 17:53 ` Ric Wheeler
2008-03-28 7:46 ` Andi Kleen
2 siblings, 3 replies; 32+ messages in thread
From: Mark Lord @ 2008-03-27 14:35 UTC (permalink / raw)
To: Tejun Heo; +Cc: jeff, linux-ide, alan
Mark Lord wrote:
..
> Absolute theoretical worst case for a drive with a buffer 4X the largest
> current size: 328 seconds. Not taking into account having bad-sector
> retries for each of those I/O blocks, but *nobody* is going to wait
> that long anyway. They'll have long since pulled the power cord or
> reached for the BIG RED BUTTON.
..
Speaking of which.. these are all WRITEs.
In 18 years of IDE/ATA development,
I have *never* seen a hard disk drive report a WRITE error.
Which makes sense, if you think about it -- it's rewriting the sector
with new ECC info, so it *should* succeed. The only case where it won't,
is if the sector has been marked as "bad" internally, and the drive is
too dumb to try anyways after it runs out of remap space.
In which case we've already lost data, and taking more than a hundred
and twenty seconds isn't going to make a serious difference.
Mmm.. anyone got a spare modern-ish drive to risk destroying?
Say, one of the few still-functioning DeathStars, or an buggy-NCQ Maxtor ?
If so, it might be fun to try and produce a no-more-remaps scenario on it.
One could use "hdparm --make-bad-sector" to corrupt a few hundred/thousand
sectors in a row (sequentially numbered).
Then loop and attempt to read from them individually with "hdparm --read-sector"
(should fail on all, but it might force the drive to remap them).
Then finally try and write back to them with "hdparm --write-sector",
and see if a WRITE ERROR is ever reported. Maybe time the individual WRITEs
to see if any of them take more than a few milliseconds.
Perhaps try this whole thing with/without the write cache enabled.
Mmm...
Cheers
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:35 ` Mark Lord
@ 2008-03-27 15:31 ` Alan Cox
2008-03-27 18:01 ` Ric Wheeler
2008-03-28 1:57 ` Tejun Heo
2 siblings, 0 replies; 32+ messages in thread
From: Alan Cox @ 2008-03-27 15:31 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, jeff, linux-ide
> In 18 years of IDE/ATA development,
> I have *never* seen a hard disk drive report a WRITE error.
You don't try hard enough. Also a cache flush write error is more likely
than a write reporting the error, because with caching enabled the disk
only finds out when it comes to try and flush or you do.
> Which makes sense, if you think about it -- it's rewriting the sector
> with new ECC info, so it *should* succeed. The only case where it won't,
> is if the sector has been marked as "bad" internally, and the drive is
> too dumb to try anyways after it runs out of remap space.
Or you have a drive with raid optimised firmware, a magneto-optical or
other similar cases.
Alan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:35 ` Mark Lord
2008-03-27 15:31 ` Alan Cox
@ 2008-03-27 18:01 ` Ric Wheeler
2008-03-28 1:57 ` Tejun Heo
2 siblings, 0 replies; 32+ messages in thread
From: Ric Wheeler @ 2008-03-27 18:01 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, jeff, linux-ide, alan
Mark Lord wrote:
> Mark Lord wrote:
> ..
>> Absolute theoretical worst case for a drive with a buffer 4X the largest
>> current size: 328 seconds. Not taking into account having bad-sector
>> retries for each of those I/O blocks, but *nobody* is going to wait
>> that long anyway. They'll have long since pulled the power cord or
>> reached for the BIG RED BUTTON.
> ..
>
> Speaking of which.. these are all WRITEs.
>
> In 18 years of IDE/ATA development,
> I have *never* seen a hard disk drive report a WRITE error.
I have seen them in the wild.
>
> Which makes sense, if you think about it -- it's rewriting the sector
> with new ECC info, so it *should* succeed. The only case where it won't,
> is if the sector has been marked as "bad" internally, and the drive is
> too dumb to try anyways after it runs out of remap space.
>
> In which case we've already lost data, and taking more than a hundred
> and twenty seconds isn't going to make a serious difference.
You can definitely start failing writes once your remapped sector table is
exhausted, but to your point, that drive is usually in bad shape at this point
in time.
That makes it more important to fail quickly so that we don't hang waiting for
something that is most likely to be on its last legs...
>
> Mmm.. anyone got a spare modern-ish drive to risk destroying?
> Say, one of the few still-functioning DeathStars, or an buggy-NCQ Maxtor ?
>
> If so, it might be fun to try and produce a no-more-remaps scenario on it.
> One could use "hdparm --make-bad-sector" to corrupt a few hundred/thousand
> sectors in a row (sequentially numbered).
I don't think that this will do it. What happens with our sector corruption, I
believe, is that we corrupt the data integrity bits around the sector. Once we
write, that original sector is repaired since the drive overwrites the junk bits
we gave it. The remapped sector count should not be growing (but it is worth
checking to verify my theory ;-)).
You have my blessing to be mean to a drive that you got from me if that helps ;-)
>
> Then loop and attempt to read from them individually with "hdparm
> --read-sector"
> (should fail on all, but it might force the drive to remap them).
Again, I don't think that reads will ever force a remap.
>
> Then finally try and write back to them with "hdparm --write-sector",
> and see if a WRITE ERROR is ever reported. Maybe time the individual
> WRITEs
> to see if any of them take more than a few milliseconds.
>
> Perhaps try this whole thing with/without the write cache enabled.
>
> Mmm...
>
> Cheers
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:35 ` Mark Lord
2008-03-27 15:31 ` Alan Cox
2008-03-27 18:01 ` Ric Wheeler
@ 2008-03-28 1:57 ` Tejun Heo
2008-03-28 2:33 ` Mark Lord
2 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2008-03-28 1:57 UTC (permalink / raw)
To: Mark Lord; +Cc: jeff, linux-ide, alan
Hello, Mark.
Mark Lord wrote:
> Speaking of which.. these are all WRITEs.
>
> In 18 years of IDE/ATA development,
> I have *never* seen a hard disk drive report a WRITE error.
>
> Which makes sense, if you think about it -- it's rewriting the sector
> with new ECC info, so it *should* succeed. The only case where it won't,
> is if the sector has been marked as "bad" internally, and the drive is
> too dumb to try anyways after it runs out of remap space.
>
> In which case we've already lost data, and taking more than a hundred
> and twenty seconds isn't going to make a serious difference.
Yeah, the disk must be knee deep in shit to report WRITE failure. I
don't really expect the code to be exercised often but was mainly trying
fill the loophole in libata error handling as this type of behavior is
what the spec requires on FLUSH errors.
I didn't add global timeout because retries are done iff the drive is
reporting progress.
1. Drives genuinely deep in shit and getting lots of WRITE errors would
report different sectors on each FLUSH and we NEED to keep retrying.
That's what the spec requires and the FLUSH could be from shutdown and
if so that would be the drive's last chance to write data to the drive.
2. There are other issues causing the command to fail (e.g. timeout, HSM
violation or somesuch). This is the case EH can take a really long time
if it keeps retrying but the posted code doesn't retry if this is the case.
3. The drive is crazy and reporting errors for no good reason. Unless
the drive is really anti-social and raise such error condition only
after tens of seconds, this shouldn't take too long. Also, if LBA
doesn't change for each retry, the tries count is halved.
So, I think the code should be safe. Do you still think we need a
global timeout? It is easy to add. I'm just not sure whether we need
it or not.
> Mmm.. anyone got a spare modern-ish drive to risk destroying?
> Say, one of the few still-functioning DeathStars, or an buggy-NCQ Maxtor ?
>
> If so, it might be fun to try and produce a no-more-remaps scenario on it.
> One could use "hdparm --make-bad-sector" to corrupt a few hundred/thousand
> sectors in a row (sequentially numbered).
>
> Then loop and attempt to read from them individually with "hdparm
> --read-sector"
> (should fail on all, but it might force the drive to remap them).
>
> Then finally try and write back to them with "hdparm --write-sector",
> and see if a WRITE ERROR is ever reported. Maybe time the individual
> WRITEs
> to see if any of them take more than a few milliseconds.
>
> Perhaps try this whole thing with/without the write cache enabled.
>
> Mmm...
Heh... :-)
--
tejun
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 1:57 ` Tejun Heo
@ 2008-03-28 2:33 ` Mark Lord
2008-03-28 13:36 ` Ric Wheeler
0 siblings, 1 reply; 32+ messages in thread
From: Mark Lord @ 2008-03-28 2:33 UTC (permalink / raw)
To: Tejun Heo; +Cc: jeff, linux-ide, alan
Tejun Heo wrote:
> Hello, Mark.
>
> Mark Lord wrote:
>> Speaking of which.. these are all WRITEs.
>>
>> In 18 years of IDE/ATA development,
>> I have *never* seen a hard disk drive report a WRITE error.
>>
>> Which makes sense, if you think about it -- it's rewriting the sector
>> with new ECC info, so it *should* succeed. The only case where it won't,
>> is if the sector has been marked as "bad" internally, and the drive is
>> too dumb to try anyways after it runs out of remap space.
>>
>> In which case we've already lost data, and taking more than a hundred
>> and twenty seconds isn't going to make a serious difference.
>
> Yeah, the disk must be knee deep in shit to report WRITE failure. I
> don't really expect the code to be exercised often but was mainly trying
> fill the loophole in libata error handling as this type of behavior is
> what the spec requires on FLUSH errors.
>
> I didn't add global timeout because retries are done iff the drive is
> reporting progress.
>
> 1. Drives genuinely deep in shit and getting lots of WRITE errors would
> report different sectors on each FLUSH and we NEED to keep retrying.
> That's what the spec requires and the FLUSH could be from shutdown and
> if so that would be the drive's last chance to write data to the drive.
>
> 2. There are other issues causing the command to fail (e.g. timeout, HSM
> violation or somesuch). This is the case EH can take a really long time
> if it keeps retrying but the posted code doesn't retry if this is the case.
>
> 3. The drive is crazy and reporting errors for no good reason. Unless
> the drive is really anti-social and raise such error condition only
> after tens of seconds, this shouldn't take too long. Also, if LBA
> doesn't change for each retry, the tries count is halved.
>
> So, I think the code should be safe. Do you still think we need a
> global timeout? It is easy to add. I'm just not sure whether we need
> it or not.
..
With EH becoming more and more capable and complex,
a global deadline for FLUSH looks like a reasonable thing.
People who have no backups can leave it at the default "near-infinity" setting
that is there now, and folks with RAID1 (or better) can set it to a much
shorter number -- so that their system-recovery reboot doesn't take 3 hours
to get past the FLUSH_CACHE on the failing drive. :)
Cheers
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 2:33 ` Mark Lord
@ 2008-03-28 13:36 ` Ric Wheeler
2008-03-28 14:52 ` Tejun Heo
0 siblings, 1 reply; 32+ messages in thread
From: Ric Wheeler @ 2008-03-28 13:36 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, jeff, linux-ide, alan
Mark Lord wrote:
> Tejun Heo wrote:
>> Hello, Mark.
>>
>> Mark Lord wrote:
>>> Speaking of which.. these are all WRITEs.
>>>
>>> In 18 years of IDE/ATA development,
>>> I have *never* seen a hard disk drive report a WRITE error.
>>>
>>> Which makes sense, if you think about it -- it's rewriting the sector
>>> with new ECC info, so it *should* succeed. The only case where it
>>> won't,
>>> is if the sector has been marked as "bad" internally, and the drive is
>>> too dumb to try anyways after it runs out of remap space.
>>>
>>> In which case we've already lost data, and taking more than a hundred
>>> and twenty seconds isn't going to make a serious difference.
>>
>> Yeah, the disk must be knee deep in shit to report WRITE failure. I
>> don't really expect the code to be exercised often but was mainly trying
>> fill the loophole in libata error handling as this type of behavior is
>> what the spec requires on FLUSH errors.
>>
>> I didn't add global timeout because retries are done iff the drive is
>> reporting progress.
>>
>> 1. Drives genuinely deep in shit and getting lots of WRITE errors would
>> report different sectors on each FLUSH and we NEED to keep retrying.
>> That's what the spec requires and the FLUSH could be from shutdown and
>> if so that would be the drive's last chance to write data to the drive.
>>
>> 2. There are other issues causing the command to fail (e.g. timeout, HSM
>> violation or somesuch). This is the case EH can take a really long time
>> if it keeps retrying but the posted code doesn't retry if this is the
>> case.
>>
>> 3. The drive is crazy and reporting errors for no good reason. Unless
>> the drive is really anti-social and raise such error condition only
>> after tens of seconds, this shouldn't take too long. Also, if LBA
>> doesn't change for each retry, the tries count is halved.
>>
>> So, I think the code should be safe. Do you still think we need a
>> global timeout? It is easy to add. I'm just not sure whether we need
>> it or not.
> ..
>
> With EH becoming more and more capable and complex,
> a global deadline for FLUSH looks like a reasonable thing.
> People who have no backups can leave it at the default "near-infinity"
> setting
> that is there now, and folks with RAID1 (or better) can set it to a much
> shorter number -- so that their system-recovery reboot doesn't take 3 hours
> to get past the FLUSH_CACHE on the failing drive. :)
>
> Cheers
I think that is a really important knob to have. Not just for RAID
systems, but we use the FLUSH_CACHE on systems without barriers mainly
when we power down & do the unmounts, etc.
If you hit a bad block during power down of a laptop, I can image that
have a worst case of (30?) seconds is infinitely better than multiple
minutes ;-)
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 13:36 ` Ric Wheeler
@ 2008-03-28 14:52 ` Tejun Heo
2008-03-28 14:53 ` Ric Wheeler
2008-03-28 16:04 ` Mark Lord
0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2008-03-28 14:52 UTC (permalink / raw)
To: ric; +Cc: Mark Lord, jeff, linux-ide, alan
Ric Wheeler wrote:
> I think that is a really important knob to have. Not just for RAID
> systems, but we use the FLUSH_CACHE on systems without barriers mainly
> when we power down & do the unmounts, etc.
>
> If you hit a bad block during power down of a laptop, I can image that
> have a worst case of (30?) seconds is infinitely better than multiple
> minutes ;-)
Fully finishing FLUSH CACHE requires command repetition. Not fully
finishing FLUSH CACHE on shutdown means sure data loss. Given that
FLUSH CACHE failure is very rare and it's repeatedly retried if and only
if the device actively indicates failure, I'm not too sure. Also note
that if FLUSH CACHE fails, you cannot even trust the FS journal. Things
can get silently corrupt.
--
tejun
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 14:52 ` Tejun Heo
@ 2008-03-28 14:53 ` Ric Wheeler
2008-03-28 15:16 ` Alan Cox
2008-03-28 16:04 ` Mark Lord
1 sibling, 1 reply; 32+ messages in thread
From: Ric Wheeler @ 2008-03-28 14:53 UTC (permalink / raw)
To: Tejun Heo; +Cc: Mark Lord, jeff, linux-ide, alan
Tejun Heo wrote:
> Ric Wheeler wrote:
>> I think that is a really important knob to have. Not just for RAID
>> systems, but we use the FLUSH_CACHE on systems without barriers mainly
>> when we power down & do the unmounts, etc.
>>
>> If you hit a bad block during power down of a laptop, I can image that
>> have a worst case of (30?) seconds is infinitely better than multiple
>> minutes ;-)
>
> Fully finishing FLUSH CACHE requires command repetition. Not fully
> finishing FLUSH CACHE on shutdown means sure data loss. Given that
> FLUSH CACHE failure is very rare and it's repeatedly retried if and only
> if the device actively indicates failure, I'm not too sure. Also note
> that if FLUSH CACHE fails, you cannot even trust the FS journal. Things
> can get silently corrupt.
>
I do agree with the above, we should try to get the FLUSH done according
to spec, I meant to argue that we should bound the time spent. If my
laptop spends more than 30? 60? 120? seconds trying to flush a write
cache, I will probably be looking for a way to force it to power down ;-)
It is also worth noting that most users of ext3 run without barriers
enabled (and the drive write cache enabled) which means that we test
this corruption path on any non-UPS power failure.
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 14:53 ` Ric Wheeler
@ 2008-03-28 15:16 ` Alan Cox
2008-03-28 16:57 ` Ric Wheeler
0 siblings, 1 reply; 32+ messages in thread
From: Alan Cox @ 2008-03-28 15:16 UTC (permalink / raw)
To: ric; +Cc: Tejun Heo, Mark Lord, jeff, linux-ide
> I do agree with the above, we should try to get the FLUSH done according
> to spec, I meant to argue that we should bound the time spent. If my
> laptop spends more than 30? 60? 120? seconds trying to flush a write
> cache, I will probably be looking for a way to force it to power down ;-)
But if your PhD thesis is being written back you'd be different 8). I am
not sure we can exceed 30 seconds, currently although we set 60 second
I/O timeouts we are timing out at 30 seconds in some traces I get sent so
something is messing up our timeout handling back to the default. I've
tried tracing it and so far failed to figure it out.
> It is also worth noting that most users of ext3 run without barriers
> enabled (and the drive write cache enabled) which means that we test
> this corruption path on any non-UPS power failure.
It is most unfortunate that distributions continue to ship that default.
Alan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 15:16 ` Alan Cox
@ 2008-03-28 16:57 ` Ric Wheeler
0 siblings, 0 replies; 32+ messages in thread
From: Ric Wheeler @ 2008-03-28 16:57 UTC (permalink / raw)
To: Alan Cox; +Cc: Tejun Heo, Mark Lord, jeff, linux-ide
Alan Cox wrote:
>> I do agree with the above, we should try to get the FLUSH done according
>> to spec, I meant to argue that we should bound the time spent. If my
>> laptop spends more than 30? 60? 120? seconds trying to flush a write
>> cache, I will probably be looking for a way to force it to power down ;-)
>
> But if your PhD thesis is being written back you'd be different 8). I am
> not sure we can exceed 30 seconds, currently although we set 60 second
> I/O timeouts we are timing out at 30 seconds in some traces I get sent so
> something is messing up our timeout handling back to the default. I've
> tried tracing it and so far failed to figure it out.
The challenge is in getting the retry, more than in just the timeout on
just one IO. For example, if we have a full 16MB write cache and the
disk is really, really toast (i.e., a head failed which means each and
every IO in that 16MB will fail), we don't want to do 16MB/512 distinct
30-60 seconds retries....
That is where Mark's idea about capping the whole sequence of retries
comes into play - we can use the global timer to prevent this from
running into an eternity of retry attempts.
>
>> It is also worth noting that most users of ext3 run without barriers
>> enabled (and the drive write cache enabled) which means that we test
>> this corruption path on any non-UPS power failure.
>
> It is most unfortunate that distributions continue to ship that default.
>
> Alan
I have been thinking that running without barriers by default is mostly
OK for laptops (which have a fairly usable UPS in a working battery). If
we destage the write cache rebustly as this thread is discussing, we
should cover almost all normal failure cases.
Desktop and server systems should normally use either barriers or
disable the write cache when ever you have data you care about...
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 14:52 ` Tejun Heo
2008-03-28 14:53 ` Ric Wheeler
@ 2008-03-28 16:04 ` Mark Lord
1 sibling, 0 replies; 32+ messages in thread
From: Mark Lord @ 2008-03-28 16:04 UTC (permalink / raw)
To: Tejun Heo; +Cc: ric, jeff, linux-ide, alan
Heh.. and Tomas Lund has just posted here reporting a FLUSH CACHE EXT error,
the first one I've ever seen (probably missed some, by the sounds of things).
Look for subject line: Need help understanding SATA error message.
Cheers
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
2008-03-27 14:35 ` Mark Lord
@ 2008-03-27 17:53 ` Ric Wheeler
2008-03-27 18:52 ` Jeff Garzik
2008-03-28 7:46 ` Andi Kleen
2 siblings, 1 reply; 32+ messages in thread
From: Ric Wheeler @ 2008-03-27 17:53 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, jeff, linux-ide, alan
Mark Lord wrote:
> Tejun Heo wrote:
>>
>> As the code is being smart against retrying needlessly, it won't be
>> too dangerous to increase the 20 tries (taken from Alan's patch) but I
>> think it's as good as any other random number. If anyone knows any
>> meaningful number, please chime in. The same goes for 60 secs timeout
>> too.
> ..
>
> I really think that we should enforce a strict upper limit on the time
> that can be spent inside the flush-cache near-infinite loop being
> introduced.
>
> Some things rely on I/O completing or failing in a time deterministic
> manner.
>
> Really, the entire flush + retries etc.. should never, ever, be permitted
> to take more than XX seconds total. Not 60 seconds per retry, but XX
> seconds
> total for the original command + however many retries we can fit in there.
>
> As for the value of XX, well.. make it a sysfs attribute, with a default
> of something "sensible". The time bounds is really dependent upon how
> quickly the drive can empty its onboard cache, or how large a cache it has.
>
> Figure the biggest drives will have no more than, say 64MB of cache for
> many years (biggest SATA drive now uses 16MB). Assuming near-worst case
> I/O size of 4KB, that's 16384 I/O operations, if none were adjacent on
> disk.
>
> What's the average access time these days? Say.. 20ms worst case for any
> drive with a cache that huge? That's unrealistically slow for data that's
> already in the drive cache, but .. 16384 * .020 seconds = 328 seconds.
>
> Absolute theoretical worst case for a drive with a buffer 4X the largest
> current size: 328 seconds. Not taking into account having bad-sector
> retries for each of those I/O blocks, but *nobody* is going to wait
> that long anyway. They'll have long since pulled the power cord or
> reached for the BIG RED BUTTON.
>
> On a 16MB cache drive, that number would be 328 / 4 = 82 seconds.
>
> That's what I'd put for the limit.
> But we could be slighly nonsensical and agree upon 120 seconds. :)
>
> Cheers
>
I think that the 30 seconds was meant to be that worst case time for the drive
to respond to a command. We try to push vendors to respond in much less time
than that (it's important to get things like the fast fail path for RAID working
correctly), say something like 10-15 seconds.
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 17:53 ` Ric Wheeler
@ 2008-03-27 18:52 ` Jeff Garzik
2008-03-27 20:23 ` Ric Wheeler
0 siblings, 1 reply; 32+ messages in thread
From: Jeff Garzik @ 2008-03-27 18:52 UTC (permalink / raw)
To: ric; +Cc: Mark Lord, Tejun Heo, linux-ide, alan
Ric Wheeler wrote:
> I think that the 30 seconds was meant to be that worst case time for the
> drive to respond to a command. We try to push vendors to respond in much
> less time than that (it's important to get things like the fast fail
> path for RAID working correctly), say something like 10-15 seconds.
Multiple vendors say the FLUSH CACHE worst case can exceed 30 seconds
though...
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 18:52 ` Jeff Garzik
@ 2008-03-27 20:23 ` Ric Wheeler
0 siblings, 0 replies; 32+ messages in thread
From: Ric Wheeler @ 2008-03-27 20:23 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Mark Lord, Tejun Heo, linux-ide, alan
Jeff Garzik wrote:
> Ric Wheeler wrote:
>> I think that the 30 seconds was meant to be that worst case time for
>> the drive to respond to a command. We try to push vendors to respond
>> in much less time than that (it's important to get things like the
>> fast fail path for RAID working correctly), say something like 10-15
>> seconds.
>
>
> Multiple vendors say the FLUSH CACHE worst case can exceed 30 seconds
> though...
>
> Jeff
That might be true, but in that case, I still think we should never retry the
flush ;-)
ric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
2008-03-27 14:35 ` Mark Lord
2008-03-27 17:53 ` Ric Wheeler
@ 2008-03-28 7:46 ` Andi Kleen
2008-03-28 8:30 ` Tejun Heo
2 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2008-03-28 7:46 UTC (permalink / raw)
To: Mark Lord; +Cc: Tejun Heo, jeff, linux-ide, alan
Mark Lord <liml@rtr.ca> writes:
>
> Really, the entire flush + retries etc.. should never, ever, be permitted
> to take more than XX seconds total. Not 60 seconds per retry, but XX seconds
> total for the original command + however many retries we can fit in there.
I fully agree. Also please make that user limit settable use module_param too,
not only sysfs. I recently had the pleasure to deal with a failing disk
which started to spew errors already on device probing and it somewhat
annoyed me that that slowed down boot up time that much. Perhaps the default
during device probing should be also lower?
-Andi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 7:46 ` Andi Kleen
@ 2008-03-28 8:30 ` Tejun Heo
2008-03-28 8:48 ` Andi Kleen
0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2008-03-28 8:30 UTC (permalink / raw)
To: Andi Kleen; +Cc: Mark Lord, jeff, linux-ide, alan
Hello, Andi.
Andi Kleen wrote:
> Mark Lord <liml@rtr.ca> writes:
>> Really, the entire flush + retries etc.. should never, ever, be
>> permitted to take more than XX seconds total. Not 60 seconds per
>> retry, but XX seconds total for the original command + however many
>> retries we can fit in there.
>
> I fully agree. Also please make that user limit settable use
> module_param too, not only sysfs. I recently had the pleasure to deal
> with a failing disk which started to spew errors already on device
> probing and it somewhat annoyed me that that slowed down boot up time
> that much. Perhaps the default during device probing should be also
> lower?
Adding global timeout now. Regarding the boot time exception, which
version was it and do you have logs?
--
tejun
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 8:30 ` Tejun Heo
@ 2008-03-28 8:48 ` Andi Kleen
2008-03-28 8:53 ` Tejun Heo
0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2008-03-28 8:48 UTC (permalink / raw)
To: Tejun Heo; +Cc: Mark Lord, jeff, linux-ide, alan
Tejun Heo <htejun@gmail.com> writes:
> Regarding the boot time exception, which
> version was it and do you have logs?
I tried first 2.6.25-rc4 and then 2.6.24 (similar behaviour).
My impression was also that the exception handling was very
heavy handed -- is it really needed to always reset the link
on each error?
I'll send you full logs (huge) in separate mail.
-Andi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-28 8:48 ` Andi Kleen
@ 2008-03-28 8:53 ` Tejun Heo
0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2008-03-28 8:53 UTC (permalink / raw)
To: Andi Kleen; +Cc: Mark Lord, jeff, linux-ide, alan
Hello,
Andi Kleen wrote:
> Tejun Heo <htejun@gmail.com> writes:
>
>> Regarding the boot time exception, which
>> version was it and do you have logs?
>
> I tried first 2.6.25-rc4 and then 2.6.24 (similar behaviour).
Hmmm...
> My impression was also that the exception handling was very
> heavy handed -- is it really needed to always reset the link
> on each error?
It depends on the type of error. For things like state machine
violation and timeouts, I think it's safer to reset.
> I'll send you full logs (huge) in separate mail.
Alright.
--
tejun
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
` (5 preceding siblings ...)
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
@ 2008-03-27 17:51 ` Ric Wheeler
2008-03-27 18:53 ` Jeff Garzik
` (2 more replies)
6 siblings, 3 replies; 32+ messages in thread
From: Ric Wheeler @ 2008-03-27 17:51 UTC (permalink / raw)
To: Tejun Heo; +Cc: jeff, linux-ide, alan, liml
Tejun Heo wrote:
> Hello, all.
>
> I was going through mailbox and saw Alan's patch[A] which didn't get
> the love it deserved. It turned out that ata_flush_cache() function
> the patch modifies had been dead for some time. I ended up re-doing
> it in the EH framework and it turned out okay.
>
> After the patchset, on FLUSH failure, the following is done.
>
> 1. It's always retried with longer timeout (60s for now) after
> failure.
>
> 2. If the device is making progress by retrying, humor it for longer
> (20 tries for now).
>
> 3. If the device is fails the command with the same failed sector,
> retry fewer times (log2 of the original number of tries).
>
> 4. If retried FLUSH fails for something other than device error,
> don't keep retrying. We're likely wasting time.
Are we sure that it is ever the right thing to do to reissue a flush command?
I am worried that this might be much closer to the media error class of device
errors than something that will benefit from a retry of any type.
Also, I am unclear as to how we measure the progress of the device if the flush
command has failed?
> As the code is being smart against retrying needlessly, it won't be
> too dangerous to increase the 20 tries (taken from Alan's patch) but I
> think it's as good as any other random number. If anyone knows any
> meaningful number, please chime in. The same goes for 60 secs timeout
> too.
>
> I made a debug patch to trigger timeouts and device errors on FLUSH.
> I'll post the patch as a reply. It adds the following four module
> params which can be written runtime via /sys/module/libata/parameters.
>
> flush_dbg_do_timeout:
> If non-zero value is written, the specfied number of FLUSHes
> will be timed out.
>
> flush_dbg_do_deverr:
> If non-zero value is written, the specfied number of FLUSHes
> will be terminated with device error.
>
> flush_dbg_fail_sector:
> The failed sector for the next deverr.
>
> flush_dbg_fail_increment:
> Number of sectors to add to fail_sector after each deverr.
>
> I tested different scenarios and it all seems to work fine but it
> would be really great if someone can test this on a (hmmm....) live
> dying drive.
>
> This patchet is for #upstream but generated on top of
>
> #upstream-fixes (4cde32fc4b32e96a99063af3183acdfd54c563f0)
> + [1] libata: ATA_EHI_LPM should be ATA_EH_LPM
>
> as there is a humongous patchset pending review #upstream. Once this
> gets acked, I'll move it over to #upstream. It shouldn't interfere
> too much anyway.
>
> Thanks.
>
We have access to a fair number of flaky drives, I can see if we can test some
of this...
ric
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 17:51 ` Ric Wheeler
@ 2008-03-27 18:53 ` Jeff Garzik
2008-03-27 22:00 ` Alan Cox
2008-03-28 2:02 ` Tejun Heo
2 siblings, 0 replies; 32+ messages in thread
From: Jeff Garzik @ 2008-03-27 18:53 UTC (permalink / raw)
To: ric; +Cc: Tejun Heo, linux-ide, alan, liml
Ric Wheeler wrote:
> Are we sure that it is ever the right thing to do to reissue a flush
> command?
I confess no field experience here, but that's what the spec wants at
least...
Jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 17:51 ` Ric Wheeler
2008-03-27 18:53 ` Jeff Garzik
@ 2008-03-27 22:00 ` Alan Cox
2008-03-28 2:02 ` Tejun Heo
2 siblings, 0 replies; 32+ messages in thread
From: Alan Cox @ 2008-03-27 22:00 UTC (permalink / raw)
To: ric; +Cc: Tejun Heo, jeff, linux-ide, liml
> Are we sure that it is ever the right thing to do to reissue a flush command?
The spec is pretty clear about this.
> I am worried that this might be much closer to the media error class of device
> errors than something that will benefit from a retry of any type.
There are several cases it matters and there are some where it is going
to matter. RAID firmware is the obvious one but the upcoming joy is large
physical sector sizes where a read/modify/write may be required and fail
but the remaining sectors can be written or fired at spare blocks
> Also, I am unclear as to how we measure the progress of the device if the flush
> command has failed?
The spec says that if a flush fails to write back a block then the
failed block is dropped from the cache. Thus you make progress and you
have a mechanism to report each failed LBA.
Alan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET #upstream] libata: improve FLUSH error handling
2008-03-27 17:51 ` Ric Wheeler
2008-03-27 18:53 ` Jeff Garzik
2008-03-27 22:00 ` Alan Cox
@ 2008-03-28 2:02 ` Tejun Heo
2008-03-28 9:48 ` Alan Cox
2 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2008-03-28 2:02 UTC (permalink / raw)
To: ric; +Cc: jeff, linux-ide, alan, liml
Ric Wheeler wrote:
>> 4. If retried FLUSH fails for something other than device error,
>> don't keep retrying. We're likely wasting time.
>
> Are we sure that it is ever the right thing to do to reissue a flush
> command?
>
> I am worried that this might be much closer to the media error class of
> device errors than something that will benefit from a retry of any type.
Heh.. yeah, as Alan and Jeff said, that's the spec. If write failure
occurs during flush, the drive is supposed to stop flushing, remove the
sector from cache and report error to the host. If the host re-issues
flush, the drive continues flushing after the failed sector.
> Also, I am unclear as to how we measure the progress of the device if
> the flush command has failed?
Yes by reporting the first failing sector. Probably it's better to test
failed > last_failed than failed == last_failed.
>> as there is a humongous patchset pending review #upstream. Once this
>> gets acked, I'll move it over to #upstream. It shouldn't interfere
>> too much anyway.
>>
> We have access to a fair number of flaky drives, I can see if we can
> test some of this...
If you can trigger flush errors, please go ahead.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 32+ messages in thread