[PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
@ 2005-06-26 15:21 Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 01/09] NCQ: add ata_qc_complete_err() and @drv_err to functions Tejun Heo
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

 Hello, Jeff.
 Hello, Jens.

 This patchset implements generic completion and error-handling for
NCQ commands.  This patchset assumes that the previous six misc
patches to NCQ are applied.

 In the original implementation, ahci driver proper handled completion
and error-handling of NCQ commands and NCQ error-handling was broken.

 * fails to finish scsi cmds (EH cannot be entered after the first
   time and erring requests never get completed or failed.)
 * doesn't reset host_failed
 * corrupts shost->eh_cmd_q

 New implementation...

 * does as much as possible in the libata-core layer.
 * unifies error and timeout paths and handles all in EH context.

 I tested EH using ASUS P5LD2 (ICH7R) and Samsung HD160JJ, and, for
me, EH is pretty solid now.  I'll post logs from my test runs in a
reply to this mail.

[ Start of patch descriptions ]

01_NCQ_add-ata_qc_complete_err.patch
	: add ata_qc_complete_err() and @drv_err to functions

	In the error path, the error register is read in the later
	stage during sense buffer construction.  However, w/ NCQ, we
	need to get the error register value from the log page and use
	it for sense buffer construction.  This patch adds
	ata_qc_complete_err() and adds @drv_err to functions in the
	error path.

02_NCQ_add-timeout-to-ata_read_log_page.patch
	: add timeout to ata_read_log_page()

	Some drives may lock up during read log page after NCQ
	failures (HD160JJ does), so we need to timeout
	ata_read_log_page().  This function adds timeout feature to
	ata_read_log_page().

03_NCQ_implement-ap_sactive.patch
	: add ap->sactive

	This patch makes libata-core layer aware of ap->sactive
	status.

04_NCQ_export-scsi_retry_command.patch
	: export scsi_retry_command

	Export scsi_retry_command

05_NCQ_add-ncq-helpers.patch
	: implement NCQ helpers

	This patch implements the following NCQ helpers to be used by
	specific drivers to implement interrupt and error handler.

	ata_ncq_complete()	: normal completion of commands
	ata_ncq_abort()		: error completion of commands
	ata_ncq_recover()	: EH recovery helper

06_NCQ_ahci-new-eh.patch
	: convert ahci to use new NCQ helpers

	This patch converts ahci to use new NCQ helpers.

07_NCQ_ahci-stop-dma-before-reset.patch
	: stop dma before reset

	AHCI 1.1 mandates stopping dma before issueing COMMRESET.  The
	original code didn't and it resulted in occasional lockup of
	the controller during EH recovery.  This patch fixes the
	problem.

08_NCQ_remove-or-unexport-unused-functions.patch
	: remove/unexport unused/unnecessary functions

	This patch removes ata_scsi_block_requests() and
	ata_scsi_unblock_requests(), and makes ata_read_log_page() and
	ata_to_sense_error() static.

09_NCQ_ahci-debug.patch
	: causes error or timeout

	This is what I've used for testing EH.  This patch contains
	codes for corrupting or skipping specific tags causing error
	conditions.  If you're curious....

[ End of patch descriptions ]

 Thanks a lot.

--
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 01/09] NCQ: add ata_qc_complete_err() and @drv_err to functions
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 02/09] NCQ: add timeout to ata_read_log_page() Tejun Heo
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

01_NCQ_add-ata_qc_complete_err.patch

	In the error path, the error register is read in the later
	stage during sense buffer construction.  However, w/ NCQ, we
	need to get the error register value from the log page and use
	it for sense buffer construction.  This patch adds
	ata_qc_complete_err() and adds @drv_err to functions in the
	error path.

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/libata-core.c |   35 +++++++++++++++++++++++----
 drivers/scsi/libata-scsi.c |   57 +++++++++++++++++++++++++--------------------
 include/linux/libata.h     |    3 +-
 3 files changed, 64 insertions(+), 31 deletions(-)

Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-06-27 00:20:28.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
@@ -63,7 +63,8 @@ static struct ata_queued_cmd *ata_qc_new
 					      struct ata_device *dev);
 static struct ata_queued_cmd *ata_eh_qc_new_init(struct ata_port *ap,
 						 struct ata_device *dev);
-static int ata_qc_complete_noop(struct ata_queued_cmd *qc, u8 drv_stat);
+static int ata_qc_complete_noop(struct ata_queued_cmd *qc,
+				u8 drv_stat, u8 drv_err);
 static void __ata_qc_complete(struct ata_queued_cmd *qc);
 
 static unsigned int ata_unique_id = 1;
@@ -3100,7 +3101,8 @@ static struct ata_queued_cmd *ata_eh_qc_
 	return qc;
 }
 
-static int ata_qc_complete_noop(struct ata_queued_cmd *qc, u8 drv_stat)
+static int ata_qc_complete_noop(struct ata_queued_cmd *qc,
+				u8 drv_stat, u8 drv_err)
 {
 	return 0;
 }
@@ -3152,9 +3154,10 @@ void ata_qc_free(struct ata_queued_cmd *
 }
 
 /**
- *	ata_qc_complete - Complete an active ATA command
+ *	ata_qc_complete_err - Complete an active ATA command
  *	@qc: Command to complete
  *	@drv_stat: ATA Status register contents
+ *	@drv_err: ATA Error register contents
  *
  *	Indicate to the mid and upper layers that an ATA
  *	command has completed, with either an ok or not-ok status.
@@ -3164,7 +3167,7 @@ void ata_qc_free(struct ata_queued_cmd *
  *
  */
 
-void ata_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat)
+void ata_qc_complete_err(struct ata_queued_cmd *qc, u8 drv_stat, u8 drv_err)
 {
 	int rc;
 
@@ -3175,7 +3178,7 @@ void ata_qc_complete(struct ata_queued_c
 		ata_sg_clean(qc);
 
 	/* call completion callback */
-	rc = qc->complete_fn(qc, drv_stat);
+	rc = qc->complete_fn(qc, drv_stat, drv_err);
 
 	/* if callback indicates not to complete command (non-zero),
 	 * return immediately
@@ -3188,6 +3191,27 @@ void ata_qc_complete(struct ata_queued_c
 	VPRINTK("EXIT\n");
 }
 
+/**
+ *	ata_qc_complete - Complete an active ATA command
+ *	@qc: Command to complete
+ *	@drv_stat: ATA Status register contents
+ *
+ *	This function reads the Error register if necessary and call
+ *	ata_qc_complete_err().
+ *
+ *	LOCKING:
+ *	spin_lock_irqsave(host_set lock)
+ *
+ */
+
+void ata_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat)
+{
+	u8 drv_err = 0;
+	if ((drv_stat & (ATA_ERR | ATA_BUSY)) == ATA_ERR)
+		drv_err = ata_chk_err(qc->ap);
+	ata_qc_complete_err(qc, drv_stat, drv_err);
+}
+
 static inline int ata_should_dma_map(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
@@ -4579,6 +4603,7 @@ EXPORT_SYMBOL_GPL(ata_std_ports);
 EXPORT_SYMBOL_GPL(ata_device_add);
 EXPORT_SYMBOL_GPL(ata_sg_init);
 EXPORT_SYMBOL_GPL(ata_sg_init_one);
+EXPORT_SYMBOL_GPL(ata_qc_complete_err);
 EXPORT_SYMBOL_GPL(ata_qc_complete);
 EXPORT_SYMBOL_GPL(ata_qc_issue_prot);
 EXPORT_SYMBOL_GPL(ata_eng_timeout);
Index: work/drivers/scsi/libata-scsi.c
===================================================================
--- work.orig/drivers/scsi/libata-scsi.c	2005-06-27 00:20:28.000000000 +0900
+++ work/drivers/scsi/libata-scsi.c	2005-06-27 00:20:29.000000000 +0900
@@ -188,8 +188,10 @@ struct ata_queued_cmd *ata_scsi_qc_new(s
 
 /**
  *	ata_to_sense_error - convert ATA error to SCSI error
- *	@qc: Command that we are erroring out
+ *	@cmd: SCSI command in question
+ *	@apid: ap->id of the port in question (for logging)
  *	@drv_stat: value contained in ATA status register
+ *	@drv_err: value contained in ATA error register
  *
  *	Converts an ATA error into a SCSI error. While we are at it
  *	we decode and dump the ATA error for the user so that they
@@ -200,10 +202,8 @@ struct ata_queued_cmd *ata_scsi_qc_new(s
  *	spin_lock_irqsave(host_set lock)
  */
 
-void ata_to_sense_error(struct ata_queued_cmd *qc, u8 drv_stat)
+void ata_to_sense_error(struct scsi_cmnd *cmd, int apid, u8 drv_stat, u8 drv_err)
 {
-	struct scsi_cmnd *cmd = qc->scsicmd;
-	u8 err = 0;
 	unsigned char *sb = cmd->sense_buffer;
 	/* Based on the 3ware driver translation table */
 	static unsigned char sense_table[][4] = {
@@ -253,17 +253,23 @@ void ata_to_sense_error(struct ata_queue
 	 *	Is this an error we can process/parse
 	 */
 
-	if(drv_stat & ATA_ERR)
-		/* Read the err bits */
-		err = ata_chk_err(qc->ap);
+	if (drv_stat == 0xff && drv_err == 0xff) {
+		/* This is a request to invoke eh.  Returning
+		 * CHECK_CONDITION without actual sense data will make
+		 * SCSI midlayer invoke eh. */
+		return;
+	}
+
+	if(!(drv_stat & ATA_ERR))
+		drv_err = 0;
 
 	/* Display the ATA level error info */
 
-	printk(KERN_WARNING "ata%u: status=0x%02x { ", qc->ap->id, drv_stat);
+	printk(KERN_WARNING "ata%u: status=0x%02x { ", apid, drv_stat);
 	if(drv_stat & 0x80)
 	{
 		printk("Busy ");
-		err = 0;	/* Data is not valid in this case */
+		drv_err = 0;	/* Data is not valid in this case */
 	}
 	else {
 		if(drv_stat & 0x40)	printk("DriveReady ");
@@ -276,21 +282,21 @@ void ata_to_sense_error(struct ata_queue
 	}
 	printk("}\n");
 
-	if(err)
+	if(drv_err)
 	{
-		printk(KERN_WARNING "ata%u: error=0x%02x { ", qc->ap->id, err);
-		if(err & 0x04)		printk("DriveStatusError ");
-		if(err & 0x80)
+		printk(KERN_WARNING "ata%u: error=0x%02x { ", apid, drv_err);
+		if(drv_err & 0x04)	printk("DriveStatusError ");
+		if(drv_err & 0x80)
 		{
-			if(err & 0x04)
+			if(drv_err & 0x04)
 				printk("BadCRC ");
 			else
 				printk("Sector ");
 		}
-		if(err & 0x40)		printk("UncorrectableError ");
-		if(err & 0x10)		printk("SectorIdNotFound ");
-		if(err & 0x02)		printk("TrackZeroNotFound ");
-		if(err & 0x01)		printk("AddrMarkNotFound ");
+		if(drv_err & 0x40)	printk("UncorrectableError ");
+		if(drv_err & 0x10)	printk("SectorIdNotFound ");
+		if(drv_err & 0x02)	printk("TrackZeroNotFound ");
+		if(drv_err & 0x01)	printk("AddrMarkNotFound ");
 		printk("}\n");
 
 		/* Should we dump sector info here too ?? */
@@ -301,7 +307,7 @@ void ata_to_sense_error(struct ata_queue
 	while(sense_table[i][0] != 0xFF)
 	{
 		/* Look for best matches first */
-		if((sense_table[i][0] & err) == sense_table[i][0])
+		if((sense_table[i][0] & drv_err) == sense_table[i][0])
 		{
 			sb[0] = 0x70;
 			sb[2] = sense_table[i][1];
@@ -313,8 +319,8 @@ void ata_to_sense_error(struct ata_queue
 		i++;
 	}
 	/* No immediate match */
-	if(err)
-		printk(KERN_DEBUG "ata%u: no sense translation for 0x%02x\n", qc->ap->id, err);
+	if(drv_err)
+		printk(KERN_DEBUG "ata%u: no sense translation for 0x%02x\n", apid, drv_err);
 
 	i = 0;
 	/* Fall back to interpreting status bits */
@@ -332,7 +338,7 @@ void ata_to_sense_error(struct ata_queue
 		i++;
 	}
 	/* No error ?? */
-	printk(KERN_ERR "ata%u: called with no error (%02X)!\n", qc->ap->id, drv_stat);
+	printk(KERN_ERR "ata%u: called with no error (%02X)!\n", apid, drv_stat);
 	/* additional-sense-code[-qualifier] */
 
 	sb[0] = 0x70;
@@ -753,12 +759,13 @@ static unsigned int ata_scsi_rw_xlat(str
 	return 1;
 }
 
-static int ata_scsi_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat)
+static int ata_scsi_qc_complete(struct ata_queued_cmd *qc,
+				u8 drv_stat, u8 drv_err)
 {
 	struct scsi_cmnd *cmd = qc->scsicmd;
 
 	if (unlikely(drv_stat & (ATA_ERR | ATA_BUSY | ATA_DRQ)))
-		ata_to_sense_error(qc, drv_stat);
+		ata_to_sense_error(cmd, qc->ap->id, drv_stat, drv_err);
 	else
 		cmd->result = SAM_STAT_GOOD;
 
@@ -1391,7 +1398,7 @@ void ata_scsi_badcmd(struct scsi_cmnd *c
 	done(cmd);
 }
 
-static int atapi_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat)
+static int atapi_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat, u8 drv_err)
 {
 	struct scsi_cmnd *cmd = qc->scsicmd;
 
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h	2005-06-27 00:20:28.000000000 +0900
+++ work/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
@@ -180,7 +180,7 @@ struct ata_port;
 struct ata_queued_cmd;
 
 /* typedefs */
-typedef int (*ata_qc_cb_t) (struct ata_queued_cmd *qc, u8 drv_stat);
+typedef int (*ata_qc_cb_t) (struct ata_queued_cmd *qc, u8 drv_stat, u8 drv_err);
 
 struct ata_ioports {
 	unsigned long		cmd_addr;
@@ -439,6 +439,7 @@ extern void ata_bmdma_start (struct ata_
 extern void ata_bmdma_stop(struct ata_port *ap);
 extern u8   ata_bmdma_status(struct ata_port *ap);
 extern void ata_bmdma_irq_clear(struct ata_port *ap);
+extern void ata_qc_complete_err(struct ata_queued_cmd *qc, u8 drv_stat, u8 drv_err);
 extern void ata_qc_complete(struct ata_queued_cmd *qc, u8 drv_stat);
 extern void ata_eng_timeout(struct ata_port *ap);
 extern void ata_scsi_simulate(u16 *id, struct scsi_cmnd *cmd,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 02/09] NCQ: add timeout to ata_read_log_page()
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 01/09] NCQ: add ata_qc_complete_err() and @drv_err to functions Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 03/09] NCQ: add ap->sactive Tejun Heo
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

02_NCQ_add-timeout-to-ata_read_log_page.patch

	Some drives may lock up during read log page after NCQ
	failures (HD160JJ does), so we need to timeout
	ata_read_log_page().  This function adds timeout feature to
	ata_read_log_page().

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/ahci.c        |    2 +-
 drivers/scsi/libata-core.c |   35 ++++++++++++++++++++++++++++++++---
 include/linux/libata.h     |    2 +-
 3 files changed, 34 insertions(+), 5 deletions(-)

Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-06-27 00:20:29.000000000 +0900
@@ -674,7 +674,7 @@ static void ahci_ncq_timeout(struct ata_
 		goto done;
 	}
 
-	if (ata_read_log_page(ap, 0, READ_LOG_SATA_NCQ_PAGE, buffer, 1)) {
+	if (ata_read_log_page(ap, 0, READ_LOG_SATA_NCQ_PAGE, buffer, 1, 0)) {
 		printk(KERN_ERR "ata%u: unable to read log page\n", ap->id);
 		goto out;
 	}
Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
@@ -1309,6 +1309,12 @@ err_out:
 	DPRINTK("EXIT, err\n");
 }
 
+static void ata_read_log_page_timeout(unsigned long data)
+{
+	struct completion *wait = (void *)data;
+	complete(wait);
+}
+
 /**
  *	ata_read_log_page - read a specific log page
  *	@ap: port on which device we wish to probe resides
@@ -1316,6 +1322,7 @@ err_out:
  *	@page: page to read
  *	@buffer: where to store the read data
  *	@sectors: how much data to read
+ *	@timeout: timeout in millisecs
  *
  *	After reading the device information page, we use several
  *	bits of information from it to initialize data structures
@@ -1328,10 +1335,11 @@ err_out:
  */
 
 int ata_read_log_page(struct ata_port *ap, unsigned int device, char page,
-		      char *buffer, unsigned int sectors)
+		      char *buffer, unsigned int sectors, unsigned int timeout)
 {
 	struct ata_device *dev = &ap->device[device];
 	DECLARE_COMPLETION(wait);
+	struct timer_list timer;
 	struct ata_queued_cmd *qc;
 	unsigned long flags;
 	int rc;
@@ -1355,15 +1363,36 @@ int ata_read_log_page(struct ata_port *a
 	qc->waiting = &wait;
 	qc->complete_fn = ata_qc_complete_noop;
 
+	if (timeout) {
+		init_timer(&timer);
+		timer.function = ata_read_log_page_timeout;
+		timer.data = (unsigned long)&wait;
+		timer.expires = jiffies + timeout * HZ / 1000;
+		add_timer(&timer);
+	}
+
 	spin_lock_irqsave(&ap->host_set->lock, flags);
 	rc = ata_qc_issue(qc);
 	spin_unlock_irqrestore(&ap->host_set->lock, flags);
 
-	if (rc)
+	if (rc) {
+		if (timeout)
+			del_timer(&timer);
 		return -EIO;
+	}
 
 	wait_for_completion(&wait);
-	return 0;
+
+	if (timeout && !del_timer(&timer)) {
+		spin_lock_irqsave(&ap->host_set->lock, flags);
+		if (qc->waiting == &wait) {
+			ata_qc_complete(qc, 0);
+			rc = -ETIMEDOUT;
+		}
+		spin_unlock_irqrestore(&ap->host_set->lock, flags);
+	}
+
+	return rc;
 }
 
 /**
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
+++ work/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
@@ -453,7 +453,7 @@ extern void ata_scsi_block_requests(stru
 extern void ata_scsi_unblock_requests(struct ata_port *);
 extern void ata_scsi_requeue(struct ata_queued_cmd *);
 extern int ata_read_log_page(struct ata_port *, unsigned int, char, char *,
-			     unsigned int);
+			     unsigned int, unsigned int);
 
 
 #ifdef CONFIG_PCI


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 03/09] NCQ: add ap->sactive
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 01/09] NCQ: add ata_qc_complete_err() and @drv_err to functions Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 02/09] NCQ: add timeout to ata_read_log_page() Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 04/09] NCQ: export scsi_retry_command Tejun Heo
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

03_NCQ_implement-ap_sactive.patch

	This patch makes libata-core layer aware of ap->sactive
	status.

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/ahci.c        |    2 +-
 drivers/scsi/libata-core.c |   32 ++++++++++++++++++++------------
 drivers/scsi/libata-scsi.c |    4 ++--
 include/linux/libata.h     |    5 ++---
 4 files changed, 25 insertions(+), 18 deletions(-)

Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-06-27 00:20:29.000000000 +0900
@@ -651,7 +651,7 @@ static void ahci_ncq_timeout(struct ata_
 	u32 sactive;
 	int reset;
 
-	printk(KERN_WARNING "ata%u: ncq interrupt error (Q=%d)\n", ap->id, ap->queue_depth);
+	printk(KERN_WARNING "ata%u: ncq interrupt error (Q=%08lx)\n", ap->id, ap->sactive);
 
 	spin_lock_irqsave(&ap->host_set->lock, flags);
 
Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
@@ -3143,13 +3143,17 @@ static void __ata_qc_complete(struct ata
 	unsigned int tag = ata_qc_to_tag(qc);
 
 	if (likely(qc->flags & ATA_QCFLAG_ACTIVE)) {
-		assert(ap->queue_depth);
-		ap->queue_depth--;
-
-		if (!ap->queue_depth)
-			ap->flags &= ~ATA_FLAG_NCQ_QUEUED;
-		if (tag == ap->active_tag)
+		assert(ap->flags & ATA_FLAG_INFLIGHT);
+		if (ap->active_tag == ATA_TAG_POISON) {
+			assert(ap->sactive & (1 << tag));
+			ap->sactive &= ~(1 << tag);
+			if (!ap->sactive)
+				ap->flags &= ~ATA_FLAG_INFLIGHT;
+		} else {
+			assert(ap->active_tag == tag);
 			ap->active_tag = ATA_TAG_POISON;
+			ap->flags &= ~ATA_FLAG_INFLIGHT;
+		}
 	}
 
 	qc->flags = 0;
@@ -3285,7 +3289,7 @@ static inline int ata_qc_issue_ok(struct
 	/*
 	 * If nothing is queued, it's always ok to continue.
 	 */
-	if (!ap->queue_depth)
+	if (!(ap->flags & ATA_FLAG_INFLIGHT))
 		return 1;
 
 	/*
@@ -3299,7 +3303,7 @@ static inline int ata_qc_issue_ok(struct
 	 * Command is NCQ, allow it to be queued if the commands that are
 	 * currently running are also NCQ
 	 */
-	if (ap->flags & ATA_FLAG_NCQ_QUEUED)
+	if (ap->sactive)
 		return 1;
 
 	return 0;
@@ -3377,13 +3381,17 @@ int ata_qc_issue(struct ata_queued_cmd *
 
 	ap->ops->qc_prep(qc);
 
-	qc->ap->active_tag = ata_qc_to_tag(qc);
 	qc->flags |= ATA_QCFLAG_ACTIVE;
 
-	if (qc->flags & ATA_QCFLAG_NCQ)
-		ap->flags |= ATA_FLAG_NCQ_QUEUED;
+	if (qc->flags & ATA_QCFLAG_NCQ) {
+		assert(ap->active_tag == ATA_TAG_POISON);
+		ap->sactive |= 1 << ata_qc_to_tag(qc);
+	} else {
+		assert(!ap->sactive || qc->flags & ATA_QCFLAG_PREEMPT);
+		ap->active_tag = ata_qc_to_tag(qc);
+	}
 
-	ap->queue_depth++;
+	ap->flags |= ATA_FLAG_INFLIGHT;
 
 	rc = ap->ops->qc_issue(qc);
 	if (rc != ATA_QC_ISSUE_OK)
Index: work/drivers/scsi/libata-scsi.c
===================================================================
--- work.orig/drivers/scsi/libata-scsi.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/libata-scsi.c	2005-06-27 00:20:29.000000000 +0900
@@ -162,10 +162,10 @@ struct ata_queued_cmd *ata_scsi_qc_new(s
 	 */
 	if (ap->cmd_waiters)
 		return NULL;
-	if (ap->queue_depth) {
+	if (ap->flags & ATA_FLAG_INFLIGHT) {
 		if (!scsi_rw_ncq_request(dev, cmd))
 			return NULL;
-		if (!(ap->flags & ATA_FLAG_NCQ_QUEUED))
+		if (!ap->sactive)
 			return NULL;
 	}
 
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
+++ work/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
@@ -117,7 +117,7 @@ enum {
 	ATA_FLAG_SATA_RESET	= (1 << 7), /* use COMRESET */
 	ATA_FLAG_PIO_DMA	= (1 << 8), /* PIO cmds via DMA */
 	ATA_FLAG_NCQ		= (1 << 9), /* Can do NCQ */
-	ATA_FLAG_NCQ_QUEUED	= (1 << 10), /* NCQ commands are queued */
+	ATA_FLAG_INFLIGHT	= (1 << 10), /* Command(s) in flight */
 
 	ATA_QCFLAG_ACTIVE	= (1 << 1), /* cmd not yet ack'd to scsi lyer */
 	ATA_QCFLAG_SG		= (1 << 3), /* have s/g table? */
@@ -315,9 +315,8 @@ struct ata_port {
 	struct ata_device	device[ATA_MAX_DEVICES];
 
 	struct ata_queued_cmd	qcmd[ATA_MAX_CMDS];
-	unsigned long		qactive;
+	unsigned long		qactive, sactive;
 	unsigned int		active_tag;
-	unsigned int		queue_depth;
 
 	wait_queue_head_t	cmd_wait_queue;
 	unsigned int		cmd_waiters;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 04/09] NCQ: export scsi_retry_command
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (2 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 03/09] NCQ: add ap->sactive Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 05/09] NCQ: implement NCQ helpers Tejun Heo
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

04_NCQ_export-scsi_retry_command.patch

	Export scsi_retry_command

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/scsi.c      |    1 +
 include/scsi/scsi_cmnd.h |    1 +
 2 files changed, 2 insertions(+)

Index: work/drivers/scsi/scsi.c
===================================================================
--- work.orig/drivers/scsi/scsi.c	2005-06-26 23:55:32.000000000 +0900
+++ work/drivers/scsi/scsi.c	2005-06-27 00:20:30.000000000 +0900
@@ -849,6 +849,7 @@ int scsi_retry_command(struct scsi_cmnd 
 
 	return scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY);
 }
+EXPORT_SYMBOL(scsi_retry_command);
 
 /*
  * Function:    scsi_finish_command
Index: work/include/scsi/scsi_cmnd.h
===================================================================
--- work.orig/include/scsi/scsi_cmnd.h	2005-06-26 23:55:32.000000000 +0900
+++ work/include/scsi/scsi_cmnd.h	2005-06-27 00:20:30.000000000 +0900
@@ -151,5 +151,6 @@ extern struct scsi_cmnd *scsi_get_comman
 extern void scsi_put_command(struct scsi_cmnd *);
 extern void scsi_io_completion(struct scsi_cmnd *, unsigned int, unsigned int);
 extern void scsi_finish_command(struct scsi_cmnd *cmd);
+extern int scsi_retry_command(struct scsi_cmnd *cmd);
 
 #endif /* _SCSI_SCSI_CMND_H */


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 05/09] NCQ: implement NCQ helpers
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (3 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 04/09] NCQ: export scsi_retry_command Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 06/09] NCQ: convert ahci to use new " Tejun Heo
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

05_NCQ_add-ncq-helpers.patch

	This patch implements the following NCQ helpers to be used by
	specific drivers to implement interrupt and error handler.

	ata_ncq_complete()	: normal completion of commands
	ata_ncq_abort()		: error completion of commands
	ata_ncq_recover()	: EH recovery helper

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/libata-core.c |  186 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/libata-scsi.c |   96 ++++++++++++++++++++++-
 drivers/scsi/libata.h      |    4 
 include/linux/libata.h     |    8 +
 4 files changed, 289 insertions(+), 5 deletions(-)

Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-06-27 00:20:30.000000000 +0900
@@ -3279,6 +3279,13 @@ static inline int ata_qc_issue_ok(struct
 {
 	if (qc->flags & ATA_QCFLAG_PREEMPT)
 		return 1;
+
+	/*
+	 * Recovery in progress.  Only PREEMPT commands allowed.
+	 */
+	if (ap->flags & (ATA_FLAG_NCQ_FAILED | ATA_FLAG_RECOVERY))
+		return 0;
+
 	/*
 	 * if people are already waiting for a queue drain, don't allow a
 	 * new 'lucky' queuer to get in there
@@ -3827,6 +3834,182 @@ irqreturn_t ata_interrupt (int irq, void
 	return IRQ_RETVAL(handled);
 }
 
+
+/*
+ * NCQ helpers
+ */
+
+/**
+ *	ata_ncq_complete - NCQ driver helper.  Complete requests normally.
+ *	@ap: port in question
+ *
+ *	Complete in-flight commands.  One device per port is assumed.
+ *	This functions is meant to be called from specific driver's
+ *	interrupt routine to complete requests normally.  On
+ *	invocation, if non-NCQ command was in-flight, it's completed
+ *	normally.  If NCQ commands were in-flight, Sactive register is
+ *	read and completed commands are processed.
+ *
+ *	LOCKING:
+ *	spin_lock_irqsave(host_set lock)
+ */
+int ata_ncq_complete(struct ata_port *ap)
+{
+	int nr_done = 0;
+	unsigned long done_mask = 0;
+	unsigned tag;
+
+	if (ap->sactive) {
+		unsigned long new_sactive = scr_read(ap, SCR_ACTIVE);
+		done_mask = new_sactive ^ ap->sactive;
+
+		if (unlikely(done_mask & new_sactive)) {
+			printk(KERN_ERR "ata%u: illegal sactive transition (%08lx->%08lx)\n",
+			       ap->id, ap->sactive, new_sactive);
+			done_mask &= ~new_sactive;
+		}
+	} else if (ap->active_tag != ATA_TAG_POISON)
+		done_mask = 1 << ap->active_tag;
+
+	ata_for_each_tag(tag, done_mask) {
+		struct ata_queued_cmd *qc = ata_qc_from_tag(ap, tag);
+		if (qc) {
+			ata_qc_complete(qc, 0);
+			nr_done++;
+		} else
+			printk(KERN_ERR "ata%u: missing tag %d\n", ap->id, tag);
+	}
+
+	return nr_done;
+}
+
+/**
+ *	ata_ncq_abort - NCQ driver helper.  Abort commands and invoke EH.
+ *	@ap: port in question.
+ *
+ *	Abort in-flight commands and invoke EH.  This function is
+ *	meant to be called from specific driver's interrupt routine to
+ *	indicate error.
+ *
+ *	LOCKING:
+ *	spin_lock_irqsave(host_set lock)
+ */
+
+int ata_ncq_abort(struct ata_port *ap)
+{
+	int nr_aborted = 0;
+	unsigned long sactive = 0;
+	unsigned tag;
+
+	printk(KERN_WARNING "ata%u: aborting commands due to error.  "
+	       "active_tag %d, sactive %08lx\n",
+	       ap->id,
+	       ap->active_tag != ATA_TAG_POISON ? ap->active_tag : -1,
+	       ap->sactive);
+
+	if (ap->sactive) {
+		/* Complete successful requests before aborting. */
+		ata_ncq_complete(ap);
+		if (!ap->sactive)
+			printk(KERN_WARNING "ata%u: device reports successful "
+			       "completion of all NCQ commands after error\n",
+			       ap->id);
+		ap->flags |= ATA_FLAG_NCQ_FAILED;
+		sactive = ap->sactive;
+	} else if (ap->active_tag != ATA_TAG_POISON)
+		sactive = 1 << ap->active_tag;
+
+	ata_for_each_tag(tag, sactive) {
+		struct ata_queued_cmd *qc = ata_qc_from_tag(ap, tag);
+		if (qc) {
+			ata_qc_complete_err(qc, 0xff, 0xff);
+			nr_aborted++;
+		} else
+			printk(KERN_ERR "ata%u: missing tag %d\n", ap->id, tag);
+	}
+
+	return nr_aborted;
+}
+
+static inline int ata_read_log_10h(struct ata_port *ap, unsigned *tagp,
+				   u8 *drv_statp, u8 *drv_errp)
+{
+	char *buffer;
+	int rc;
+
+	if (!(buffer = kmalloc(512, GFP_KERNEL))) {
+		printk(KERN_ERR "ata%u: unable to allocate memory for error\n",
+		       ap->id);
+		return -ENOMEM;
+	}
+
+	if ((rc = ata_read_log_page(ap, 0, READ_LOG_SATA_NCQ_PAGE,
+				    buffer, 1, ATA_READLOG_10H_TIMEOUT)) < 0) {
+		printk(KERN_ERR "ata%u: failed to read log page 10h (%d)\n",
+		       ap->id, rc);
+		goto out;
+	}
+
+	if (buffer[0] & 0x80) {
+		printk(KERN_INFO "ata%u: NQ bit set on log page 10h, timeout?\n",
+		       ap->id);
+		rc = -EIO;
+		goto out;
+	}
+
+	*tagp = buffer[0] & 0x1f;
+	*drv_statp = buffer[2] | ATA_ERR;
+	*drv_errp = buffer[3];
+	rc = 0;
+ out:
+	kfree(buffer);
+	return rc;
+}
+
+/**
+ *	ata_ncq_recover - NCQ driver helper.  Recover from error.
+ *	@ap: port in question
+ *	@did_reset: specific driver performed reset.  Log page 10h might
+ *		    be invalid.
+ *
+ *	This function is to be called from eng_timeout routine of
+ *	specific drivers.  Before calling this function, specific
+ *	drivers are required to
+ *
+ *	- Clear all in-flight requests.  Drivers only have to clear
+ *	  low-level state (like stopping DMA engine and clearing
+ *	  interrupts).  All generic command cancelling are dealt by
+ *	  libata-core layer.
+ *
+ *	- Make the controller ready for new commands.
+ *
+ *	LOCKING:
+ *	Inherited from SCSI layer (in EH context, can sleep)
+ */
+void ata_ncq_recover(struct ata_port *ap, int did_reset)
+{
+	unsigned ncq_abort_tag = ATA_TAG_POISON;
+	u8 stat = 0, err = 0;
+
+	printk(KERN_WARNING "ata%u: recovering from error\n", ap->id);
+
+	if ((ap->flags & ATA_FLAG_NCQ_FAILED) && !did_reset) {
+		if (ata_read_log_10h(ap, &ncq_abort_tag, &stat, &err) == 0)
+			printk(KERN_INFO "ata%u: log_ext_10h, tag=%d stat=%02x err=%02x\n",
+			       ap->id, ncq_abort_tag, stat, err);
+		else {
+			printk(KERN_WARNING "ata%u: resetting...\n", ap->id);
+			ap->ops->phy_reset(ap);
+		}
+	}
+
+	spin_lock_irq(&ap->host_set->lock);
+	ap->flags &= ~ATA_FLAG_NCQ_FAILED;
+	spin_unlock_irq(&ap->host_set->lock);
+
+	ata_scsi_error_abort_cmds(ap, ncq_abort_tag, stat, err);
+}
+
 /**
  *	atapi_packet_task - Write CDB bytes to hardware
  *	@_data: Port to which ATAPI device is attached.
@@ -4683,6 +4866,9 @@ EXPORT_SYMBOL_GPL(ata_scsi_block_request
 EXPORT_SYMBOL_GPL(ata_scsi_unblock_requests);
 EXPORT_SYMBOL_GPL(ata_scsi_requeue);
 EXPORT_SYMBOL_GPL(ata_read_log_page);
+EXPORT_SYMBOL_GPL(ata_ncq_complete);
+EXPORT_SYMBOL_GPL(ata_ncq_abort);
+EXPORT_SYMBOL_GPL(ata_ncq_recover);
 
 #ifdef CONFIG_PCI
 EXPORT_SYMBOL_GPL(pci_test_config_bits);
Index: work/drivers/scsi/libata-scsi.c
===================================================================
--- work.orig/drivers/scsi/libata-scsi.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/libata-scsi.c	2005-06-27 00:20:30.000000000 +0900
@@ -162,6 +162,8 @@ struct ata_queued_cmd *ata_scsi_qc_new(s
 	 */
 	if (ap->cmd_waiters)
 		return NULL;
+	if (ap->flags & (ATA_FLAG_NCQ_FAILED | ATA_FLAG_RECOVERY))
+		return NULL;
 	if (ap->flags & ATA_FLAG_INFLIGHT) {
 		if (!scsi_rw_ncq_request(dev, cmd))
 			return NULL;
@@ -478,19 +480,103 @@ int ata_scsi_error(struct Scsi_Host *hos
 
 	DPRINTK("ENTER\n");
 
+	spin_lock_irq(&ap->host_set->lock);
+	ap->flags |= ATA_FLAG_RECOVERY;
+	spin_unlock_irq(&ap->host_set->lock);
+
 	ap->ops->eng_timeout(ap);
 
-	/* TODO: this is per-command; when queueing is supported
-	 * this code will either change or move to a more
-	 * appropriate place
-	 */
-	host->host_failed--;
+	spin_lock_irq(&ap->host_set->lock);
+	host->host_failed = 0;
+	ap->flags &= ~ATA_FLAG_RECOVERY;
+	if (ap->cmd_waiters)
+		wake_up(&ap->cmd_wait_queue);
+	spin_unlock_irq(&ap->host_set->lock);
 
 	DPRINTK("EXIT\n");
 	return 0;
 }
 
 /**
+ *	ata_scsi_error_abort_cmds - Finish EH processing of SCSI commands
+ *	@ap: ATA port in recovery
+ *	@ncq_abort_tag: tag of the failed request in NCQ error
+ *	@ncq_abort_stat: Status register value for the failed request
+ *	@ncq_abort_err: Error register value for the failed request
+ *
+ *	Abort or retry SCSI commands after EH handling.  If
+ *	ncq_abort_tag is a valid tag value (not ATA_TAG_POISON), only
+ *	the command is failed and others are retried.  Otherwise, all
+ *	inflight commands are failed.  This function can handle both
+ *	aborted (by interrupt handler) and timedout commands.
+ *
+ *	LOCKING:
+ *	Inherited from SCSI layer (in EH context, can sleep)
+ */
+void ata_scsi_error_abort_cmds(struct ata_port *ap, unsigned ncq_abort_tag,
+			       u8 ncq_abort_stat, u8 ncq_abort_err)
+{
+	unsigned long active_mask;
+	unsigned tag;
+	struct ata_queued_cmd *qc;
+	struct Scsi_Host *shost = ap->host;
+	LIST_HEAD(cmds);
+	struct scsi_cmnd *ncq_abort_scmd = NULL, *scmd;
+	struct list_head *le, *tmp;
+
+	spin_lock_irq(&ap->host_set->lock);
+	list_splice_init(&shost->eh_cmd_q, &cmds);
+	spin_unlock_irq(&ap->host_set->lock);
+
+	if (ncq_abort_tag != ATA_TAG_POISON)
+		ncq_abort_scmd = ap->qcmd[ncq_abort_tag].scsicmd;
+
+	/*
+	 * Kill all active commands.
+	 */
+	if (ap->active_tag != ATA_TAG_POISON)
+		active_mask = 1 << ap->active_tag;
+	else
+		active_mask = ap->sactive;
+
+	ata_for_each_tag(tag, active_mask) {
+		u8 stat, err;
+		qc = &ap->qcmd[tag];
+
+		if (ncq_abort_scmd && qc->scsicmd != ncq_abort_scmd) {
+			qc->scsidone = (void *)scsi_retry_command;
+			stat = 0;
+			err = 0;
+		} else {
+			qc->scsidone = scsi_finish_command;
+			stat = ncq_abort_scmd ? ncq_abort_stat : ATA_ERR;
+			err = ncq_abort_scmd ? ncq_abort_err : 0x80;
+		}
+
+		if (qc->scsicmd)
+			list_del_init(&qc->scsicmd->eh_entry);
+		ata_qc_complete_err(qc, stat, err);
+	}
+
+	/*
+	 * Finish off commands qc-completed by interrupt handler.
+	 */
+	list_for_each_safe(le, tmp, &cmds) {
+		scmd = list_entry(le, struct scsi_cmnd, eh_entry);
+
+		if (ncq_abort_scmd && scmd != ncq_abort_scmd)
+			scsi_retry_command(scmd);
+		else {
+			u8 stat, err;
+			stat = ncq_abort_scmd ? ncq_abort_stat : ATA_ERR;
+			err = ncq_abort_scmd ? ncq_abort_err : 0x80;
+			ata_to_sense_error(scmd, ap->id, stat, err);
+			scsi_finish_command(scmd);
+		}
+	}
+}
+
+/**
  *	ata_scsi_flush_xlat - Translate SCSI SYNCHRONIZE CACHE command
  *	@qc: Storage for translated ATA taskfile
  *	@scsicmd: SCSI command to translate (ignored)
Index: work/drivers/scsi/libata.h
===================================================================
--- work.orig/drivers/scsi/libata.h	2005-06-27 00:20:28.000000000 +0900
+++ work/drivers/scsi/libata.h	2005-06-27 00:20:30.000000000 +0900
@@ -28,6 +28,10 @@
 #define DRV_NAME	"libata"
 #define DRV_VERSION	"1.11"	/* must be exactly four chars */
 
+#define ata_for_each_tag(tag, mask) \
+	for (tag = find_first_bit(&mask, ATA_MAX_CMDS); tag < ATA_MAX_CMDS; \
+	     tag = find_next_bit(&mask, ATA_MAX_CMDS, tag + 1))
+
 struct ata_scsi_args {
 	u16			*id;
 	struct scsi_cmnd	*cmd;
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h	2005-06-27 00:20:29.000000000 +0900
+++ work/include/linux/libata.h	2005-06-27 00:20:30.000000000 +0900
@@ -87,6 +87,7 @@ enum {
 	ATA_MAX_BUS		= 2,
 	ATA_DEF_BUSY_WAIT	= 10000,
 	ATA_SHORT_PAUSE		= (HZ >> 6) + 1,
+	ATA_READLOG_10H_TIMEOUT	= 5000,	/* in milliseconds */
 
 	ATA_SHT_EMULATED	= 1,
 	ATA_SHT_CMD_PER_LUN	= 1,
@@ -118,6 +119,8 @@ enum {
 	ATA_FLAG_PIO_DMA	= (1 << 8), /* PIO cmds via DMA */
 	ATA_FLAG_NCQ		= (1 << 9), /* Can do NCQ */
 	ATA_FLAG_INFLIGHT	= (1 << 10), /* Command(s) in flight */
+	ATA_FLAG_NCQ_FAILED	= (1 << 11), /* NCQ command(s) failed */
+	ATA_FLAG_RECOVERY	= (1 << 12), /* Recovery in progress */
 
 	ATA_QCFLAG_ACTIVE	= (1 << 1), /* cmd not yet ack'd to scsi lyer */
 	ATA_QCFLAG_SG		= (1 << 3), /* have s/g table? */
@@ -405,6 +408,8 @@ extern int ata_scsi_detect(Scsi_Host_Tem
 extern int ata_scsi_ioctl(struct scsi_device *dev, int cmd, void __user *arg);
 extern int ata_scsi_queuecmd(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *));
 extern int ata_scsi_error(struct Scsi_Host *host);
+extern void ata_scsi_error_abort_cmds(struct ata_port *ap, unsigned ncq_abort_tag,
+				      u8 ncq_abort_stat, u8 ncq_abort_err);
 extern int ata_scsi_release(struct Scsi_Host *host);
 extern unsigned int ata_host_intr(struct ata_port *ap, struct ata_queued_cmd *qc);
 /*
@@ -453,6 +458,9 @@ extern void ata_scsi_unblock_requests(st
 extern void ata_scsi_requeue(struct ata_queued_cmd *);
 extern int ata_read_log_page(struct ata_port *, unsigned int, char, char *,
 			     unsigned int, unsigned int);
+extern int ata_ncq_complete(struct ata_port *);
+extern int ata_ncq_abort(struct ata_port *);
+extern void ata_ncq_recover(struct ata_port *, int);
 
 
 #ifdef CONFIG_PCI


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 06/09] NCQ: convert ahci to use new NCQ helpers
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (4 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 05/09] NCQ: implement NCQ helpers Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset Tejun Heo
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

06_NCQ_ahci-new-eh.patch

	This patch converts ahci to use new NCQ helpers.

Signed-off-by: Tejun Heo <htejun@gmail.com>

 ahci.c |  359 +++++++++++------------------------------------------------------
 1 files changed, 63 insertions(+), 296 deletions(-)

Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:29.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
@@ -171,7 +171,6 @@ struct ahci_port_priv {
 	dma_addr_t		cmd_tbl_dma;
 	void			*rx_fis;
 	dma_addr_t		rx_fis_dma;
-	u32			sactive;
 };
 
 static u32 ahci_scr_read (struct ata_port *ap, unsigned int sc_reg);
@@ -427,6 +426,47 @@ static void ahci_scr_write (struct ata_p
 	writel(val, (void *) ap->ioaddr.scr_addr + (sc_reg * 4));
 }
 
+static void ahci_stop_dma(struct ata_port *ap)
+{
+	void __iomem *port_mmio = (void __iomem *) ap->ioaddr.cmd_addr;
+	u32 tmp;
+	int work;
+
+	/* stop DMA */
+	tmp = readl(port_mmio + PORT_CMD);
+	tmp &= ~PORT_CMD_START;
+	writel(tmp, port_mmio + PORT_CMD);
+
+	/* wait for engine to stop. */
+	work = 500;
+	while (work-- > 0) {
+		tmp = readl(port_mmio + PORT_CMD);
+		if ((tmp & PORT_CMD_LIST_ON) == 0)
+			break;
+		msleep(1);
+	}
+}
+
+static void ahci_start_dma(struct ata_port *ap)
+{
+	void __iomem *port_mmio = (void __iomem *) ap->ioaddr.cmd_addr;
+	u32 tmp;
+
+	/* clear SATA phy error, if any */
+	tmp = readl(port_mmio + PORT_SCR_ERR);
+	writel(tmp, port_mmio + PORT_SCR_ERR);
+
+	/* clear status */
+	tmp = readl(port_mmio + PORT_IRQ_STAT);
+	writel(tmp, port_mmio + PORT_IRQ_STAT);
+
+	/* re-start DMA */
+	tmp = readl(port_mmio + PORT_CMD);
+	tmp |= PORT_CMD_START;
+	writel(tmp, port_mmio + PORT_CMD);
+	readl(port_mmio + PORT_CMD); /* flush */
+}
+
 static void ahci_phy_reset(struct ata_port *ap)
 {
 	void __iomem *port_mmio = (void __iomem *) ap->ioaddr.cmd_addr;
@@ -551,290 +591,36 @@ static void ahci_qc_prep(struct ata_queu
 	ahci_fill_sg(qc, offset);
 }
 
-/*
- * Return 1 if COMRESET was done
- */
-static int ahci_intr_error(struct ata_port *ap, u32 irq_stat)
+static void ahci_eng_timeout(struct ata_port *ap)
 {
-	void *mmio = ap->host_set->mmio_base;
-	void *port_mmio = ahci_port_base(mmio, ap->port_no);
+	void __iomem *port_mmio = (void __iomem *) ap->ioaddr.cmd_addr;
+	int reset = 0;
 	u32 tmp;
-	int work, reset = 0;
 
-	/* stop DMA */
-	tmp = readl(port_mmio + PORT_CMD);
-	tmp &= ~PORT_CMD_START;
-	writel(tmp, port_mmio + PORT_CMD);
-
-	/* wait for engine to stop.  TODO: this could be
-	 * as long as 500 msec
-	 */
-	work = 1000;
-	while (work-- > 0) {
-		tmp = readl(port_mmio + PORT_CMD);
-		if ((tmp & PORT_CMD_LIST_ON) == 0)
-			break;
-		udelay(10);
-	}
-
-	/* clear SATA phy error, if any */
-	tmp = readl(port_mmio + PORT_SCR_ERR);
-	writel(tmp, port_mmio + PORT_SCR_ERR);
-
-	/* clear status */
-	tmp = readl(port_mmio + PORT_IRQ_STAT);
-	writel(tmp, port_mmio + PORT_IRQ_STAT);
+	ahci_stop_dma(ap);
 
-	/* if DRQ/BSY is set, device needs to be reset.
-	 * if so, issue COMRESET
+	/*
+	 * if DRQ/BSY is set, device needs to be reset; otherwise,
+	 * restarting DMA engine should suffice.
 	 */
 	tmp = readl(port_mmio + PORT_TFDATA);
 	if (tmp & (ATA_BUSY | ATA_DRQ)) {
-		printk(KERN_WARNING "ata%u: stat=%x, issuing COMRESET\n", ap->id, tmp);
-		writel(0x301, port_mmio + PORT_SCR_CTL);
-		readl(port_mmio + PORT_SCR_CTL); /* flush */
-		udelay(10);
-		writel(0x300, port_mmio + PORT_SCR_CTL);
-		readl(port_mmio + PORT_SCR_CTL); /* flush */
+		printk(KERN_WARNING DRV_NAME
+		       " ata%u: stat=%x, issuing COMRESET\n", ap->id, tmp);
+		__sata_phy_reset(ap);
 		reset = 1;
 	}
 
-	/* re-start DMA */
-	tmp = readl(port_mmio + PORT_CMD);
-	tmp |= PORT_CMD_START;
-	writel(tmp, port_mmio + PORT_CMD);
-	readl(port_mmio + PORT_CMD); /* flush */
-
-	printk(KERN_WARNING "ata%u: error occurred, port reset\n", ap->id);
-	return reset;
-}
-
-static void ahci_complete_requests(struct ata_port *ap, u32 tag_mask, int err)
-{
-	while (tag_mask) {
-		struct ata_queued_cmd *qc;
-		int tag = ffs(tag_mask) - 1;
-
-		tag_mask &= ~(1 << tag);
-		qc = ata_qc_from_tag(ap, tag);
-		if (qc)
-			ata_qc_complete(qc, err);
-		else
-			printk(KERN_ERR "ahci: missing tag %d\n", tag);
-	}
-}
-
-static void dump_log_page(unsigned char *p)
-{
-	int i;
-
-	printk("LOG 0x10: nq=%d, tag=%d\n", p[0] >> 7, p[0] & 0x1f);
-
-	for (i = 2; i < 14; i++)
-		printk("%d:%d ", i, p[i]);
+	ahci_start_dma(ap);
 
-	printk("\n");
-}
-
-/*
- * TODO: needs to use READ_LOG_EXT/page=10h to retrieve error information
- */
-extern void ata_qc_free(struct ata_queued_cmd *qc);
-static void ahci_ncq_timeout(struct ata_port *ap)
-{
-	struct ahci_port_priv *pp = ap->private_data;
-	void *mmio = ap->host_set->mmio_base;
-	void *port_mmio = ahci_port_base(mmio, ap->port_no);
-	struct ata_queued_cmd *qc;
-	unsigned long flags;
-	char *buffer;
-	u32 sactive;
-	int reset;
-
-	printk(KERN_WARNING "ata%u: ncq interrupt error (Q=%08lx)\n", ap->id, ap->sactive);
-
-	spin_lock_irqsave(&ap->host_set->lock, flags);
-
-	sactive = readl(port_mmio + PORT_SCR_ACT);
-
-	printk(KERN_WARNING "ata%u: SActive 0x%x (0x%x)\n", ap->id, sactive, pp->sactive);
-	reset = ahci_intr_error(ap, readl(port_mmio + PORT_IRQ_STAT));
-
-	spin_unlock_irqrestore(&ap->host_set->lock, flags);
-
-	/*
-	 * if COMRESET was done, we don't have to issue a log page read
-	 */
-	if (reset)
-		goto done;
-
-	buffer = kmalloc(512, GFP_KERNEL);
-	if (!buffer) {
-		printk(KERN_ERR "ata%u: unable to allocate memory for error\n", ap->id);
-		goto done;
-	}
-
-	if (ata_read_log_page(ap, 0, READ_LOG_SATA_NCQ_PAGE, buffer, 1, 0)) {
-		printk(KERN_ERR "ata%u: unable to read log page\n", ap->id);
-		goto out;
-	}
-
-	dump_log_page(buffer);
-
-	/*
-	 * if NQ is cleared, bottom 5 bits contain the tag of the errored
-	 * command
-	 */
-	if ((buffer[0] & (1 << 7)) == 0) {
-		int tag = buffer[0] & 0x1f;
-
-		qc = ata_qc_from_tag(ap, tag);
-		if (qc)
-			ata_qc_complete(qc, ATA_ERR);
-	}
-
-	/*
-	 * requeue the remaining commands
-	 */
-	while (pp->sactive) {
-		int tag = ffs(pp->sactive) - 1;
-
-		pp->sactive &= ~(1 << tag);
-		qc = ata_qc_from_tag(ap, tag);
-		if (qc) {
-			if (qc->scsicmd)
-				ata_qc_free(qc);
-			else
-				ata_qc_complete(qc, ATA_ERR);
-		} else
-			printk(KERN_ERR "ata%u: missing tag %d\n", ap->id, tag);
-	}
-
-out:
-	kfree(buffer);
-done:
-	ata_scsi_unblock_requests(ap);
-}
-
-static void ahci_nonncq_timeout(struct ata_port *ap)
-{
-	void *mmio = ap->host_set->mmio_base;
-	void *port_mmio = ahci_port_base(mmio, ap->port_no);
-	struct ata_queued_cmd *qc;
-
-	DPRINTK("ENTER\n");
-
-	ahci_intr_error(ap, readl(port_mmio + PORT_IRQ_STAT));
-
-	qc = ata_qc_from_tag(ap, ap->active_tag);
-	if (!qc) {
-		printk(KERN_ERR "ata%u: BUG: timeout without command\n",
-		       ap->id);
-	} else {
-		/* hack alert!  We cannot use the supplied completion
-	 	 * function from inside the ->eh_strategy_handler() thread.
-	 	 * libata is the only user of ->eh_strategy_handler() in
-	 	 * any kernel, so the default scsi_done() assumes it is
-	 	 * not being called from the SCSI EH.
-	 	 */
-		qc->scsidone = scsi_finish_command;
-		ata_qc_complete(qc, ATA_ERR);
-	}
-}
-
-static void ahci_eng_timeout(struct ata_port *ap)
-{
-	struct ahci_port_priv *pp = ap->private_data;
-
-	if (pp->sactive)
-		ahci_ncq_timeout(ap);
-	else
-		ahci_nonncq_timeout(ap);
-}
-
-static int ahci_ncq_intr(struct ata_port *ap, u32 status)
-{
-	struct ahci_port_priv *pp = ap->private_data;
-	void *mmio = ap->host_set->mmio_base;
-	void *port_mmio = ahci_port_base(mmio, ap->port_no);
-
-	if (!pp->sactive)
-		return 0;
-
-	if (status & PORT_IRQ_SDB_FIS) {
-		u8 *sdb = pp->rx_fis + RX_FIS_SDB_REG;
-		u32 sactive, mask;
-
-		if (unlikely(sdb[2] & ATA_ERR)) {
-			printk("SDB fis, stat %x, err %x\n", sdb[2], sdb[3]);
-			return 1;
-		}
-
-		/*
-		 * SActive will have the bits cleared for completed commands
-		 */
-		sactive = readl(port_mmio + PORT_SCR_ACT);
-		mask = pp->sactive & ~sactive;
-		if (mask) {
-			ahci_complete_requests(ap, mask, 0);
-			pp->sactive = sactive;
-			return 1;
-		} else
-			printk(KERN_INFO "ata%u: SDB with no bits cleared\n", ap->id);
-	} else if (status & PORT_IRQ_D2H_REG_FIS) {
-		u8 *d2h = pp->rx_fis + RX_FIS_D2H_REG;
-
-		/*
-		 * pre-BSY clear error, let timeout error handling take care
-		 * of it when it kicks in
-		 */
-		if (d2h[2] & ATA_ERR) {
-			VPRINTK("D2H fis, err %x\n", d2h[2]);
-			return 1;
-		}
-
-		printk("D2H fis\n");
-	} else
-		printk(KERN_WARNING "ata%u: unhandled FIS, stat %x\n", ap->id, status);
-
-	return 0;
-}
-
-static void ahci_ncq_intr_error(struct ata_port *ap, u32 status)
-{
-	struct ahci_port_priv *pp = ap->private_data;
-	struct ata_queued_cmd *qc;
-	struct ata_taskfile tf;
-	int tag;
-
-	printk(KERN_ERR "ata%u: NCQ err status 0x%x\n", ap->id, status);
-
-	if (status & PORT_IRQ_D2H_REG_FIS) {
-		ahci_tf_read(ap, &tf);
-		tag = tf.nsect >> 3;
-
-		qc = ata_qc_from_tag(ap, tag);
-		if (qc) {
-			printk(KERN_ERR "ata%u: ending bad tag %d\n", ap->id, tag);
-			pp->sactive &= ~(1 << tag);
-			ata_qc_complete(qc, ATA_ERR);
-		} else
-			printk(KERN_ERR "ata%u: error on tag %d, but not present\n", ap->id, tag);
-	}
-
-	/*
-	 * let command timeout deal with error handling
-	 */
-	ata_scsi_block_requests(ap);
+	/* let libata layer recover */
+	ata_ncq_recover(ap, reset);
 }
 
 static inline int ahci_host_intr(struct ata_port *ap)
 {
-	struct ahci_port_priv *pp = ap->private_data;
-	void *mmio = ap->host_set->mmio_base;
-	void *port_mmio = ahci_port_base(mmio, ap->port_no);
-	struct ata_queued_cmd *qc;
-	u32 status, serr, ci;
+	void __iomem *port_mmio = (void __iomem *) ap->ioaddr.cmd_addr;
+	u32 status, serr;
 
 	serr = readl(port_mmio + PORT_SCR_ERR);
 	writel(serr, port_mmio + PORT_SCR_ERR);
@@ -842,29 +628,13 @@ static inline int ahci_host_intr(struct 
 	status = readl(port_mmio + PORT_IRQ_STAT);
 	writel(status, port_mmio + PORT_IRQ_STAT);
 
-	if (status & PORT_IRQ_FATAL) {
-		printk("ata%u: irq error %x %x, tag %d\n", ap->id, serr, status, ap->active_tag);
-		if (pp->sactive)
-			ahci_ncq_intr_error(ap, status);
-		else
-			ahci_intr_error(ap, status);
-
-		return 1;
-	}
-
-	if (ahci_ncq_intr(ap, status))
-		return 1;
-
-	ci = readl(port_mmio + PORT_CMD_ISSUE);
-
-	if ((ci & (1 << ap->active_tag)) == 0) {
-		VPRINTK("NON-NCQ interrupt\n");
-
-		qc = ata_qc_from_tag(ap, ap->active_tag);
-		if (qc && (qc->flags & ATA_QCFLAG_ACTIVE) &&
-		    !(qc->flags & ATA_QCFLAG_NCQ))
-			ata_qc_complete(qc, 0);
-	}
+	if (!(status & PORT_IRQ_FATAL)) {
+		void *cmd_issue = port_mmio + PORT_CMD_ISSUE;
+		if (ap->sactive ||
+		    (readl(cmd_issue) & (1 << ap->active_tag)) == 0)
+			ata_ncq_complete(ap);
+	} else
+		ata_ncq_abort(ap);
 
 	return 1;
 }
@@ -921,13 +691,10 @@ static irqreturn_t ahci_interrupt (int i
 static int ahci_qc_issue(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct ahci_port_priv *pp = ap->private_data;
 	void *port_mmio = (void *) ap->ioaddr.cmd_addr;
 	unsigned int tag = ata_qc_to_tag(qc);
 
 	if (qc->flags & ATA_QCFLAG_NCQ) {
-		pp->sactive |= (1 << tag);
-
 		writel(1 << tag, port_mmio + PORT_SCR_ACT);
 		readl(port_mmio + PORT_SCR_ACT);
 	}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (5 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 06/09] NCQ: convert ahci to use new " Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-07-26 21:12   ` Jeff Garzik
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 08/09] NCQ: remove/unexport unused/unnecessary functions Tejun Heo
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

07_NCQ_ahci-stop-dma-before-reset.patch

	AHCI 1.1 mandates stopping dma before issueing COMMRESET.  The
	original code didn't and it resulted in occasional lockup of
	the controller during EH recovery.  This patch fixes the
	problem.

Signed-off-by: Tejun Heo <htejun@gmail.com>

 ahci.c |    2 ++
 1 files changed, 2 insertions(+)

Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
@@ -474,7 +474,9 @@ static void ahci_phy_reset(struct ata_po
 	struct ata_device *dev = &ap->device[0];
 	u32 tmp;
 
+	ahci_stop_dma(ap);
 	__sata_phy_reset(ap);
+	ahci_start_dma(ap);
 
 	if (ap->flags & ATA_FLAG_PORT_DISABLED)
 		return;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 08/09] NCQ: remove/unexport unused/unnecessary functions
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (6 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 09/09] NCQ: causes error or timeout Tejun Heo
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

08_NCQ_remove-or-unexport-unused-functions.patch

	This patch removes ata_scsi_block_requests() and
	ata_scsi_unblock_requests(), and makes ata_read_log_page() and
	ata_to_sense_error() static.

Signed-off-by: Tejun Heo <htejun@gmail.com>

 drivers/scsi/libata-core.c |    8 +++-----
 drivers/scsi/libata-scsi.c |   19 ++-----------------
 drivers/scsi/libata.h      |    1 -
 include/linux/libata.h     |    4 ----
 4 files changed, 5 insertions(+), 27 deletions(-)

Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-06-27 00:20:30.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-06-27 00:20:32.000000000 +0900
@@ -1334,8 +1334,9 @@ static void ata_read_log_page_timeout(un
  *	Grabs host_set lock.
  */
 
-int ata_read_log_page(struct ata_port *ap, unsigned int device, char page,
-		      char *buffer, unsigned int sectors, unsigned int timeout)
+static int ata_read_log_page(struct ata_port *ap, unsigned int device,
+			     char page, char *buffer, unsigned int sectors,
+			     unsigned int timeout)
 {
 	struct ata_device *dev = &ap->device[device];
 	DECLARE_COMPLETION(wait);
@@ -4862,10 +4863,7 @@ EXPORT_SYMBOL_GPL(ata_dev_classify);
 EXPORT_SYMBOL_GPL(ata_dev_id_string);
 EXPORT_SYMBOL_GPL(ata_scsi_simulate);
 EXPORT_SYMBOL_GPL(ata_scsi_change_queue_depth);
-EXPORT_SYMBOL_GPL(ata_scsi_block_requests);
-EXPORT_SYMBOL_GPL(ata_scsi_unblock_requests);
 EXPORT_SYMBOL_GPL(ata_scsi_requeue);
-EXPORT_SYMBOL_GPL(ata_read_log_page);
 EXPORT_SYMBOL_GPL(ata_ncq_complete);
 EXPORT_SYMBOL_GPL(ata_ncq_abort);
 EXPORT_SYMBOL_GPL(ata_ncq_recover);
Index: work/drivers/scsi/libata-scsi.c
===================================================================
--- work.orig/drivers/scsi/libata-scsi.c	2005-06-27 00:20:30.000000000 +0900
+++ work/drivers/scsi/libata-scsi.c	2005-06-27 00:20:32.000000000 +0900
@@ -204,7 +204,8 @@ struct ata_queued_cmd *ata_scsi_qc_new(s
  *	spin_lock_irqsave(host_set lock)
  */
 
-void ata_to_sense_error(struct scsi_cmnd *cmd, int apid, u8 drv_stat, u8 drv_err)
+static void ata_to_sense_error(struct scsi_cmnd *cmd, int apid,
+			       u8 drv_stat, u8 drv_err)
 {
 	unsigned char *sb = cmd->sense_buffer;
 	/* Based on the 3ware driver translation table */
@@ -445,22 +446,6 @@ void ata_scsi_requeue(struct ata_queued_
 		ata_qc_complete(qc, ATA_ERR);
 }
 
-void ata_scsi_block_requests(struct ata_port *ap)
-{
-	struct Scsi_Host *host = ap->host;
-
-	scsi_block_requests(host);
-}
-
-void ata_scsi_unblock_requests(struct ata_port *ap)
-{
-	struct Scsi_Host *host = ap->host;
-
-	scsi_unblock_requests(host);
-}
-
-
-
 /**
  *	ata_scsi_error - SCSI layer error handler callback
  *	@host: SCSI host on which error occurred
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h	2005-06-27 00:20:30.000000000 +0900
+++ work/include/linux/libata.h	2005-06-27 00:20:32.000000000 +0900
@@ -453,11 +453,7 @@ extern int ata_std_bios_param(struct scs
 			      sector_t capacity, int geom[]);
 extern int ata_scsi_slave_config(struct scsi_device *sdev);
 extern int ata_scsi_change_queue_depth(struct scsi_device *, int);
-extern void ata_scsi_block_requests(struct ata_port *);
-extern void ata_scsi_unblock_requests(struct ata_port *);
 extern void ata_scsi_requeue(struct ata_queued_cmd *);
-extern int ata_read_log_page(struct ata_port *, unsigned int, char, char *,
-			     unsigned int, unsigned int);
 extern int ata_ncq_complete(struct ata_port *);
 extern int ata_ncq_abort(struct ata_port *);
 extern void ata_ncq_recover(struct ata_port *, int);
Index: work/drivers/scsi/libata.h
===================================================================
--- work.orig/drivers/scsi/libata.h	2005-06-27 00:20:30.000000000 +0900
+++ work/drivers/scsi/libata.h	2005-06-27 00:20:32.000000000 +0900
@@ -51,7 +51,6 @@ extern void swap_buf_le16(u16 *buf, unsi
 
 
 /* libata-scsi.c */
-extern void ata_to_sense_error(struct ata_queued_cmd *qc, u8 drv_stat);
 extern int ata_scsi_error(struct Scsi_Host *host);
 extern unsigned int ata_scsiop_inq_std(struct ata_scsi_args *args, u8 *rbuf,
 			       unsigned int buflen);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 09/09] NCQ: causes error or timeout
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (7 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 08/09] NCQ: remove/unexport unused/unnecessary functions Tejun Heo
@ 2005-06-26 15:21 ` Tejun Heo
  2005-06-26 15:34 ` test logs Tejun Heo
  2005-06-27 14:33 ` [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Jens Axboe
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:21 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

09_NCQ_ahci-debug.patch

	This is what I've used for testing EH.  This patch contains
	codes for corrupting or skipping specific tags causing error
	conditions.  If you're curious....

Signed-off-by: Tejun Heo <htejun@gmail.com>

 ahci.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+)

Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-06-27 00:20:32.000000000 +0900
@@ -696,11 +696,45 @@ static int ahci_qc_issue(struct ata_queu
 	void *port_mmio = (void *) ap->ioaddr.cmd_addr;
 	unsigned int tag = ata_qc_to_tag(qc);
 
+#if 0
+	if (tag == 3) {
+		struct ahci_port_priv *pp = qc->ap->private_data;
+		int offset = tag * AHCI_CMD_TOTAL;
+		u8 *fis = pp->cmd_tbl + offset;
+		printk("AHCI: FAILING %02u: %08x %08x %08x %08x %08x\n",
+		       tag, ((unsigned *)fis)[0],
+		       ((unsigned *)fis)[1], ((unsigned *)fis)[2],
+		       ((unsigned *)fis)[3], ((unsigned *)fis)[4]);
+		fis[4] = 0xff;
+		fis[5] = 0xff;
+		fis[6] = 0xff;
+		fis[8] = 0xff;
+		fis[9] = 0xff;
+		fis[10] = 0xff;
+	}
+#endif
+
+#if 1
+	{
+		struct ahci_port_priv *pp = qc->ap->private_data;
+		unsigned *fis = pp->cmd_tbl + (tag * AHCI_CMD_TOTAL);
+		printk("AHCI: FIS(tag%02u): %08x %08x %08x %08x %08x\n",
+		   tag, fis[0], fis[1], fis[2], fis[3], fis[4]);
+	}
+#endif
+
 	if (qc->flags & ATA_QCFLAG_NCQ) {
 		writel(1 << tag, port_mmio + PORT_SCR_ACT);
 		readl(port_mmio + PORT_SCR_ACT);
 	}
 
+#if 1
+	if (tag == 3) {
+		printk("AHCI: SKIPPING %02u\n", tag);
+		return 0;
+	}
+#endif
+
 	writel(1 << tag, port_mmio + PORT_CMD_ISSUE);
 	readl(port_mmio + PORT_CMD_ISSUE);	/* flush */
 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: test logs
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (8 preceding siblings ...)
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 09/09] NCQ: causes error or timeout Tejun Heo
@ 2005-06-26 15:34 ` Tejun Heo
  2005-06-27 14:33 ` [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Jens Axboe
  10 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-06-26 15:34 UTC (permalink / raw)
  To: jgarzik, axboe, linux-ide

 Hello, Jeff.
 Hello, Jens.

 These are some of my test logs.  Oh, BTW, I didn't setup lksp config
file correctly and the subject lines say the patches are against
Linux-2.6.12.  They are all against NCQ head of libata-dev-2.6.  Sorry
about the confusion.

##
## Normal recovery from error.  Tag 03 is corrupted before issueing,
## and the drive fails the command.  On recovery, log page 10h reports
## the failing command and that command is failed w/ error.  All other
## in-flight requests are retried.
##
AHCI: FIS(tag02): 08608027 40dd5196 0000000a 08000010 00000000
AHCI: FAILING 03: 08608027 4096c97d 0000000c 08000018 00000000
AHCI: FIS(tag03): 08608027 40ffffff 00ffffff 08000018 00000000
AHCI: FIS(tag00): 08608027 4045382f 00000010 08000000 00000000
ata1: aborting commands due to error.  active_tag -1, sactive 0000000f
ata1: recovering from error
AHCI: FIS(tag31): 002f8027 a0000010 00000000 08000001 00000000
ata1: log_ext_10h, tag=3 stat=51 err=10
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x10 { SectorIdNotFound }
SCSI error : <1 0 0 0> return code = 0x8000002
sdb: Current: sense key: Aborted Command
    Additional sense: Recorded entity not found
end_request: I/O error, dev sdb, sector 211208573
AHCI: FIS(tag00): 08608027 40dd5196 0000000a 08000000 00000000
AHCI: FIS(tag01): 08608027 40bbe638 0000000d 08000008 00000000
AHCI: FIS(tag00): 08608027 4045382f 00000010 08000000 00000000
AHCI: FIS(tag02): 10608027 40780b0c 00000005 08000010 00000000

##
## Similar, but this time we happen to send more commands to the drive
## following the corrupted one before the drive reports error.  On
## Samsung HD160JJ, this causes drive lockup and thus the following
## read_log_page fails with timeout causing COMMRESET and failing of
## all in-flight commands.  The drive becomes fully operational
## after the COMMRESET.
##
AHCI: FIS(tag01): 10608027 40348e58 00000001 08000008 00000000
AHCI: FIS(tag02): 10608027 402bfe4b 00000011 08000010 00000000
AHCI: FAILING 03: 10608027 4096c975 0000000c 08000018 00000000
AHCI: FIS(tag03): 10608027 40ffffff 00ffffff 08000018 00000000
AHCI: FIS(tag00): 10608027 4046335b 0000000d 08000000 00000000
AHCI: FIS(tag01): 10608027 40453827 00000010 08000008 00000000
AHCI: FIS(tag04): 10608027 40dd518e 0000000a 08000020 00000000
AHCI: FIS(tag05): 10608027 40bbe630 0000000d 08000028 00000000
AHCI: FIS(tag06): 10608027 4006ff90 00000004 08000030 00000000
ata1: aborting commands due to error.  active_tag -1, sactive 0000007f
ata1: aborting commands due to error.  active_tag -1, sactive 00000000
ata1: recovering from error
AHCI: FIS(tag31): 002f8027 a0000010 00000000 08000001 00000000
ata1: failed to read log page 10h (-110)
ata1: resetting...
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <1 0 0 0> return code = 0x8000002
sdb: Current: sense key: Medium Error
    Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 222704475
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <1 0 0 0> return code = 0x8000002
sdb: Current: sense key: Medium Error
    Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 272971815

##
## This is timeout example.
##
AHCI: FIS(tag00): 10608027 40cc69a6 0000000d 08000000 00000000
AHCI: FIS(tag03): 10608027 400d2804 00000008 08000018 00000000
AHCI: SKIPPING 03
AHCI: FIS(tag04): 10608027 409bdbaa 00000000 08000020 00000000
AHCI: FIS(tag05): 10608027 405776d8 00000004 08000028 00000000
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AHCI: FIS(tag07): 10608027 402f3c4e 00000010 08000038 00000000
AHCI: FIS(tag12): 10608027 402b78dc 0000000f 08000060 00000000
ata1: recovering from error
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <1 0 0 0> return code = 0x8000002
sdb: Current: sense key: Medium Error
    Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 135079940
AHCI: FIS(tag00): 08608027 400d280c 00000008 08000000 00000000
AHCI: FIS(tag01): 10608027 40a514dd 00000001 08000008 00000000
AHCI: FIS(tag02): 10608027 4069f4e2 00000002 08000010 00000000


 Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
                   ` (9 preceding siblings ...)
  2005-06-26 15:34 ` test logs Tejun Heo
@ 2005-06-27 14:33 ` Jens Axboe
  2005-06-30  7:36   ` Jens Axboe
  10 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-06-27 14:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jgarzik, linux-ide

On Mon, Jun 27 2005, Tejun Heo wrote:
>  Hello, Jeff.
>  Hello, Jens.
> 
>  This patchset implements generic completion and error-handling for
> NCQ commands.  This patchset assumes that the previous six misc
> patches to NCQ are applied.

Excellent, much needed work in that area. I will give it a test spin
here as well, I have one drive that likes to barf with ncq occasionally.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-06-27 14:33 ` [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Jens Axboe
@ 2005-06-30  7:36   ` Jens Axboe
  2005-06-30 10:51     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-06-30  7:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jgarzik, linux-ide

On Mon, Jun 27 2005, Jens Axboe wrote:
> On Mon, Jun 27 2005, Tejun Heo wrote:
> >  Hello, Jeff.
> >  Hello, Jens.
> > 
> >  This patchset implements generic completion and error-handling for
> > NCQ commands.  This patchset assumes that the previous six misc
> > patches to NCQ are applied.
> 
> Excellent, much needed work in that area. I will give it a test spin
> here as well, I have one drive that likes to barf with ncq occasionally.

Ok, I've run with this for a few days and finally hit the
drive-stops-responding condition yesterday afternoon. Error recovery
worked a lot better than before, but eventually went down anyways. But
now I got a better look at the error, and it's the drive throwing an
ICRC (error 0x80). Very odd. I've never seen this happen with non-NCQ
operations, however I've seen it now a few times using NCQ. Any ideas?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-06-30  7:36   ` Jens Axboe
@ 2005-06-30 10:51     ` Tejun Heo
  2005-06-30 15:26       ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2005-06-30 10:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jgarzik, linux-ide

Jens Axboe wrote:
> On Mon, Jun 27 2005, Jens Axboe wrote:
> 
>>On Mon, Jun 27 2005, Tejun Heo wrote:
>>
>>> Hello, Jeff.
>>> Hello, Jens.
>>>
>>> This patchset implements generic completion and error-handling for
>>>NCQ commands.  This patchset assumes that the previous six misc
>>>patches to NCQ are applied.
>>
>>Excellent, much needed work in that area. I will give it a test spin
>>here as well, I have one drive that likes to barf with ncq occasionally.
> 
> 
> Ok, I've run with this for a few days and finally hit the
> drive-stops-responding condition yesterday afternoon. Error recovery
> worked a lot better than before, but eventually went down anyways. But
> now I got a better look at the error, and it's the drive throwing an
> ICRC (error 0x80). Very odd. I've never seen this happen with non-NCQ
> operations, however I've seen it now a few times using NCQ. Any ideas?
> 

  Hello, Jens.

  Can you please describe how the drive went down in detail?  If 
possible, log messages w/ the debug message patch applied would be 
great.  As the EH now resets both the controller (on entry to EH) and 
the drive (on timeout), we should be able to recover unless something 
goes very strange.

  I'm currently trying to rewrite sil24 driver to make it look saner and 
support NCQ.  Once I'm done with it (maybe one or two more days... I 
hope), I'll do the second take of generic NCQ patches including ATAPI EH 
fix and stuff and it would be great to have your failure log message 
before doing that.

  Thanks.  :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-06-30 10:51     ` Tejun Heo
@ 2005-06-30 15:26       ` Jens Axboe
  2005-07-01  0:20         ` Tejun
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-06-30 15:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jgarzik, linux-ide

On Thu, Jun 30 2005, Tejun Heo wrote:
> Jens Axboe wrote:
> >On Mon, Jun 27 2005, Jens Axboe wrote:
> >
> >>On Mon, Jun 27 2005, Tejun Heo wrote:
> >>
> >>>Hello, Jeff.
> >>>Hello, Jens.
> >>>
> >>>This patchset implements generic completion and error-handling for
> >>>NCQ commands.  This patchset assumes that the previous six misc
> >>>patches to NCQ are applied.
> >>
> >>Excellent, much needed work in that area. I will give it a test spin
> >>here as well, I have one drive that likes to barf with ncq occasionally.
> >
> >
> >Ok, I've run with this for a few days and finally hit the
> >drive-stops-responding condition yesterday afternoon. Error recovery
> >worked a lot better than before, but eventually went down anyways. But
> >now I got a better look at the error, and it's the drive throwing an
> >ICRC (error 0x80). Very odd. I've never seen this happen with non-NCQ
> >operations, however I've seen it now a few times using NCQ. Any ideas?
> >
> 
>  Hello, Jens.
> 
>  Can you please describe how the drive went down in detail?  If 
> possible, log messages w/ the debug message patch applied would be 
> great.  As the EH now resets both the controller (on entry to EH) and 
> the drive (on timeout), we should be able to recover unless something 
> goes very strange.

I'm pretty sure it wasn't the fault of the error handling, although I
cannot say for sure of course. I don't have the log safed, but what
happened was that the drive threw an 0x80 icrc error, drive was
COMRESET, io was errored, and then nothing happened after that. Access
to the drive hung.

I will save the log the next time it occurs, I could not this time since
I was working on the machine remotely and needed it rebooted.

>  I'm currently trying to rewrite sil24 driver to make it look saner and 
> support NCQ.  Once I'm done with it (maybe one or two more days... I 
> hope), I'll do the second take of generic NCQ patches including ATAPI EH 
> fix and stuff and it would be great to have your failure log message 
> before doing that.

It should trigger again within a day or two, I will send it when it
does. Can you resend the debug patch?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-06-30 15:26       ` Jens Axboe
@ 2005-07-01  0:20         ` Tejun
  2005-07-01  8:59           ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun @ 2005-07-01  0:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jgarzik, linux-ide

On Thu, Jun 30, 2005 at 05:26:20PM +0200, Jens Axboe wrote:
> On Thu, Jun 30 2005, Tejun Heo wrote:
> > Jens Axboe wrote:
> > >On Mon, Jun 27 2005, Jens Axboe wrote:
> > >
> > >>On Mon, Jun 27 2005, Tejun Heo wrote:
> > >>
> > >>>Hello, Jeff.
> > >>>Hello, Jens.
> > >>>
> > >>>This patchset implements generic completion and error-handling for
> > >>>NCQ commands.  This patchset assumes that the previous six misc
> > >>>patches to NCQ are applied.
> > >>
> > >>Excellent, much needed work in that area. I will give it a test spin
> > >>here as well, I have one drive that likes to barf with ncq occasionally.
> > >
> > >
> > >Ok, I've run with this for a few days and finally hit the
> > >drive-stops-responding condition yesterday afternoon. Error recovery
> > >worked a lot better than before, but eventually went down anyways. But
> > >now I got a better look at the error, and it's the drive throwing an
> > >ICRC (error 0x80). Very odd. I've never seen this happen with non-NCQ
> > >operations, however I've seen it now a few times using NCQ. Any ideas?
> > >
> > 
> >  Hello, Jens.
> > 
> >  Can you please describe how the drive went down in detail?  If 
> > possible, log messages w/ the debug message patch applied would be 
> > great.  As the EH now resets both the controller (on entry to EH) and 
> > the drive (on timeout), we should be able to recover unless something 
> > goes very strange.
> 
> I'm pretty sure it wasn't the fault of the error handling, although I
> cannot say for sure of course. I don't have the log safed, but what
> happened was that the drive threw an 0x80 icrc error, drive was
> COMRESET, io was errored, and then nothing happened after that. Access
> to the drive hung.
> 
> I will save the log the next time it occurs, I could not this time since
> I was working on the machine remotely and needed it rebooted.
> 
> >  I'm currently trying to rewrite sil24 driver to make it look saner and 
> > support NCQ.  Once I'm done with it (maybe one or two more days... I 
> > hope), I'll do the second take of generic NCQ patches including ATAPI EH 
> > fix and stuff and it would be great to have your failure log message 
> > before doing that.
> 
> It should trigger again within a day or two, I will send it when it
> does. Can you resend the debug patch?
> 
> -- 
> Jens Axboe


 Hi, Jens.

 I converted most of debug messages I've used during development into
warning messages when posting the patchset and forgot about it, so
I've never posted the debug patch.  Sorry about that.  Here's a small
patch which adds some more messages though.  The following patch also
adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
if you think it would fill your log excessively, feel free to turn it
off.  It wouldn't probably matter anyway.


Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c	2005-07-01 08:41:17.000000000 +0900
+++ work/drivers/scsi/ahci.c	2005-07-01 09:18:34.000000000 +0900
@@ -635,8 +635,14 @@ static inline int ahci_host_intr(struct 
 		if (ap->sactive ||
 		    (readl(cmd_issue) & (1 << ap->active_tag)) == 0)
 			ata_ncq_complete(ap);
-	} else
+	} else {
+		printk("AHCI: ata%u: error irq, status=%08x stat=%02x err=%02x sstat=%08x serr=%08x\n",
+		       ap->id, status,
+		       ahci_check_status(ap), ahci_check_err(ap),
+		       readl(port_mmio + PORT_SCR_STAT),
+		       readl(port_mmio + PORT_SCR_ERR));
 		ata_ncq_abort(ap);
+	}
 
 	return 1;
 }
@@ -696,6 +702,15 @@ static int ahci_qc_issue(struct ata_queu
 	void *port_mmio = (void *) ap->ioaddr.cmd_addr;
 	unsigned int tag = ata_qc_to_tag(qc);
 
+#if 1
+	{
+		struct ahci_port_priv *pp = qc->ap->private_data;
+		unsigned *fis = pp->cmd_tbl + (tag * AHCI_CMD_TOTAL);
+		printk("AHCI: ata%u: FIS(tag%02u): %08x %08x %08x %08x %08x\n",
+		       ap->id, tag, fis[0], fis[1], fis[2], fis[3], fis[4]);
+	}
+#endif
+
 	if (qc->flags & ATA_QCFLAG_NCQ) {
 		writel(1 << tag, port_mmio + PORT_SCR_ACT);
 		readl(port_mmio + PORT_SCR_ACT);
Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c	2005-07-01 08:41:17.000000000 +0900
+++ work/drivers/scsi/libata-core.c	2005-07-01 08:49:42.000000000 +0900
@@ -1485,6 +1485,7 @@ void __sata_phy_reset(struct ata_port *a
 	}
 	scr_write_flush(ap, SCR_CONTROL, 0x300); /* phy wake/clear reset */
 
+	printk("ata%u: started resetting...\n", ap->id);
 	/* wait for phy to become ready, if necessary */
 	do {
 		msleep(200);
@@ -1492,6 +1493,7 @@ void __sata_phy_reset(struct ata_port *a
 		if ((sstatus & 0xf) != 1)
 			break;
 	} while (time_before(jiffies, timeout));
+	printk("ata%u: end resetting, sstatus=%08x\n", ap->id, sstatus);
 
 	/* TODO: phy layer with polling, timeouts, etc. */
 	if (sata_dev_present(ap))

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-01  0:20         ` Tejun
@ 2005-07-01  8:59           ` Jens Axboe
  2005-07-04  5:53             ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-07-01  8:59 UTC (permalink / raw)
  To: Tejun; +Cc: jgarzik, linux-ide

On Fri, Jul 01 2005, Tejun wrote:
> On Thu, Jun 30, 2005 at 05:26:20PM +0200, Jens Axboe wrote:
> > On Thu, Jun 30 2005, Tejun Heo wrote:
> > > Jens Axboe wrote:
> > > >On Mon, Jun 27 2005, Jens Axboe wrote:
> > > >
> > > >>On Mon, Jun 27 2005, Tejun Heo wrote:
> > > >>
> > > >>>Hello, Jeff.
> > > >>>Hello, Jens.
> > > >>>
> > > >>>This patchset implements generic completion and error-handling for
> > > >>>NCQ commands.  This patchset assumes that the previous six misc
> > > >>>patches to NCQ are applied.
> > > >>
> > > >>Excellent, much needed work in that area. I will give it a test spin
> > > >>here as well, I have one drive that likes to barf with ncq occasionally.
> > > >
> > > >
> > > >Ok, I've run with this for a few days and finally hit the
> > > >drive-stops-responding condition yesterday afternoon. Error recovery
> > > >worked a lot better than before, but eventually went down anyways. But
> > > >now I got a better look at the error, and it's the drive throwing an
> > > >ICRC (error 0x80). Very odd. I've never seen this happen with non-NCQ
> > > >operations, however I've seen it now a few times using NCQ. Any ideas?
> > > >
> > > 
> > >  Hello, Jens.
> > > 
> > >  Can you please describe how the drive went down in detail?  If 
> > > possible, log messages w/ the debug message patch applied would be 
> > > great.  As the EH now resets both the controller (on entry to EH) and 
> > > the drive (on timeout), we should be able to recover unless something 
> > > goes very strange.
> > 
> > I'm pretty sure it wasn't the fault of the error handling, although I
> > cannot say for sure of course. I don't have the log safed, but what
> > happened was that the drive threw an 0x80 icrc error, drive was
> > COMRESET, io was errored, and then nothing happened after that. Access
> > to the drive hung.
> > 
> > I will save the log the next time it occurs, I could not this time since
> > I was working on the machine remotely and needed it rebooted.
> > 
> > >  I'm currently trying to rewrite sil24 driver to make it look saner and 
> > > support NCQ.  Once I'm done with it (maybe one or two more days... I 
> > > hope), I'll do the second take of generic NCQ patches including ATAPI EH 
> > > fix and stuff and it would be great to have your failure log message 
> > > before doing that.
> > 
> > It should trigger again within a day or two, I will send it when it
> > does. Can you resend the debug patch?
> > 
> > -- 
> > Jens Axboe
> 
> 
>  Hi, Jens.
> 
>  I converted most of debug messages I've used during development into
> warning messages when posting the patchset and forgot about it, so
> I've never posted the debug patch.  Sorry about that.  Here's a small
> patch which adds some more messages though.  The following patch also
> adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
> if you think it would fill your log excessively, feel free to turn it
> off.  It wouldn't probably matter anyway.

I will have to kill the issue part of the patch, that would generate
insane amounts of printk traffic :-)

I'll boot the kernel and report what happens.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-01  8:59           ` Jens Axboe
@ 2005-07-04  5:53             ` Jens Axboe
  2005-07-06 12:55               ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-07-04  5:53 UTC (permalink / raw)
  To: Tejun; +Cc: jgarzik, linux-ide

On Fri, Jul 01 2005, Jens Axboe wrote:
> >  I converted most of debug messages I've used during development into
> > warning messages when posting the patchset and forgot about it, so
> > I've never posted the debug patch.  Sorry about that.  Here's a small
> > patch which adds some more messages though.  The following patch also
> > adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
> > if you think it would fill your log excessively, feel free to turn it
> > off.  It wouldn't probably matter anyway.
> 
> I will have to kill the issue part of the patch, that would generate
> insane amounts of printk traffic :-)
> 
> I'll boot the kernel and report what happens.

It triggered last night, but the old kernel was booted. This was the
log:

ahci ata1: stat=d0, issuing COMRESET
ata1: recovering from error
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66255899
Buffer I/O error on device sda2, logical block 8018923
lost page write due to I/O error on sda2
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66239043
Buffer I/O error on device sda2, logical block 8016816
lost page write due to I/O error on sda2
ata1: recovering from error
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66239051
Buffer I/O error on device sda2, logical block 8016817
lost page write due to I/O error on sda2
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 35137043

Which just continues. I'll boot the right kernel now. I removed the
reset dependency on the read_log_page() issue, I'm suspecting we still
need that to kick start the drive after an error.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-04  5:53             ` Jens Axboe
@ 2005-07-06 12:55               ` Jens Axboe
  2005-07-06 13:00                 ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-07-06 12:55 UTC (permalink / raw)
  To: Tejun; +Cc: jgarzik, linux-ide

On Mon, Jul 04 2005, Jens Axboe wrote:
> On Fri, Jul 01 2005, Jens Axboe wrote:
> > >  I converted most of debug messages I've used during development into
> > > warning messages when posting the patchset and forgot about it, so
> > > I've never posted the debug patch.  Sorry about that.  Here's a small
> > > patch which adds some more messages though.  The following patch also
> > > adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
> > > if you think it would fill your log excessively, feel free to turn it
> > > off.  It wouldn't probably matter anyway.
> > 
> > I will have to kill the issue part of the patch, that would generate
> > insane amounts of printk traffic :-)
> > 
> > I'll boot the kernel and report what happens.
> 
> It triggered last night, but the old kernel was booted. This was the
> log:
> 
> ahci ata1: stat=d0, issuing COMRESET
> ata1: recovering from error
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 66255899
> Buffer I/O error on device sda2, logical block 8018923
> lost page write due to I/O error on sda2
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 66239043
> Buffer I/O error on device sda2, logical block 8016816
> lost page write due to I/O error on sda2
> ata1: recovering from error
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 66239051
> Buffer I/O error on device sda2, logical block 8016817
> lost page write due to I/O error on sda2
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 35137043

This is with the extra debug. Given that it is the timeout triggering,
only the sstatus is new.

ahci ata1: stat=d0, issuing COMRESET
ata1: started resetting...
ata1: end resetting, sstatus=00000113
ata1: recovering from error
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66190875
Buffer I/O error on device sda2, logical block 8010795
lost page write due to I/O error on sda2
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66159699
Buffer I/O error on device sda2, logical block 8006898
lost page write due to I/O error on sda2


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-06 12:55               ` Jens Axboe
@ 2005-07-06 13:00                 ` Jens Axboe
  2005-07-06 15:11                   ` Tejun Heo
  2005-07-08  8:03                   ` Jens Axboe
  0 siblings, 2 replies; 26+ messages in thread
From: Jens Axboe @ 2005-07-06 13:00 UTC (permalink / raw)
  To: Tejun; +Cc: jgarzik, linux-ide

On Wed, Jul 06 2005, Jens Axboe wrote:
> On Mon, Jul 04 2005, Jens Axboe wrote:
> > On Fri, Jul 01 2005, Jens Axboe wrote:
> > > >  I converted most of debug messages I've used during development into
> > > > warning messages when posting the patchset and forgot about it, so
> > > > I've never posted the debug patch.  Sorry about that.  Here's a small
> > > > patch which adds some more messages though.  The following patch also
> > > > adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
> > > > if you think it would fill your log excessively, feel free to turn it
> > > > off.  It wouldn't probably matter anyway.
> > > 
> > > I will have to kill the issue part of the patch, that would generate
> > > insane amounts of printk traffic :-)
> > > 
> > > I'll boot the kernel and report what happens.
> > 
> > It triggered last night, but the old kernel was booted. This was the
> > log:
> > 
> > ahci ata1: stat=d0, issuing COMRESET
> > ata1: recovering from error
> > ata1: status=0x01 { Error }
> > ata1: error=0x80 { Sector }
> > SCSI error : <0 0 0 0> return code = 0x8000002
> > sda: Current: sense key=0x3
> >     ASC=0x11 ASCQ=0x4
> > end_request: I/O error, dev sda, sector 66255899
> > Buffer I/O error on device sda2, logical block 8018923
> > lost page write due to I/O error on sda2
> > ata1: status=0x01 { Error }
> > ata1: error=0x80 { Sector }
> > SCSI error : <0 0 0 0> return code = 0x8000002
> > sda: Current: sense key=0x3
> >     ASC=0x11 ASCQ=0x4
> > end_request: I/O error, dev sda, sector 66239043
> > Buffer I/O error on device sda2, logical block 8016816
> > lost page write due to I/O error on sda2
> > ata1: recovering from error
> > ata1: status=0x01 { Error }
> > ata1: error=0x80 { Sector }
> > SCSI error : <0 0 0 0> return code = 0x8000002
> > sda: Current: sense key=0x3
> >     ASC=0x11 ASCQ=0x4
> > end_request: I/O error, dev sda, sector 66239051
> > Buffer I/O error on device sda2, logical block 8016817
> > lost page write due to I/O error on sda2
> > ata1: status=0x01 { Error }
> > ata1: error=0x80 { Sector }
> > SCSI error : <0 0 0 0> return code = 0x8000002
> > sda: Current: sense key=0x3
> >     ASC=0x11 ASCQ=0x4
> > end_request: I/O error, dev sda, sector 35137043
> 
> This is with the extra debug. Given that it is the timeout triggering,
> only the sstatus is new.
> 
> ahci ata1: stat=d0, issuing COMRESET
> ata1: started resetting...
> ata1: end resetting, sstatus=00000113
> ata1: recovering from error
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 66190875
> Buffer I/O error on device sda2, logical block 8010795
> lost page write due to I/O error on sda2
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 66159699
> Buffer I/O error on device sda2, logical block 8006898
> lost page write due to I/O error on sda2

btw, the reason it hangs here (I suspect) is that your read_log_page()
logic is wrong - not every error condition will have NCQ_FAILED set
before entering ncq_recover. The timeout will not, for instance.
Testing... As usual, this will take days.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-06 13:00                 ` Jens Axboe
@ 2005-07-06 15:11                   ` Tejun Heo
  2005-07-08  8:03                   ` Jens Axboe
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-07-06 15:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jgarzik, linux-ide

Jens Axboe wrote:
> On Wed, Jul 06 2005, Jens Axboe wrote:
> 
>>On Mon, Jul 04 2005, Jens Axboe wrote:
>>
>>>On Fri, Jul 01 2005, Jens Axboe wrote:
>>>
>>>>> I converted most of debug messages I've used during development into
>>>>>warning messages when posting the patchset and forgot about it, so
>>>>>I've never posted the debug patch.  Sorry about that.  Here's a small
>>>>>patch which adds some more messages though.  The following patch also
>>>>>adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
>>>>>if you think it would fill your log excessively, feel free to turn it
>>>>>off.  It wouldn't probably matter anyway.
>>>>
>>>>I will have to kill the issue part of the patch, that would generate
>>>>insane amounts of printk traffic :-)
>>>>
>>>>I'll boot the kernel and report what happens.
>>>
>>>It triggered last night, but the old kernel was booted. This was the
>>>log:
>>>
>>>ahci ata1: stat=d0, issuing COMRESET
>>>ata1: recovering from error
>>>ata1: status=0x01 { Error }
>>>ata1: error=0x80 { Sector }
>>>SCSI error : <0 0 0 0> return code = 0x8000002
>>>sda: Current: sense key=0x3
>>>    ASC=0x11 ASCQ=0x4
>>>end_request: I/O error, dev sda, sector 66255899
>>>Buffer I/O error on device sda2, logical block 8018923
>>>lost page write due to I/O error on sda2
>>>ata1: status=0x01 { Error }
>>>ata1: error=0x80 { Sector }
>>>SCSI error : <0 0 0 0> return code = 0x8000002
>>>sda: Current: sense key=0x3
>>>    ASC=0x11 ASCQ=0x4
>>>end_request: I/O error, dev sda, sector 66239043
>>>Buffer I/O error on device sda2, logical block 8016816
>>>lost page write due to I/O error on sda2
>>>ata1: recovering from error
>>>ata1: status=0x01 { Error }
>>>ata1: error=0x80 { Sector }
>>>SCSI error : <0 0 0 0> return code = 0x8000002
>>>sda: Current: sense key=0x3
>>>    ASC=0x11 ASCQ=0x4
>>>end_request: I/O error, dev sda, sector 66239051
>>>Buffer I/O error on device sda2, logical block 8016817
>>>lost page write due to I/O error on sda2
>>>ata1: status=0x01 { Error }
>>>ata1: error=0x80 { Sector }
>>>SCSI error : <0 0 0 0> return code = 0x8000002
>>>sda: Current: sense key=0x3
>>>    ASC=0x11 ASCQ=0x4
>>>end_request: I/O error, dev sda, sector 35137043
>>
>>This is with the extra debug. Given that it is the timeout triggering,
>>only the sstatus is new.
>>
>>ahci ata1: stat=d0, issuing COMRESET
>>ata1: started resetting...
>>ata1: end resetting, sstatus=00000113
>>ata1: recovering from error
>>ata1: status=0x01 { Error }
>>ata1: error=0x80 { Sector }
>>SCSI error : <0 0 0 0> return code = 0x8000002
>>sda: Current: sense key=0x3
>>    ASC=0x11 ASCQ=0x4
>>end_request: I/O error, dev sda, sector 66190875
>>Buffer I/O error on device sda2, logical block 8010795
>>lost page write due to I/O error on sda2
>>ata1: status=0x01 { Error }
>>ata1: error=0x80 { Sector }
>>SCSI error : <0 0 0 0> return code = 0x8000002
>>sda: Current: sense key=0x3
>>    ASC=0x11 ASCQ=0x4
>>end_request: I/O error, dev sda, sector 66159699
>>Buffer I/O error on device sda2, logical block 8006898
>>lost page write due to I/O error on sda2
> 
> 
> btw, the reason it hangs here (I suspect) is that your read_log_page()
> logic is wrong - not every error condition will have NCQ_FAILED set
> before entering ncq_recover. The timeout will not, for instance.
> Testing... As usual, this will take days.
> 

  I thought log page 10h would be valid only after the drive reported 
error during NCQ processing.  That's why it doesn't read log page on 
timeouts.  Hmmm, maybe we should read log page 10h on any NCQ failure 
but discard the result on timeout.  Please let me know how your testing 
goes.

  Thanks. :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-06 13:00                 ` Jens Axboe
  2005-07-06 15:11                   ` Tejun Heo
@ 2005-07-08  8:03                   ` Jens Axboe
  2005-07-08 10:27                     ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2005-07-08  8:03 UTC (permalink / raw)
  To: Tejun; +Cc: jgarzik, linux-ide

Hi,

Ok, one more error, this time from irq context:

AHCI: ata1: error irq, status=40000001 stat=51 err=04 sstat=00000113
serr=00000000
ata1: aborting commands due to error.  active_tag -1, sactive 00000001
ahci: sactive 1
ata1: recovering from error
sactive=0
ata1: failed to read log page 10h (-110)
ata1: resetting...
ata1: started resetting...
ata1: end resetting, sstatus=00000113
ata1: status=0x01 { Error }
ata1: error=0x80 { Sector }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
    ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 2120875
Buffer I/O error on device sda2, logical block 2045
lost page write due to I/O error on sda2

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-08  8:03                   ` Jens Axboe
@ 2005-07-08 10:27                     ` Tejun Heo
  2005-07-08 13:54                       ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2005-07-08 10:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jgarzik, linux-ide

On Fri, Jul 08, 2005 at 10:03:48AM +0200, Jens Axboe wrote:
> Hi,
> 
> Ok, one more error, this time from irq context:
> 
> AHCI: ata1: error irq, status=40000001 stat=51 err=04 sstat=00000113
> serr=00000000
> ata1: aborting commands due to error.  active_tag -1, sactive 00000001
> ahci: sactive 1
> ata1: recovering from error
> sactive=0
> ata1: failed to read log page 10h (-110)
> ata1: resetting...
> ata1: started resetting...
> ata1: end resetting, sstatus=00000113
> ata1: status=0x01 { Error }
> ata1: error=0x80 { Sector }
> SCSI error : <0 0 0 0> return code = 0x8000002
> sda: Current: sense key=0x3
>     ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 2120875
> Buffer I/O error on device sda2, logical block 2045
> lost page write due to I/O error on sda2
> 
> -- 
> Jens Axboe

 Hi, Jens.

 I also have a weird lockup log.  This log is generated with the
second take of NCQ patchset I've posted yesterday.  It's Samsung
HD160JJ on ICH7R AHCI.  Every command with tag#3 is scrambled, so
recovery operations are performed constantly (log 10h, if that fails
COMRESET).  After a few hours, the drive failed to become online after
COMRESET.  Rebooting didn't work.  BIOS couldn't detect/recover it.  I
had to power-cycle to make it online again.  I'm running similar test
with much lower error-late (5~6 errors per 3000 requests) to avoid too
many COMRESET's and, for more than six hours, it's been running okay.
Error log follows.

 Jens, can you run your test with the new patchset?


Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag00) dec65b00: 20608027 40492a16 0000000d 08000000 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag01) de412380: 20608027 4033c992 00000006 08000008 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag02) dec59680: 20608027 40f62998 00000007 08000010 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FAILING 03 dec65980: 28608027 408fe875 0000000c 08000018 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag03) dec65980: 28608027 40ffffff 00ffffff 08000018 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag00) df512800: 28608027 40af6dbe 0000000b 08000000 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag04) dec59c80: 28608027 4054fb45 0000000b 08000020 00000000
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag01) dec59e00: 28608027 404e6f5b 0000000a 08000008 00000000
Jul  9 01:55:33 jtj kernel: ata2: aborting commands due to error.  active_tag -1, sactive 0000001f
Jul  9 01:55:33 jtj kernel: ata2: requesting check condition for failed scmd df512800 tag 0
Jul  9 01:55:33 jtj kernel: ata2: requesting check condition for failed scmd dec59e00 tag 1
Jul  9 01:55:33 jtj kernel: ata2: requesting check condition for failed scmd dec65980 tag 3
Jul  9 01:55:33 jtj kernel: ata2: requesting check condition for failed scmd dec59c80 tag 4
Jul  9 01:55:33 jtj kernel: ata2: recovering from error
Jul  9 01:55:33 jtj kernel: ata2: aborting commands due to error.  active_tag -1, sactive 0000001b
Jul  9 01:55:33 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command error for tag 0 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command error for tag 1 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command error for tag 3 during recovery
Jul  9 01:55:33 jtj kernel: ata2: ignoring command error for tag 4 during recovery
Jul  9 01:55:33 jtj kernel: AHCI2: FIS(tag02) 00000000: 002f8027 a0000010 00000000 08000001 00000000
Jul  9 01:55:38 jtj kernel: ata2: failed to read log page 10h (-110)
Jul  9 01:55:38 jtj kernel: ata2: resetting...
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:55:38 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:55:38 jtj kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
Jul  9 01:55:38 jtj kernel: ata2: error=0x04 { DriveStatusError }
Jul  9 01:55:38 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:55:39 jtj kernel: sdb: Current: sense key: Aborted Command
Jul  9 01:55:39 jtj kernel:     Additional sense: No additional sense information
Jul  9 01:55:39 jtj kernel: end_request: I/O error, dev sdb, sector 196046270
Jul  9 01:55:39 jtj kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
Jul  9 01:55:39 jtj kernel: ata2: error=0x04 { DriveStatusError }
Jul  9 01:55:39 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:55:39 jtj kernel: sdb: Current: sense key: Aborted Command
Jul  9 01:55:39 jtj kernel:     Additional sense: No additional sense information
Jul  9 01:55:39 jtj kernel: end_request: I/O error, dev sdb, sector 172912475
Jul  9 01:55:39 jtj kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
Jul  9 01:55:39 jtj kernel: ata2: error=0x04 { DriveStatusError }
Jul  9 01:55:39 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:55:39 jtj kernel: sdb: Current: sense key: Aborted Command
Jul  9 01:55:39 jtj kernel:     Additional sense: No additional sense information
Jul  9 01:55:39 jtj kernel: end_request: I/O error, dev sdb, sector 210757749
Jul  9 01:55:39 jtj kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
Jul  9 01:55:39 jtj kernel: ata2: error=0x04 { DriveStatusError }
Jul  9 01:55:39 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:55:39 jtj kernel: sdb: Current: sense key: Aborted Command
Jul  9 01:55:39 jtj kernel:     Additional sense: No additional sense information
Jul  9 01:55:39 jtj kernel: end_request: I/O error, dev sdb, sector 190118725
Jul  9 01:55:39 jtj kernel: ata2: recovery complete
<
< Resuming operation after recover
<
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag00) dec59c80: 20608027 4054fb4d 0000000b 08000000 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag01) dec65980: 20608027 408fe87d 0000000c 08000008 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag02) dec59e00: 20608027 404e6f63 0000000a 08000010 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag00) df512800: 20608027 40af6dc6 0000000b 08000000 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FAILING 03 dec59680: 28608027 402c12c1 00000001 08000018 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag03) dec59680: 28608027 40ffffff 00ffffff 08000018 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag01) de412380: 28608027 40f81a7a 0000000e 08000008 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag00) dec65b00: 28608027 40cc1ef3 0000000c 08000000 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag02) dec65080: 28608027 40cb8b56 00000003 08000010 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag04) dec59980: 28608027 40b3098c 0000000a 08000020 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag05) dec59b00: 28608027 4003cd11 00000000 08000028 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag06) de412500: 28608027 401c742f 00000007 08000030 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag07) de412680: 28608027 40ff60f3 00000006 08000038 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag08) dec65800: 28608027 405a982c 0000000f 08000040 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag09) dec59800: 28608027 40da1b58 00000011 08000048 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag10) de412980: 28608027 40dded2e 00000002 08000050 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag11) de412800: 28608027 403c47f0 00000006 08000058 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag12) df512980: 28608027 405a6009 0000000c 08000060 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag13) dec65500: 28608027 40dd6792 00000009 08000068 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag14) dec65380: 28608027 40d893d5 0000000b 08000070 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag15) dec59500: 28608027 4016977b 0000000a 08000078 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag16) df512680: 28608027 40ef168c 0000000c 08000080 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag17) dec65c80: 28608027 40d892f2 00000001 08000088 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag18) dec65200: 28608027 4039b3ae 00000010 08000090 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag19) dec65e00: 28608027 405d0c0a 00000011 08000098 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag20) de412e00: 28608027 40a52ad3 00000002 080000a0 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag21) de412080: 28608027 405432a4 00000004 080000a8 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag22) de412c80: 28608027 40f7f7cf 00000002 080000b0 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag23) de412200: 28608027 40ab0ea9 00000011 080000b8 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag24) de412b00: 28608027 4025057f 00000004 080000c0 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag25) dec65680: 28608027 40d2921c 00000006 080000c8 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag26) dec59c80: 28608027 40086324 0000000b 080000d0 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag27) dec65980: 28608027 408976b5 00000001 080000d8 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag28) dec59e00: 28608027 40ff90bc 00000001 080000e0 00000000
Jul  9 01:55:39 jtj kernel: AHCI2: FIS(tag29) df512800: 28608027 40eaf199 00000002 080000e8 00000000
<
< All commands timed out.  Recovery kicks in.  Drive seems locked up already.
<
Jul  9 01:56:09 jtj kernel: ata2: recovering from timeout
Jul  9 01:56:09 jtj kernel: ata2: stat=40, issuing COMRESET
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 2 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 5 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 6 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 7 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 8 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 9 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 10 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 11 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 12 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 13 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 14 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 15 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 16 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 17 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 18 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 19 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 20 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 21 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 22 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 23 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 24 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 25 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 26 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 27 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 28 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 29 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 2 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 5 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 6 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 7 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 8 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 9 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 10 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 11 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 12 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 13 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 14 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 15 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 16 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 17 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 18 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 19 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 20 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 21 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 22 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 23 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 24 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 25 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 26 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 27 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 28 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 29 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 0 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 1 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 2 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 3 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 4 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 5 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 6 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 7 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 8 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 9 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 10 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 11 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 12 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 13 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 14 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 15 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 16 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 17 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 18 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 19 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 20 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 21 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 22 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 23 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 24 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 25 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 26 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 27 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 28 during recovery
Jul  9 01:56:09 jtj kernel: ata2: ignoring command completion for tag 29 during recovery
Jul  9 01:56:16 jtj kernel: ata2 is slow to respond, please be patient
Jul  9 01:56:39 jtj kernel: ata2 failed to respond (30 secs)
>
> Drive fails to become ata_ok after COMRESET and libata disables the drive.
>
Jul  9 01:56:39 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:39 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:39 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:39 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:39 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:39 jtj kernel: end_request: I/O error, dev sdb, sector 214703859
Jul  9 01:56:39 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:39 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:39 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:39 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 251140730
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 63671126
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 19665601
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 179505548
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 249105
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 119305263
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 117399795
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 257595436
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 299506520
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 48098606
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 104613872
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 207249417
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 165504914
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 198742997
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 169252731
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 216995468
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 30970610
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 272217006
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 291310602
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 44378835
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 72626852
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 49805263
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 296423081
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 69535103
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 114463260
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 185099044
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 25786037
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 33525948
Jul  9 01:56:40 jtj kernel: ata2: status=0x41 { DriveReady Error }
Jul  9 01:56:40 jtj kernel: ata2: called with no error (41)!
Jul  9 01:56:40 jtj kernel: SCSI error : <2 0 0 0> return code = 0x8000002
Jul  9 01:56:40 jtj kernel: sdb: Current: sense key: Medium Error
Jul  9 01:56:40 jtj kernel:     Additional sense: Unrecovered read error - auto reallocate failed
Jul  9 01:56:40 jtj kernel: end_request: I/O error, dev sdb, sector 48951705
Jul  9 01:56:40 jtj kernel: ata2: recovery complete
>
> Drive disabled.  Drive doesn't respond to anything until it's cold
> power cycled.
>
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 48951713
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 33525956
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 25786045
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 185099052
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 114463268
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 69535111
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 296423089
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 49805271
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 72626860
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 44378843
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 291310610
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 272217014
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 30970618
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 216995476
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 169252739
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 198743005
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000
Jul  9 01:56:41 jtj kernel: end_request: I/O error, dev sdb, sector 165504922
Jul  9 01:56:41 jtj kernel: SCSI error : <2 0 0 0> return code = 0x40000

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
  2005-07-08 10:27                     ` Tejun Heo
@ 2005-07-08 13:54                       ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2005-07-08 13:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jgarzik, linux-ide

On Fri, Jul 08 2005, Tejun Heo wrote:
> On Fri, Jul 08, 2005 at 10:03:48AM +0200, Jens Axboe wrote:
> > Hi,
> > 
> > Ok, one more error, this time from irq context:
> > 
> > AHCI: ata1: error irq, status=40000001 stat=51 err=04 sstat=00000113
> > serr=00000000
> > ata1: aborting commands due to error.  active_tag -1, sactive 00000001
> > ahci: sactive 1
> > ata1: recovering from error
> > sactive=0
> > ata1: failed to read log page 10h (-110)
> > ata1: resetting...
> > ata1: started resetting...
> > ata1: end resetting, sstatus=00000113
> > ata1: status=0x01 { Error }
> > ata1: error=0x80 { Sector }
> > SCSI error : <0 0 0 0> return code = 0x8000002
> > sda: Current: sense key=0x3
> >     ASC=0x11 ASCQ=0x4
> > end_request: I/O error, dev sda, sector 2120875
> > Buffer I/O error on device sda2, logical block 2045
> > lost page write due to I/O error on sda2
> > 
> > -- 
> > Jens Axboe
> 
>  Hi, Jens.
> 
>  I also have a weird lockup log.  This log is generated with the
> second take of NCQ patchset I've posted yesterday.  It's Samsung
> HD160JJ on ICH7R AHCI.  Every command with tag#3 is scrambled, so
> recovery operations are performed constantly (log 10h, if that fails
> COMRESET).  After a few hours, the drive failed to become online after
> COMRESET.  Rebooting didn't work.  BIOS couldn't detect/recover it.  I
> had to power-cycle to make it online again.  I'm running similar test
> with much lower error-late (5~6 errors per 3000 requests) to avoid too
> many COMRESET's and, for more than six hours, it's been running okay.
> Error log follows.

very strange, the drive must be really buggered.

>  Jens, can you run your test with the new patchset?

Sure, I will try. But I will be away from this box for the next two
weeks (vacation, then KS/OLS), so I cannot do much about it in the near
future...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset
  2005-06-26 15:21 ` [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset Tejun Heo
@ 2005-07-26 21:12   ` Jeff Garzik
  2005-07-27  6:25     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Garzik @ 2005-07-26 21:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-ide

Tejun Heo wrote:
> 07_NCQ_ahci-stop-dma-before-reset.patch
> 
> 	AHCI 1.1 mandates stopping dma before issueing COMMRESET.  The
> 	original code didn't and it resulted in occasional lockup of
> 	the controller during EH recovery.  This patch fixes the
> 	problem.
> 
> Signed-off-by: Tejun Heo <htejun@gmail.com>
> 
>  ahci.c |    2 ++
>  1 files changed, 2 insertions(+)
> 
> Index: work/drivers/scsi/ahci.c
> ===================================================================
> --- work.orig/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
> +++ work/drivers/scsi/ahci.c	2005-06-27 00:20:31.000000000 +0900
> @@ -474,7 +474,9 @@ static void ahci_phy_reset(struct ata_po
>  	struct ata_device *dev = &ap->device[0];
>  	u32 tmp;
>  
> +	ahci_stop_dma(ap);
>  	__sata_phy_reset(ap);
> +	ahci_start_dma(ap);

This is a bit worrisome, because we really shouldn't be calling 
ahci_phy_reset() when DMA is -not- stopped.  That's a violation of the 
state machine.

	Jeff




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset
  2005-07-26 21:12   ` Jeff Garzik
@ 2005-07-27  6:25     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2005-07-27  6:25 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: axboe, linux-ide

Jeff Garzik wrote:
> Tejun Heo wrote:
> 
>> 07_NCQ_ahci-stop-dma-before-reset.patch
>>
>>     AHCI 1.1 mandates stopping dma before issueing COMMRESET.  The
>>     original code didn't and it resulted in occasional lockup of
>>     the controller during EH recovery.  This patch fixes the
>>     problem.
>>
>> Signed-off-by: Tejun Heo <htejun@gmail.com>
>>
>>  ahci.c |    2 ++
>>  1 files changed, 2 insertions(+)
>>
>> Index: work/drivers/scsi/ahci.c
>> ===================================================================
>> --- work.orig/drivers/scsi/ahci.c    2005-06-27 00:20:31.000000000 +0900
>> +++ work/drivers/scsi/ahci.c    2005-06-27 00:20:31.000000000 +0900
>> @@ -474,7 +474,9 @@ static void ahci_phy_reset(struct ata_po
>>      struct ata_device *dev = &ap->device[0];
>>      u32 tmp;
>>  
>> +    ahci_stop_dma(ap);
>>      __sata_phy_reset(ap);
>> +    ahci_start_dma(ap);
> 
> 
> This is a bit worrisome, because we really shouldn't be calling 
> ahci_phy_reset() when DMA is -not- stopped.  That's a violation of the 
> state machine.
> 
>     Jeff

  Hello, Jeff.

  The case occurs when qc's time out.  When qc's time out, we need to 
forcefully terminate those and reset the state machine, so the violation 
of the state machine is necessary there, I think.

  When EH kicks in, ATA_FLAG_RECOVERY gets set, and all non-preempt qc 
completions/errors are ignored until recovery completes.  Then, the 
device gets reset and recovery commands are issued.  IOW, the state 
machine violation occurs while all completion/error notifications from 
the device are being ignored, and, after reset is complete, the state 
machine is restarted from a determined state.

  If I missed something, please point out.

  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2005-07-27  6:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-26 15:21 [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 01/09] NCQ: add ata_qc_complete_err() and @drv_err to functions Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 02/09] NCQ: add timeout to ata_read_log_page() Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 03/09] NCQ: add ap->sactive Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 04/09] NCQ: export scsi_retry_command Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 05/09] NCQ: implement NCQ helpers Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 06/09] NCQ: convert ahci to use new " Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 07/09] NCQ: stop dma before reset Tejun Heo
2005-07-26 21:12   ` Jeff Garzik
2005-07-27  6:25     ` Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 08/09] NCQ: remove/unexport unused/unnecessary functions Tejun Heo
2005-06-26 15:21 ` [PATCH Linux 2.6.12 09/09] NCQ: causes error or timeout Tejun Heo
2005-06-26 15:34 ` test logs Tejun Heo
2005-06-27 14:33 ` [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling Jens Axboe
2005-06-30  7:36   ` Jens Axboe
2005-06-30 10:51     ` Tejun Heo
2005-06-30 15:26       ` Jens Axboe
2005-07-01  0:20         ` Tejun
2005-07-01  8:59           ` Jens Axboe
2005-07-04  5:53             ` Jens Axboe
2005-07-06 12:55               ` Jens Axboe
2005-07-06 13:00                 ` Jens Axboe
2005-07-06 15:11                   ` Tejun Heo
2005-07-08  8:03                   ` Jens Axboe
2005-07-08 10:27                     ` Tejun Heo
2005-07-08 13:54                       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).