libata slab corruption saga

All of lore.kernel.org
 help / color / mirror / Atom feed

* libata slab corruption saga
@ 2005-05-26  7:32 Denis Vlasenko
  2005-05-26  7:47 ` Jeff Garzik
  0 siblings, 1 reply; 3+ messages in thread
From: Denis Vlasenko @ 2005-05-26  7:32 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

Hi Jeff,

Unfortunately it still happens even without IRQ sharing.

2005-05-25_14:54:01.79454 kern.err: ata1: command 0x35 timeout, stat 0x50 host_stat 0x1
2005-05-25_14:54:04.10684 kern.err: Slab corruption: start=c19d02fc, len=344
2005-05-25_14:54:04.10985 kern.err: Redzone: 0x5a2cf071/0x5a2cf071.
2005-05-25_14:54:04.10987 kern.err: Last user: [<c03b29f9>](scsi_put_command+0x49/0x80)
2005-05-25_14:54:04.10989 kern.err: 010: 6b 6b 6b 6b 6b 6b 6b 6b 08 0a 9d c1 6b 6b 6b 6b

It's 'use after free', someone seems to store 4-byte word into offset 0x18.
This word seems to be a kernel pointer (0xc19d0a08).

I may be mistaken, but I think it is a scsi_cmnd.eh_entry.next.
It seems that scsi_cmnd was freed (see below) and scsi_cmnd offset 0x18
is eh_entry:

struct list_head {
         struct list_head *next, *prev;
};

struct scsi_cmnd {
        int     sc_magic;
        struct scsi_device *device;
        unsigned short state;
        unsigned short owner;
        struct scsi_request *sc_request;
        struct list_head list;  /* scsi_cmnd participates in queue lists */
        struct list_head eh_entry; /* entry for the host eh_cmd_q */

2005-05-25_14:54:04.10991 kern.err: Prev obj: start=c19d0198, len=344
2005-05-25_14:54:04.10993 kern.err: Redzone: 0x5a2cf071/0x5a2cf071.
2005-05-25_14:54:04.10995 kern.err: Last user: [<c03b29f9>](scsi_put_command+0x49/0x80)
2005-05-25_14:54:04.10996 kern.err: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
2005-05-25_14:54:04.10998 kern.err: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
2005-05-25_14:54:04.11002 kern.err: Next obj: start=c19d0460, len=344
2005-05-25_14:54:04.11004 kern.err: Redzone: 0x5a2cf071/0x5a2cf071.
2005-05-25_14:54:04.11006 kern.err: Last user: [<c03b29f9>](scsi_put_command+0x49/0x80)
2005-05-25_14:54:04.11007 kern.err: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
2005-05-25_14:54:04.11009 kern.err: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

Looks like "Last user scsi_put_command+0x49" corresponds to list_empty(),
although asm output is a bit strange. Judge for yourself:

void scsi_put_command(struct scsi_cmnd *cmd)
{
        struct scsi_device *sdev = cmd->device;
        struct Scsi_Host *shost = sdev->host;
        unsigned long flags;

        /* serious error if the command hasn't come from a device list */
        spin_lock_irqsave(&cmd->device->list_lock, flags);
        BUG_ON(list_empty(&cmd->list));
        list_del_init(&cmd->list);
        spin_unlock(&cmd->device->list_lock);
        /* changing locks here, don't need to restore the irq state */
        spin_lock(&shost->free_list_lock);
asm("#0");
        if (unlikely(list_empty(&shost->free_list))) {   <==============
asm("#1");
                list_add(&cmd->list, &shost->free_list);
                cmd = NULL;
        }
        spin_unlock_irqrestore(&shost->free_list_lock, flags);

        if (likely(cmd != NULL))
                kmem_cache_free(shost->cmd_pool->slab, cmd);

        put_device(&sdev->sdev_gendev);
}

Corresponding asm:

#APP
        #0
#NO_APP
        leal    20(%esi), %edx
        movl    20(%esi), %eax
        cmpl    %edx, %eax
        je      .L132
.L127:
#APP
        pushl %edi ; popfl
#NO_APP
        testl   %ebx, %ebx
        je      .L130
        pushl   %ebx
        movl    16(%esi), %eax
        movl    (%eax), %ecx
        pushl   %ecx
        call    kmem_cache_free
        popl    %eax    <========================== scsi_put_command+0x49
        popl    %edx
.L130:
        movl    -16(%ebp), %eax
        addl    $400, %eax
        movl    %eax, 8(%ebp)
        leal    -12(%ebp), %esp
        popl    %ebx
        popl    %esi
        popl    %edi
        popl    %ebp
        jmp     put_device
.L132:
#APP
        #1
#NO_APP

Hope this helps.
--
vda


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: libata slab corruption saga
  2005-05-26  7:32 libata slab corruption saga Denis Vlasenko
@ 2005-05-26  7:47 ` Jeff Garzik
  2005-05-27  5:37   ` Denis Vlasenko
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Garzik @ 2005-05-26  7:47 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 49 bytes --]

Does the attached patch change things?

	Jeff




[-- Attachment #2: atapi-fix-error-handling.patch --]
[-- Type: text/x-patch, Size: 4387 bytes --]

From: Hannes Reinecke <hare@suse.de>
Subject: Fix sata atapi error handling
References: 70918

SCSI commands which end up on the error handler need special attention;
we have to make sure that eh_cmd_q is properly emptied or scsi_eh will
try to forever finalize the command.

With this patch eh_cmd_q is explicitely emptied if not done so in the
strategy handler and a proper abort sequence is executed for each
command if required.
We rely on the strategy handler to fill out proper sense information for
us as SATA is 'special' when it comes to command sense gathering.

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Jens Axboe <axboe@suse.de>
Acked-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11/drivers/scsi/libata-core.c
===================================================================
--- linux-2.6.11.orig/drivers/scsi/libata-core.c
+++ linux-2.6.11/drivers/scsi/libata-core.c
@@ -41,6 +41,7 @@
 #include <scsi/scsi.h>
 #include "scsi.h"
 #include "scsi_priv.h"
+#include "scsi_logging.h"
 #include <scsi/scsi_host.h>
 #include <linux/libata.h>
 #include <asm/io.h>
@@ -2587,6 +2588,11 @@ static void atapi_request_sense(struct a
 	DPRINTK("EXIT\n");
 }
 
+void ata_qc_timeout_done(struct scsi_cmnd *scmd)
+{
+	return;
+}
+
 /**
  *	ata_qc_timeout - Handle timeout of queued command
  *	@qc: Command that timed out
@@ -2618,17 +2624,16 @@ static void ata_qc_timeout(struct ata_qu
 		struct scsi_cmnd *cmd = qc->scsicmd;
 
 		if (!scsi_eh_eflags_chk(cmd, SCSI_EH_CANCEL_CMD)) {
-
 			/* finish completing original command */
+			qc->scsidone = ata_qc_timeout_done;
+
 			__ata_qc_complete(qc);
 
 			atapi_request_sense(ap, dev, cmd);
 
 			cmd->result = (CHECK_CONDITION << 1) | (DID_OK << 16);
-			scsi_finish_command(cmd);
-
-			goto out;
 		}
+		goto out;
 	}
 
 	/* hack alert!  We cannot use the supplied completion
Index: linux-2.6.11/drivers/scsi/libata-scsi.c
===================================================================
--- linux-2.6.11.orig/drivers/scsi/libata-scsi.c
+++ linux-2.6.11/drivers/scsi/libata-scsi.c
@@ -633,12 +633,6 @@ int ata_scsi_error(struct Scsi_Host *hos
 	ap = (struct ata_port *) &host->hostdata[0];
 	ap->ops->eng_timeout(ap);
 
-	/* TODO: this is per-command; when queueing is supported
-	 * this code will either change or move to a more
-	 * appropriate place
-	 */
-	host->host_failed--;
-
 	DPRINTK("EXIT\n");
 	return 0;
 }
Index: linux-2.6.11/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.11.orig/drivers/scsi/scsi_error.c
+++ linux-2.6.11/drivers/scsi/scsi_error.c
@@ -1610,6 +1610,40 @@ static void scsi_unjam_host(struct Scsi_
 	scsi_eh_flush_done_q(&eh_done_q);
 }
 
+static void scsi_invoke_strategy_handler(struct Scsi_Host *shost)
+{
+	int rtn;
+	struct list_head *lh, *lh_sf;
+	struct scsi_cmnd *scmd;
+	unsigned long flags;
+	LIST_HEAD(eh_work_q);
+	LIST_HEAD(eh_done_q);
+
+	rtn = shost->hostt->eh_strategy_handler(shost);
+
+	spin_lock_irqsave(shost->host_lock, flags);
+	list_splice_init(&shost->eh_cmd_q, &eh_work_q);
+	spin_unlock_irqrestore(shost->host_lock, flags);
+
+	SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
+
+	list_for_each_safe(lh, lh_sf, &eh_work_q) {
+		scmd = list_entry(lh, struct scsi_cmnd, eh_entry);
+
+		if (scsi_eh_eflags_chk(scmd, SCSI_EH_CANCEL_CMD) ||
+		    !SCSI_SENSE_VALID(scmd))
+			continue;
+		scmd->retries = scmd->allowed;
+		scsi_eh_finish_cmd(scmd, &eh_done_q);
+	}
+
+	if (!list_empty(&eh_work_q))
+		if (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q))
+			scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
+
+	scsi_eh_flush_done_q(&eh_done_q);
+}
+
 /**
  * scsi_error_handler - Handle errors/timeouts of SCSI cmds.
  * @data:	Host for which we are running.
@@ -1624,7 +1658,6 @@ static void scsi_unjam_host(struct Scsi_
 int scsi_error_handler(void *data)
 {
 	struct Scsi_Host *shost = (struct Scsi_Host *) data;
-	int rtn;
 	DECLARE_MUTEX_LOCKED(sem);
 
 	/*
@@ -1680,8 +1713,8 @@ int scsi_error_handler(void *data)
 		 * what we need to do to get it up and online again (if we can).
 		 * If we fail, we end up taking the thing offline.
 		 */
-		if (shost->hostt->eh_strategy_handler) 
-			rtn = shost->hostt->eh_strategy_handler(shost);
+		if (shost->hostt->eh_strategy_handler)
+			scsi_invoke_strategy_handler(shost);
 		else
 			scsi_unjam_host(shost);
 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: libata slab corruption saga
  2005-05-26  7:47 ` Jeff Garzik
@ 2005-05-27  5:37   ` Denis Vlasenko
  0 siblings, 0 replies; 3+ messages in thread
From: Denis Vlasenko @ 2005-05-27  5:37 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

On Thursday 26 May 2005 10:47, Jeff Garzik wrote:
> Does the attached patch change things?

Yes. As soon as first ata error occurs:

22:01:59.006541500 kern.err: ata1: command 0x25 timeout, stat 0x50 host_stat 0x1
22:01:59.007252500 kern.alert: Unable to handle kernel paging request at virtual address 6b6b6b6b
22:01:59.008197500 kern.alert:  printing eip:
22:01:59.009304500 kern.info: c03b5d7a
22:01:59.010231500 kern.alert: *pde = 00000000
22:01:59.010919500 kern.alert: Oops: 0000 [#1]
22:01:59.011948500 kern.info: Modules linked in: snd_pcm_oss snd_mixer_oss snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore snd_page_alloc ipt_REDIRECT ipt_MASQUERADE ipt_multiport ipt_state iptable_nat ip_conntrack iptable_filter cls_u32 sch_htb iptable_mangle autofs ip_tables
22:01:59.012521500 kern.info: CPU:    0
22:01:59.013618500 kern.info: EIP:    0060:[<c03b5d7a>]    Not tainted VLI
22:01:59.014569500 kern.info: EFLAGS: 00010202   (2.6.12-rc2-cl)
22:01:59.015709500 kern.info: EIP is at scsi_try_to_abort_cmd+0xa/0x50
22:01:59.016749500 kern.info: eax: 6b6b6b6b   ebx: c19d0460   ecx: c19d0460   edx: c19d0478
22:01:59.017802500 kern.info: esi: c19cffb4   edi: c19cffb4   ebp: c19cff84   esp: c19cff80
22:01:59.018791500 kern.info: ds: 007b   es: 007b   ss: 0068
22:01:59.019865500 kern.info: Process scsi_eh_0 (pid: 478, threadinfo=c19cf000 task=c1968530)
22:01:59.020918500 kern.info: Stack: c19d0460 c19cff9c c03b5eb7 c19d0460 c19d0478 c19cffb4 00000246 c19cffc8
22:01:59.021945500 kern.info:        c03b6a92 c19cffb4 c19cffac c19cffac c19cffac c19d0478 c19d0478 c19cffd4
22:01:59.023051500 kern.info:        dfc1507c 00000000 c19cffec c03b6b16 dfc1507c 00000000 00000000 c19cffdc
22:01:59.023907500 kern.info: Call Trace:
22:01:59.025073500 kern.info:  [<c0103cc5>] show_stack+0x75/0x90
22:01:59.027613500 kern.info:  [<c0103e19>] show_registers+0x119/0x190
22:01:59.028311500 kern.info:  [<c0103ff5>] die+0xb5/0x130
22:01:59.029231500 kern.info:  [<c0110728>] do_page_fault+0x458/0x6d6
22:01:59.030303500 kern.info:  [<c01038ef>] error_code+0x4f/0x60
22:01:59.031213500 kern.info:  [<c03b5eb7>] scsi_eh_abort_cmds+0x37/0x90
22:01:59.032241500 kern.info:  [<c03b6a92>] scsi_invoke_strategy_handler+0xd2/0xf0
22:01:59.033137500 kern.info:  [<c03b6b16>] scsi_error_handler+0x66/0xd0
22:01:59.033840500 kern.info:  [<c0100cc5>] kernel_thread_helper+0x5/0x10
22:01:59.034854500 kern.info: Code: 00 00 00 5d c3 8b 43 38 89 43 34 8b 4d 0c 51 53 e8 1c ff ff ff 58 5a eb 88 90 8d b4 26 00 00 00 00 55 89 e5 8b 4d 08 53 8b 41 04 <8b> 00 8b 40 5c 8b 40 20 85 c0 ba 03 20 00 00 74 23 8b 59 2c 85

We hit use-after-free here:

static int scsi_try_to_abort_cmd(struct scsi_cmnd *scmd)
{
        unsigned long flags;
        int rtn = FAILED;

        if (!scmd->device->host->hostt->eh_abort_handler)
                return rtn;

Seems like struct scsi_cmnd pointed by scmd is filled by slab poisoning pattern:

000007d0 <scsi_try_to_abort_cmd>:
     7d0:       55                      push   %ebp
     7d1:       89 e5                   mov    %esp,%ebp
     7d3:       8b 4d 08                mov    0x8(%ebp),%ecx         ecx = scmd
     7d6:       53                      push   %ebx
     7d7:       8b 41 04                mov    0x4(%ecx),%eax         eax = scmd->device (0x6b6b6b6b6)
     7da:       8b 00                   mov    (%eax),%eax            trying to get scmd->device->host, BOOM
     7dc:       8b 40 5c                mov    0x5c(%eax),%eax
     7df:       8b 40 20                mov    0x20(%eax),%eax        
     7e2:       85 c0                   test   %eax,%eax
     7e4:       ba 03 20 00 00          mov    $0x2003,%edx
     7e9:       74 23                   je     80e <scsi_try_to_abort_cmd+0x3e>
--
vda


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-05-27  6:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-26  7:32 libata slab corruption saga Denis Vlasenko
2005-05-26  7:47 ` Jeff Garzik
2005-05-27  5:37   ` Denis Vlasenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.