public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* AACRAID fails to initialize after an kexec operation
@ 2007-04-23  7:49 Vivek Goyal
  2007-04-23 13:01 ` Salyzyn, Mark
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2007-04-23  7:49 UTC (permalink / raw)
  To: linux-scsi, aacraid; +Cc: Kexec Mailing List, Salyzyn, Mark

Hi,

I am trying to kexec into 2.6.21-rc6-mm1 kernel on an x86_64 machine and
aacraid panics in the second kernel. Following is the panic message.

Any idea what's going on? Please let me know if more details are required.

Adaptec aacraid driver (1.1-5[2437]-mh4)
ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<0000000000000000>]
PGD 0
Oops: 0000 [1] SMP
last sysfs file:
CPU 4
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.21-rc6-mm1 #2
RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
RSP: 0018:ffff810100c9fc78  EFLAGS: 00010246
RAX: ffff810100c9fcc4 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000001001 RDI: ffff810100e02d30
RBP: ffff810100e02d30 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff810100e02800
R13: 0000000000000000 R14: ffffffff80690ee1 R15: 000000000000002d
FS:  0000000000000000(0000) GS:ffff810100cf6440(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff810100c9e000, task ffff810100c9d450)
Stack:  ffffffff8045952f 0000000000000000 ffffffff00000000 ffff810100c9fcc4
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 ffffffff80690ee1 000000000000002d ffff810100e02d30 0000000000000001
Call Trace:
Inexact backtrace:
 [<ffffffff8045952f>] aac_rx_restart_adapter+0x7e/0x169
 [<ffffffff80459a0a>] _aac_rx_init+0x70/0x2f6
 [<ffffffff80280ee4>] cache_alloc_refill+0xdb/0x1db
 [<ffffffff8045317f>] aac_probe_one+0x1a9/0x462
 [<ffffffff8035f3a0>] pci_device_probe+0xd1/0x138
 [<ffffffff803b3251>] driver_probe_device+0xf7/0x174
 [<ffffffff803b33e4>] __driver_attach+0x6f/0xae
 [<ffffffff803b3375>] __driver_attach+0x0/0xae
 [<ffffffff803b3375>] __driver_attach+0x0/0xae
 [<ffffffff803b261e>] bus_for_each_dev+0x43/0x6e
 [<ffffffff803b2993>] bus_add_driver+0x78/0x19a
 [<ffffffff8035f578>] __pci_register_driver+0x58/0x8d
 [<ffffffff8084a909>] aac_init+0x35/0x70
 [<ffffffff8082d8ad>] kernel_init+0x167/0x2d1
 [<ffffffff8020a998>] child_rip+0xa/0x12
 [<ffffffff8036bd60>] acpi_ds_init_one_object+0x0/0x7c
 [<ffffffff8082d746>] kernel_init+0x0/0x2d1
 [<ffffffff8020a98e>] child_rip+0x0/0x12


Code:  Bad RIP value.
RIP  [<0000000000000000>]
 RSP <ffff810100c9fc78>
CR2: 0000000000000000
Kernel panic - not syncing: Attempted to kill init!

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: AACRAID fails to initialize after an kexec operation
  2007-04-23  7:49 AACRAID fails to initialize after an kexec operation Vivek Goyal
@ 2007-04-23 13:01 ` Salyzyn, Mark
  2007-04-23 13:38   ` [PATCH] aacraid: fails to initialize after a " Salyzyn, Mark
  0 siblings, 1 reply; 11+ messages in thread
From: Salyzyn, Mark @ 2007-04-23 13:01 UTC (permalink / raw)
  To: vgoyal, linux-scsi; +Cc: Kexec Mailing List

[-- Attachment #1: Type: text/plain, Size: 3702 bytes --]

2.6.21-rc6-mm1 contains an earlier kexec patch, that one needs to be
removed and this one put in it's place.

Basically the following fragment represents the update in the later
patch that deals with this specific issue.

diff -ru a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
@@ -535,6 +539,8 @@
        }

        /* Failure to reset here is an option ... */
+       dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
+       dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
        dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
        if ((((status & 0xff) != 0xff) || reset_devices) &&
          !aac_rx_restart_adapter(dev, 0))

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@in.ibm.com] 
> Sent: Monday, April 23, 2007 3:49 AM
> To: linux-scsi@vger.kernel.org; AACRAID
> Cc: Kexec Mailing List; Salyzyn, Mark
> Subject: AACRAID fails to initialize after an kexec operation
> 
> 
> Hi,
> 
> I am trying to kexec into 2.6.21-rc6-mm1 kernel on an x86_64 
> machine and
> aacraid panics in the second kernel. Following is the panic message.
> 
> Any idea what's going on? Please let me know if more details 
> are required.
> 
> Adaptec aacraid driver (1.1-5[2437]-mh4)
> ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
> Unable to handle kernel NULL pointer dereference at 
> 0000000000000000 RIP:
>  [<0000000000000000>]
> PGD 0
> Oops: 0000 [1] SMP
> last sysfs file:
> CPU 4
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.21-rc6-mm1 #2
> RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
> RSP: 0018:ffff810100c9fc78  EFLAGS: 00010246
> RAX: ffff810100c9fcc4 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000001001 RDI: ffff810100e02d30
> RBP: ffff810100e02d30 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff810100e02800
> R13: 0000000000000000 R14: ffffffff80690ee1 R15: 000000000000002d
> FS:  0000000000000000(0000) GS:ffff810100cf6440(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> Process swapper (pid: 1, threadinfo ffff810100c9e000, task 
> ffff810100c9d450)
> Stack:  ffffffff8045952f 0000000000000000 ffffffff00000000 
> ffff810100c9fcc4
>  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>  ffffffff80690ee1 000000000000002d ffff810100e02d30 0000000000000001
> Call Trace:
> Inexact backtrace:
>  [<ffffffff8045952f>] aac_rx_restart_adapter+0x7e/0x169
>  [<ffffffff80459a0a>] _aac_rx_init+0x70/0x2f6
>  [<ffffffff80280ee4>] cache_alloc_refill+0xdb/0x1db
>  [<ffffffff8045317f>] aac_probe_one+0x1a9/0x462
>  [<ffffffff8035f3a0>] pci_device_probe+0xd1/0x138
>  [<ffffffff803b3251>] driver_probe_device+0xf7/0x174
>  [<ffffffff803b33e4>] __driver_attach+0x6f/0xae
>  [<ffffffff803b3375>] __driver_attach+0x0/0xae
>  [<ffffffff803b3375>] __driver_attach+0x0/0xae
>  [<ffffffff803b261e>] bus_for_each_dev+0x43/0x6e
>  [<ffffffff803b2993>] bus_add_driver+0x78/0x19a
>  [<ffffffff8035f578>] __pci_register_driver+0x58/0x8d
>  [<ffffffff8084a909>] aac_init+0x35/0x70
>  [<ffffffff8082d8ad>] kernel_init+0x167/0x2d1
>  [<ffffffff8020a998>] child_rip+0xa/0x12
>  [<ffffffff8036bd60>] acpi_ds_init_one_object+0x0/0x7c
>  [<ffffffff8082d746>] kernel_init+0x0/0x2d1
>  [<ffffffff8020a98e>] child_rip+0x0/0x12
> 
> 
> Code:  Bad RIP value.
> RIP  [<0000000000000000>]
>  RSP <ffff810100c9fc78>
> CR2: 0000000000000000
> Kernel panic - not syncing: Attempted to kill init!
> 
> Thanks
> Vivek
> 

[-- Attachment #2: aacraid_kexec_5.patch --]
[-- Type: application/octet-stream, Size: 2670 bytes --]

diff -ru a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
--- a/drivers/scsi/aacraid/rx.c	2007-04-03 11:31:40.288114365 -0400
+++ b/drivers/scsi/aacraid/rx.c	2007-04-03 11:34:12.560873530 -0400
@@ -467,16 +467,19 @@
 	if (bled)
 		printk(KERN_ERR "%s%d: adapter kernel panic'd %x.\n",
 			dev->name, dev->id, bled);
-	else
+	else {
 		bled = aac_adapter_sync_cmd(dev, IOP_RESET_ALWAYS,
 		  0, 0, 0, 0, 0, 0, &var, NULL, NULL, NULL, NULL);
-	if (bled)
+		if (!bled && (var != 0x00000001))
+			bled = -EINVAL;
+	}
+	if (bled && (bled != -ETIMEDOUT))
 		bled = aac_adapter_sync_cmd(dev, IOP_RESET,
 		  0, 0, 0, 0, 0, 0, &var, NULL, NULL, NULL, NULL);
 
-	if (bled)
+	if (bled && (bled != -ETIMEDOUT))
 		return -EINVAL;
-	if (var == 0x3803000F) { /* USE_OTHER_METHOD */
+	if (bled || (var == 0x3803000F)) { /* USE_OTHER_METHOD */
 		rx_writel(dev, MUnit.reserved2, 3);
 		msleep(5000); /* Delay 5 seconds */
 		var = 0x00000001;
@@ -526,6 +529,7 @@
 {
 	unsigned long start;
 	unsigned long status;
+	int restart = 0;
 	int instance = dev->id;
 	const char * name = dev->name;
 
@@ -534,15 +538,21 @@
 		goto error_iounmap;
 	}
 
+	/* Failure to reset here is an option ... */
+	dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
+	dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
+	dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
+	if ((((status & 0xff) != 0xff) || reset_devices) &&
+	  !aac_rx_restart_adapter(dev, 0))
+		++restart;
 	/*
 	 *	Check to see if the board panic'd while booting.
 	 */
 	status = rx_readl(dev, MUnit.OMRx[0]);
 	if (status & KERNEL_PANIC) {
-		if ((status = aac_rx_check_health(dev)) <= 0)
-			goto error_iounmap;
-		if (aac_rx_restart_adapter(dev, status))
+		if (aac_rx_restart_adapter(dev, aac_rx_check_health(dev)))
 			goto error_iounmap;
+		++restart;
 	}
 	/*
 	 *	Check to see if the board failed any self tests.
@@ -565,11 +575,23 @@
 	 */
 	while (!((status = rx_readl(dev, MUnit.OMRx[0])) & KERNEL_UP_AND_RUNNING))
 	{
-		if(time_after(jiffies, start+startup_timeout*HZ)) {
+		if ((restart &&
+		  (status & (KERNEL_PANIC|SELF_TEST_FAILED|MONITOR_PANIC))) ||
+		  time_after(jiffies, start+HZ*startup_timeout)) {
 			printk(KERN_ERR "%s%d: adapter kernel failed to start, init status = %lx.\n", 
 					dev->name, instance, status);
 			goto error_iounmap;
 		}
+		if (!restart &&
+		  ((status & (KERNEL_PANIC|SELF_TEST_FAILED|MONITOR_PANIC)) ||
+		  time_after(jiffies, start + HZ *
+		  ((startup_timeout > 60)
+		    ? (startup_timeout - 60)
+		    : (startup_timeout / 2))))) {
+			if (likely(!aac_rx_restart_adapter(dev, aac_rx_check_health(dev))))
+				start = jiffies;
+			++restart;
+		}
 		msleep(1);
 	}
 	/*

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-23 13:01 ` Salyzyn, Mark
@ 2007-04-23 13:38   ` Salyzyn, Mark
  2007-04-23 16:12     ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Salyzyn, Mark @ 2007-04-23 13:38 UTC (permalink / raw)
  To: linux-scsi; +Cc: Kexec Mailing List, James Bottomley, vgoyal, Judith Lebzelter

[-- Attachment #1: Type: text/plain, Size: 5440 bytes --]

Missing portion of the kexec changes to the aacraid driver. The platform
functions were not initialized when the restart function is activated
resulting in a panic when these platform functions are called. Please
note that it is NOT a mistake that the disable interrupt handler is used
as the initial value of the enable interrupt platform function, it will
be set up correctly once the adapter is discovered and initialized.

Please note that aacraid_kexec_5.patch contains this fix, since an
earlier aacraid_kexec patch was applied to the tree and appears to have
propagated, this is meant to supersede the earlier patch and bring it up
to date with aacraid_kexec_5.patch, but expected to break when the
aacraid_kexec_5.patch propagates... James, can you sort out this mess
(either by stopping aacraid_kexec_5.patch, or by letting this enclosed
patch move with some level of priority)?

ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
handling of patches.

This attached patch is against current scsi-misc-2.6. Also expect this
patch can be applied to 2.6.21-rc6-mm1

Signed-off-by: Mark Salyzyn <aacraid@adaptec.com>

---

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org 
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Salyzyn, Mark
> Sent: Monday, April 23, 2007 9:01 AM
> To: vgoyal@in.ibm.com; linux-scsi@vger.kernel.org
> Cc: Kexec Mailing List
> Subject: RE: AACRAID fails to initialize after an kexec operation
> 
> 
> 2.6.21-rc6-mm1 contains an earlier kexec patch, that one needs to be
> removed and this one put in it's place.
> 
> Basically the following fragment represents the update in the later
> patch that deals with this specific issue.
> 
> diff -ru a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
> @@ -535,6 +539,8 @@
>         }
> 
>         /* Failure to reset here is an option ... */
> +       dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
> +       dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
>         dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
>         if ((((status & 0xff) != 0xff) || reset_devices) &&
>           !aac_rx_restart_adapter(dev, 0))
> 
> Sincerely -- Mark Salyzyn
> 
> > -----Original Message-----
> > From: Vivek Goyal [mailto:vgoyal@in.ibm.com] 
> > Sent: Monday, April 23, 2007 3:49 AM
> > To: linux-scsi@vger.kernel.org; AACRAID
> > Cc: Kexec Mailing List; Salyzyn, Mark
> > Subject: AACRAID fails to initialize after an kexec operation
> > 
> > 
> > Hi,
> > 
> > I am trying to kexec into 2.6.21-rc6-mm1 kernel on an x86_64 
> > machine and
> > aacraid panics in the second kernel. Following is the panic message.
> > 
> > Any idea what's going on? Please let me know if more details 
> > are required.
> > 
> > Adaptec aacraid driver (1.1-5[2437]-mh4)
> > ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
> > Unable to handle kernel NULL pointer dereference at 
> > 0000000000000000 RIP:
> >  [<0000000000000000>]
> > PGD 0
> > Oops: 0000 [1] SMP
> > last sysfs file:
> > CPU 4
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.21-rc6-mm1 #2
> > RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
> > RSP: 0018:ffff810100c9fc78  EFLAGS: 00010246
> > RAX: ffff810100c9fcc4 RBX: 0000000000000000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: 0000000000001001 RDI: ffff810100e02d30
> > RBP: ffff810100e02d30 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000002 R12: ffff810100e02800
> > R13: 0000000000000000 R14: ffffffff80690ee1 R15: 000000000000002d
> > FS:  0000000000000000(0000) GS:ffff810100cf6440(0000) 
> > knlGS:0000000000000000
> > CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > Process swapper (pid: 1, threadinfo ffff810100c9e000, task 
> > ffff810100c9d450)
> > Stack:  ffffffff8045952f 0000000000000000 ffffffff00000000 
> > ffff810100c9fcc4
> >  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> >  ffffffff80690ee1 000000000000002d ffff810100e02d30 0000000000000001
> > Call Trace:
> > Inexact backtrace:
> >  [<ffffffff8045952f>] aac_rx_restart_adapter+0x7e/0x169
> >  [<ffffffff80459a0a>] _aac_rx_init+0x70/0x2f6
> >  [<ffffffff80280ee4>] cache_alloc_refill+0xdb/0x1db
> >  [<ffffffff8045317f>] aac_probe_one+0x1a9/0x462
> >  [<ffffffff8035f3a0>] pci_device_probe+0xd1/0x138
> >  [<ffffffff803b3251>] driver_probe_device+0xf7/0x174
> >  [<ffffffff803b33e4>] __driver_attach+0x6f/0xae
> >  [<ffffffff803b3375>] __driver_attach+0x0/0xae
> >  [<ffffffff803b3375>] __driver_attach+0x0/0xae
> >  [<ffffffff803b261e>] bus_for_each_dev+0x43/0x6e
> >  [<ffffffff803b2993>] bus_add_driver+0x78/0x19a
> >  [<ffffffff8035f578>] __pci_register_driver+0x58/0x8d
> >  [<ffffffff8084a909>] aac_init+0x35/0x70
> >  [<ffffffff8082d8ad>] kernel_init+0x167/0x2d1
> >  [<ffffffff8020a998>] child_rip+0xa/0x12
> >  [<ffffffff8036bd60>] acpi_ds_init_one_object+0x0/0x7c
> >  [<ffffffff8082d746>] kernel_init+0x0/0x2d1
> >  [<ffffffff8020a98e>] child_rip+0x0/0x12
> > 
> > 
> > Code:  Bad RIP value.
> > RIP  [<0000000000000000>]
> >  RSP <ffff810100c9fc78>
> > CR2: 0000000000000000
> > Kernel panic - not syncing: Attempted to kill init!
> > 
> > Thanks
> > Vivek
> > 
> 

[-- Attachment #2: aacraid_kexec_fix.patch --]
[-- Type: application/octet-stream, Size: 521 bytes --]

diff -ru a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
--- a/drivers/scsi/aacraid/rx.c	2007-04-23 09:08:34.753022711 -0400
+++ b/drivers/scsi/aacraid/rx.c	2007-04-23 09:25:59.506381713 -0400
@@ -539,6 +539,8 @@
 	}
 
 	/* Failure to reset here is an option ... */
+	dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
+	dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
 	dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
 	if ((((status & 0xff) != 0xff) || reset_devices) &&
 	  !aac_rx_restart_adapter(dev, 0))

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-23 13:38   ` [PATCH] aacraid: fails to initialize after a " Salyzyn, Mark
@ 2007-04-23 16:12     ` Vivek Goyal
  2007-04-23 17:20       ` Salyzyn, Mark
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2007-04-23 16:12 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: linux-scsi, James Bottomley, Kexec Mailing List, Judith Lebzelter

On Mon, Apr 23, 2007 at 09:38:43AM -0400, Salyzyn, Mark wrote:
> Missing portion of the kexec changes to the aacraid driver. The platform
> functions were not initialized when the restart function is activated
> resulting in a panic when these platform functions are called. Please
> note that it is NOT a mistake that the disable interrupt handler is used
> as the initial value of the enable interrupt platform function, it will
> be set up correctly once the adapter is discovered and initialized.
> 
> Please note that aacraid_kexec_5.patch contains this fix, since an
> earlier aacraid_kexec patch was applied to the tree and appears to have
> propagated, this is meant to supersede the earlier patch and bring it up
> to date with aacraid_kexec_5.patch, but expected to break when the
> aacraid_kexec_5.patch propagates... James, can you sort out this mess
> (either by stopping aacraid_kexec_5.patch, or by letting this enclosed
> patch move with some level of priority)?
> 
> ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
> handling of patches.
> 
> This attached patch is against current scsi-misc-2.6. Also expect this
> patch can be applied to 2.6.21-rc6-mm1
> 
> Signed-off-by: Mark Salyzyn <aacraid@adaptec.com>
> 
> ---
> 

Thanks Mark,

I applied this patch but it still does not work. Now oops is gone but
I get following while aacraid is trying to initialize.

Adaptec aacraid driver (1.1-5[2437]-mh4)
ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
aacraid: aac_fib_send: first asynchronous command timed out.
Usually a result of a PCI interrupt routing problem;
update mother board BIOS or consider utilizing one of
the SAFE mode kernel options (acpi, apic etc)
aac_fib_free, XferState != 0, fibptr = 0xffff810104140000, XferState = 0x810ad
ACPI: PCI interrupt for device 0000:01:02.0 disabled
aacraid: probe of 0000:01:02.0 failed with error -110


Thanks
Vivek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-23 16:12     ` Vivek Goyal
@ 2007-04-23 17:20       ` Salyzyn, Mark
  2007-04-24  8:44         ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Salyzyn, Mark @ 2007-04-23 17:20 UTC (permalink / raw)
  To: vgoyal; +Cc: linux-scsi, James Bottomley, Kexec Mailing List, Judith Lebzelter

That is a failure to route the interrupts and is possibly an issue with
the kernel and the hardware, and not the driver directly (since there is
an expectation that request_irq will connect the interrupt to the
interrupt service routine). Judith reported success in the past with
this patch on her hardware, perhaps the motherboard on your system has
some odd BIOS setup of the hardware that is giving acpi or the apic some
headaches? Can you check out success or failure on other motherboards?
Please try the suggestions from the driver (safe flags)?

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@in.ibm.com] 
> Sent: Monday, April 23, 2007 12:12 PM
> To: Salyzyn, Mark
> Cc: linux-scsi@vger.kernel.org; James Bottomley; Kexec 
> Mailing List; Judith Lebzelter
> Subject: Re: [PATCH] aacraid: fails to initialize after a 
> kexec operation
> 
> 
> On Mon, Apr 23, 2007 at 09:38:43AM -0400, Salyzyn, Mark wrote:
> > Missing portion of the kexec changes to the aacraid driver. 
> The platform
> > functions were not initialized when the restart function is 
> activated
> > resulting in a panic when these platform functions are 
> called. Please
> > note that it is NOT a mistake that the disable interrupt 
> handler is used
> > as the initial value of the enable interrupt platform 
> function, it will
> > be set up correctly once the adapter is discovered and initialized.
> > 
> > Please note that aacraid_kexec_5.patch contains this fix, since an
> > earlier aacraid_kexec patch was applied to the tree and 
> appears to have
> > propagated, this is meant to supersede the earlier patch 
> and bring it up
> > to date with aacraid_kexec_5.patch, but expected to break when the
> > aacraid_kexec_5.patch propagates... James, can you sort out 
> this mess
> > (either by stopping aacraid_kexec_5.patch, or by letting 
> this enclosed
> > patch move with some level of priority)?
> > 
> > ObligatoryDisclaimer: Please accept my condolences 
> regarding Outlook's
> > handling of patches.
> > 
> > This attached patch is against current scsi-misc-2.6. Also 
> expect this
> > patch can be applied to 2.6.21-rc6-mm1
> > 
> > Signed-off-by: Mark Salyzyn <aacraid@adaptec.com>
> > 
> > ---
> > 
> 
> Thanks Mark,
> 
> I applied this patch but it still does not work. Now oops is gone but
> I get following while aacraid is trying to initialize.
> 
> Adaptec aacraid driver (1.1-5[2437]-mh4)
> ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
> aacraid: aac_fib_send: first asynchronous command timed out.
> Usually a result of a PCI interrupt routing problem;
> update mother board BIOS or consider utilizing one of
> the SAFE mode kernel options (acpi, apic etc)
> aac_fib_free, XferState != 0, fibptr = 0xffff810104140000, 
> XferState = 0x810ad
> ACPI: PCI interrupt for device 0000:01:02.0 disabled
> aacraid: probe of 0000:01:02.0 failed with error -110
> 
> 
> Thanks
> Vivek
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-23 17:20       ` Salyzyn, Mark
@ 2007-04-24  8:44         ` Vivek Goyal
  2007-04-24  9:01           ` Vivek Goyal
  2007-04-24 13:21           ` Salyzyn, Mark
  0 siblings, 2 replies; 11+ messages in thread
From: Vivek Goyal @ 2007-04-24  8:44 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi

On Mon, Apr 23, 2007 at 01:20:32PM -0400, Salyzyn, Mark wrote:
> That is a failure to route the interrupts and is possibly an issue with
> the kernel and the hardware, and not the driver directly (since there is
> an expectation that request_irq will connect the interrupt to the
> interrupt service routine). Judith reported success in the past with
> this patch on her hardware, perhaps the motherboard on your system has
> some odd BIOS setup of the hardware that is giving acpi or the apic some
> headaches? Can you check out success or failure on other motherboards?
> Please try the suggestions from the driver (safe flags)?
> 
> Sincerely -- Mark Salyzyn
> 

Hi Mark,

We don't even go through BIOS in kexec and kdump. So BIOS should not be an
issue.

Looks like you sent some message to controller and then waiting for an
interrupt from the controller as an indication of completion of command. In
this case you never seem to get an interrupt hence timeout.

To bypass this problem, I am now booting my second kernel with "irqpoll"
command line option. This will make sure that aacraid interrupt handler
gets invoked even if there is an interrupt routing issue.

This option does help in progressing the things but it ends up corrupting
something or other on the disk. In three attempts I get three types of
errors.

In first attempt I get continuous stream of following messages once
root file system has been mounted.

=============================================
sda1: rw=0, want=9261304112, limit=41945652
attempt to access beyond end of device
sda1: rw=0, want=9261304112, limit=41945652
attempt to access beyond end of device
sda1: rw=0, want=9261304112, limit=41945652
attempt to access beyond end of device
sda1: rw=0, want=9261304112, limit=41945652
attempt to access beyond end of device
sda1: rw=0, want=9261304112, limit=41945652
attempt to access beyond end of device
============================================

In second attempt, it mounted the file system but it found some issue
with "resize" inode and asked me to run fsck manually. Which in turn 
deleted whole lot of inodes.

In third attemt it panics later when it finds ext3 to be corrupted.

=========================================
Creating block device nodes.
Trying to resume from LABEL=SWAP-sda3
No suspend signature on swap, not resuming.
Creating root device.
Mounting root filesystem.
EXT3-fs: Magic mismatch, very weird !
mount: error mouKernel panic - not syncing: Attempted to kill init!
nting /dev/root
=================================================== 

Following are relevant aacraid initiliazation messages on serial console.

===================================================================
Adaptec aacraid driver (1.1-5[2437]-mh4)
ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
AAC0: kernel 5.2-0[11835] Jan  9 2007
AAC0: monitor 5.2-0[11835]
AAC0: bios 5.2-0[11835]
AAC0: serial 1625d1
AAC0: 64bit support enabled.
AAC0: 64 Bit DAC enabled
scsi0 : ServeRAID
scsi 0:0:0:0: Direct-Access     IBM      x366             V1.0 PQ: 0 ANSI: 2
scsi 0:1:0:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:1:1:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:1:2:0: Direct-Access     IBM-ESXS ST973401SS       B519 PQ: 0 ANSI: 5
scsi 0:3:0:0: Enclosure         IBM      SAS SES-2 DEVICE 0.09 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
sd 0:0:0:0: [sda] Assuming Write Enabled
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
sd 0:0:0:0: [sda] Assuming Write Enabled
sd 0:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 >
sd 0:0:0:0: [sda] Attached SCSI removable disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
scsi 0:1:0:0: Attached scsi generic sg1 type 0
scsi 0:1:1:0: Attached scsi generic sg2 type 0
scsi 0:1:2:0: Attached scsi generic sg3 type 0
scsi 0:3:0:0: Attached scsi generic sg4 type 13
================================================

I am not sure why this reset leaves file system in corrupted state and
is there a better way to handle this? Link syncing the existing commands
before restarting it.

Should one keep a dedicated partition on the disk and not mount it in first
kernel. Mount this partition only in second kernel to save the dump. I shall
have to test such configuration.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-24  8:44         ` Vivek Goyal
@ 2007-04-24  9:01           ` Vivek Goyal
  2007-04-24 13:21           ` Salyzyn, Mark
  1 sibling, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2007-04-24  9:01 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi

On Tue, Apr 24, 2007 at 02:14:44PM +0530, Vivek Goyal wrote:
> 
> In second attempt, it mounted the file system but it found some issue
> with "resize" inode and asked me to run fsck manually. Which in turn 
> deleted whole lot of inodes.
> 
> In third attemt it panics later when it finds ext3 to be corrupted.
> 
> =========================================
> Creating block device nodes.
> Trying to resume from LABEL=SWAP-sda3
> No suspend signature on swap, not resuming.
> Creating root device.
> Mounting root filesystem.
> EXT3-fs: Magic mismatch, very weird !
> mount: error mouKernel panic - not syncing: Attempted to kill init!
> nting /dev/root
> =================================================== 
> 

Hi Mark,

Interesting observation. After above message I rebooted my system
expecting ext3 is corrupted and I shall have to try to recover it
using fsck. Nothing of that sort happened. System just booted fine.

This leaves me wondering why does ext things that Magic number is a 
mismatch while booting using kexec. Is AACRAID returning the write bytes
from the disk after an reset?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-24  8:44         ` Vivek Goyal
  2007-04-24  9:01           ` Vivek Goyal
@ 2007-04-24 13:21           ` Salyzyn, Mark
  2007-04-30  9:53             ` Vivek Goyal
  1 sibling, 1 reply; 11+ messages in thread
From: Salyzyn, Mark @ 2007-04-24 13:21 UTC (permalink / raw)
  To: vgoyal; +Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi

The system BIOS sets up the card's PCI configuration and there is code
in the kernel that is capable of picking up some of the BIOS'
information from the BIOS Data Space (not sure if it is actively
collected in your configuration, you need a kernel flag to pick this
up). On kexec this BIOS Data Space information is missing (?) and if
there was any reconfiguration of the PCI space going on (I think only
the Linux BIOS project does this), kexec will inherit it. This issue
strikes me as a corrupted PCI configuration inherited in the kexec case,
such corrupted PCI configurations could be a motherboard specific issue
and can be related to the BIOS' initial setup for the initial kernel. At
least that is my thought process in questioning the motherboard BIOS or
hardware.

Another possibility is that after you have patched over the interrupt
routing issues (a PCI configuration problem), the card has a foreign
array, and the reset and reconfiguration is taking arrays offline. Add
'aacraid.commit=1' to force the foreign arrays to be accepted by the
card.

Could you please check if this issue is specific to your motherboard
model. Could you please check if there is an updated motherboard BIOS
available for it. Could you please check if this issue is specific to
the GB product release cycle? Given the information you have collected,
I would still try the safe flags since there is an interrupt routing
issue.

Another possibility is the reset did not hit your card, the card is not
working correctly or the reset is not working correctly. This feature
was added to the Firmware at the end of 2004, so B11835 certainly would
have it, but that Firmware appears to be an interim test release of the
GB product, and the latest Firmware release to IBM should be B11847 (I
could be mistaken).

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org 
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Vivek Goyal
> Sent: Tuesday, April 24, 2007 4:45 AM
> To: Salyzyn, Mark
> Cc: James Bottomley; Kexec Mailing List; Judith Lebzelter; 
> linux-scsi@vger.kernel.org
> Subject: Re: [PATCH] aacraid: fails to initialize after a 
> kexec operation
> 
> 
> On Mon, Apr 23, 2007 at 01:20:32PM -0400, Salyzyn, Mark wrote:
> > That is a failure to route the interrupts and is possibly 
> an issue with
> > the kernel and the hardware, and not the driver directly 
> (since there is
> > an expectation that request_irq will connect the interrupt to the
> > interrupt service routine). Judith reported success in the past with
> > this patch on her hardware, perhaps the motherboard on your 
> system has
> > some odd BIOS setup of the hardware that is giving acpi or 
> the apic some
> > headaches? Can you check out success or failure on other 
> motherboards?
> > Please try the suggestions from the driver (safe flags)?
> > 
> > Sincerely -- Mark Salyzyn
> > 
> 
> Hi Mark,
> 
> We don't even go through BIOS in kexec and kdump. So BIOS 
> should not be an
> issue.
> 
> Looks like you sent some message to controller and then waiting for an
> interrupt from the controller as an indication of completion 
> of command. In
> this case you never seem to get an interrupt hence timeout.
> 
> To bypass this problem, I am now booting my second kernel 
> with "irqpoll"
> command line option. This will make sure that aacraid 
> interrupt handler
> gets invoked even if there is an interrupt routing issue.
> 
> This option does help in progressing the things but it ends 
> up corrupting
> something or other on the disk. In three attempts I get three types of
> errors.
> 
> In first attempt I get continuous stream of following messages once
> root file system has been mounted.
> 
> =============================================
> sda1: rw=0, want=9261304112, limit=41945652
> attempt to access beyond end of device
> sda1: rw=0, want=9261304112, limit=41945652
> attempt to access beyond end of device
> sda1: rw=0, want=9261304112, limit=41945652
> attempt to access beyond end of device
> sda1: rw=0, want=9261304112, limit=41945652
> attempt to access beyond end of device
> sda1: rw=0, want=9261304112, limit=41945652
> attempt to access beyond end of device
> ============================================
> 
> In second attempt, it mounted the file system but it found some issue
> with "resize" inode and asked me to run fsck manually. Which in turn 
> deleted whole lot of inodes.
> 
> In third attemt it panics later when it finds ext3 to be corrupted.
> 
> =========================================
> Creating block device nodes.
> Trying to resume from LABEL=SWAP-sda3
> No suspend signature on swap, not resuming.
> Creating root device.
> Mounting root filesystem.
> EXT3-fs: Magic mismatch, very weird !
> mount: error mouKernel panic - not syncing: Attempted to kill init!
> nting /dev/root
> =================================================== 
> 
> Following are relevant aacraid initiliazation messages on 
> serial console.
> 
> ===================================================================
> Adaptec aacraid driver (1.1-5[2437]-mh4)
> ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 25
> AAC0: kernel 5.2-0[11835] Jan  9 2007
> AAC0: monitor 5.2-0[11835]
> AAC0: bios 5.2-0[11835]
> AAC0: serial 1625d1
> AAC0: 64bit support enabled.
> AAC0: 64 Bit DAC enabled
> scsi0 : ServeRAID
> scsi 0:0:0:0: Direct-Access     IBM      x366             
> V1.0 PQ: 0 ANSI: 2
> scsi 0:1:0:0: Direct-Access     IBM-ESXS ST973401SS       
> B519 PQ: 0 ANSI: 5
> scsi 0:1:1:0: Direct-Access     IBM-ESXS ST973401SS       
> B519 PQ: 0 ANSI: 5
> scsi 0:1:2:0: Direct-Access     IBM-ESXS ST973401SS       
> B519 PQ: 0 ANSI: 5
> scsi 0:3:0:0: Enclosure         IBM      SAS SES-2 DEVICE 
> 0.09 PQ: 0 ANSI: 5
> sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
> sd 0:0:0:0: [sda] Assuming Write Enabled
> sd 0:0:0:0: [sda] Assuming drive cache: write through
> sd 0:0:0:0: [sda] 429459456 512-byte hardware sectors (219883 MB)
> sd 0:0:0:0: [sda] Assuming Write Enabled
> sd 0:0:0:0: [sda] Assuming drive cache: write through
>  sda: sda1 sda2 sda3 sda4 < sda5 >
> sd 0:0:0:0: [sda] Attached SCSI removable disk
> sd 0:0:0:0: Attached scsi generic sg0 type 0
> scsi 0:1:0:0: Attached scsi generic sg1 type 0
> scsi 0:1:1:0: Attached scsi generic sg2 type 0
> scsi 0:1:2:0: Attached scsi generic sg3 type 0
> scsi 0:3:0:0: Attached scsi generic sg4 type 13
> ================================================
> 
> I am not sure why this reset leaves file system in corrupted state and
> is there a better way to handle this? Link syncing the 
> existing commands
> before restarting it.
> 
> Should one keep a dedicated partition on the disk and not 
> mount it in first
> kernel. Mount this partition only in second kernel to save 
> the dump. I shall
> have to test such configuration.
> 
> Thanks
> Vivek
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-24 13:21           ` Salyzyn, Mark
@ 2007-04-30  9:53             ` Vivek Goyal
  2007-04-30 14:11               ` Salyzyn, Mark
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2007-04-30  9:53 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi,
	Darrick J. Wong

On Tue, Apr 24, 2007 at 09:21:35AM -0400, Salyzyn, Mark wrote:
> The system BIOS sets up the card's PCI configuration and there is code
> in the kernel that is capable of picking up some of the BIOS'
> information from the BIOS Data Space (not sure if it is actively
> collected in your configuration, you need a kernel flag to pick this
> up). On kexec this BIOS Data Space information is missing (?) and if
> there was any reconfiguration of the PCI space going on (I think only
> the Linux BIOS project does this), kexec will inherit it. This issue
> strikes me as a corrupted PCI configuration inherited in the kexec case,
> such corrupted PCI configurations could be a motherboard specific issue
> and can be related to the BIOS' initial setup for the initial kernel. At
> least that is my thought process in questioning the motherboard BIOS or
> hardware.
> 
> Another possibility is that after you have patched over the interrupt
> routing issues (a PCI configuration problem), the card has a foreign
> array, and the reset and reconfiguration is taking arrays offline. Add
> 'aacraid.commit=1' to force the foreign arrays to be accepted by the
> card.
> 

Hi Mark,

So aacraid.commit=1 and irqpoll combination has done the trick. I can
kexec/kdump into second kernel. I am using an IBM x366 series machine.
There is one array and three disks behind it.

Now few queries.

- What is the concept of foreign arrays? 
- Should we pass aacraid.commit=1 all the time or this is only for
  some special cases? What's the point in resetting an adapter if it
  does not online the array it is managing?
- For kexec, it calls the device shutdown routine (aac_shutdown) in this
  case. If this is the case for normal kexec (not kdump) adapter should
  not be reset?
- Still needs to be found out why PCI configuration is getting corrupted
  and why irq routing is not proper and irqpoll is required.
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-30  9:53             ` Vivek Goyal
@ 2007-04-30 14:11               ` Salyzyn, Mark
  2007-05-02  4:21                 ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Salyzyn, Mark @ 2007-04-30 14:11 UTC (permalink / raw)
  To: vgoyal
  Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi,
	Darrick J. Wong

Foreign arrays are arrays configured on another adapter then moved over
to the current host adapter. I do not know why this may be the case in
your situation, but it had the smell of behaving like a foreign array
and thus my suggestion. We use commit=1 for all situations where the
importation of an array is not considered an error and there is no BIOS
to intervene prior to driver load. Typically we advise to set this flag
in embedded systems, or in non-Intel based architectures. Normally on
Intel based systems you get a query from the card's BIOS as you boot
that queries the user (to answer yes) to accept the array configuration
should it be detected as foreign.

I see some problems with declaring aacraid.commit=1 for kdump, you are
changing the storage system conditions and the fact you have a foreign
array may have been the cause of the primary kernel's failure. You are
rubbing out a factor in the system's failure? I would also hate to store
a kernel dump over an array one does not know the status or origin of.

If there is a clean shutdown, and there are no outstanding commands from
the OS (including the ioctl, so make sure the management software
commands are shut down), I do not see a reason to reset the adapter.

I agree, the irqpoll is troublesome! Could something else in the kexec
kernel be catching the interrupts and dropping them on the floor? Are
there any other devices sharing that same interrupt line that may be
holding the interrupt asserted? /proc/irq/*, /proc/interrupts? By
routing, I did not make it clear, but there is more than just the PCI
hardware in control of the path of an Interrupt from the controller
hardware to the interrupt service routine ... this may not be a pure
issue with PCI configuration being corrupted.

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal@in.ibm.com] 
> Sent: Monday, April 30, 2007 5:54 AM
> To: Salyzyn, Mark
> Cc: James Bottomley; Kexec Mailing List; Judith Lebzelter; 
> linux-scsi@vger.kernel.org; Darrick J. Wong
> Subject: Re: [PATCH] aacraid: fails to initialize after a 
> kexec operation
> 
> 
> On Tue, Apr 24, 2007 at 09:21:35AM -0400, Salyzyn, Mark wrote:
> > The system BIOS sets up the card's PCI configuration and 
> there is code
> > in the kernel that is capable of picking up some of the BIOS'
> > information from the BIOS Data Space (not sure if it is actively
> > collected in your configuration, you need a kernel flag to pick this
> > up). On kexec this BIOS Data Space information is missing (?) and if
> > there was any reconfiguration of the PCI space going on (I 
> think only
> > the Linux BIOS project does this), kexec will inherit it. This issue
> > strikes me as a corrupted PCI configuration inherited in 
> the kexec case,
> > such corrupted PCI configurations could be a motherboard 
> specific issue
> > and can be related to the BIOS' initial setup for the 
> initial kernel. At
> > least that is my thought process in questioning the 
> motherboard BIOS or
> > hardware.
> > 
> > Another possibility is that after you have patched over the 
> interrupt
> > routing issues (a PCI configuration problem), the card has a foreign
> > array, and the reset and reconfiguration is taking arrays 
> offline. Add
> > 'aacraid.commit=1' to force the foreign arrays to be accepted by the
> > card.
> > 
> 
> Hi Mark,
> 
> So aacraid.commit=1 and irqpoll combination has done the trick. I can
> kexec/kdump into second kernel. I am using an IBM x366 series machine.
> There is one array and three disks behind it.
> 
> Now few queries.
> 
> - What is the concept of foreign arrays? 
> - Should we pass aacraid.commit=1 all the time or this is only for
>   some special cases? What's the point in resetting an adapter if it
>   does not online the array it is managing?
> - For kexec, it calls the device shutdown routine 
> (aac_shutdown) in this
>   case. If this is the case for normal kexec (not kdump) 
> adapter should
>   not be reset?
> - Still needs to be found out why PCI configuration is 
> getting corrupted
>   and why irq routing is not proper and irqpoll is required.
>  
> Thanks
> Vivek
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] aacraid: fails to initialize after a  kexec operation
  2007-04-30 14:11               ` Salyzyn, Mark
@ 2007-05-02  4:21                 ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2007-05-02  4:21 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: James Bottomley, Kexec Mailing List, Judith Lebzelter, linux-scsi,
	Darrick J. Wong

On Mon, Apr 30, 2007 at 10:11:03AM -0400, Salyzyn, Mark wrote:
> Foreign arrays are arrays configured on another adapter then moved over
> to the current host adapter. I do not know why this may be the case in
> your situation, but it had the smell of behaving like a foreign array
> and thus my suggestion. We use commit=1 for all situations where the
> importation of an array is not considered an error and there is no BIOS
> to intervene prior to driver load. Typically we advise to set this flag
> in embedded systems, or in non-Intel based architectures. Normally on
> Intel based systems you get a query from the card's BIOS as you boot
> that queries the user (to answer yes) to accept the array configuration
> should it be detected as foreign.
> 
> I see some problems with declaring aacraid.commit=1 for kdump, you are
> changing the storage system conditions and the fact you have a foreign
> array may have been the cause of the primary kernel's failure. You are
> rubbing out a factor in the system's failure? I would also hate to store
> a kernel dump over an array one does not know the status or origin of.
> 

Hi Mark,

How does one find from BIOS if array is local or foreign? In this machine
I have not done any migration. I have not even configured the array. I think
I am using default factory settings. If I get into the controller BIOS and
query arrays, it shows me one array of type Volume.

So if an adapter is managing both local and foreign arrays, it would online
local one upon reset but offline foreign one. So we can continue to save
dump?

By the way, when you say that foreign arrays are configured on another
adapter and then moved to current host adapter. Once the movement is
complete (I am assuming it will happen in first kernel) then what's the
issue with saving dump on foreign array. I think if applications are 
actively using the disks behind foreign array, then it should not be
unreliable to save dump on those disks?


> If there is a clean shutdown, and there are no outstanding commands from
> the OS (including the ioctl, so make sure the management software
> commands are shut down), I do not see a reason to reset the adapter.
> 

In case of normal kexec (not kdump) clean shutdown takes place. All 
filesystems are unmounted, processes stopped and from kernel we call
device_shutdown() which should shutdown the device no pending interrupt.
I am wondering why it does not happen in case of aacraid and we end up
restarting adapter even in case of clean shutdown using kexec.

> I agree, the irqpoll is troublesome! Could something else in the kexec
> kernel be catching the interrupts and dropping them on the floor? Are
> there any other devices sharing that same interrupt line that may be
> holding the interrupt asserted? /proc/irq/*, /proc/interrupts? By
> routing, I did not make it clear, but there is more than just the PCI
> hardware in control of the path of an Interrupt from the controller
> hardware to the interrupt service routine ... this may not be a pure
> issue with PCI configuration being corrupted.
> 

I will look more into it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-05-02  4:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-23  7:49 AACRAID fails to initialize after an kexec operation Vivek Goyal
2007-04-23 13:01 ` Salyzyn, Mark
2007-04-23 13:38   ` [PATCH] aacraid: fails to initialize after a " Salyzyn, Mark
2007-04-23 16:12     ` Vivek Goyal
2007-04-23 17:20       ` Salyzyn, Mark
2007-04-24  8:44         ` Vivek Goyal
2007-04-24  9:01           ` Vivek Goyal
2007-04-24 13:21           ` Salyzyn, Mark
2007-04-30  9:53             ` Vivek Goyal
2007-04-30 14:11               ` Salyzyn, Mark
2007-05-02  4:21                 ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox