* aic7xxx timer handling bug
@ 2006-01-09 12:06 Andrew Morton
2006-01-09 14:56 ` James Bottomley
2006-01-10 18:11 ` James Bottomley
0 siblings, 2 replies; 4+ messages in thread
From: Andrew Morton @ 2006-01-09 12:06 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
While doing a binary search for a buggy patch (it was
gregkh-pci-x86-pci-domain-support-the-meat.patch, reported on
linux-kernel), I hit the below.
PIIX4: IDE controller at PCI slot 0000:00:10.0
PIIX4: device not capable of full native PCI mode
PIIX4: device disabled (BIOS)
PIIX4: IDE controller at PCI slot 0000:00:10.0
PIIX4: device not capable of full native PCI mode
PIIX4: device disabled (BIOS)
hda: max request size: 128KiB
hda: 120064896 sectors (61473 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(33)
hda: cache flushes notsupported
hda: hda1 hda2 hda3
hdc: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.20
ACPI: PCI Interrupt 0000:03:04.0[A]: no GSI - using IRQ 11
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
<Adaptec 29160 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
0:0:0:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
------------[ cut here ]------------
kernel BUG at kernel/timer.c:293!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 1
EIP: 0060:[<c0120388>] Not tainted VLI
EFLAGS: 00010046 (2.6.15)
EIP is at mod_timer+0x21/0x2b
eax: c36e3d18 ebx: c3efde00 ecx: c3f38068 edx: ffff04b2
esi: c3609800 edi: c3f00080 ebp: c3621e5c esp: c3621e5c
ds: 007b es: 007b ss: 0068
Process scsi_eh_0 (pid: 817, threadinfo=c3620000 task=c3cc2550)
Stack: c3621eb4 c02791ed 00000001 00000246 00000000 c3efde98 c3cbd800 c3cbd800
c3cbd900 c3f38068 00000071 00000007 00000000 c3620001 00000000 00000000
00000041 00000001 41607c00 c3efde00 00000071 000003e8 c3621ecc c027fd70
Call Trace:
[<c01036c5>] show_stack+0x6e/0x88
[<c01037fa>] show_registers+0x104/0x15c
[<c01039b2>] die+0xe9/0x166
[<c0103ac1>] do_trap+0x92/0x94
[<c0103d14>] do_invalid_op+0x95/0x9c
[<c01033cf>] error_code+0x4f/0x54
[<c02791ed>] ahc_handle_seqint+0x1324/0x13eb
[<c027fd70>] ahc_pause_and_flushwork+0x3c6/0x441
[<c028ac5f>] ahc_linux_queue_recovery_cmd+0x247/0x84d
[<c028892f>] ahc_linux_abort+0xe/0x2a
[<c026cb61>] scsi_send_eh_cmnd+0xba/0xd2
[<c026ce19>] scsi_eh_tur+0x80/0xbd
[<c026ceb8>] scsi_eh_abort_cmds+0x62/0x72
[<c026d79f>] scsi_unjam_host+0x85/0x97
[<c026d81e>] scsi_error_handler+0x6d/0x91
[<c012a1e9>] kthread+0x9f/0xa4
[<c0100f01>] kernel_thread_helper+0x5/0xb
Code: 0f 0b 08 01 d1 5f 35 c0 eb d3 55 89 e5 83 78 0c 00 74 18 39 50 08 74 07 e
I don't think the PCI domain patch is the cause of this - it's just the
thing which cause the abort handler to run. I'd be suspecting that the
aix7xxx abort handling is bust. I seem to recall seeing someone else
report this a month or so ago?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: aic7xxx timer handling bug
2006-01-09 12:06 aic7xxx timer handling bug Andrew Morton
@ 2006-01-09 14:56 ` James Bottomley
2006-01-10 18:11 ` James Bottomley
1 sibling, 0 replies; 4+ messages in thread
From: James Bottomley @ 2006-01-09 14:56 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-scsi
On Mon, 2006-01-09 at 04:06 -0800, Andrew Morton wrote:
> While doing a binary search for a buggy patch (it was
> gregkh-pci-x86-pci-domain-support-the-meat.patch, reported on
> linux-kernel), I hit the below.
This is a hard one to solve thanks to Justin and his idiot my driver
knows best policies.
The problem is essentially that recovery can take time and the mid-layer
will get impatient and use a bigger hammer. Justin, therefore, tries to
hold of the timeout until recovery has finished (which can apparently
take forever in his driver). Just pulling out the timer adjustment
might be a welcome cosmetic fix, but it won't solve the underlying
problem: we'll have the mid-layer thinking the command is aborted and
the aic driver still trying to process it.
James
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: aic7xxx timer handling bug
2006-01-09 12:06 aic7xxx timer handling bug Andrew Morton
2006-01-09 14:56 ` James Bottomley
@ 2006-01-10 18:11 ` James Bottomley
2006-01-11 3:30 ` Andrew Morton
1 sibling, 1 reply; 4+ messages in thread
From: James Bottomley @ 2006-01-10 18:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-scsi
On Mon, 2006-01-09 at 04:06 -0800, Andrew Morton wrote:
> While doing a binary search for a buggy patch (it was
> gregkh-pci-x86-pci-domain-support-the-meat.patch, reported on
> linux-kernel), I hit the below.
OK, try this; it should pull out all of the aic7xxx timer handling and
replace it with proper mechanisms (I had to rework the locking a bit to
get this to happen correctly, so caveat emptor).
Unfortunately, aic79xx also needs something like this ... and the patch
there will be much larger.
James
diff --git a/drivers/scsi/aic7xxx/Kconfig.aic7xxx b/drivers/scsi/aic7xxx/Kconfig.aic7xxx
--- a/drivers/scsi/aic7xxx/Kconfig.aic7xxx
+++ b/drivers/scsi/aic7xxx/Kconfig.aic7xxx
@@ -42,13 +42,13 @@ config AIC7XXX_CMDS_PER_DEVICE
config AIC7XXX_RESET_DELAY_MS
int "Initial bus reset delay in milli-seconds"
depends on SCSI_AIC7XXX
- default "15000"
+ default "5000"
---help---
The number of milliseconds to delay after an initial bus reset.
The bus settle delay following all error recovery actions is
dictated by the SCSI layer and is not affected by this value.
- Default: 15000 (15 seconds)
+ Default: 5000 (5 seconds)
config AIC7XXX_PROBE_EISA_VL
bool "Probe for EISA and VL AIC7XXX Adapters"
diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.c b/drivers/scsi/aic7xxx/aic7xxx_osm.c
--- a/drivers/scsi/aic7xxx/aic7xxx_osm.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_osm.c
@@ -375,7 +375,7 @@ static void ahc_linux_queue_cmd_complete
struct scsi_cmnd *cmd);
static void ahc_linux_sem_timeout(u_long arg);
static void ahc_linux_freeze_simq(struct ahc_softc *ahc);
-static void ahc_linux_release_simq(u_long arg);
+static void ahc_linux_release_simq(struct ahc_softc *ahc);
static int ahc_linux_queue_recovery_cmd(struct scsi_cmnd *cmd, scb_flag flag);
static void ahc_linux_initialize_scsi_bus(struct ahc_softc *ahc);
static u_int ahc_linux_user_tagdepth(struct ahc_softc *ahc,
@@ -1073,7 +1073,6 @@ ahc_linux_register_host(struct ahc_softc
return (ENOMEM);
*((struct ahc_softc **)host->hostdata) = ahc;
- ahc_lock(ahc, &s);
ahc->platform_data->host = host;
host->can_queue = AHC_MAX_QUEUE;
host->cmd_per_lun = 2;
@@ -1084,7 +1083,9 @@ ahc_linux_register_host(struct ahc_softc
host->max_lun = AHC_NUM_LUNS;
host->max_channel = (ahc->features & AHC_TWIN) ? 1 : 0;
host->sg_tablesize = AHC_NSEG;
+ ahc_lock(ahc, &s);
ahc_set_unit(ahc, ahc_linux_unit++);
+ ahc_unlock(ahc, &s);
sprintf(buf, "scsi%d", host->host_no);
new_name = malloc(strlen(buf) + 1, M_DEVBUF, M_NOWAIT);
if (new_name != NULL) {
@@ -1094,7 +1095,6 @@ ahc_linux_register_host(struct ahc_softc
host->unique_id = ahc->unit;
ahc_linux_initialize_scsi_bus(ahc);
ahc_intr_enable(ahc, TRUE);
- ahc_unlock(ahc, &s);
host->transportt = ahc_linux_transport_template;
@@ -1120,10 +1120,13 @@ ahc_linux_initialize_scsi_bus(struct ahc
{
int i;
int numtarg;
+ unsigned long s;
i = 0;
numtarg = 0;
+ ahc_lock(ahc, &s);
+
if (aic7xxx_no_reset != 0)
ahc->flags &= ~(AHC_RESET_BUS_A|AHC_RESET_BUS_B);
@@ -1170,16 +1173,12 @@ ahc_linux_initialize_scsi_bus(struct ahc
ahc_update_neg_request(ahc, &devinfo, tstate,
tinfo, AHC_NEG_ALWAYS);
}
+ ahc_unlock(ahc, &s);
/* Give the bus some time to recover */
if ((ahc->flags & (AHC_RESET_BUS_A|AHC_RESET_BUS_B)) != 0) {
ahc_linux_freeze_simq(ahc);
- init_timer(&ahc->platform_data->reset_timer);
- ahc->platform_data->reset_timer.data = (u_long)ahc;
- ahc->platform_data->reset_timer.expires =
- jiffies + (AIC7XXX_RESET_DELAY * HZ)/1000;
- ahc->platform_data->reset_timer.function =
- ahc_linux_release_simq;
- add_timer(&ahc->platform_data->reset_timer);
+ msleep(AIC7XXX_RESET_DELAY);
+ ahc_linux_release_simq(ahc);
}
}
@@ -2059,6 +2058,9 @@ ahc_linux_sem_timeout(u_long arg)
static void
ahc_linux_freeze_simq(struct ahc_softc *ahc)
{
+ unsigned long s;
+
+ ahc_lock(ahc, &s);
ahc->platform_data->qfrozen++;
if (ahc->platform_data->qfrozen == 1) {
scsi_block_requests(ahc->platform_data->host);
@@ -2068,17 +2070,15 @@ ahc_linux_freeze_simq(struct ahc_softc *
CAM_LUN_WILDCARD, SCB_LIST_NULL,
ROLE_INITIATOR, CAM_REQUEUE_REQ);
}
+ ahc_unlock(ahc, &s);
}
static void
-ahc_linux_release_simq(u_long arg)
+ahc_linux_release_simq(struct ahc_softc *ahc)
{
- struct ahc_softc *ahc;
u_long s;
int unblock_reqs;
- ahc = (struct ahc_softc *)arg;
-
unblock_reqs = 0;
ahc_lock(ahc, &s);
if (ahc->platform_data->qfrozen > 0)
diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.h b/drivers/scsi/aic7xxx/aic7xxx_osm.h
--- a/drivers/scsi/aic7xxx/aic7xxx_osm.h
+++ b/drivers/scsi/aic7xxx/aic7xxx_osm.h
@@ -223,9 +223,6 @@ int ahc_dmamap_unload(struct ahc_softc *
*/
#define ahc_dmamap_sync(ahc, dma_tag, dmamap, offset, len, op)
-/************************** Timer DataStructures ******************************/
-typedef struct timer_list ahc_timer_t;
-
/********************************** Includes **********************************/
#ifdef CONFIG_AIC7XXX_REG_PRETTY_PRINT
#define AIC_DEBUG_REGISTERS 1
@@ -235,30 +232,9 @@ typedef struct timer_list ahc_timer_t;
#include "aic7xxx.h"
/***************************** Timer Facilities *******************************/
-#define ahc_timer_init init_timer
-#define ahc_timer_stop del_timer_sync
-typedef void ahc_linux_callback_t (u_long);
-static __inline void ahc_timer_reset(ahc_timer_t *timer, int usec,
- ahc_callback_t *func, void *arg);
-static __inline void ahc_scb_timer_reset(struct scb *scb, u_int usec);
-
-static __inline void
-ahc_timer_reset(ahc_timer_t *timer, int usec, ahc_callback_t *func, void *arg)
-{
- struct ahc_softc *ahc;
-
- ahc = (struct ahc_softc *)arg;
- del_timer(timer);
- timer->data = (u_long)arg;
- timer->expires = jiffies + (usec * HZ)/1000000;
- timer->function = (ahc_linux_callback_t*)func;
- add_timer(timer);
-}
-
static __inline void
ahc_scb_timer_reset(struct scb *scb, u_int usec)
{
- mod_timer(&scb->io_ctx->eh_timeout, jiffies + (usec * HZ)/1000000);
}
/***************************** SMP support ************************************/
@@ -393,7 +369,6 @@ struct ahc_platform_data {
spinlock_t spin_lock;
u_int qfrozen;
- struct timer_list reset_timer;
struct semaphore eh_sem;
struct Scsi_Host *host; /* pointer to scsi host */
#define AHC_LINUX_NOIRQ ((uint32_t)~0)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: aic7xxx timer handling bug
2006-01-10 18:11 ` James Bottomley
@ 2006-01-11 3:30 ` Andrew Morton
0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2006-01-11 3:30 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
James Bottomley <James.Bottomley@SteelEye.com> wrote:
>
> On Mon, 2006-01-09 at 04:06 -0800, Andrew Morton wrote:
> > While doing a binary search for a buggy patch (it was
> > gregkh-pci-x86-pci-domain-support-the-meat.patch, reported on
> > linux-kernel), I hit the below.
>
> OK, try this; it should pull out all of the aic7xxx timer handling and
> replace it with proper mechanisms (I had to rework the locking a bit to
> get this to happen correctly, so caveat emptor).
It fixes the oops. With this +
gregkh-pci-x86-pci-domain-support-the-meat.patch:
26 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
27 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
28 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
29 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
30 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
31 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
Pending list:
2 SCB_CONTROL[0x0] SCB_SCSIID[0x7] SCB_LUN[0x0]
Kernel Free SCB list: 1 0
Untagged Q(0): 2
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
scsi0:0:0:0: Cmd aborted from QINFIFO
aic7xxx_abort returns 0x2002
0:0:0:0: scsi: Device offlined - not ready after error recovery
0:0:1:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:1:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:1:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0: At time of recovery, card was paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State in Message-in phase, at SEQADDR 0x103
Card was paused
ACCUM = 0x0, SINDEX = 0x71, DINDEX = 0xe4, ARG_2 = 0x0
HCNT = 0x0 SCBPTR = 0x0
SCSIPHASE[0x8]:(MSG_IN_PHASE) SCSISIGI[0xe6]:(REQI|BSYI|MSGI|IOI|CDI)
ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0xe0]:(MSGI|IOI|CDI)
SCSISEQ[0x12]:(ENAUTOATNP|ENRSELI) SBLKCTL[0xa]:(SELWIDE|SELBUSB)
SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE) SEQ_FLAGS[0x0]
SSTAT0[0x2]:(SPIORDY) SSTAT1[0x11]:(REQINIT|PHASEMIS)
SSTAT2[0x10]:(EXP_ACTIVE) SSTAT3[0x0] SIMODE0[0x8]:(ENSWRAP)
SIMODE1[0xac]:(ENSCSIPERR|ENBUSFREE|ENSCSIRST|ENSELTIMO)
SXFRCTL0[0x88]:(SPIOEN|DFON) DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD
STACK: 0x0 0x164 0x179 0x102
SCB count = 4
Kernel NEXTQSCB = 3
Card NEXTQSCB = 2
QINFIFO entries: 2
Waiting Queue entries:
Disconnected Queue entries:
QOUTFIFO entries:
Sequencer Free SCB List: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
Sequencer SCB Info:
0 SCB_CONTROL[0xc0]:(DISCENB|TARGET_SCB) SCB_SCSIID[0x17]
SCB_LUN[0x0] SCB_TAG[0xff]
1 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
2 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
3 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID)
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff]
That all took about one minute per disk. I have 10 SCSI disks in this
thing.
Then we get a few minutes of:
scsi0:0:5:0: Cmd aborted from QINFIFO
aic7xxx_abort returns 0x2002
0:0:5:0: scsi: Device offlined - not ready after error recovery
0:0:6:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:6:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:6:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
0:0:6:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:6:0: Attempting to queue a TARGET RESET message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:6:0: Command not found
aic7xxx_dev_reset returns 0x2002
0:0:6:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
0:0:6:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:6:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
0:0:6:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:6:0: scsi: Device offlined - not ready after error recovery
0:0:8:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:8:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:8:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
0:0:8:0: Command already completed
aic7xxx_abort returns 0x2002
0:0:8:0: Attempting to queue a TARGET RESET message
CDB: 0x12 0x0 0x0 0x0 0x24 0x0
0:0:8:0: Command not found
After about 20 minutes, initscripts ran and it almost booted. (This
machine has everything installed on the IDE disk).
Perhaps those timeouts are a bit too long??
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-01-11 3:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-09 12:06 aic7xxx timer handling bug Andrew Morton
2006-01-09 14:56 ` James Bottomley
2006-01-10 18:11 ` James Bottomley
2006-01-11 3:30 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox