aic7xxx woes in 2.5

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* aic7xxx woes in 2.5
@ 2002-12-15  4:31 Andrew Morton
  2002-12-15  6:06 ` Ishikawa
  2002-12-15 20:09 ` Justin T. Gibbs
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2002-12-15  4:31 UTC (permalink / raw)
  To: linux-scsi


For about six months in the 2.5 series, using aic7xxx, about every fourth boot
one of my disks tends to get:

(scsi1:A:4:0): parity-error detected in Data-in phase: SEQADDR(0x1ae) SCSIRATE(0x88)
scsi1:0:4:0: Attempting to queue an ABORT message

This is invariably fatal.  The box locks and the NMI watchdog
kicks it over.   The call trace is:

(gdb) bt
#0  0xc01d3288 in rep_nop () at include/asm/processor.h:468
#1  0xc01d325d in __delay (loops=98000) at arch/i386/lib/delay.c:63
#2  0xc01d32ad in __const_udelay (xloops=858800) at arch/i386/lib/delay.c:74
#3  0xc01d327c in __udelay (usecs=200) at arch/i386/lib/delay.c:79
#4  0xc0231d7f in ahc_delay (usec=200) at drivers/scsi/aic7xxx/aic7xxx_osm.h:607
#5  0xc022b81e in ahc_clear_critical_section (ahc=0xc3e03000) at drivers/scsi/aic7xxx/aic7xxx_core.c:1392                      #6  0xc0235272 in ahc_linux_queue_recovery_cmd (cmd=0xc1766c00, flag=SCB_ABORT) at drivers/scsi/aic7xxx/aic7xxx_linux.c:2490
#7  0xc023569a in ahc_linux_abort (cmd=0xc1766c00) at drivers/scsi/aic7xxx/aic7xxx_linux.c:2667                                #8  0xc022592b in scsi_try_to_abort_cmd (scmd=0xc1766c00) at drivers/scsi/scsi_error.c:820
#9  0xc0225a0c in scsi_eh_abort_cmd (sc_todo=0xc1766c00, shost=0xc17de63c) at drivers/scsi/scsi_error.c:902                    #10 0xc022614e in scsi_unjam_host (shost=0xc17de63c) at drivers/scsi/scsi_error.c:1532
#11 0xc0226286 in scsi_error_handler (data=0xc17de63c) at drivers/scsi/scsi_error.c:1659                                       

It would seem that the machine locked up in ahc_clear_critical_section():

                do {
                        ahc_delay(200);
                } while (!ahc_is_paused(ahc));

The parity error is intermittent.  But when it happens, the lockup
always happens.

This never happens in 2.4 kernels.

It seems to happen a little more frequently on uniprocessor builds.

So relevant questions would be:

1) Why does only 2.5 get the parity error?

2) Why does the recovery lock up?

3) Does anyone have a diff for Justin's new driver?


lspci:

00:0a.0 SCSI storage controller: Adaptec AIC-7880U (rev 01)
03:04.0 SCSI storage controller: Adaptec 7892A (rev 02)

2.4.19-pre4's dmesg:


scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.5
        <Adaptec 29160 Ultra160 SCSI adapter>
        aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.5
        <Adaptec aic7880 Ultra SCSI adapter>
        aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs

  Vendor: QUANTUM   Model: ATLAS IV 9 SCA    Rev: 0B0B
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: QUANTUM   Model: ATLAS 10K 9SCA    Rev: UC81
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: SEAGATE   Model: ST19101W          Rev: 0014
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: QUANTUM   Model: QM39100TD-SCA     Rev: N1B0
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: FUJITSU   Model: MAF3364L SUN36G   Rev: 1213
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: ESG-SHV   Model: SCA HSBP M4       Rev: 0.63
  Type:   Processor                          ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled.  Depth 253
scsi0:A:1:0: Tagged Queuing enabled.  Depth 253
scsi0:A:2:0: Tagged Queuing enabled.  Depth 253
scsi0:A:4:0: Tagged Queuing enabled.  Depth 253
scsi0:A:5:0: Tagged Queuing enabled.  Depth 253
  Vendor: QUANTUM   Model: ATLAS 10K 9SCA    Rev: UC81
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: QUANTUM   Model: ATLAS 10K 9SCA    Rev: UCH0
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: QUANTUM   Model: ATLAS 10K 9SCA    Rev: UC81
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: QUANTUM   Model: ATLAS 10K 9SCA    Rev: UCP0
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: FUJITSU   Model: MAF3364L SUN36G   Rev: 1213
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: ESG-SHV   Model: SCA HSBP M4       Rev: 0.63
  Type:   Processor                          ANSI SCSI revision: 02
scsi1:A:0:0: Tagged Queuing enabled.  Depth 253
scsi1:A:1:0: Tagged Queuing enabled.  Depth 253
scsi1:A:2:0: Tagged Queuing enabled.  Depth 253
scsi1:A:4:0: Tagged Queuing enabled.  Depth 253
scsi1:A:5:0: Tagged Queuing enabled.  Depth 253
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Attached scsi disk sdc at scsi0, channel 0, id 2, lun 0
Attached scsi disk sdd at scsi0, channel 0, id 4, lun 0
Attached scsi disk sde at scsi0, channel 0, id 5, lun 0
Attached scsi disk sdf at scsi1, channel 0, id 0, lun 0
Attached scsi disk sdg at scsi1, channel 0, id 1, lun 0
Attached scsi disk sdh at scsi1, channel 0, id 2, lun 0
Attached scsi disk sdi at scsi1, channel 0, id 4, lun 0
Attached scsi disk sdj at scsi1, channel 0, id 5, lun 0
(scsi0:A:0): 40.000MB/s transfers (20.000MHz, offset 31, 16bit)
SCSI device sda: 17942584 512-byte hdwr sectors (9187 MB)
 sda: sda1
(scsi0:A:1): 40.000MB/s transfers (20.000MHz, offset 31, 16bit)
SCSI device sdb: 17938986 512-byte hdwr sectors (9185 MB)
 sdb: sdb1
(scsi0:A:2): 40.000MB/s transfers (20.000MHz, offset 15, 16bit)
SCSI device sdc: 17783240 512-byte hdwr sectors (9105 MB)
 sdc: sdc1
(scsi0:A:4): 40.000MB/s transfers (20.000MHz, offset 31, 16bit)
SCSI device sdd: 17783249 512-byte hdwr sectors (9105 MB)
 sdd: sdd1 < sdd5 >
(scsi0:A:5): 40.000MB/s transfers (20.000MHz, offset 63, 16bit)
SCSI device sde: 71132959 512-byte hdwr sectors (36420 MB)
 sde: sde1 < sde5 sde6 sde7 >
(scsi1:A:0): 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
SCSI device sdf: 17938986 512-byte hdwr sectors (9185 MB)
 sdf: sdf1
(scsi1:A:1): 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
SCSI device sdg: 17938986 512-byte hdwr sectors (9185 MB)
 sdg: sdg1
(scsi1:A:2): 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
SCSI device sdh: 17938986 512-byte hdwr sectors (9185 MB)
 sdh: sdh1
(scsi1:A:4): 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
SCSI device sdi: 17938986 512-byte hdwr sectors (9185 MB)
 sdi: sdi1 < sdi5 sdi6 >
(scsi1:A:5): 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
SCSI device sdj: 71132959 512-byte hdwr sectors (36420 MB)
 sdj: sdj1 < sdj5 sdj6 >
Attached scsi generic sg5 at scsi0, channel 0, id 6, lun 0,  type 3
Attached scsi generic sg11 at scsi1, channel 0, id 6, lun 0,  type 3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15  4:31 aic7xxx woes in 2.5 Andrew Morton
@ 2002-12-15  6:06 ` Ishikawa
  2002-12-15  6:48   ` Andrew Morton
  2002-12-15 20:17   ` Justin T. Gibbs
  2002-12-15 20:09 ` Justin T. Gibbs
  1 sibling, 2 replies; 11+ messages in thread
From: Ishikawa @ 2002-12-15  6:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-scsi

Hi,

> The parity error is intermittent.  But when it happens, the lockup
> always happens.
> 
> This never happens in 2.4 kernels.
> 
> It seems to happen a little more frequently on uniprocessor builds.
> 
> So relevant questions would be:
> 
> 1) Why does only 2.5 get the parity error?

Since you say "uniprocessor builds", maybe you are using 
high-quality dual processor board. But just in case, does your
motherboard support proper PCI parity bus check?
(I remember that when I switched motherboards about two years ago,
I noticed that the SCSI driver warns of me of a
parity error and won't start. I had to add a boot line
command option to ignore the parity error. The
board didn't seem to handle PCI bus parity bit properly. 
A surprise. I switched to another board 
a couple of weeks later, which supports 
parity without problem.)

So assuiming that the PCI parity is handled
correctly on your motherboard, I wonder if it is 
possibly a real intermittent parity error.
Maybe 2.5 is now more efficient in
data I/O rate and the excercised bus may encounter
occasional parity error.  A pure guess.
Frankly only a hardware engineer with good diagnostic
tool can tell the real cause if it is a real parity error.

Of course, there is a chance that the parity error
is reported by a slightly buggy driver (downloaded
firmware may not handle the timing correctly, etc. under
new kernel timing condition. )

> 2) Why does the recovery lock up?

A good question. There still may be missed
lock-up path(s) during recovery even in 2.5.

> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.5
>         <Adaptec 29160 Ultra160 SCSI adapter>
>         aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
> 

I noticed that you have many disks.
Are they in external enclosure?
If not, is the power-supply in your
PC box spec'ed to supply enough power?
[I just had to reassemble non-linux PC to
upgrade the power-supply after I changed the video card.
(Newer video card seems to suck in power like a large
gas guzller.) Initially I replaced a power supply
to a newer one, which I thought was enough
to give the required power. But later on, I realized that
the new one didn't offer the enough power: the system
would still crash/get hung under heavy usage, and
upgraded to a larger one. That PC runs fine now.]

It is possible that the PC is running fine but the
power condition may be close to the safety limit and
a real parity may occur under heavy I/O conditions.

BTW, strange things do happen when we switch 
kernels and drivers, don't they?
If only I have a spare PC,
I would have tried  linux 2.5.xx to see how the
newer SCSI susbsystem fares in real-world conditions after
seeing so many problems in the older kernels with
my set of flakey and esoteric hardware drives: very long
silent/time out period of my CD changer drives, and
a Segate disk that had a few bad blocks which go bad
after it is heated up enough, etc..
[I still keep the Seagate drive as a test sample for
recovery testing. ]

-- 
int main(void){int j=2002;/*(c)2002 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="h>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15  6:06 ` Ishikawa
@ 2002-12-15  6:48   ` Andrew Morton
  2002-12-15 13:48     ` Ishikawa
  2002-12-15 20:17   ` Justin T. Gibbs
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2002-12-15  6:48 UTC (permalink / raw)
  To: Ishikawa; +Cc: linux-scsi

Ishikawa wrote:
> 
> Hi,
> 
> > The parity error is intermittent.  But when it happens, the lockup
> > always happens.
> >
> > This never happens in 2.4 kernels.
> >
> > It seems to happen a little more frequently on uniprocessor builds.
> >
> > So relevant questions would be:
> >
> > 1) Why does only 2.5 get the parity error?
> 
> Since you say "uniprocessor builds", maybe you are using
> high-quality dual processor board. But just in case, does your
> motherboard support proper PCI parity bus check?

I expect it supports everything.  It is a 1998-vintage Intel
ad450nx server - these things cost $70,000 in their day.  It
is built like a battleship.  See
http://images.google.com/images?hl=en&lr=&ie=ISO-8859-1&q=ad450nx

But that error is a scsi bus error, not a PCI bus error.

> ...
> > scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.5
> >         <Adaptec 29160 Ultra160 SCSI adapter>
> >         aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
> >
> 
> I noticed that you have many disks.
> Are they in external enclosure?

They are internal.

> If not, is the power-supply in your
> PC box spec'ed to supply enough power?

It has four power supplies and approximately 13 fans ;)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15  6:48   ` Andrew Morton
@ 2002-12-15 13:48     ` Ishikawa
  0 siblings, 0 replies; 11+ messages in thread
From: Ishikawa @ 2002-12-15 13:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-scsi

Hi,

> I expect it supports everything.  It is a 1998-vintage Intel
> ad450nx server - these things cost $70,000 in their day.  It
> is built like a battleship.  See
> http://images.google.com/images?hl=en&lr=&ie=ISO-8859-1&q=ad450nx
> 
> But that error is a scsi bus error, not a PCI bus error.
> 
> > ...
> > > scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.5
> > >         <Adaptec 29160 Ultra160 SCSI adapter>
> > >         aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
> > >
> >
> > I noticed that you have many disks.
> > Are they in external enclosure?
> 
> They are internal.
> 
> > If not, is the power-supply in your
> > PC box spec'ed to supply enough power?
> 
> It has four power supplies and approximately 13 fans ;)

I agree that you have ample reason to suspect
driver problem, not hardware! :-)

I tried to locate my old posting, but somehow could not find
it in my local folder, but if I recall correctly
mine is certainly the PCI parity error which appeared
somewhere after the PCI_COMMAND_PARITY line in the
startup message below.

> scsi: host order: sym53c8xx:tmscsim
> sym53c8xx: at PCI bus 0, device 12, function 0
> sym53c8xx: setting PCI_COMMAND_PARITY...(fix-up)
> sym53c8xx: 53c895 detected with Tekram NVRAM
> sym53c895-0: rev 0x1 on pci bus 0 device 12 function 0 irq 10
> sym53c895-0: Tekram format NVRAM, ID 7, Fast-40, Parity Checking
> sym53c895-0: SCSI bus mode change from 80 to 80.
> scsi0 : sym53c8xx-1.7.3c-20010512
> sym53c895-0-<4,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 15)
> sym53c895-0-<6,*>: FAST-20 WIDE SCSI 40.0 MBu (50.0 ns, offset 31)

Again, possibly the high(er) speed of I/O might
affect the electrical cable condition, but
given the hardware you have, I trust that
you have high-quality cable and terminator, and
so maybe someone who is familiar with the driver code
can help us here.


 


-- 
int main(void){int j=2002;/*(c)2002 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="h>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15  6:06 ` Ishikawa
  2002-12-15  6:48   ` Andrew Morton
@ 2002-12-15 20:17   ` Justin T. Gibbs
  1 sibling, 0 replies; 11+ messages in thread
From: Justin T. Gibbs @ 2002-12-15 20:17 UTC (permalink / raw)
  To: Ishikawa, Andrew Morton; +Cc: linux-scsi

> Hi,
> 
>> The parity error is intermittent.  But when it happens, the lockup
>> always happens.
>> 
>> This never happens in 2.4 kernels.
>> 
>> It seems to happen a little more frequently on uniprocessor builds.
>> 
>> So relevant questions would be:
>> 
>> 1) Why does only 2.5 get the parity error?
> 
> 
> Since you say "uniprocessor builds", maybe you are using 
> high-quality dual processor board. But just in case, does your
> motherboard support proper PCI parity bus check?

These are SCSI parity errors, not PCI parity errors.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15  4:31 aic7xxx woes in 2.5 Andrew Morton
  2002-12-15  6:06 ` Ishikawa
@ 2002-12-15 20:09 ` Justin T. Gibbs
  2002-12-16  9:40   ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2002-12-15 20:09 UTC (permalink / raw)
  To: Andrew Morton, linux-scsi

> For about six months in the 2.5 series, using aic7xxx, about every fourth
> boot one of my disks tends to get:
> 
> (scsi1:A:4:0): parity-error detected in Data-in phase: SEQADDR(0x1ae)
> SCSIRATE(0x88) scsi1:0:4:0: Attempting to queue an ABORT message
> 
> This is invariably fatal.

...

> This never happens in 2.4 kernels.
> 
> It seems to happen a little more frequently on uniprocessor builds.
> 
> So relevant questions would be:
> 
> 1) Why does only 2.5 get the parity error?

Most likely different loads on your SCSI bus.  The driver can't "make up"
SCSI bus parity errors.

> 2) Why does the recovery lock up?

I would actually have to know the sequencer instruction that we
are blocked on in the clear_critical_sections code to be able to
say.  Several recovery bugs have been fixed in later driver versions.

> 3) Does anyone have a diff for Justin's new driver?

Just populate the scsi/aic7xxx directory with the files found
here:

http://people.FreeBSD.org/~gibbs/linux/SRC/

You will need to merge in the Kconfig and Makefile for the scsi
directory, but if you are running a fairly recent kernel, you
can just overwrite those files with those supplied in the linux-2.5
archive supplied at the above URL.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-15 20:09 ` Justin T. Gibbs
@ 2002-12-16  9:40   ` Andrew Morton
  2002-12-16 18:52     ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2002-12-16  9:40 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-scsi

"Justin T. Gibbs" wrote:
> 
> > For about six months in the 2.5 series, using aic7xxx, about every fourth
> > boot one of my disks tends to get:
> >
> > (scsi1:A:4:0): parity-error detected in Data-in phase: SEQADDR(0x1ae)
> > SCSIRATE(0x88) scsi1:0:4:0: Attempting to queue an ABORT message
> >
> > This is invariably fatal.
> 
> ...
> 
> > This never happens in 2.4 kernels.
> >
> > It seems to happen a little more frequently on uniprocessor builds.
> >
> > So relevant questions would be:
> >
> > 1) Why does only 2.5 get the parity error?
> 
> Most likely different loads on your SCSI bus.  The driver can't "make up"
> SCSI bus parity errors.

It's very consistent.  Never seen on 2.4.

> > 2) Why does the recovery lock up?
> 
> I would actually have to know the sequencer instruction that we
> are blocked on in the clear_critical_sections code to be able to
> say.  Several recovery bugs have been fixed in later driver versions.

OK, let's move on then.
 
> > 3) Does anyone have a diff for Justin's new driver?
> 
> Just populate the scsi/aic7xxx directory with the files found
> here:
> 
> http://people.FreeBSD.org/~gibbs/linux/SRC/
> 
> You will need to merge in the Kconfig and Makefile for the scsi
> directory, but if you are running a fairly recent kernel, you
> can just overwrite those files with those supplied in the linux-2.5
> archive supplied at the above URL.

That's very awkward and will hamper efforts to get testing done.

I grafted it into the 2.5.52 tree.  The Kconfig entries for
aix7xxx_old seem to be lost.


The driver still has a serious bug in ahc_linux_queue_recovery_cmd().
It does

	ahc_unlock(ahc, &s);

where local variable `s' is uninitialised.  But that gets copied
into the CPU's interrupt flag.


The driver got through recognising the disks and then locked up
strangely:

Program received signal SIGEMT, Emulation trap.
cache_alloc_refill (cachep=0xd00675a0, flags=0) at include/linux/list.h:127
127             prev->next = next;
(gdb) bt
#0  cache_alloc_refill (cachep=0xd00675a0, flags=0) at include/linux/list.h:127
#1  0x00000246 in ?? ()
#2  0xc0135947 in kmalloc (size=256, flags=0) at mm/slab.c:1652
#3  0xc0239835 in ahc_linux_dv_inq (ahc=0xc175e400, cmd=0xc3dd0c00, devinfo=0xc3d77fb0, targ=0xc3dcee00, request_length=96)
    at drivers/scsi/aic7xxx/aic7xxx_osm.c:3303
#4  0xc0237f5d in ahc_linux_dv_target (ahc=0xc175e400, target_offset=4) at drivers/scsi/aic7xxx/aic7xxx_osm.c:2060
#5  0xc0237d47 in ahc_linux_dv_thread (data=0xc175e400) at drivers/scsi/aic7xxx/aic7xxx_osm.c:1955

This is an NMI watchdog interrupt.  In here:

1571                    while (slabp->inuse < cachep->num && batchcount--)
1572                            ac_entry(ac)[ac->avail++] =
1573                                    cache_alloc_one_tail(cachep, slabp);

Presumably due to errors in use of slab-allocated memory.


So I enabled slab debugging and:


Program received signal SIGTRAP, Trace/breakpoint trap.
0xc013606f in kfree (objp=0xc3da5ed4) at mm/slab.c:1452
1452                            BUG();
(gdb) bt
#0  0xc013606f in kfree (objp=0xc3da5ed4) at mm/slab.c:1452
#1  0xc023bca1 in ahc_linux_free_target (ahc=0xc175c000, targ=0xc3dcf800) at drivers/scsi/aic7xxx/aic7xxx_osm.c:4588
#2  0xc023bdbd in ahc_linux_free_device (ahc=0xc175c000, dev=0xc3da4ba4) at drivers/scsi/aic7xxx/aic7xxx_osm.c:4642
#3  0xc023c36d in ahc_done (ahc=0xc175c000, scb=0xc3d78070) at drivers/scsi/aic7xxx/aic7xxx_osm.c:4858
#4  0xc02296bd in ahc_run_qoutfifo (ahc=0xc175c000) at drivers/scsi/aic7xxx/aic7xxx_core.c:344
#5  0xc023b93a in ahc_linux_isr (irq=35, dev_id=0xc175c000, regs=0xd0003f74) at drivers/scsi/aic7xxx/aic7xxx_inline.h:600
#6  0xc010c710 in handle_IRQ_event (irq=35, regs=0xd0003f74, action=0xc3d9c974) at arch/i386/kernel/irq.c:210
#7  0xc010c8f2 in do_IRQ (regs=
      {ebx = -805298176, ecx = 384, edx = -805298176, esi = -1072657832, edi = 0, ebp = -805290072, eax = 17, xds = 104, xes = -1072693144, orig_eax = -221, eip = -1072657788, xcs = 96, eflags = 582, esp = -805290056, xss = -1072657690}) at arch/i386/kernel/irq.c:391
#8  0xc010b114 in common_interrupt () at include/linux/kallsyms.h:39
#9  0xc0108ae6 in cpu_idle () at arch/i386/kernel/process.c:144
#10 0xc039553a in start_secondary (unused=0xc038692d) at arch/i386/kernel/smpboot.c:467

That's in here:

        if (cachep->flags & SLAB_RED_ZONE) {
                objp -= BYTES_PER_WORD;
                if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
                        /* Either write before start, or a double free. */
                        BUG();

Presumably a double free in ahc_linux_free_target()

I can debug further if you like, but would really appreciate unified
diffs, thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-16  9:40   ` Andrew Morton
@ 2002-12-16 18:52     ` Justin T. Gibbs
  2002-12-16 19:03       ` Christoph Hellwig
  2002-12-16 19:08       ` Andrew Morton
  0 siblings, 2 replies; 11+ messages in thread
From: Justin T. Gibbs @ 2002-12-16 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-scsi

> The driver still has a serious bug in ahc_linux_queue_recovery_cmd().
> It does
> 
> 	ahc_unlock(ahc, &s);

The sole ahc_unlock() in that routine looks like this:

#if LINUX_VERSION_CODE < KERNEL_VERSION(2,5,0)
                ahc_unlock(ahc, &s);
#else
                spin_unlock_irq(ahc->platform_data->host->host_lock);
#endif

Since you are running 2.5.X, the ahc_unlock never occurs.
In 2.4.X, ahd_midlayer_entrypoint_lock() saves the cpu flags
for us, so the variable is never uninitialized in the case
where it actually is compiled in.

> The driver got through recognising the disks and then locked up
> strangely:
> 
> Program received signal SIGEMT, Emulation trap.
> cache_alloc_refill (cachep=0xd00675a0, flags=0) at
> include/linux/list.h:127 127             prev->next = next;
> (gdb) bt
># 0  cache_alloc_refill (cachep=0xd00675a0, flags=0) at
># include/linux/list.h:127 1  0x00000246 in ?? ()
># 2  0xc0135947 in kmalloc (size=256, flags=0) at mm/slab.c:1652
># 3  0xc0239835 in ahc_linux_dv_inq (ahc=0xc175e400, cmd=0xc3dd0c00,
># devinfo=0xc3d77fb0, targ=0xc3dcee00, request_length=96)
>     at drivers/scsi/aic7xxx/aic7xxx_osm.c:3303
># 4  0xc0237f5d in ahc_linux_dv_target (ahc=0xc175e400, target_offset=4)
># at drivers/scsi/aic7xxx/aic7xxx_osm.c:2060 5  0xc0237d47 in
># ahc_linux_dv_thread (data=0xc175e400) at
># drivers/scsi/aic7xxx/aic7xxx_osm.c:1955
> 
> This is an NMI watchdog interrupt.  In here:
> 
> 1571                    while (slabp->inuse < cachep->num && batchcount--)
> 1572                            ac_entry(ac)[ac->avail++] =
> 1573                                    cache_alloc_one_tail(cachep,
> slabp);
> 
> Presumably due to errors in use of slab-allocated memory.

I'll look into this today.

> I can debug further if you like, but would really appreciate unified
> diffs, thanks.

Against???  That's the whole problem with diffs.  Every person wants them
against something different.  If you can use BK, the  James Bottomley 
has integrated the latest driver into here:

http://linux-scsi.bkbits.net/scsi-aic7xxx-2.5

I have not pulled down this repro to verify it yet though.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-16 18:52     ` Justin T. Gibbs
@ 2002-12-16 19:03       ` Christoph Hellwig
  2002-12-16 19:08       ` Andrew Morton
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2002-12-16 19:03 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Andrew Morton, linux-scsi

On Mon, Dec 16, 2002 at 11:52:26AM -0700, Justin T. Gibbs wrote:
> Against???  That's the whole problem with diffs.  Every person wants them
> against something different.

Best idea is usually latest official release or current BK tree.
I've put the output of bk export -tpatch -r1.981,1.982 (i.e. vs latest BK)
at http://verein.lst.de/~hch/aic7xxx-2.5.52.patch.gz



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-16 18:52     ` Justin T. Gibbs
  2002-12-16 19:03       ` Christoph Hellwig
@ 2002-12-16 19:08       ` Andrew Morton
  2002-12-16 19:26         ` Justin T. Gibbs
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2002-12-16 19:08 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-scsi

"Justin T. Gibbs" wrote:
> 
> > The driver still has a serious bug in ahc_linux_queue_recovery_cmd().
> > It does
> >
> >       ahc_unlock(ahc, &s);
> 
> The sole ahc_unlock() in that routine looks like this:
> 
> #if LINUX_VERSION_CODE < KERNEL_VERSION(2,5,0)
>                 ahc_unlock(ahc, &s);
> #else
>                 spin_unlock_irq(ahc->platform_data->host->host_lock);
> #endif
> 
> Since you are running 2.5.X, the ahc_unlock never occurs.
> In 2.4.X, ahd_midlayer_entrypoint_lock() saves the cpu flags
> for us, so the variable is never uninitialized in the case
> where it actually is compiled in.

In 2.5.52 uniprocessor a

	make drivers/scsi/aic7xxx/aic7xxx_linux.i

gives:

static __inline void
ahc_unlock(struct ahc_softc *ahc, unsigned long *flags)
{
        do { do { (void)(  &ahc->platform_data->spin_lock  ); } while(0) ; __asm__ __volatile__("pushl %0 ; popfl":   :"g" (   *flags  ):"memory", "cc") ; do { } while (0) ; } while (0) ;
}

Which is loading *flags into the CPU's interrupt status register.

On 2.5.52 SMP:

static __inline void
ahc_unlock(struct ahc_softc *ahc, unsigned long *flags)
{
        do { _raw_spin_unlock( &ahc->platform_data->spin_lock ); __asm__ __volatile__("pushl %0 ; popfl":   :"g" (   *flags  ):"memory", "cc") ; do { } while (0) ; } while (0) ;
}

ditto.


And it is being called from ahc_linux_queue_recovery_cmd:

        if (wait) {
                struct timer_list timer;
                int ret;

                ahc_unlock(ahc, &s);
                init_timer(&timer);

So I think the problem is still there.

> ..
> >
> > 1571                    while (slabp->inuse < cachep->num && batchcount--)
> > 1572                            ac_entry(ac)[ac->avail++] =
> > 1573                                    cache_alloc_one_tail(cachep,
> > slabp);
> >
> > Presumably due to errors in use of slab-allocated memory.
> 
> I'll look into this today.

Thanks.
 
> > I can debug further if you like, but would really appreciate unified
> > diffs, thanks.
> 
> Against???

The current devel kernel.  Nobody uses anything else, and if they
do, integration of diffs is much easier than a wholesale overwrite.

This is why everyone uses them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: aic7xxx woes in 2.5
  2002-12-16 19:08       ` Andrew Morton
@ 2002-12-16 19:26         ` Justin T. Gibbs
  0 siblings, 0 replies; 11+ messages in thread
From: Justin T. Gibbs @ 2002-12-16 19:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-scsi

>> Since you are running 2.5.X, the ahc_unlock never occurs.
>> In 2.4.X, ahd_midlayer_entrypoint_lock() saves the cpu flags
>> for us, so the variable is never uninitialized in the case
>> where it actually is compiled in.
> 
> In 2.5.52 uniprocessor a
> 
> 	make drivers/scsi/aic7xxx/aic7xxx_linux.i
> 
> gives:
> 
> static __inline void
> ahc_unlock(struct ahc_softc *ahc, unsigned long *flags)
> {
>         do { do { (void)(  &ahc->platform_data->spin_lock  ); } while(0)
> ; __asm__ __volatile__("pushl %0 ; popfl":   :"g" (   *flags  ):"memory",
> "cc") ; do { } while (0) ; } while (0) ; }
> 
> Which is loading *flags into the CPU's interrupt status register.

Since I wrote the routine, I'm well aware of how it operates.

> And it is being called from ahc_linux_queue_recovery_cmd:
> 
>         if (wait) {
>                 struct timer_list timer;
>                 int ret;
> 
>                 ahc_unlock(ahc, &s);
>                 init_timer(&timer);

You must have botched the integration of the latest driver from here:
http://people.FreeBSD.org/~gibbs/linux/SRC.

I just downloaded it again (both the archive from the 10th and the 13th)
and neither use ahc_unlock under 2.5.X in ahc_linux_queue_recovery_cmd().
What are the $Id$ strings at the top of the file?

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-12-16 19:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-15  4:31 aic7xxx woes in 2.5 Andrew Morton
2002-12-15  6:06 ` Ishikawa
2002-12-15  6:48   ` Andrew Morton
2002-12-15 13:48     ` Ishikawa
2002-12-15 20:17   ` Justin T. Gibbs
2002-12-15 20:09 ` Justin T. Gibbs
2002-12-16  9:40   ` Andrew Morton
2002-12-16 18:52     ` Justin T. Gibbs
2002-12-16 19:03       ` Christoph Hellwig
2002-12-16 19:08       ` Andrew Morton
2002-12-16 19:26         ` Justin T. Gibbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox