Re: 2.5.63-mm2

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: 2.5.63-mm2
       [not found] ` <1046815078.12931.79.camel@ibm-b>
@ 2003-03-05  7:40   ` Andrew Morton
  2003-03-05 17:38     ` 2.5.63-mm2 Mike Anderson
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2003-03-05  7:40 UTC (permalink / raw)
  To: Mark Wong; +Cc: linux-scsi

Mark Wong <markw@osdl.org> wrote:
>
> It appears something is conflicting with the old Adapatec AIC7xxx.  My
> system halts when it attempts to probe the devices (I think it's that.) 
> So I started using the new AIC7xxx driver and all is well.  I don't see
> any messages to the console that points to any causes.  Is there
> someplace I can look for a clue to the problem?
> 
> I actually didn't realize I was using the old driver and have no qualms
> about not using it, but if it'll help someone else, I can help gather
> information.
> 

Well aic7xxx_old fails for me in 2.5.62, as well as later kernels.
Always in the same way.  It gets stuck at boot:

(gdb) bt
#0  wait_for_completion (x=0xc8dbfea8) at kernel/sched.c:1408
#1  0xc024295d in scsi_wait_req (SRpnt=0xf7e22a00, cmnd=0xc8dbfeec, buffer=0xc03fce60, bufflen=36, timeout=6000, 
    retries=3) at drivers/scsi/scsi.c:673
#2  0xc024839f in scsi_probe_lun (sreq=0xf7e22a00, inq_result=0xc03fce60 "", bflags=0xc8dbff20)
    at drivers/scsi/scsi_scan.c:1050
#3  0xc024877e in scsi_probe_and_add_lun (host=0xf7e92600, q=0xc8dbff80, channel=0, id=2, lun=0, bflagsp=0xc8dbff54)
    at drivers/scsi/scsi_scan.c:1372
#4  0xc0248a8e in scsi_scan_target (shost=0xf7e92600, q=0xc8dbff80, channel=0, id=2) at drivers/scsi/scsi_scan.c:1835
#5  0xc0248b7e in scsi_scan_host (shost=0xf7e92600) at drivers/scsi/scsi_scan.c:1892
#6  0xc0243b9e in scsi_add_host (shost=0xf7e92600, dev=0x0) at drivers/scsi/hosts.c:314
#7  0xc0243fbd in scsi_register_host (shost_tp=<incomplete type>) at drivers/scsi/hosts.c:513
#8  0xc03b3259 in init_this_scsi_driver () at drivers/scsi/scsi_module.c:38
#9  0xc03a07d5 in do_initcalls () at init/main.c:472
#10 0xc03a0803 in do_basic_setup () at init/main.c:497
#11 0xc01050a7 in init (unused=0x0) at init/main.c:535

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.63-mm2
  2003-03-05  7:40   ` 2.5.63-mm2 Andrew Morton
@ 2003-03-05 17:38     ` Mike Anderson
  2003-03-05 18:52       ` 2.5.63-mm2 Mike Anderson
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Anderson @ 2003-03-05 17:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mark Wong, linux-scsi

I can also reproduce the problem on my system now that I switch from new
AIC7xxx to old AIC7xxx. I am looking at the problem now.

Andrew Morton [akpm@digeo.com] wrote:
> Mark Wong <markw@osdl.org> wrote:
> >
> > It appears something is conflicting with the old Adapatec AIC7xxx.  My
> > system halts when it attempts to probe the devices (I think it's that.) 
> > So I started using the new AIC7xxx driver and all is well.  I don't see
> > any messages to the console that points to any causes.  Is there
> > someplace I can look for a clue to the problem?
> > 
> > I actually didn't realize I was using the old driver and have no qualms
> > about not using it, but if it'll help someone else, I can help gather
> > information.
> > 
> 
> Well aic7xxx_old fails for me in 2.5.62, as well as later kernels.
> Always in the same way.  It gets stuck at boot:
> 
> (gdb) bt
> #0  wait_for_completion (x=0xc8dbfea8) at kernel/sched.c:1408
> #1  0xc024295d in scsi_wait_req (SRpnt=0xf7e22a00, cmnd=0xc8dbfeec, buffer=0xc03fce60, bufflen=36, timeout=6000, 
>     retries=3) at drivers/scsi/scsi.c:673
> #2  0xc024839f in scsi_probe_lun (sreq=0xf7e22a00, inq_result=0xc03fce60 "", bflags=0xc8dbff20)
>     at drivers/scsi/scsi_scan.c:1050
> #3  0xc024877e in scsi_probe_and_add_lun (host=0xf7e92600, q=0xc8dbff80, channel=0, id=2, lun=0, bflagsp=0xc8dbff54)
>     at drivers/scsi/scsi_scan.c:1372
> #4  0xc0248a8e in scsi_scan_target (shost=0xf7e92600, q=0xc8dbff80, channel=0, id=2) at drivers/scsi/scsi_scan.c:1835
> #5  0xc0248b7e in scsi_scan_host (shost=0xf7e92600) at drivers/scsi/scsi_scan.c:1892
> #6  0xc0243b9e in scsi_add_host (shost=0xf7e92600, dev=0x0) at drivers/scsi/hosts.c:314
> #7  0xc0243fbd in scsi_register_host (shost_tp=<incomplete type>) at drivers/scsi/hosts.c:513
> #8  0xc03b3259 in init_this_scsi_driver () at drivers/scsi/scsi_module.c:38
> #9  0xc03a07d5 in do_initcalls () at init/main.c:472
> #10 0xc03a0803 in do_basic_setup () at init/main.c:497
> #11 0xc01050a7 in init (unused=0x0) at init/main.c:535
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.63-mm2
  2003-03-05 17:38     ` 2.5.63-mm2 Mike Anderson
@ 2003-03-05 18:52       ` Mike Anderson
  2003-03-05 21:33         ` 2.5.63-mm2 Patrick Mansfield
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Anderson @ 2003-03-05 18:52 UTC (permalink / raw)
  To: Andrew Morton, Mark Wong, linux-scsi

The patch below fixed the problem on my system. I had my list empty
checks reversed if aborting and bus device reset failed. The condition
that causes the error handler to run is still unknown. I will look at it
when I get a chance.

Mike Anderson [andmike@us.ibm.com] wrote:
> I can also reproduce the problem on my system now that I switch from new
> AIC7xxx to old AIC7xxx. I am looking at the problem now.
> 
> Andrew Morton [akpm@digeo.com] wrote:
> > Mark Wong <markw@osdl.org> wrote:
> > >
> > > It appears something is conflicting with the old Adapatec AIC7xxx.  My
> > > system halts when it attempts to probe the devices (I think it's that.) 
> > > So I started using the new AIC7xxx driver and all is well.  I don't see
> > > any messages to the console that points to any causes.  Is there
> > > someplace I can look for a clue to the problem?
> > > 
> > > I actually didn't realize I was using the old driver and have no qualms
> > > about not using it, but if it'll help someone else, I can help gather
> > > information.
> > > 
> > 
-andmike
--
Michael Anderson
andmike@us.ibm.com


=====
name:		00_scsi_error_ready_devs-1.diff
version:	2003-03-05.10:39:28-0800
against:	2.5.63

 scsi_error.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

=====
===== drivers/scsi/scsi_error.c 1.38 vs edited =====
--- 1.38/drivers/scsi/scsi_error.c	Sat Feb 22 08:17:01 2003
+++ edited/drivers/scsi/scsi_error.c	Wed Mar  5 10:14:22 2003
@@ -1490,9 +1490,9 @@
 			       struct list_head *work_q,
 			       struct list_head *done_q)
 {
-	if (scsi_eh_bus_device_reset(shost, work_q, done_q))
-		if (scsi_eh_bus_reset(shost, work_q, done_q))
-			if (scsi_eh_host_reset(work_q, done_q))
+	if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
+		if (!scsi_eh_bus_reset(shost, work_q, done_q))
+			if (!scsi_eh_host_reset(work_q, done_q))
 				scsi_eh_offline_sdevs(work_q, done_q);
 }
 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.63-mm2
  2003-03-05 18:52       ` 2.5.63-mm2 Mike Anderson
@ 2003-03-05 21:33         ` Patrick Mansfield
  2003-03-05 22:01           ` 2.5.63-mm2 Mike Anderson
  2003-03-06  7:38           ` 2.5.63-mm2 Matthew Jacob
  0 siblings, 2 replies; 6+ messages in thread
From: Patrick Mansfield @ 2003-03-05 21:33 UTC (permalink / raw)
  To: Andrew Morton, Mark Wong, linux-scsi

On Wed, Mar 05, 2003 at 10:52:31AM -0800, Mike Anderson wrote:
> The patch below fixed the problem on my system. I had my list empty
> checks reversed if aborting and bus device reset failed. The condition
> that causes the error handler to run is still unknown. I will look at it
> when I get a chance.

Mike -

With your patch, I am able to boot again using the feral driver with an
isp1020 (on a NUMAQ system).

Though I still do not know why the feral gets a time out but qlogicisp
does not. It is apparently a read that times out while mounting root for
the first time (readonly), so this is not the first read sent to the drive
(partition code should have sent a read). It could be the queue_depth
settings (qlogicisp sets to 1, I have feral setting them to 16).

On boot with (SCSI error logging on) I see:

[ ... ]

TCP: Hash tables configured (established 524288 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
isp6: Loop ID 125, AL_PA 0x1, Port ID 0x1, Loop State 0x2, Topology
'Private Loop'
Error handler scsi_eh_0 waking up
Error handler scsi_eh_0 waking up
scsi_eh_prt_fail_stats: 0:0:0:0 cmds failed: 0, cancel: 1
Total of 1 commands on 1 devices require eh work
scsi_eh_0: aborting cmd:0xc3f4e600
scsi_eh_0: aborting cmd failed:0xc3f4e600
scsi_eh_0: Sending BDR sdev: 0xc3fba600
isp0: Interrupting Mailbox Command (0x17) Timeout
isp0: Mailbox Command 'ABORT TARGET' failed (TIMEOUT)
scsi_eh_0: BDR failed sdev:0xc3fba600
scsi_eh_0: Sending BRST chan: 0
scsi_try_bus_reset: Snd Bus RST
isp0: Interrupting Mailbox Command (0x18) Timeout
isp0: Mailbox Command 'BUS RESET' failed (TIMEOUT)
scsi_eh_0: BRST failed chan: 0
scsi_eh_0: Sending HRST
scsi_try_host_reset: Snd Host RST
isp0: Differential Mode
isp0: Ultra Mode Capable
isp0: Board Type 1040B, Chip Revision 0x5, loaded F/W Revision 4.66.0
isp0: Last F/W revision was 4.40.0
scsi_eh_done scmd: c3f4e600 result: 2
scsi_send_eh_cmnd: scmd: c3f4e600, rtn:2002
scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
scsi_eh_tur: scmd c3f4e600 rtn 2002
scsi_eh_0: flush retry cmd: c3f4e600
scsi_restart_operations: waking up host to restart
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 112k freed
INIT: version 2.78 booting

[ ... ]

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.63-mm2
  2003-03-05 21:33         ` 2.5.63-mm2 Patrick Mansfield
@ 2003-03-05 22:01           ` Mike Anderson
  2003-03-06  7:38           ` 2.5.63-mm2 Matthew Jacob
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Anderson @ 2003-03-05 22:01 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: Andrew Morton, Mark Wong, linux-scsi

Thanks for trying this Patrick. I have added this to my list of systems
available that display error handling signatures for future testing.

Patrick Mansfield [patmans@us.ibm.com] wrote:
> On Wed, Mar 05, 2003 at 10:52:31AM -0800, Mike Anderson wrote:
> > The patch below fixed the problem on my system. I had my list empty
> > checks reversed if aborting and bus device reset failed. The condition
> > that causes the error handler to run is still unknown. I will look at it
> > when I get a chance.
> 
> Mike -
> 
> With your patch, I am able to boot again using the feral driver with an
> isp1020 (on a NUMAQ system).
> 
> Though I still do not know why the feral gets a time out but qlogicisp
> does not. It is apparently a read that times out while mounting root for
> the first time (readonly), so this is not the first read sent to the drive
> (partition code should have sent a read). It could be the queue_depth
> settings (qlogicisp sets to 1, I have feral setting them to 16).
> 
-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.5.63-mm2
  2003-03-05 21:33         ` 2.5.63-mm2 Patrick Mansfield
  2003-03-05 22:01           ` 2.5.63-mm2 Mike Anderson
@ 2003-03-06  7:38           ` Matthew Jacob
  1 sibling, 0 replies; 6+ messages in thread
From: Matthew Jacob @ 2003-03-06  7:38 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: Andrew Morton, Mark Wong, linux-scsi

> Though I still do not know why the feral gets a time out but qlogicisp
> does not. It is apparently a read that times out while mounting root for
> the first time (readonly), so this is not the first read sent to the drive

The command in question that's timing out is "ABOUT FIRMWARE".

If at reset you see the pattern "ISP " in mailbox registers 1, 2, 3
you're supposed to be at the hard PROM- i.e., no firmware has been
loaded and set running by system boot procedures. Unfortunately you get
some platforms where if this is not set this means that the card is
neither in hard reset state (i.e., running out of it's prom) nor
actually running f/w. Hence the timeout on the command.

I've always considered this a relatively minor buglet as it doesn't
occur on *that* many systems but I should probably fix it.

-matt

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-03-06  7:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20030302180959.3c9c437a.akpm@digeo.com>
     [not found] ` <1046815078.12931.79.camel@ibm-b>
2003-03-05  7:40   ` 2.5.63-mm2 Andrew Morton
2003-03-05 17:38     ` 2.5.63-mm2 Mike Anderson
2003-03-05 18:52       ` 2.5.63-mm2 Mike Anderson
2003-03-05 21:33         ` 2.5.63-mm2 Patrick Mansfield
2003-03-05 22:01           ` 2.5.63-mm2 Mike Anderson
2003-03-06  7:38           ` 2.5.63-mm2 Matthew Jacob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox