Hi -

I've been having some problems with my USB CD-ROM burner (HP 8200) - it
locks up the machine occasionally.  This is with Linux 2.4.6.

I saw a posting to this list (linux-usb-devel) by Cody Pisto with some
HP
8200 patches (17 Jul 2001), but the geocrawler archive doesn't archive
the
attachments - or at least I couldn't figure out how to retrieve them.  I
wouldn't mind if somebody (Cody?) could send me a copy, but here's what
I've
found out...

Of course, first you have to patch drivers/usb/Config.in to add an
HP8200 option.

Next, the lockups.  They're caused by an attempt to lock the
io_request_lock spinlock while it's already locked.  I'm running on a
single processor machine.  I'm posting two patches to linux-kernel - one
is enhanced spinlock debugging code that reveals the problem, the other
is a remote debugger stub for intel so I can use gdb on the kernel like
sparc & ppc.

In, short, here's the problem:

The trouble starts when scsi_try_to_abort_command is called (for reasons
I'm still unclear on).  This function is in drivers/scsi/scsi_error.c:

     755 STATIC int scsi_try_to_abort_command(Scsi_Cmnd * SCpnt, int
timeout)
     756 {
     757         int rtn;
     758         unsigned long flags;
     759 
     760         SCpnt->eh_state = FAILED;   /* Until we come up with
something better */
     761 
     762         if (SCpnt->host->hostt->eh_abort_handler == NULL) {
     763                 return FAILED;
     764         }
     765         /* 
     766          * scsi_done was called just after the command timed
out and before
     767          * we had a chance to process it. (DB)
     768          */
     769         if (SCpnt->serial_number == 0)
     770                 return SUCCESS;
     771 
     772         SCpnt->owner = SCSI_OWNER_LOWLEVEL;
     773 
     774         spin_lock_irqsave(&io_request_lock, flags);
     775         rtn = SCpnt->host->hostt->eh_abort_handler(SCpnt);
     776         spin_unlock_irqrestore(&io_request_lock, flags);
     777         return rtn;
     778 }

Notice, at the end, the call to eh_abort_handler while the
io_request_lock is held (lines 774-777).

For the USB CD-ROM burner, eh_abort_handler is a pointer to
command_abort in drivers/usb/storage/scsiglue.c:

     173 static int command_abort( Scsi_Cmnd *srb )
     174 {
     175         struct us_data *us = (struct us_data
*)srb->host->hostdata[0];
     176 
     177         US_DEBUGP("command_abort() called\n");
     178 
     179         /* if we're stuck waiting for an IRQ, simulate it */
     180         if (atomic_read(us->ip_wanted)) {
     181                 US_DEBUGP("-- simulating missing IRQ\n");
     182                 up(&(us->ip_waitq));
     183         }
     184 
     185         /* if the device has been removed, this worked */
     186         if (!us->pusb_dev) {
     187                 US_DEBUGP("-- device removed already\n");
     188                 return SUCCESS;
     189         }
     190 
     191         /* if we have an urb pending, let's wake the control
thread up */
     192         if (us->current_urb->status == -EINPROGRESS) {
     193                 /* cancel the URB -- this will automatically
wake the thread */
     194                 usb_unlink_urb(us->current_urb);
     195 
     196                 /* wait for us to be done */
     197                 down(&(us->notify));
     198                 return SUCCESS;
     199         }
     200 
     201         US_DEBUGP ("-- nothing to abort\n");
     202         return FAILED;
     203 }

The problem is the down on line 197.  It causes the kernel to schedule
while the io_request_lock is held.  Now, if anything else comes along
that needs the io_request_lock, and runs before the down completes, the
kernel locks up.  Lots of stuff can actually trigger the lockup; I've
seen a page fault trying to read something in from disk cause it, as
well as just a normal disk read from user space.

I'm attaching a kernel gdb trace of one of these lockups.  It's a bit
cryptic, because the kernel gdb doesn't let me switch between tasks, so
I have to read back through a stack dump manually.  Basically, the trace
starts with a BUG() in my revised spinlock code that detects when the
same processor that holds the lock attempts to grab it again.  The
spinlock recorded the PC and task_struct when the lock was first
grabbed, so even though we're looking at the moment when the second task
came along and tried to grab it again, we can use the stored information
to find 1) which task grabbed the lock; 2) what it's PC counter was when
it grabbed it; and 3) (by reading the stack trace) what's it's doing
now.  In this trace, the answers to those questions are: 1) pid 1370
(comm = "scsi_eh"; unclear what that is); 2) the spinlock in
scsi_try_to_abort_command; and 3) scheduled from the down on line 197

Anyway, I don't know enough about this code to try and figure what the
fix should be, so maybe somebody on this list can suggest it.  Then
I'll need to figure out why scsi_try_to_abort_command() is being called
in the first place - any ideas?  It seems to be about a 50/50
proposition that during an entire CD burn, sometimes it locks up, and
sometimes it completes the whole thing.

And like I said, I'm attaching the kernel gdb trace... as an
attachment... so geocrawler can lose it too..

-- 
                                        -bwb

                                        Brent Baccala
                                        baccala@freesoft.org

==============================================================================
       For news from freesoft.org, subscribe to announce@freesoft.org:
   
mailto:announce-request@freesoft.org?subject=subscribe&body=subscribe
==============================================================================