Hi - I've been having some problems with my USB CD-ROM burner (HP 8200) - it locks up the machine occasionally. This is with Linux 2.4.6. I saw a posting to this list (linux-usb-devel) by Cody Pisto with some HP 8200 patches (17 Jul 2001), but the geocrawler archive doesn't archive the attachments - or at least I couldn't figure out how to retrieve them. I wouldn't mind if somebody (Cody?) could send me a copy, but here's what I've found out... Of course, first you have to patch drivers/usb/Config.in to add an HP8200 option. Next, the lockups. They're caused by an attempt to lock the io_request_lock spinlock while it's already locked. I'm running on a single processor machine. I'm posting two patches to linux-kernel - one is enhanced spinlock debugging code that reveals the problem, the other is a remote debugger stub for intel so I can use gdb on the kernel like sparc & ppc. In, short, here's the problem: The trouble starts when scsi_try_to_abort_command is called (for reasons I'm still unclear on). This function is in drivers/scsi/scsi_error.c: 755 STATIC int scsi_try_to_abort_command(Scsi_Cmnd * SCpnt, int timeout) 756 { 757 int rtn; 758 unsigned long flags; 759 760 SCpnt->eh_state = FAILED; /* Until we come up with something better */ 761 762 if (SCpnt->host->hostt->eh_abort_handler == NULL) { 763 return FAILED; 764 } 765 /* 766 * scsi_done was called just after the command timed out and before 767 * we had a chance to process it. (DB) 768 */ 769 if (SCpnt->serial_number == 0) 770 return SUCCESS; 771 772 SCpnt->owner = SCSI_OWNER_LOWLEVEL; 773 774 spin_lock_irqsave(&io_request_lock, flags); 775 rtn = SCpnt->host->hostt->eh_abort_handler(SCpnt); 776 spin_unlock_irqrestore(&io_request_lock, flags); 777 return rtn; 778 } Notice, at the end, the call to eh_abort_handler while the io_request_lock is held (lines 774-777). For the USB CD-ROM burner, eh_abort_handler is a pointer to command_abort in drivers/usb/storage/scsiglue.c: 173 static int command_abort( Scsi_Cmnd *srb ) 174 { 175 struct us_data *us = (struct us_data *)srb->host->hostdata[0]; 176 177 US_DEBUGP("command_abort() called\n"); 178 179 /* if we're stuck waiting for an IRQ, simulate it */ 180 if (atomic_read(us->ip_wanted)) { 181 US_DEBUGP("-- simulating missing IRQ\n"); 182 up(&(us->ip_waitq)); 183 } 184 185 /* if the device has been removed, this worked */ 186 if (!us->pusb_dev) { 187 US_DEBUGP("-- device removed already\n"); 188 return SUCCESS; 189 } 190 191 /* if we have an urb pending, let's wake the control thread up */ 192 if (us->current_urb->status == -EINPROGRESS) { 193 /* cancel the URB -- this will automatically wake the thread */ 194 usb_unlink_urb(us->current_urb); 195 196 /* wait for us to be done */ 197 down(&(us->notify)); 198 return SUCCESS; 199 } 200 201 US_DEBUGP ("-- nothing to abort\n"); 202 return FAILED; 203 } The problem is the down on line 197. It causes the kernel to schedule while the io_request_lock is held. Now, if anything else comes along that needs the io_request_lock, and runs before the down completes, the kernel locks up. Lots of stuff can actually trigger the lockup; I've seen a page fault trying to read something in from disk cause it, as well as just a normal disk read from user space. I'm attaching a kernel gdb trace of one of these lockups. It's a bit cryptic, because the kernel gdb doesn't let me switch between tasks, so I have to read back through a stack dump manually. Basically, the trace starts with a BUG() in my revised spinlock code that detects when the same processor that holds the lock attempts to grab it again. The spinlock recorded the PC and task_struct when the lock was first grabbed, so even though we're looking at the moment when the second task came along and tried to grab it again, we can use the stored information to find 1) which task grabbed the lock; 2) what it's PC counter was when it grabbed it; and 3) (by reading the stack trace) what's it's doing now. In this trace, the answers to those questions are: 1) pid 1370 (comm = "scsi_eh"; unclear what that is); 2) the spinlock in scsi_try_to_abort_command; and 3) scheduled from the down on line 197 Anyway, I don't know enough about this code to try and figure what the fix should be, so maybe somebody on this list can suggest it. Then I'll need to figure out why scsi_try_to_abort_command() is being called in the first place - any ideas? It seems to be about a 50/50 proposition that during an entire CD burn, sometimes it locks up, and sometimes it completes the whole thing. And like I said, I'm attaching the kernel gdb trace... as an attachment... so geocrawler can lose it too.. -- -bwb Brent Baccala baccala@freesoft.org ============================================================================== For news from freesoft.org, subscribe to announce@freesoft.org: mailto:announce-request@freesoft.org?subject=subscribe&body=subscribe ==============================================================================