* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? [not found] ` <3FA28C9A.5010608@pacbell.net> @ 2004-03-06 2:08 ` David Mosberger 2004-03-06 2:13 ` David Mosberger 2004-03-06 4:55 ` David Brownell 0 siblings, 2 replies; 25+ messages in thread From: David Mosberger @ 2004-03-06 2:08 UTC (permalink / raw) To: David Brownell Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini OK, finally a bit of progress. If you remember back in October 2003 I reported: > One-line summary: plug-in your USB keyboard, see your machine die. > So, I have this non-name USB keyboard (with built-in 2-port USB > hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64. > In retrospect, it's clear to me that the same keyboard also > occasionally crashes 2.4 kernels, but there the problem appears > more seldom. Perhaps once in 10 reboots and once the machine is > booted and the keyboard is running, it keeps on working. The > keyboard in question is a BTC 5141H. After this, I spent a (small) amount of time looking over the HID code etc to see what could be causing it. I could find nothing wrong so I gave up, connected another USB keyboard, and basically ignored the problem. In retrospect, that was Good Thinking, because I was apparently looking at the wrong code: the problem _does_ appear to be coming from the USB HCD, not from the HIDeous code. Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines I tested would crash as soon as they had _any_ USB keyboard plugged in. That is, the problem no longer was limited to the BTC keyboard, which is special because it has a built-in hub. This was encouraging. Turns out it's this patch that was causing the crashes: http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17 That was strange, because even to my USB-untrained eye the patch looked obviously correct. However, I think the root cause of the problem really has to do with a race-condition between the controller and the driver. In particular, if I apply the patch below, my USB keyboards (including the BTC keyboard) work just fine! === drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004 +++ edited/drivers/usb/host/ohci-q.c Fri Mar 5 17:25:55 2004 @@ -438,7 +451,7 @@ * behave. frame_no wraps every 2^16 msec, and changes right before * SF is triggered. */ - ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; + ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2; /* rm_list is just singly linked, for simplicity */ ed->ed_next = ohci->ed_rm_list; However, I think the root-cause of the problem may be this optimization in ohci_irq(): /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */ Indeed, if I apply this patch instead: === drivers/usb/host/ohci-hcd.c 1.56 vs edited ==--- 1.56/drivers/usb/host/ohci-hcd.c Tue Mar 2 05:52:40 2004 +++ edited/drivers/usb/host/ohci-hcd.c Fri Mar 5 17:45:09 2004 @@ -584,7 +584,7 @@ int ints; /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */ - if ((ohci->hcca->done_head != 0) + if (0 && (ohci->hcca->done_head != 0) && ! (le32_to_cpup (&ohci->hcca->done_head) & 0x01)) { ints = OHCI_INTR_WDH; there are no crashes either. So my theory is that I was seeing this sequence of events: - HCD signals WDH interrupt & sends DMA to update the frame number in the host-controller communication area (HCCA) - host gets interrupt, but skips readl() and hence reads a stale frame number N instead of the up-to-date value (N+1) - HCD cancels a transfer descriptor (TD), moves it to the "remove list" and calculates the frame number at which it can be remove from the host-controller's list as N+1 - SOF interrupt arrives (probably was pending already?) - interrupt handler does a readl() and now sees the updated frame-number N+1 - HCD sees that the cancelled TD's time stamp N+1 is <= the current current time stamp (N+1) and goes ahead and removes it from the host-list, while the controller is still looking at the entry being removed - HCD ends up dereferencing a bad pointer and ends up reading from address 0xf0000000, which on our ia64 machines is a read-only area, which then results in a machine-check abort Does this sound plausible? What beats me is why UHCI would have the same issue. I know even less about UHCI than I do about OHCI but perhaps there is a similar problem. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 2:08 ` [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? David Mosberger @ 2004-03-06 2:13 ` David Mosberger 2004-03-06 4:55 ` David Brownell 1 sibling, 0 replies; 25+ messages in thread From: David Mosberger @ 2004-03-06 2:13 UTC (permalink / raw) To: davidm Cc: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini Typo-alert: >>>>> On Fri, 5 Mar 2004 18:08:40 -0800, David Mosberger <davidm@linux.hpl.hp.com> said: David> - HCD ends up dereferencing a bad pointer and ends up David> reading from address 0xf0000000, which on our ia64 machines David> is a read-only area, which then results in a machine-check David> abort ^^^^^^^^^ make that "write-only" --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 2:08 ` [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? David Mosberger 2004-03-06 2:13 ` David Mosberger @ 2004-03-06 4:55 ` David Brownell 2004-03-06 5:49 ` David Mosberger 2004-03-06 9:17 ` David Mosberger 1 sibling, 2 replies; 25+ messages in thread From: David Brownell @ 2004-03-06 4:55 UTC (permalink / raw) To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > OK, finally a bit of progress. If you remember back in October 2003 I > reported: > > > One-line summary: plug-in your USB keyboard, see your machine die. > > > So, I have this non-name USB keyboard (with built-in 2-port USB > > hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64. > > In retrospect, it's clear to me that the same keyboard also > > occasionally crashes 2.4 kernels, but there the problem appears > > more seldom. > > Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines > I tested would crash as soon as they had _any_ USB keyboard plugged > in. That is, the problem no longer was limited to the BTC keyboard, > which is special because it has a built-in hub. This was encouraging. > > Turns out it's this patch that was causing the crashes: > > http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17 Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!! There's something wierd going on recently for sure, and it's not caused just by that patch. I'm not sure reverting that would make things better overall right now; hard to say. I'm thinking the "disable periodic schedule" patch is also worth looking at. I can imagine some silicon having bugs if that's turned off (just like some _systems_ have issues when it's left on). > That was strange, because even to my USB-untrained eye the patch > looked obviously correct. However, I think the root cause of the > problem really has to do with a race-condition between the controller > and the driver. In particular, if I apply the patch below, my USB > keyboards (including the BTC keyboard) work just fine! > > ... > - ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; > + ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2; I've seen that _change_ behavior in some regression tests (usbtest test11/test12), as if the extra msec let things quiesce (so only one of two broken states showed, and not the oopsable one) but not _fix_ it. > However, I think the root-cause of the problem may be this optimization > in ohci_irq(): > > /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */ > > Indeed, if I apply this patch instead: > > ... > /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */ > - if ((ohci->hcca->done_head != 0) > + if (0 && (ohci->hcca->done_head != 0) > ... > > there are no crashes either. See this post from Martin Diehl ... my response isn't out yet: http://marc.theaimsgroup.com/?l=linux-usb-devel&m\x107850825815775&w=2 The reason I keep ending up thinking that readl-elimination must be OK (me agreeing with Martin) is that the memory there came from dma_alloc_coherent() ... so if anything's wrong, it'd be at most lack of rmb(), not a stale-cache kind of thing. > So my theory is that I was seeing this sequence of events: > > - HCD signals WDH interrupt & sends DMA to update the frame number in > the host-controller communication area (HCCA) > > - host gets interrupt, but skips readl() and hence reads a stale > frame number N instead of the up-to-date value (N+1) It reads the frame number directly from the controller, so it's not possible that it can be so stale that an rmb() wouldn't fix sequencing issues. What might be possible though is that the donelist gets modified by the time the unlinks get processed, with some extra TDs changing state (from HC perspective) ... haven't explored that possibility. > - HCD cancels a transfer descriptor (TD), moves it to the "remove list" > and calculates the frame number at which it can be remove from > the host-controller's list as N+1 That'd be an ED on the remove list, not a TD. Also in dma-coherent memory. The cancelation would apply to one or more of the TDs queued to that ED. > - SOF interrupt arrives (probably was pending already?) > > - interrupt handler does a readl() and now sees the updated > frame-number N+1 > > - HCD sees that the cancelled TD's time stamp N+1 is <= the current > current time stamp (N+1) and goes ahead and removes it from the > host-list, while the controller is still looking at the entry being > removed This ED-level disagreement between the HC and HCD might explain some issues. I think the current trouble cases are usually where there's only one TD queued to the ED, so that TD and the dummy keep ping-ponging back and forth. The HC certainly seems to overwrite the "dummy TD" -- > - HCD ends up dereferencing a bad pointer and ends up reading from > address 0xf0000000, which on our ia64 machines is a read-only area, > which then results in a machine-check abort I'm surprised that DMA from a read-only area would be a problem! :) If OHCI is getting a PCI error, I'd expect a "UE" IRQ. > Does this sound plausible? Parts of it. There's definite recent nastiness. Of the type that other eyes sometimes see better. > What beats me is why UHCI would have the same issue. I know even less > about UHCI than I do about OHCI but perhaps there is a similar > problem. I still suspect some problem in the HID code. But right now I'm quite certain of a recent-ish OHCI issue. - Dave > > --david > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 4:55 ` David Brownell @ 2004-03-06 5:49 ` David Mosberger 2004-03-06 7:21 ` David Mosberger 2004-03-06 16:37 ` David Brownell 2004-03-06 9:17 ` David Mosberger 1 sibling, 2 replies; 25+ messages in thread From: David Mosberger @ 2004-03-06 5:49 UTC (permalink / raw) To: David Brownell Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Fri, 05 Mar 2004 20:55:01 -0800, David Brownell <david-b@pacbell.net> said: >> Turns out it's this patch that was causing the crashes: >> http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17 David.B> Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!! Of course. What I'm saying is that in 2.6.0-test{8,9} it was rare to trigger the problem (only with BTC keyboard) and the change above made it trivial to trigger the keyboard. Basically, your fix in cset 1.1619.1.17 made it more common for stuff to be unlinked in the "deferred" (proper) manner and that made it much more likely to trigger the bug. David.B> The reason I keep ending up thinking that readl-elimination David.B> must be OK (me agreeing with Martin) is that the memory David.B> there came from dma_alloc_coherent() ... so if anything's David.B> wrong, it'd be at most lack of rmb(), not a stale-cache David.B> kind of thing. It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt ordering. I believe the WHD interrupt is arriving at the CPU before the DMA update to the HCCA is done. In my second patch, the readl() at the beginning of the interrupt ensures that the DMA update to the HCCA is completed before the readl() returns data. David.B> It reads the frame number directly from the controller, so David.B> it's not possible that it can be so stale that an rmb() David.B> wouldn't fix sequencing issues. Eh, it's read like this: #define OHCI_FRAME_NO(hccap) ((u16)le32_to_cpup(&(hccap)->frame_no)) finish_unlinks (ohci, OHCI_FRAME_NO(ohci->hcca), ptregs); The HCCA is in host memory. >> - HCD ends up dereferencing a bad pointer and ends up reading >> from address 0xf0000000, which on our ia64 machines is a >> read-only area, which then results in a machine-check abort David.B> I'm surprised that DMA from a read-only area would be a David.B> problem! :) If OHCI is getting a PCI error, I'd expect a David.B> "UE" IRQ. You must have not received my follow-up message. There was a typo in my message: it was supposed to say "write-only" area. David.B> I still suspect some problem in the HID code. But right David.B> now I'm quite certain of a recent-ish OHCI issue. I'm 99% certain that the problem I saw back in October (BTC keyboard) was identical to the one triggered by cset 1.1619.1.17. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 5:49 ` David Mosberger @ 2004-03-06 7:21 ` David Mosberger 2004-03-06 8:39 ` David Mosberger 2004-03-06 16:37 ` David Brownell 1 sibling, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-06 7:21 UTC (permalink / raw) To: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini, davidm >>>>> On Fri, 5 Mar 2004 21:49:20 -0800, David Mosberger <davidm@linux.hpl.hp.com> said: David> It's not an issue of DMA coherency, it's an issue of DMA David> vs. interrupt ordering. I believe the WHD interrupt is David> arriving at the CPU before the DMA update to the HCCA is David> done. Actually, it looks like I misunderstood the OHCI spec on first reading. It seems like the causal relationship goes like this: (1) Start of Frame -> (2) update HccaFrameNumber -> (3) trigger SF interrupt Now, suppose you get a WDH interrupt between (1) and (2). You'd read the old frame-number yet by the time the interrupt from (3) arrives the HC might already be accessing the ED that you're about to remove. If this is correct, then the first patch is probably a better approach: === drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004 +++ edited/drivers/usb/host/ohci-q.c Fri Mar 5 17:25:55 2004 @@ -438,7 +451,7 @@ * behave. frame_no wraps every 2^16 msec, and changes right before * SF is triggered. */ - ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; + ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2; /* rm_list is just singly linked, for simplicity */ ed->ed_next = ohci->ed_rm_list; This actually makes tons of sense if you think of it like jiffies: you need to make sure you delay at least one full frame-interval. If you set the tick to "+ 1" and the current tick is almost over, that requirement is violated. Setting it to "+ 2" should be safe. The only problem I can think of is if the delay between point (1) and (2) were to exceed one frame-interval (1 msec). While unlikely, the right PCI topology and heavy bus traffic perhaps could cause such delays. However, even then it's probably OK because the HC would presumably stall when trying to update the HccaFrameNumber the second time and the previous update hasn't completed yet. Here is one little piece of evidence that's consistent with this explanation: last week I tried to rip some audio tracks off a CD. With PIO, this caused interrupts to get delayed 2-3msec and that caused all kinds of weird effects on the USB bus. Mostly, I'd suddenly lose the keyboard or the mouse, though reconnecting them would "fix" the problem for a short time. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 7:21 ` David Mosberger @ 2004-03-06 8:39 ` David Mosberger 0 siblings, 0 replies; 25+ messages in thread From: David Mosberger @ 2004-03-06 8:39 UTC (permalink / raw) To: davidm Cc: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Fri, 5 Mar 2004 23:21:32 -0800, David Mosberger <davidm@linux.hpl.hp.com> said: David> (1) Start of Frame -> (2) update HccaFrameNumber -> (3) David> trigger SF interrupt David> Now, suppose you get a WDH interrupt between (1) and (2). David> You'd read the old frame-number yet by the time the interrupt David> from (3) arrives the HC might already be accessing the ED David> that you're about to remove. Sorry for the monologue---trying to learn how this is all supposed to work... The OHCI spec says that HccaFrameNumber is updated in this fashion: (a) send Start-of-Frame (b) HccaFrameNumber <- HcFmNumber.StartingFrame (c) start processing ED (& post SF intr if requested) Since start_ed_unlink() uses the following sequence: (1) ed->hwINFO |= ED_DEQUEUE (2) ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1 Then as long as (1) is observed by the HC before (2) (which it should be), the race I described isn't possible: if (2) read the "old" frame-number, then the HC wouldn't have started step (c) yet and hence the HC would observe step (1) and notice that the ED is being dequeued. Converseley, if the HC started to process the ED before (1) completed (i.e., it missed the ED_DEQUEUE flag), then step (2) must have been reading the the new frame-number. OK, I see now the conundrum... BTW: does the value 0xf0000000 bear any special meaning in USB? We already considered whether this would be a NULL-pointer after I/O MMU translation but it is not: the I/O MMU window is at 0x40000000-0x80000000 on the machines in question. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 5:49 ` David Mosberger 2004-03-06 7:21 ` David Mosberger @ 2004-03-06 16:37 ` David Brownell 2004-03-08 6:18 ` Grant Grundler 1 sibling, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-06 16:37 UTC (permalink / raw) To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > David.B> The reason I keep ending up thinking that readl-elimination > David.B> must be OK (me agreeing with Martin) is that the memory > David.B> there came from dma_alloc_coherent() ... so if anything's > David.B> wrong, it'd be at most lack of rmb(), not a stale-cache > David.B> kind of thing. > > It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt > ordering. I believe the WHD interrupt is arriving at the CPU before Which is what I sketched to Martin, as the reason to be interested in a patch that was equivalent to your second patch. Unfortunately that doesn't check out as being "the" fix here. Martin still saw the problem. (And I don't see how it'd be what gave me several hard lockups ... but it didn't work either, and the day before, without that change, no lockup.) > the DMA update to the HCCA is done. In my second patch, the readl() > at the beginning of the interrupt ensures that the DMA update to > the HCCA is completed before the readl() returns data. DMA-coherent memory is defined as "memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects." Such a write-buffering mechanism is clearly a type of (write-)caching effect, and readl() would be a kind of dma_rmb(), if you will. I suspect the docs are wrong about what dma-coherent means. > If this is correct, then the first patch is probably a better > approach: > > ... > - ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; > + ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2; > .. > > This actually makes tons of sense if you think of it like jiffies: you > need to make sure you delay at least one full frame-interval. Right, I had that theory too. As I reported, it doesn't check out as the only issue. Plus, even assuming it's correct, adding another millisecond slows down some common operations even more. I'd rather just slow down just the "right" paths, rather than all of them ... :) - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 16:37 ` David Brownell @ 2004-03-08 6:18 ` Grant Grundler 2004-03-08 18:58 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: Grant Grundler @ 2004-03-08 6:18 UTC (permalink / raw) To: David Brownell Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini On Sat, Mar 06, 2004 at 08:37:43AM -0800, David Brownell wrote: > DMA-coherent memory is defined as "memory for which a write by either > the device or the processor can immediately be read by the processor > or device without having to worry about caching effects." The use of "immediate" here means no other sync function needs to be called to access the data - ie don't need to call pci_sync_single(). In general, the accesses are ordered following PCI ordering rules. But every architecture (including x86) has issues with "inflight" DMA. Line based Interrupts are delivered on a different path than DMA and thus ordering can't be enforced. For example, the code around the following comment in drivers/net/tg3.c: /* * Flush PCI write. This also guarantees that our * status block has been flushed to host memory. */ > `Such a > write-buffering mechanism is clearly a type of (write-)caching effect, No - the data is still in flight and in some deterministic time frame will become visible to the CPU. Calling it a "caching effect" confuses the issues even worse. > and readl() would be a kind of dma_rmb(), if you will. Yes, that's correct - but it's orthogonal to "cache coherent". > I suspect the docs are wrong about what dma-coherent means. Not "wrong", just misunderstood. ;^) hth, grant ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-08 6:18 ` Grant Grundler @ 2004-03-08 18:58 ` David Mosberger 2004-03-08 21:48 ` David Brownell 0 siblings, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-08 18:58 UTC (permalink / raw) To: Grant Grundler Cc: David Brownell, davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Sun, 7 Mar 2004 22:18:02 -0800, Grant Grundler <iod00d@hp.com> said: >> `Such a write-buffering mechanism is clearly a type of >> (write-)caching effect, Grant> No - the data is still in flight and in some deterministic Grant> time frame will become visible to the CPU. Calling it a Grant> "caching effect" confuses the issues even worse. That's why I'm so unhappy that the PCI interface used the term "consistent" memory, when it should have said "coherent". We should nail a plate on everbody's forehead saying: consistency = coherency + ordering Perhaps then people would start to have a clear distincition between the meaning of the two terms (or at least it would force them to think about it! ;-). But in any case, as later experimentation confirmed, the USB bug isn't (just) an ordering issue. The order of operation described in the OHCI spec does not rely on any specific order of interrupt delivery at all, so I was wrong about that. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-08 18:58 ` David Mosberger @ 2004-03-08 21:48 ` David Brownell 2004-03-09 9:15 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-08 21:48 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini [-- Attachment #1: Type: text/plain, Size: 647 bytes --] David Mosberger wrote: > consistency = coherency + ordering That's one of the things I meant when I said the docs weren't quite there yet ... the dma_*() API docs don't address how to achieve the ordering, or even mention that it's an issue. Plus, there are definitions of coherency that include event ordering; it's not clear which definition is intended. Anyway, see the attached patch. It goes on top of 2.6.4-bk2 and passes my unlink tests (the ones that haven't been run much on recent kernels!) on several different OHCI hosts. Other issues may be lurking, but this seems to fix a handful of nasties that I could reproduce. - Dave [-- Attachment #2: ohci-0308.patch --] [-- Type: text/plain, Size: 3017 bytes --] Fixes for some recent OHCI instability. - Never munge ed->hwTailP once it's set up, except when appending to the queue. - Don't clear control/bulk "current ED" registers when taking entries off the queue. The chip may still be using the value. - Revert part of previous bk1 patch, which removed a fast path that makes a visible difference in some common operations. Removing it just slowed timings, and didn't fix anything. - Make the "unlink during submit" path behave better (SMP). --- 1.50/drivers/usb/host/ohci-q.c Tue Mar 2 22:48:07 2004 +++ edited/drivers/usb/host/ohci-q.c Mon Mar 8 13:15:08 2004 @@ -282,7 +282,6 @@ if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_CLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_controlcurrent); // post those pci writes (void) readl (&ohci->regs->control); } @@ -306,7 +305,6 @@ if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_BLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_bulkcurrent); // post those pci writes (void) readl (&ohci->regs->control); } @@ -331,6 +329,19 @@ periodic_unlink (ohci, ed); break; } + + /* NOTE: Except for a couple of exceptionally clean unlink cases + * (like unlinking the only c/b ED, with no TDs) HCs may still be + * caching this operational ED (or its address). Safe unlinking + * involves not marking it ED_IDLE till INTR_SF; we always do that + * if td_list isn't empty. Otherwise the race is small; but ... + */ + if (ed->state == ED_OPER) { + ed->state = ED_IDLE; + ed->hwINFO &= ~(ED_SKIP | ED_DEQUEUE); + ed->hwHeadP &= ~ED_H; + wmb (); + } } @@ -794,8 +805,6 @@ next->next_dl_td = rev; rev = next; - if (ed->hwTailP == cpu_to_le32 (next->td_dma)) - ed->hwTailP = next->hwNextTD; ed->hwHeadP = next->hwNextTD | toggle; } @@ -941,12 +950,7 @@ continue; } - /* patch pointers hc uses ... tail, if we're removing - * an otherwise active td, and whatever td pointer - * points to this td - */ - if (ed->hwTailP == cpu_to_le32 (td->td_dma)) - ed->hwTailP = td->hwNextTD; + /* patch pointer hc uses */ savebits = *prev & ~cpu_to_le32 (TD_MASK); *prev = td->hwNextTD | savebits; @@ -1041,7 +1045,7 @@ /* clean schedule: unlink EDs that are no longer busy */ if (list_empty (&ed->td_list)) - start_ed_unlink (ohci, ed); + ed_deschedule (ohci, ed); /* ... reenabling halted EDs only after fault cleanup */ else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) == ED_SKIP) { td = list_entry (ed->td_list.next, struct td, td_list); --- 1.57/drivers/usb/host/ohci-hcd.c Tue Mar 2 22:48:07 2004 +++ edited/drivers/usb/host/ohci-hcd.c Mon Mar 8 12:39:24 2004 @@ -233,7 +233,7 @@ spin_lock (&urb->lock); if (urb->status != -EINPROGRESS) { spin_unlock (&urb->lock); - + urb->hcpriv = urb_priv; finish_urb (ohci, urb, 0); retval = 0; goto fail; ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-08 21:48 ` David Brownell @ 2004-03-09 9:15 ` David Mosberger 2004-03-09 17:36 ` David Brownell 0 siblings, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-09 9:15 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini How about something along the following lines? The patch is relative to 2.6.4-rc1. What it does is add a new state ED_DESCHEDULED, which is treated exactly like ED_IDLE, except that in this state, the HC may still be referring to the ED in question. Thus, if ohci_endpoint_disable() finds an ED in this state, it will take it down via start_ed_unlink() and once that is done, the ED will be in ED_IDLE state and hence safe for removal (HC won't be referencing it anymore). I left some printk's for low-frequency events enabled, to give you a better idea of how this is working. Note that I removed the posting of the PCI-writes in ed_deschedule(). I don't think those are needed. As far as I can see, a write-memory barrier at the end of this routine should suffice. With this patch applied: - My dual-CPU Itanium machine boots fine (doesn't hang anymore as with 2.6.4-rc1) - My home workstation (single CPU Itanium with 3 USB devices) now boots and works fine again. Moreover, I don't see any USB keyboard/mouse disconnects anymore when doing heavy IDE traffic. The BTC keyboard, however, still does NOT work. I'm fairly certain now that this is indeed a separate problem in the HID. The reason being that I'm now fairly comfortable with the OHCI HCD and if this bug truly were in the OHCI HCD, then why does it trigger with UHCI as well (I haven't tried EHCI yet; maybe I should try connecting the BTC keyboard via a USB 2.0 hub). Comments? --david === drivers/usb/host/ohci-hcd.c 1.56 vs edited ==--- 1.56/drivers/usb/host/ohci-hcd.c Tue Mar 2 05:52:40 2004 +++ edited/drivers/usb/host/ohci-hcd.c Tue Mar 9 00:24:54 2004 @@ -239,8 +239,11 @@ goto fail; } - /* schedule the ed if needed */ - if (ed->state = ED_IDLE) { + if (ed->state = ED_UNLINK) { + printk("%s: ed %p being unlinked\n", __FUNCTION__, ed); + goto fail0; + } else if (ed->state = ED_IDLE || ed->state = ED_DESCHEDULED) { + /* schedule the ed if needed */ retval = ed_schedule (ohci, ed); if (retval < 0) goto fail0; @@ -344,9 +347,15 @@ if (!ed) goto done; + printk("%s(ed=%p, type=%d, state=%u)\n", __FUNCTION__, ed, ed->type, ed->state); + if (!HCD_IS_RUNNING (ohci->hcd.state)) ed->state = ED_IDLE; switch (ed->state) { + case ED_DESCHEDULED: + start_ed_unlink (ohci, ed); + BUG_ON (ed->state != ED_UNLINK); + /* fall through */ case ED_UNLINK: /* wait for hw to finish? */ /* major IRQ delivery trouble loses INTR_SF too... */ WARN_ON (limit-- = 0); === drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004 +++ edited/drivers/usb/host/ohci-q.c Tue Mar 9 00:39:39 2004 @@ -168,10 +168,16 @@ { int branch; +#if 0 +printk("%s: ed=%p type=%d\n", __FUNCTION__, ed, ed->type); +#endif ed->state = ED_OPER; ed->ed_prev = 0; ed->ed_next = 0; ed->hwNextED = 0; + ed->hwINFO &= ~ED_SKIP; + WARN_ON (ed->hwHeadP & ED_H); + ed->hwHeadP &= ~ED_H; /* XXX is this really needed?? --davidm */ wmb (); /* we care about rm_list when setting CLE/BLE in case the HC was at @@ -267,24 +273,30 @@ ed, ed->branch, ed->load, ed->interval); } -/* unlink an ed from one of the HC chains. +/* Deschedule an ed from one of the HC chains. * just the link to the ed is unlinked. * the link from the ed still points to another operational ed or 0 - * so the HC can eventually finish the processing of the unlinked ed + * so the HC can eventually finish the processing of the unlinked ed. + * The ED isn't idle after returning from this routine, because the HC + * may continue to refer to it, but as far as the HCD is concerned, + * the ED is reusable. If it is necessary to get the ED into the + * ED_IDLE state, the full unlink-sequence needs to be performed + * (via start_ed_unlink()). */ -static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) +static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) { +#if 0 +printk("%s: ed=%p type=%d\n", __FUNCTION__, ed, ed->type); +#endif ed->hwINFO |= ED_SKIP; switch (ed->type) { case PIPE_CONTROL: + /* remove ED from the HC's list: */ if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_CLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_controlcurrent); - // post those pci writes - (void) readl (&ohci->regs->control); } writel (le32_to_cpup (&ed->hwNextED), &ohci->regs->ed_controlhead); @@ -292,6 +304,7 @@ ed->ed_prev->ed_next = ed->ed_next; ed->ed_prev->hwNextED = ed->hwNextED; } + /* remove ED from the HCD's list: */ if (ohci->ed_controltail = ed) { ohci->ed_controltail = ed->ed_prev; if (ohci->ed_controltail) @@ -302,13 +315,11 @@ break; case PIPE_BULK: + /* remove ED from the singly-linked HC's list: */ if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_BLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_bulkcurrent); - // post those pci writes - (void) readl (&ohci->regs->control); } writel (le32_to_cpup (&ed->hwNextED), &ohci->regs->ed_bulkhead); @@ -316,6 +327,7 @@ ed->ed_prev->ed_next = ed->ed_next; ed->ed_prev->hwNextED = ed->hwNextED; } + /* remove ED from the HCD's list: */ if (ohci->ed_bulktail = ed) { ohci->ed_bulktail = ed->ed_prev; if (ohci->ed_bulktail) @@ -331,6 +343,12 @@ periodic_unlink (ohci, ed); break; } + ed->state = ED_DESCHEDULED; + /* + * Ensure HCD sees these updates (in particular update of + * hwINFO) before any following updates. + */ + wmb (); } @@ -387,7 +405,7 @@ /* NOTE: only ep0 currently needs this "re"init logic, during * enumeration (after set_address, or if ep0 maxpacket >8). */ - if (ed->state = ED_IDLE) { + if (ed->state = ED_IDLE || ed->state = ED_DESCHEDULED) { u32 info; info = usb_pipedevice (pipe); @@ -418,6 +436,9 @@ done: spin_unlock_irqrestore (&ohci->lock, flags); +#if 0 +printk("new ed %p: type=%u state=%u hwTailP=%x hwHeadP=%x\n", ed, ed->type, ed->state, ed->hwTailP, ed->hwHeadP); +#endif return ed; } @@ -428,17 +449,38 @@ * real work is done at the next start frame (SF) hardware interrupt */ static void start_ed_unlink (struct ohci_hcd *ohci, struct ed *ed) -{ +{ + u32 mask = 0; + + /* deschedule ED as far as the HCD is concerned: */ + ed_deschedule (ohci, ed); + + /* now initiate the unlink sequence to ensure the HC isn't using the ED anymore: */ + if (ed->type = PIPE_CONTROL) + mask = OHCI_CTRL_CLE; + else if (ed->type = PIPE_BULK) + mask = OHCI_CTRL_BLE; + if (mask) { + /* + * Disable scanning of the control or bulk list; this + * ensures that when we get an interrupt with a frame + * # greater than the current frame #, the HC isn't + * using the list anymore. + */ + ohci->hc_control &= ~mask; + writel (ohci->hc_control, &ohci->regs->control); + } ed->hwINFO |= ED_DEQUEUE; ed->state = ED_UNLINK; - ed_deschedule (ohci, ed); /* SF interrupt might get delayed; record the frame counter value that * indicates when the HC isn't looking at it, so concurrent unlinks * behave. frame_no wraps every 2^16 msec, and changes right before * SF is triggered. */ + rmb(); /* ensure ed_deschedule() work is done before we read the frame # */ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; +printk("%s: ed=%p tick=%u\n", __FUNCTION__, ed, ed->tick); /* rm_list is just singly linked, for simplicity */ ed->ed_next = ohci->ed_rm_list; @@ -452,6 +494,7 @@ // flush those pci writes (void) readl (&ohci->regs->control); } + /* ??? What if HCD isn't running? Shouldn't that be handled as well? */ } /*-------------------------------------------------------------------------* @@ -614,6 +657,7 @@ info = TD_CC | TD_DP_SETUP | TD_T_DATA0; td_fill (ohci, info, urb->setup_dma, 8, urb, cnt++); if (data_len > 0) { + BUG_ON(data_len >= 4096); info = TD_CC | TD_R | TD_T_DATA1; info |= is_out ? TD_DP_OUT : TD_DP_IN; /* NOTE: mishandles transfers >8K, some >4K */ @@ -758,6 +802,7 @@ struct list_head *tmp = td->td_list.next; u32 toggle = ed->hwHeadP & ED_C; +printk("%s: ed=%p\n", __FUNCTION__, ed); /* clear ed halt; this is the td that caused it, but keep it inactive * until its urb->complete() has a chance to clean up. */ @@ -794,8 +839,6 @@ next->next_dl_td = rev; rev = next; - if (ed->hwTailP = cpu_to_le32 (next->td_dma)) - ed->hwTailP = next->hwNextTD; ed->hwHeadP = next->hwNextTD | toggle; } @@ -881,6 +924,7 @@ { struct ed *ed, **last; +printk("%s(tick=%u)\n", __FUNCTION__, tick); rescan_all: for (last = &ohci->ed_rm_list, ed = *last; ed != NULL; ed = *last) { struct list_head *entry, *tmp; @@ -941,12 +985,7 @@ continue; } - /* patch pointers hc uses ... tail, if we're removing - * an otherwise active td, and whatever td pointer - * points to this td - */ - if (ed->hwTailP = cpu_to_le32 (td->td_dma)) - ed->hwTailP = td->hwNextTD; + /* patch pointers hc uses */ savebits = *prev & ~cpu_to_le32 (TD_MASK); *prev = td->hwNextTD | savebits; @@ -957,23 +996,22 @@ /* if URB is done, clean up */ if (urb_priv->td_cnt = urb_priv->length) { modified = completed = 1; - finish_urb (ohci, urb, regs); + finish_urb (ohci, urb, regs); /* this drops ohci->lock! */ } } if (completed && !list_empty (&ed->td_list)) goto rescan_this; /* ED's now officially unlinked, hc doesn't see */ +#if 1 +printk("%s: done with ed %p\n", __FUNCTION__, ed); +#endif ed->state = ED_IDLE; ed->hwINFO &= ~(ED_SKIP | ED_DEQUEUE); ed->hwHeadP &= ~ED_H; ed->hwNextED = 0; - /* but if there's work queued, reschedule */ - if (!list_empty (&ed->td_list)) { - if (HCD_IS_RUNNING(ohci->hcd.state)) - ed_schedule (ohci, ed); - } + BUG_ON (!list_empty (&ed->td_list)); if (modified) goto rescan_all; @@ -1039,9 +1077,9 @@ if (urb_priv->td_cnt = urb_priv->length) finish_urb (ohci, urb, regs); - /* clean schedule: unlink EDs that are no longer busy */ + /* clean schedule: deschedule EDs that are no longer busy */ if (list_empty (&ed->td_list)) - start_ed_unlink (ohci, ed); + ed_deschedule (ohci, ed); /* ... reenabling halted EDs only after fault cleanup */ else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) = ED_SKIP) { td = list_entry (ed->td_list.next, struct td, td_list); === drivers/usb/host/ohci.h 1.19 vs edited ==--- 1.19/drivers/usb/host/ohci.h Wed Feb 11 03:42:39 2004 +++ edited/drivers/usb/host/ohci.h Mon Mar 8 23:03:31 2004 @@ -48,6 +48,7 @@ #define ED_IDLE 0x00 /* NOT linked to HC */ #define ED_UNLINK 0x01 /* being unlinked from hc */ #define ED_OPER 0x02 /* IS linked to hc */ +#define ED_DESCHEDULED 0x03 /* idle for HCD, but HC may still hold reference to it */ u8 type; /* PIPE_{BULK,...} */ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 9:15 ` David Mosberger @ 2004-03-09 17:36 ` David Brownell 2004-03-09 17:58 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-09 17:36 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > How about something along the following lines? The patch is relative > to 2.6.4-rc1. What it does is add a new state ED_DESCHEDULED, which > is treated exactly like ED_IDLE, except that in this state, the HC may > still be referring to the ED in question. Thus, if Sounds exactly like ED_UNLINK -- except maybe that it's not been put onto ed_rm_list (with ED_DEQUEUE set). Why add another state? The parts of this patch that came from the one I sent earlier are obviously correct (what were your test results for that?), and there's non-worrisome noise (printks etc). But some parts worry me. Like changing that code to BUG() on a driver behavior that's perfectly reasonable; and removing some of the PCI posting, which makes it easier for the HC and its driver to disagree about schedule status. > The BTC keyboard, however, still does NOT work. I'm fairly certain > now that this is indeed a separate problem in the HID. The reason That was my original suspicion, you may recall ... :) - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 17:36 ` David Brownell @ 2004-03-09 17:58 ` David Mosberger 2004-03-09 20:39 ` David Brownell 0 siblings, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-09 17:58 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Tue, 09 Mar 2004 09:36:53 -0800, David Brownell <david-b@pacbell.net> said: David.B> David Mosberger wrote: >> How about something along the following lines? The patch is >> relative to 2.6.4-rc1. What it does is add a new state >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except >> that in this state, the HC may still be referring to the ED in >> question. Thus, if David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not David.B> been put onto ed_rm_list (with ED_DEQUEUE set). David.B> Why add another state? So you can tell them apart. See ohci_endpoint_disable(). You seem to think small races are OK. I disagree. The smaller the race, the harder it is to debug (and yes, it _will_ bite you eventually). With my patch, you're _guaranteed_ that ED_IDLE means the HC doesn't refer to the ED anymore so you can free the memory and do whatever you want without worry about potentially causing the HC to do something bad. David.B> But some parts worry me. Like changing that code to BUG() David.B> on a driver behavior that's perfectly reasonable; With my patch, the only reason you enter ED_UNLINK state is ohci_endpoint_disable(). If you try to ohci_urb_enqueue() on an ED in this state, it will fail, so I don't there is a problem (I see now that I forgot to set the error-code for this case, that's obviously something that needs to be fixed). But perhaps you had different semantics in mind for ED_UNLINK? David.B> removing some of the PCI posting, which makes it easier for David.B> the HC and its driver to disagree about schedule status. Meaning what? Please explain what sequence of events would cause problems that could be solved by flushing the posted writes. AFAICT, the flushes are just expensive NO-OPs in this particular case. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 17:58 ` David Mosberger @ 2004-03-09 20:39 ` David Brownell 2004-03-09 23:32 ` David Mosberger 2004-03-10 6:59 ` David Mosberger 0 siblings, 2 replies; 25+ messages in thread From: David Brownell @ 2004-03-09 20:39 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > > > David Mosberger wrote: > >> ... add a new state > >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except > >> that in this state, the HC may still be referring to the ED in > >> question. Thus, if > > David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not > David.B> been put onto ed_rm_list (with ED_DEQUEUE set). > David.B> Why add another state? > > So you can tell them apart. See ohci_endpoint_disable(). Not informative. Why doesn't UNLINK work as-is? If there's a bug in how that's handled, better to fix it. I'd be willing to believe there is one, given proof. > You seem to think small races are OK. I disagree. The smaller the I certainly didn't say that, any more than you said that there's no such thing as an engineering tradeoff! (With which statement I'd likewise disagree.) But I certainly did mean to say that I was skeptical you were actually running into one particular race, which I've had an eye on for some time. (And which shouldn't have the failure mode you reported, though some other things might...) > David.B> But some parts worry me. Like changing that code to BUG() > David.B> on a driver behavior that's perfectly reasonable; > > With my patch, the only reason you enter ED_UNLINK state is > ohci_endpoint_disable(). If you try to ohci_urb_enqueue() on an ED in > this state, it will fail, so I don't there is a problem (I see now It's entered in usb_unlink_urb() as always. So as I noted, your patch would make drivers BUG() in cases that are perfectly reasonable according to the API. > that I forgot to set the error-code for this case, that's obviously > something that needs to be fixed). But perhaps you had different > semantics in mind for ED_UNLINK? As I've said more than once on this thread. - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 20:39 ` David Brownell @ 2004-03-09 23:32 ` David Mosberger 2004-03-10 2:53 ` David Brownell 2004-03-10 6:59 ` David Mosberger 1 sibling, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-09 23:32 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David B., I'll respond to your concerns about my OHCI changes later on. I wanted to make progress on the BTC HID issue first so we have the full picture. Remember how I was asking what's so special about 0xf00000 because that's the address the OHCI HC was reading from that caused the crash? Well, looky here what you get when you interpret the memory at address 0 as an ED: hw=(infoð000000 tailpð000000 headpð000000 nextEDð000000) D'oh! (I suspect this strange memory pattern is a left-over from the memory-testing done by the firmware; I should say thanks to whoever invented this pattern; the problem would have been much harder to debug if the address hadn't convenently fallen on a write-only region of the address-space.) Anyhow, I have been wondering since about Sunday whether it's really safe to write HcControlHeadED (and HcBulkHeadED) with 0. The register description itself is ambiguous. Now I'm finding that Figure 6-5 "List Service Flow" and Section 6.4.2.2 "Locating Endpoint Descriptors" are outright contradictory! The flow-chart suggests that after loading CurrentED with the contents of HeadED, the HC checks whether CurrentED is 0 and, if so, does nothing. However, the text in Section 6.4.2.2 says: ... At this point, the Host Controller checks the BulkListFilled bit or ControlListFilled bit of the HcCommandStatus register. If the respective "Filled" bit is set to 1, there is at least one Endpoint Descriptor on the list which needs service. In this case, the HostController will copy the value of HcControlHeadED or HcBulkHeadED into HcControlCurrentED or HcBulkCurrentED respectively, clear the "Filled" bit to 0, and attempt to process the Endpoint Descriptor now present in the CurrentED register. ... So, if the HC behaves as described in the text, then there is an obvious race: HC HCD - start new frame - find list enabled (CLE/BLE set) - find "Filled" bit enabled (CLF/BLF set) - ed_deschedule() gets called - turn off "list-enabled" bit - store 0 into *HeadED - read *HeadED and store in *CurrentEd - clear "Filled" bit to 0 - process ED at "Current" In other words, whenever de-scheduling the first ED on the control or bulk list, there is a risk that with the right timing, you'll end up processing the "ED" at address zero! So I changed ohci-q.c from donig this: if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_CLE; writel (ohci->hc_control, &ohci->regs->control); } writel (le32_to_cpup (&ed->hwNextED), &ohci->regs->ed_controlhead); } else ... to this: if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_CLE; writel (ohci->hc_control, &ohci->regs->control); } else writel (le32_to_cpup (&ed->hwNextED), &ohci->regs->ed_controlhead); } else ... and now there are no more crashes when plugging in the BTC keyboard. (Note: the change is safe only if the ED being removed remains valid until the clearing of the CLE is observed, i.e, until the next start of frame, which is the case with my patch from yesterday.) Now the next problem is to figure out why one of the URBs submitted by the HID consistently times out the first time round. (Oh, and there is an infinite loop in hidinput_connect() which is trivial to fix.) --david PS: It would have been nice if I had been smart enough to check the memory at address 0 upfront (as described above), but in truth, I found the problem in the reverse order by noticing that the machine died right after/during the ed_deschedule() and the clearing of the *HeadED register was just about the only thing left that could cause the crash. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 23:32 ` David Mosberger @ 2004-03-10 2:53 ` David Brownell 2004-03-10 6:11 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-10 2:53 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > Anyhow, I have been wondering since about Sunday whether it's really > safe to write HcControlHeadED (and HcBulkHeadED) with 0. The register > description itself is ambiguous. Now I'm finding that Figure 6-5 > "List Service Flow" and Section 6.4.2.2 "Locating Endpoint > Descriptors" are outright contradictory! OHCI isn't as tightly specified as one might like. There are also differences in how vendors implemented it (notably initialization); but at least there don't seem to be major incompatibilities that crept in between implementations. > So, if the HC behaves as described in the text, then there is an > obvious race: > .... > So I changed ohci-q.c ... to this: > > if (ed->ed_prev = NULL) { > if (!ed->hwNextED) { > ohci->hc_control &= ~OHCI_CTRL_CLE; > writel (ohci->hc_control, &ohci->regs->control); > } else (added the ELSE) > writel (le32_to_cpup (&ed->hwNextED), > &ohci->regs->ed_controlhead); > } else ... > > and now there are no more crashes when plugging in the BTC keyboard. Interesting. I had those "else"s in the patch I was testing too, and they were literally the last "is this really needed?" change I removed before posting that patch yesterday. I had tested it on four different implementations of OHCI (two on SMP) and the "else" didn't make a difference ... as it might not, given the race you identified. This is a good example of something better found with a fresh set of eyes looking at spec and code! Looks like that's really needed then! Whose OHCI silicon are you testing with, by the way? Good catch, regardless. - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-10 2:53 ` David Brownell @ 2004-03-10 6:11 ` David Mosberger 0 siblings, 0 replies; 25+ messages in thread From: David Mosberger @ 2004-03-10 6:11 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Tue, 09 Mar 2004 18:53:58 -0800, David Brownell <david-b@pacbell.net> said: David.B> Whose OHCI silicon David.B> are you testing with, by the way? lspci claims it's a NEC chip: a0:01.0 USB Controller: NEC Corporation USB (rev 41) (prog-if 10 [OHCI]) Subsystem: NEC Corporation USB Flags: bus master, medium devsel, latency 128, IRQ 62 Memory at 00000000d0062000 (32-bit, non-prefetchable) [size=4K] Capabilities: <available only to root> --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-09 20:39 ` David Brownell 2004-03-09 23:32 ` David Mosberger @ 2004-03-10 6:59 ` David Mosberger 2004-03-10 16:22 ` David Brownell 1 sibling, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-10 6:59 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Tue, 09 Mar 2004 12:39:52 -0800, David Brownell <david-b@pacbell.net> said: David.B> David Mosberger wrote: >> > David Mosberger wrote: >> ... add a new state >> >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except >> >> that in this state, the HC may still be referring to the ED in >> >> question. Thus, if David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not David.B> been put onto ed_rm_list (with ED_DEQUEUE set). Why add David.B> another state? >> So you can tell them apart. See ohci_endpoint_disable(). David.B> Not informative. Why doesn't UNLINK work as-is? If David.B> there's a bug in how that's handled, better to fix it. I'd David.B> be willing to believe there is one, given proof. The current OHCI relies on the internals of the dma_pool() implementation. If the implementation changed to, say, modify the memory that is free or, heaven forbid, return the memory to the kernel, you'd get _extremely_ difficult to track down race conditions. Even so, the code isn't race-free, like I explained yesterday: - ed_alloc() clears the ED to 0 via memset() - the order in which memset() clears memory is undefined (various from platform to platform etc) - thus you might get a case where hwTailP is 0 but hwHeadP is non-zero, which would cause the HC to happily start dereferencing the descriptor There are probably other potential issues. Frankly, I _don't_ want to have to deal with this. Especially considering that the alternative costs virtually nothing, requires just a few source code changes, and contains the EDs that are potentially still referenced by the HC to the HC code itself. David.B> But some parts worry me. Like changing that code to BUG() David.B> on a driver behavior that's perfectly reasonable; I think the patch below incorporates your suggestions and fixes the problems you pointed out. In particular: - dropped debug printing - allow URB submission during ED_UNLINK again - add a big fat comment in ed_deschedule() why we must not update controlhead when the list goes empty (it's quite counter-intuitive to _not_ do it and I'm worried someone in the feature may want to "fix" this again...) The rest should be the samae as yesterday. The patch is still against 2.6.4-rc2 (and still contains your hwTailP update fixes). Please consider for inclusion. --david === drivers/usb/host/ohci-hcd.c 1.56 vs edited ==--- 1.56/drivers/usb/host/ohci-hcd.c Tue Mar 2 05:52:40 2004 +++ edited/drivers/usb/host/ohci-hcd.c Tue Mar 9 22:29:08 2004 @@ -239,8 +239,8 @@ goto fail; } - /* schedule the ed if needed */ - if (ed->state = ED_IDLE) { + if (ed->state = ED_IDLE || ed->state = ED_DESCHEDULED) { + /* schedule the ed if needed */ retval = ed_schedule (ohci, ed); if (retval < 0) goto fail0; @@ -347,6 +347,10 @@ if (!HCD_IS_RUNNING (ohci->hcd.state)) ed->state = ED_IDLE; switch (ed->state) { + case ED_DESCHEDULED: + start_ed_unlink (ohci, ed); + BUG_ON (ed->state != ED_UNLINK); + /* fall through */ case ED_UNLINK: /* wait for hw to finish? */ /* major IRQ delivery trouble loses INTR_SF too... */ WARN_ON (limit-- = 0); === drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004 +++ edited/drivers/usb/host/ohci-q.c Tue Mar 9 22:27:04 2004 @@ -172,6 +172,7 @@ ed->ed_prev = 0; ed->ed_next = 0; ed->hwNextED = 0; + ed->hwINFO &= ~ED_SKIP; wmb (); /* we care about rm_list when setting CLE/BLE in case the HC was at @@ -187,6 +188,8 @@ switch (ed->type) { case PIPE_CONTROL: if (ohci->ed_controltail = NULL) { + /* Writing of control-head is safe only if CLE is off. */ + BUG_ON (ohci->hc_control & OHCI_CTRL_CLE); writel (ed->dma, &ohci->regs->ed_controlhead); } else { ohci->ed_controltail->ed_next = ed; @@ -203,6 +206,8 @@ case PIPE_BULK: if (ohci->ed_bulktail = NULL) { + /* Writing of bulk-head is safe only if BLE is off. */ + BUG_ON (ohci->hc_control & OHCI_CTRL_BLE); writel (ed->dma, &ohci->regs->ed_bulkhead); } else { ohci->ed_bulktail->ed_next = ed; @@ -267,10 +272,15 @@ ed, ed->branch, ed->load, ed->interval); } -/* unlink an ed from one of the HC chains. +/* Deschedule an ed from one of the HC chains. * just the link to the ed is unlinked. * the link from the ed still points to another operational ed or 0 - * so the HC can eventually finish the processing of the unlinked ed + * so the HC can eventually finish the processing of the unlinked ed. + * The ED isn't idle after returning from this routine, because the HC + * may continue to refer to it, but as far as the HCD is concerned, + * the ED is reusable. If it is necessary to get the ED into the + * ED_IDLE state, the full unlink-sequence needs to be performed + * (via start_ed_unlink()). */ static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) { @@ -278,20 +288,33 @@ switch (ed->type) { case PIPE_CONTROL: + /* remove ED from the HC's list: */ if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_CLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_controlcurrent); - // post those pci writes - (void) readl (&ohci->regs->control); - } - writel (le32_to_cpup (&ed->hwNextED), - &ohci->regs->ed_controlhead); + /* + * Caveat: even though the list is + * empty now, we MUST NOT write 0 to + * controlhead. The OHCI spec is + * ambiguous on this point (Figure 6-5 + * and Section 6.4.2.2 specify + * conflicting behaviors), but there + * are definitely HCs out there which + * will happily try to process an ED + * from address 0. It's OK not to + * update controlhead because we leave + * the ED alive for long enough that + * this is safe. + */ + } else + writel (le32_to_cpup (&ed->hwNextED), + &ohci->regs->ed_controlhead); } else { ed->ed_prev->ed_next = ed->ed_next; ed->ed_prev->hwNextED = ed->hwNextED; } + /* remove ED from the HCD's list: */ if (ohci->ed_controltail = ed) { ohci->ed_controltail = ed->ed_prev; if (ohci->ed_controltail) @@ -302,20 +325,33 @@ break; case PIPE_BULK: + /* remove ED from the singly-linked HC's list: */ if (ed->ed_prev = NULL) { if (!ed->hwNextED) { ohci->hc_control &= ~OHCI_CTRL_BLE; writel (ohci->hc_control, &ohci->regs->control); - writel (0, &ohci->regs->ed_bulkcurrent); - // post those pci writes - (void) readl (&ohci->regs->control); - } - writel (le32_to_cpup (&ed->hwNextED), - &ohci->regs->ed_bulkhead); + /* + * Caveat: even though the list is + * empty now, we MUST NOT write 0 to + * bulkhead. The OHCI spec is + * ambiguous on this point (Figure 6-5 + * and Section 6.4.2.2 specify + * conflicting behaviors), but there + * are definitely HCs out there which + * will happily try to process an ED + * from address 0. It's OK not to + * update controlhead because we leave + * the ED alive for long enough that + * this is safe. + */ + } else + writel (le32_to_cpup (&ed->hwNextED), + &ohci->regs->ed_bulkhead); } else { ed->ed_prev->ed_next = ed->ed_next; ed->ed_prev->hwNextED = ed->hwNextED; } + /* remove ED from the HCD's list: */ if (ohci->ed_bulktail = ed) { ohci->ed_bulktail = ed->ed_prev; if (ohci->ed_bulktail) @@ -331,6 +367,12 @@ periodic_unlink (ohci, ed); break; } + ed->state = ED_DESCHEDULED; + /* + * Ensure HCD sees these updates (in particular update of + * hwINFO) before any following updates. + */ + wmb (); } @@ -387,7 +429,7 @@ /* NOTE: only ep0 currently needs this "re"init logic, during * enumeration (after set_address, or if ep0 maxpacket >8). */ - if (ed->state = ED_IDLE) { + if (ed->state = ED_IDLE || ed->state = ED_DESCHEDULED) { u32 info; info = usb_pipedevice (pipe); @@ -429,15 +471,35 @@ */ static void start_ed_unlink (struct ohci_hcd *ohci, struct ed *ed) { + u32 mask = 0; + + /* deschedule ED as far as the HCD is concerned: */ + ed_deschedule (ohci, ed); + + /* now initiate the unlink sequence to ensure the HC isn't using the ED anymore: */ + if (ed->type = PIPE_CONTROL) + mask = OHCI_CTRL_CLE; + else if (ed->type = PIPE_BULK) + mask = OHCI_CTRL_BLE; + if (mask) { + /* + * Disable scanning of the control or bulk list; this + * ensures that when we get an interrupt with a frame + * # greater than the current frame #, the HC isn't + * using the list anymore. + */ + ohci->hc_control &= ~mask; + writel (ohci->hc_control, &ohci->regs->control); + } ed->hwINFO |= ED_DEQUEUE; ed->state = ED_UNLINK; - ed_deschedule (ohci, ed); /* SF interrupt might get delayed; record the frame counter value that * indicates when the HC isn't looking at it, so concurrent unlinks * behave. frame_no wraps every 2^16 msec, and changes right before * SF is triggered. */ + rmb(); /* ensure ed_deschedule() work is done before we read the frame # */ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1; /* rm_list is just singly linked, for simplicity */ @@ -452,6 +514,7 @@ // flush those pci writes (void) readl (&ohci->regs->control); } + /* ??? What if HCD isn't running? Shouldn't that be handled as well? */ } /*-------------------------------------------------------------------------* @@ -614,6 +677,7 @@ info = TD_CC | TD_DP_SETUP | TD_T_DATA0; td_fill (ohci, info, urb->setup_dma, 8, urb, cnt++); if (data_len > 0) { + BUG_ON(data_len >= 4096); info = TD_CC | TD_R | TD_T_DATA1; info |= is_out ? TD_DP_OUT : TD_DP_IN; /* NOTE: mishandles transfers >8K, some >4K */ @@ -794,8 +858,6 @@ next->next_dl_td = rev; rev = next; - if (ed->hwTailP = cpu_to_le32 (next->td_dma)) - ed->hwTailP = next->hwNextTD; ed->hwHeadP = next->hwNextTD | toggle; } @@ -941,12 +1003,7 @@ continue; } - /* patch pointers hc uses ... tail, if we're removing - * an otherwise active td, and whatever td pointer - * points to this td - */ - if (ed->hwTailP = cpu_to_le32 (td->td_dma)) - ed->hwTailP = td->hwNextTD; + /* patch pointers hc uses */ savebits = *prev & ~cpu_to_le32 (TD_MASK); *prev = td->hwNextTD | savebits; @@ -957,7 +1014,7 @@ /* if URB is done, clean up */ if (urb_priv->td_cnt = urb_priv->length) { modified = completed = 1; - finish_urb (ohci, urb, regs); + finish_urb (ohci, urb, regs); /* this drops ohci->lock! */ } } if (completed && !list_empty (&ed->td_list)) @@ -1039,9 +1096,9 @@ if (urb_priv->td_cnt = urb_priv->length) finish_urb (ohci, urb, regs); - /* clean schedule: unlink EDs that are no longer busy */ + /* clean schedule: deschedule EDs that are no longer busy */ if (list_empty (&ed->td_list)) - start_ed_unlink (ohci, ed); + ed_deschedule (ohci, ed); /* ... reenabling halted EDs only after fault cleanup */ else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) = ED_SKIP) { td = list_entry (ed->td_list.next, struct td, td_list); === drivers/usb/host/ohci.h 1.19 vs edited ==--- 1.19/drivers/usb/host/ohci.h Wed Feb 11 03:42:39 2004 +++ edited/drivers/usb/host/ohci.h Mon Mar 8 23:03:31 2004 @@ -48,6 +48,7 @@ #define ED_IDLE 0x00 /* NOT linked to HC */ #define ED_UNLINK 0x01 /* being unlinked from hc */ #define ED_OPER 0x02 /* IS linked to hc */ +#define ED_DESCHEDULED 0x03 /* idle for HCD, but HC may still hold reference to it */ u8 type; /* PIPE_{BULK,...} */ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-10 6:59 ` David Mosberger @ 2004-03-10 16:22 ` David Brownell 2004-03-10 18:04 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-10 16:22 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini I'll send out a revised patch later, thanks! It's also good this code got a more careful read. It seems like some things are not as obvious as I might like... That patch will merge those list corruption fixes I sent, the "else" you verified was needed (ugh!!!), and some of what you include here. It won't add new BUG_ON calls (WARN at worst) or a duplicate ED state (see next). > >> ... add a new state > >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except > >> that in this state, the HC may still be referring to the ED in > >> question. Thus, if > > David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not > David.B> been put onto ed_rm_list (with ED_DEQUEUE set). Why add > David.B> another state? > ... > > The current OHCI relies on the internals of the dma_pool() > implementation. If the implementation changed to, say, modify the > memory that is free or, heaven forbid, return the memory to the > kernel, you'd get _extremely_ difficult to track down race conditions. The current implementation _does_ poison memory on free, if slab poisoning is enabled. (That's why I asked if you were using it.) And that's been quite handy for reporting list corruption bugs, from races or otherwise. But those list corruption bugs hit a blind spot in that code: it doesn't check for modify-after-free. Which is why those bugs were able to hide for so long! It'd be good if you said _how_ you think it relies on such internals. Some of your debug diagnostics wrongly claimed allocation of "new" EDs when they were just being re-used. That'd make intentional/safe re-use look like a bug or race. > Even so, the code isn't race-free, like I explained yesterday: > > - ed_alloc() clears the ED to 0 via memset() > - the order in which memset() clears memory is undefined (various > from platform to platform etc) There's a wmb() before any ED is handed off to the OHCI silicon; that forces a defined order. Top of ed_schedule(). First use, or Nth re-use; no matter. > - thus you might get a case where hwTailP is 0 but hwHeadP > is non-zero, which would cause the HC to happily start > dereferencing the descriptor If you assume a bug where the ED is freed but still in use, that's hardly the only thing that'd go wrong!! You can't use such a potential bug to prove something else is broken. - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-10 16:22 ` David Brownell @ 2004-03-10 18:04 ` David Mosberger 2004-03-11 2:43 ` David Brownell 0 siblings, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-10 18:04 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Wed, 10 Mar 2004 08:22:26 -0800, David Brownell <david-b@pacbell.net> said: David.B> It won't add new BUG_ON calls (WARN at worst) I put them there mostly as assertions. What I'd really want there is a DEBUG_BUG_ON, which is more like assert() in user-land (i.e., production code would drop the checks). WARN_ON() would be fine, too. >> The current OHCI relies on the internals of the dma_pool() >> implementation. If the implementation changed to, say, modify >> the memory that is free or, heaven forbid, return the memory to >> the kernel, you'd get _extremely_ difficult to track down race >> conditions. David.B> It'd be good if you said _how_ you think it relies on such David.B> internals. I thought I did. Suppose somebody changed the dma_pool code such that it would overwrite freed memory with an 0xf00000000000000 pattern. If the HC can still hold a reference to a freed ED (it can without my patch), the HC could see this kind of ED: hw=(info\0000000 tailpð000000 headp\0000000 nextEDð000000) If so, the HC would go ahead and try to interpret the memory at address 0 as a transfer descriptor. Depending on the memory contents, this could cause silent data corruption at an arbitrary address. >> - thus you might get a case where hwTailP is 0 but hwHeadP is >> non-zero, which would cause the HC to happily start dereferencing >> the descriptor David.B> If you assume a bug where the ED is freed but still in use, David.B> that's hardly the only thing that'd go wrong!! You can't David.B> use such a potential bug to prove something else is broken. You lost me here. All I'm saying is that the current code has a dangerous race that can corrupt memory, crash machines, or have all sorts of other wild side-effects. I never claimed this bug had anything todo with the BTC keyboard problem. --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-10 18:04 ` David Mosberger @ 2004-03-11 2:43 ` David Brownell 2004-03-11 5:35 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-11 2:43 UTC (permalink / raw) To: davidm Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > >> The current OHCI relies on the internals of the dma_pool() > >> implementation. ... > David.B> It'd be good if you said _how_ you think it relies on such > David.B> internals. > > I thought I did. Suppose somebody changed the dma_pool code such that > it would overwrite freed memory with an 0xf00000000000000 pattern. Erm, _anything_ the dma_pool code does with freed memory is legal. Even the old "monkeys flying out of the back of the server" trick! :) Anyway, please (a) see if 2.6.4 works for you, and (b) direct any future followups on this thread _only_ to linux-usb-devel. - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-11 2:43 ` David Brownell @ 2004-03-11 5:35 ` David Mosberger 0 siblings, 0 replies; 25+ messages in thread From: David Mosberger @ 2004-03-11 5:35 UTC (permalink / raw) To: David Brownell Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Wed, 10 Mar 2004 18:43:25 -0800, David Brownell <david-b@pacbell.net> said: David.B> David Mosberger wrote: >> The current OHCI relies on the internals of the dma_pool() >> implementation. ... David.B> It'd be good if you said _how_ you think it relies on such David.B> internals. >> I thought I did. Suppose somebody changed the dma_pool code >> such that it would overwrite freed memory with an >> 0xf00000000000000 pattern. David.B> Erm, _anything_ the dma_pool code does with freed memory is David.B> legal. Glad to see we agree on that! David.B> Anyway, please (a) see if 2.6.4 works for you, and (b) David.B> direct any future followups on this thread _only_ to David.B> linux-usb-devel. The patch that's in 2.6.4 definitely looks like a step in the right direction. I still have some concerns. I'll follow up on linux-usb-devel with more details. Thanks, --david ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 4:55 ` David Brownell 2004-03-06 5:49 ` David Mosberger @ 2004-03-06 9:17 ` David Mosberger 2004-03-06 17:30 ` David Brownell 1 sibling, 1 reply; 25+ messages in thread From: David Mosberger @ 2004-03-06 9:17 UTC (permalink / raw) To: David Brownell Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Fri, 05 Mar 2004 20:55:01 -0800, David Brownell <david-b@pacbell.net> said: >> Does this sound plausible? David.B> Parts of it. There's definite recent nastiness. Of the David.B> type that other eyes sometimes see better. Here is patch #3. It also Works For Me. I was wondering whether it it is really safe to mess with the OHCI control registers the way ed_deschedule() does at a time the OHCI is running. To test this theory, I delayed the ed_deschedule() handling to finish_unlinks(), as shown in the patch below. I don't know whether this is really safe as far as the host's lists are concerned, but it does avoid the crashes. What's the argument as to why it's safe to update the OHCI control registers in ed_deschedule() at the time start_ed_unlink() is running? --david === drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004 +++ edited/drivers/usb/host/ohci-q.c Sat Mar 6 01:09:16 2004 @@ -274,7 +274,10 @@ */ static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) { +#if 0 ed->hwINFO |= ED_SKIP; + wmb(); +#endif switch (ed->type) { case PIPE_CONTROL: @@ -431,7 +434,12 @@ { ed->hwINFO |= ED_DEQUEUE; ed->state = ED_UNLINK; +#if 0 ed_deschedule (ohci, ed); +#else + ed->hwINFO |= ED_SKIP; + wmb(); +#endif /* SF interrupt might get delayed; record the frame counter value that * indicates when the HC isn't looking at it, so concurrent unlinks @@ -896,6 +904,11 @@ last = &ed->ed_next; continue; } + +#if 0 +#else + ed_deschedule (ohci, ed); +#endif if (!list_empty (&ed->td_list)) { struct td *td; ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 9:17 ` David Mosberger @ 2004-03-06 17:30 ` David Brownell 2004-03-08 18:49 ` David Mosberger 0 siblings, 1 reply; 25+ messages in thread From: David Brownell @ 2004-03-06 17:30 UTC (permalink / raw) To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini David Mosberger wrote: > Here is patch #3. It also Works For Me. I was wondering whether it I've had several "Works For Me" patches too, but then if the silicion got kicked a bit differently it'd not behave... :( > it is really safe to mess with the OHCI control registers the way > ed_deschedule() does at a time the OHCI is running. To test this It must be, or we'd not have had a driver working for several years now! The quick stop/restart cycles haven't seemed to be a problem with any OHCI silicion in the way they are with, for example, VIA EHCI. > theory, I delayed the ed_deschedule() handling to finish_unlinks(), as > shown in the patch below. I don't know whether this is really safe as > far as the host's lists are concerned, but it does avoid the crashes. My suspicions have been focussing on finish_unlinks(). That's really the only place the HCD does anything that could corrupt the ED queues, which is what looks to be happening. Your change doesn't actually _unlink_ in the same way; interesting change, I'll have to think about it. It certainly changes timings. > What's the argument as to why it's safe to update the OHCI control > registers in ed_deschedule() at the time start_ed_unlink() is running? It's always safe to update those registers, except that some silicon doesn't support that while the controller is suspended. - Dave ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? 2004-03-06 17:30 ` David Brownell @ 2004-03-08 18:49 ` David Mosberger 0 siblings, 0 replies; 25+ messages in thread From: David Mosberger @ 2004-03-08 18:49 UTC (permalink / raw) To: David Brownell Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini >>>>> On Sat, 06 Mar 2004 09:30:31 -0800, David Brownell <david-b@pacbell.net> said: David.B> David Mosberger wrote: >> Here is patch #3. It also Works For Me. I was wondering whether >> it David.B> I've had several "Works For Me" patches too, but then if David.B> the silicion got kicked a bit differently it'd not David.B> behave... :( Just for completeness, I can confirm this. Depending on the exact USB configuration (varying number and types of devices connected via a varying number of hubs), there are still problems. So clearly my patches didn't address the root cause (yet). --david ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2004-03-11 5:35 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200310272235.h9RMZ9x1000602@napali.hpl.hp.com>
[not found] ` <20031028013013.GA3991@kroah.com>
[not found] ` <200310280300.h9S30Hkw003073@napali.hpl.hp.com>
[not found] ` <3FA12A2E.4090308@pacbell.net>
[not found] ` <16289.29015.81760.774530@napali.hpl.hp.com>
[not found] ` <16289.55171.278494.17172@napali.hpl.hp.com>
[not found] ` <3FA28C9A.5010608@pacbell.net>
2004-03-06 2:08 ` [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? David Mosberger
2004-03-06 2:13 ` David Mosberger
2004-03-06 4:55 ` David Brownell
2004-03-06 5:49 ` David Mosberger
2004-03-06 7:21 ` David Mosberger
2004-03-06 8:39 ` David Mosberger
2004-03-06 16:37 ` David Brownell
2004-03-08 6:18 ` Grant Grundler
2004-03-08 18:58 ` David Mosberger
2004-03-08 21:48 ` David Brownell
2004-03-09 9:15 ` David Mosberger
2004-03-09 17:36 ` David Brownell
2004-03-09 17:58 ` David Mosberger
2004-03-09 20:39 ` David Brownell
2004-03-09 23:32 ` David Mosberger
2004-03-10 2:53 ` David Brownell
2004-03-10 6:11 ` David Mosberger
2004-03-10 6:59 ` David Mosberger
2004-03-10 16:22 ` David Brownell
2004-03-10 18:04 ` David Mosberger
2004-03-11 2:43 ` David Brownell
2004-03-11 5:35 ` David Mosberger
2004-03-06 9:17 ` David Mosberger
2004-03-06 17:30 ` David Brownell
2004-03-08 18:49 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox