From: David Brownell <david-b@pacbell.net>
To: davidm@hpl.hp.com
Cc: Greg KH <greg@kroah.com>,
vojtech@suse.cz, linux-usb-devel@lists.sourceforge.net,
linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org,
pochini@shiny.it
Subject: Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
Date: Fri, 05 Mar 2004 20:55:01 -0800 [thread overview]
Message-ID: <404959A5.6040809@pacbell.net> (raw)
In-Reply-To: <16457.12968.365287.561596@napali.hpl.hp.com>
David Mosberger wrote:
> OK, finally a bit of progress. If you remember back in October 2003 I
> reported:
>
> > One-line summary: plug-in your USB keyboard, see your machine die.
>
> > So, I have this non-name USB keyboard (with built-in 2-port USB
> > hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
> > In retrospect, it's clear to me that the same keyboard also
> > occasionally crashes 2.4 kernels, but there the problem appears
> > more seldom.
>
> Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
> I tested would crash as soon as they had _any_ USB keyboard plugged
> in. That is, the problem no longer was limited to the BTC keyboard,
> which is special because it has a built-in hub. This was encouraging.
>
> Turns out it's this patch that was causing the crashes:
>
> http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17
Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!
There's something wierd going on recently for sure, and it's
not caused just by that patch. I'm not sure reverting that
would make things better overall right now; hard to say.
I'm thinking the "disable periodic schedule" patch is also
worth looking at. I can imagine some silicon having bugs
if that's turned off (just like some _systems_ have issues
when it's left on).
> That was strange, because even to my USB-untrained eye the patch
> looked obviously correct. However, I think the root cause of the
> problem really has to do with a race-condition between the controller
> and the driver. In particular, if I apply the patch below, my USB
> keyboards (including the BTC keyboard) work just fine!
>
> ...
> - ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
> + ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
I've seen that _change_ behavior in some regression tests
(usbtest test11/test12), as if the extra msec let things
quiesce (so only one of two broken states showed, and
not the oopsable one) but not _fix_ it.
> However, I think the root-cause of the problem may be this optimization
> in ohci_irq():
>
> /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
>
> Indeed, if I apply this patch instead:
>
> ...
> /* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
> - if ((ohci->hcca->done_head != 0)
> + if (0 && (ohci->hcca->done_head != 0)
> ...
>
> there are no crashes either.
See this post from Martin Diehl ... my response isn't out yet:
http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107850825815775&w=2
The reason I keep ending up thinking that readl-elimination
must be OK (me agreeing with Martin) is that the memory there
came from dma_alloc_coherent() ... so if anything's wrong,
it'd be at most lack of rmb(), not a stale-cache kind of thing.
> So my theory is that I was seeing this sequence of events:
>
> - HCD signals WDH interrupt & sends DMA to update the frame number in
> the host-controller communication area (HCCA)
>
> - host gets interrupt, but skips readl() and hence reads a stale
> frame number N instead of the up-to-date value (N+1)
It reads the frame number directly from the controller, so it's
not possible that it can be so stale that an rmb() wouldn't fix
sequencing issues.
What might be possible though is that the donelist gets modified
by the time the unlinks get processed, with some extra TDs changing
state (from HC perspective) ... haven't explored that possibility.
> - HCD cancels a transfer descriptor (TD), moves it to the "remove list"
> and calculates the frame number at which it can be remove from
> the host-controller's list as N+1
That'd be an ED on the remove list, not a TD. Also in dma-coherent
memory. The cancelation would apply to one or more of the TDs
queued to that ED.
> - SOF interrupt arrives (probably was pending already?)
>
> - interrupt handler does a readl() and now sees the updated
> frame-number N+1
>
> - HCD sees that the cancelled TD's time stamp N+1 is <= the current
> current time stamp (N+1) and goes ahead and removes it from the
> host-list, while the controller is still looking at the entry being
> removed
This ED-level disagreement between the HC and HCD might explain
some issues. I think the current trouble cases are usually where
there's only one TD queued to the ED, so that TD and the dummy
keep ping-ponging back and forth. The HC certainly seems to
overwrite the "dummy TD" --
> - HCD ends up dereferencing a bad pointer and ends up reading from
> address 0xf0000000, which on our ia64 machines is a read-only area,
> which then results in a machine-check abort
I'm surprised that DMA from a read-only area would be a problem! :)
If OHCI is getting a PCI error, I'd expect a "UE" IRQ.
> Does this sound plausible?
Parts of it. There's definite recent nastiness. Of the type
that other eyes sometimes see better.
> What beats me is why UHCI would have the same issue. I know even less
> about UHCI than I do about OHCI but perhaps there is a similar
> problem.
I still suspect some problem in the HID code. But right now
I'm quite certain of a recent-ish OHCI issue.
- Dave
>
> --david
>
next prev parent reply other threads:[~2004-03-06 4:58 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-10-27 22:35 serious 2.6 bug in USB subsystem? David Mosberger
2003-10-28 1:30 ` Greg KH
2003-10-28 3:00 ` David Mosberger
2003-10-30 15:11 ` [linux-usb-devel] " David Brownell
2003-10-30 20:15 ` David Mosberger
[not found] ` <16289.55171.278494.17172@napali.hpl.hp.com>
2003-10-31 16:23 ` David Brownell
2003-10-31 18:34 ` David Mosberger
2003-10-31 18:50 ` Valdis.Kletnieks
2003-10-31 19:28 ` David Brownell
2003-10-31 19:50 ` David Mosberger
2003-10-31 20:06 ` David S. Miller
2004-03-06 2:08 ` David Mosberger
2004-03-06 2:13 ` David Mosberger
2004-03-06 4:55 ` David Brownell [this message]
2004-03-06 5:49 ` David Mosberger
2004-03-06 7:21 ` David Mosberger
2004-03-06 8:39 ` David Mosberger
2004-03-06 16:37 ` David Brownell
2004-03-08 6:18 ` Grant Grundler
2004-03-08 18:58 ` David Mosberger
2004-03-08 21:48 ` David Brownell
2004-03-09 9:15 ` David Mosberger
2004-03-09 17:36 ` David Brownell
2004-03-09 17:58 ` David Mosberger
2004-03-09 20:39 ` David Brownell
2004-03-09 23:32 ` David Mosberger
2004-03-10 2:53 ` David Brownell
2004-03-10 6:11 ` David Mosberger
2004-03-10 6:59 ` David Mosberger
2004-03-10 7:52 ` Wouter Lueks
2004-03-10 16:49 ` David Mosberger
2004-03-10 19:49 ` Wouter Lueks
2004-03-10 16:22 ` David Brownell
2004-03-10 18:04 ` David Mosberger
2004-03-11 2:43 ` David Brownell
2004-03-11 5:35 ` David Mosberger
2004-03-10 19:29 ` Colin Leroy
2004-03-06 9:17 ` [linux-usb-devel] " David Mosberger
2004-03-06 17:30 ` David Brownell
2004-03-07 13:48 ` Matthias Andree
2004-03-08 18:49 ` David Mosberger
2003-11-03 3:46 ` David Brownell
2003-11-03 21:25 ` David Mosberger
2004-03-03 12:33 ` Wouter Lueks
2004-03-03 15:30 ` Wouter Lueks
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=404959A5.6040809@pacbell.net \
--to=david-b@pacbell.net \
--cc=davidm@hpl.hp.com \
--cc=greg@kroah.com \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-usb-devel@lists.sourceforge.net \
--cc=pochini@shiny.it \
--cc=vojtech@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox