From: David Mosberger <davidm@napali.hpl.hp.com>
To: David Brownell <david-b@pacbell.net>
Cc: davidm@hpl.hp.com, Greg KH <greg@kroah.com>,
vojtech@suse.cz, linux-usb-devel@lists.sourceforge.net,
linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org,
pochini@shiny.it
Subject: Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
Date: Sat, 06 Mar 2004 02:08:40 +0000 [thread overview]
Message-ID: <16457.12968.365287.561596@napali.hpl.hp.com> (raw)
In-Reply-To: <3FA28C9A.5010608@pacbell.net>
OK, finally a bit of progress. If you remember back in October 2003 I
reported:
> One-line summary: plug-in your USB keyboard, see your machine die.
> So, I have this non-name USB keyboard (with built-in 2-port USB
> hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
> In retrospect, it's clear to me that the same keyboard also
> occasionally crashes 2.4 kernels, but there the problem appears
> more seldom. Perhaps once in 10 reboots and once the machine is
> booted and the keyboard is running, it keeps on working. The
> keyboard in question is a BTC 5141H.
After this, I spent a (small) amount of time looking over the HID code
etc to see what could be causing it. I could find nothing wrong so I
gave up, connected another USB keyboard, and basically ignored the
problem. In retrospect, that was Good Thinking, because I was
apparently looking at the wrong code: the problem _does_ appear to be
coming from the USB HCD, not from the HIDeous code.
Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
I tested would crash as soon as they had _any_ USB keyboard plugged
in. That is, the problem no longer was limited to the BTC keyboard,
which is special because it has a built-in hub. This was encouraging.
Turns out it's this patch that was causing the crashes:
http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17
That was strange, because even to my USB-untrained eye the patch
looked obviously correct. However, I think the root cause of the
problem really has to do with a race-condition between the controller
and the driver. In particular, if I apply the patch below, my USB
keyboards (including the BTC keyboard) work just fine!
=== drivers/usb/host/ohci-q.c 1.48 vs edited ==--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c Fri Mar 5 17:25:55 2004
@@ -438,7 +451,7 @@
* behave. frame_no wraps every 2^16 msec, and changes right before
* SF is triggered.
*/
- ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
/* rm_list is just singly linked, for simplicity */
ed->ed_next = ohci->ed_rm_list;
However, I think the root-cause of the problem may be this optimization
in ohci_irq():
/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
Indeed, if I apply this patch instead:
=== drivers/usb/host/ohci-hcd.c 1.56 vs edited ==--- 1.56/drivers/usb/host/ohci-hcd.c Tue Mar 2 05:52:40 2004
+++ edited/drivers/usb/host/ohci-hcd.c Fri Mar 5 17:45:09 2004
@@ -584,7 +584,7 @@
int ints;
/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
- if ((ohci->hcca->done_head != 0)
+ if (0 && (ohci->hcca->done_head != 0)
&& ! (le32_to_cpup (&ohci->hcca->done_head) & 0x01)) {
ints = OHCI_INTR_WDH;
there are no crashes either.
So my theory is that I was seeing this sequence of events:
- HCD signals WDH interrupt & sends DMA to update the frame number in
the host-controller communication area (HCCA)
- host gets interrupt, but skips readl() and hence reads a stale
frame number N instead of the up-to-date value (N+1)
- HCD cancels a transfer descriptor (TD), moves it to the "remove list"
and calculates the frame number at which it can be remove from
the host-controller's list as N+1
- SOF interrupt arrives (probably was pending already?)
- interrupt handler does a readl() and now sees the updated
frame-number N+1
- HCD sees that the cancelled TD's time stamp N+1 is <= the current
current time stamp (N+1) and goes ahead and removes it from the
host-list, while the controller is still looking at the entry being
removed
- HCD ends up dereferencing a bad pointer and ends up reading from
address 0xf0000000, which on our ia64 machines is a read-only area,
which then results in a machine-check abort
Does this sound plausible?
What beats me is why UHCI would have the same issue. I know even less
about UHCI than I do about OHCI but perhaps there is a similar
problem.
--david
next parent reply other threads:[~2004-03-06 2:08 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <200310272235.h9RMZ9x1000602@napali.hpl.hp.com>
[not found] ` <20031028013013.GA3991@kroah.com>
[not found] ` <200310280300.h9S30Hkw003073@napali.hpl.hp.com>
[not found] ` <3FA12A2E.4090308@pacbell.net>
[not found] ` <16289.29015.81760.774530@napali.hpl.hp.com>
[not found] ` <16289.55171.278494.17172@napali.hpl.hp.com>
[not found] ` <3FA28C9A.5010608@pacbell.net>
2004-03-06 2:08 ` David Mosberger [this message]
2004-03-06 2:13 ` [linux-usb-devel] Re: serious 2.6 bug in USB subsystem? David Mosberger
2004-03-06 4:55 ` David Brownell
2004-03-06 5:49 ` David Mosberger
2004-03-06 7:21 ` David Mosberger
2004-03-06 8:39 ` David Mosberger
2004-03-06 16:37 ` David Brownell
2004-03-08 6:18 ` Grant Grundler
2004-03-08 18:58 ` David Mosberger
2004-03-08 21:48 ` David Brownell
2004-03-09 9:15 ` David Mosberger
2004-03-09 17:36 ` David Brownell
2004-03-09 17:58 ` David Mosberger
2004-03-09 20:39 ` David Brownell
2004-03-09 23:32 ` David Mosberger
2004-03-10 2:53 ` David Brownell
2004-03-10 6:11 ` David Mosberger
2004-03-10 6:59 ` David Mosberger
2004-03-10 16:22 ` David Brownell
2004-03-10 18:04 ` David Mosberger
2004-03-11 2:43 ` David Brownell
2004-03-11 5:35 ` David Mosberger
2004-03-06 9:17 ` David Mosberger
2004-03-06 17:30 ` David Brownell
2004-03-08 18:49 ` David Mosberger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16457.12968.365287.561596@napali.hpl.hp.com \
--to=davidm@napali.hpl.hp.com \
--cc=david-b@pacbell.net \
--cc=davidm@hpl.hp.com \
--cc=greg@kroah.com \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-usb-devel@lists.sourceforge.net \
--cc=pochini@shiny.it \
--cc=vojtech@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox