Highly critical bug in XHCI Controller

linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Highly critical bug in XHCI Controller
@ 2024-11-17  7:33 Markus Rechberger
  2024-11-17 12:44 ` Markus Rechberger
  2024-11-17 13:12 ` Greg KH
  0 siblings, 2 replies; 12+ messages in thread
From: Markus Rechberger @ 2024-11-17  7:33 UTC (permalink / raw)
  To: linux-usb, linux-kernel

Hi,

the issue was first reported at vdr-portal.de
https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196

we've got around a highly critical bug in the xhci driver.

https://sundtek.de/support/uxvd32.txt

In xhci.c

The bug is still active in Mainline:
https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382

static int xhci_check_bw_table(struct xhci_hcd *xhci,
        struct xhci_virt_device *virt_dev,
        int old_active_eps)

bw_table can end up with a NULL pointer.

This problem will lead to a complete kernel crash, rendering the entire
system unusable without any access to the actual linux system.

How to trigger the problem?
Short D+/D- or pull them to ground on a USB device while connecting the
device.

The problem can happen due to following cases:
* a device is getting suddenly disconnected during enumeration
* a faulty cable
* a faulty device 
* a malicious device triggers this issue on purpose
* if there are electrical issues during connecting a device.

A quick hotfix would be to check if bw_table is NULL in
xhci_check_bw_table, however the check should be performed earlier - in
the area where bw_table is supposed to be assigned.

Best Regards,
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17  7:33 Highly critical bug in XHCI Controller Markus Rechberger
@ 2024-11-17 12:44 ` Markus Rechberger
  2024-11-17 13:12   ` Greg KH
                     ` (2 more replies)
  2024-11-17 13:12 ` Greg KH
  1 sibling, 3 replies; 12+ messages in thread
From: Markus Rechberger @ 2024-11-17 12:44 UTC (permalink / raw)
  To: linux-usb, linux-kernel

Basically the issue comes from hub_port_connect.

drivers/usb/core/hub.c

hub_port_init returns -71 -EPROTO and jumps to loop
https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450

I'd question if usb_ep0_reinit is really required in loop which is
running following functions:
    usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
    usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
    usb_enable_endpoint(udev, &udev->ep0, true);

this is something only experience over the past decades can tell?

usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do
much, but crashes the entire system with the upstream kernel when it
triggers xhci_check_bw_table).

I removed usb_ep0_reinit here and devices are still workable under
various conditions (again I shorted and pulled D+/D- to ground for
testing).
The NULL PTR check in xhci_check_bw_table would be a second line of
defense but as indicated in the first mail it shouldn't even get there.



As a second issue I found in usb_reset_and_verify device 
https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131

        ret = hub_port_init(parent_hub, udev, port1, i, &descriptor);
        if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
            break;
        }

hub_port_init can also return -71 / -EPROTO, the cases should be very
rare when usb_reset_and_verify_device is triggered and that happens.


I'm just waiting for comments now since this is some critical piece of
infrastructure code before proceeding with a patch.

On Sun, 2024-11-17 at 15:33 +0800, Markus Rechberger wrote:
> Hi,
> 
> 
> the issue was first reported at vdr-portal.de
> https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196
> 
> we've got around a highly critical bug in the xhci driver.
> 
> https://sundtek.de/support/uxvd32.txt
> 
> In xhci.c
> 
> The bug is still active in Mainline:
> https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382
> 
> static int xhci_check_bw_table(struct xhci_hcd *xhci,
>         struct xhci_virt_device *virt_dev,
>         int old_active_eps)
> 
> bw_table can end up with a NULL pointer.
> 
> This problem will lead to a complete kernel crash, rendering the
> entire
> system unusable without any access to the actual linux system.
> 
> How to trigger the problem?
> Short D+/D- or pull them to ground on a USB device while connecting
> the
> device.
> 
> The problem can happen due to following cases:
> * a device is getting suddenly disconnected during enumeration
> * a faulty cable
> * a faulty device 
> * a malicious device triggers this issue on purpose
> * if there are electrical issues during connecting a device.
> 
> A quick hotfix would be to check if bw_table is NULL in
> xhci_check_bw_table, however the check should be performed earlier -
> in
> the area where bw_table is supposed to be assigned.
> 
> Best Regards,
> Markus


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17  7:33 Highly critical bug in XHCI Controller Markus Rechberger
  2024-11-17 12:44 ` Markus Rechberger
@ 2024-11-17 13:12 ` Greg KH
  1 sibling, 0 replies; 12+ messages in thread
From: Greg KH @ 2024-11-17 13:12 UTC (permalink / raw)
  To: Markus Rechberger; +Cc: linux-usb, linux-kernel

On Sun, Nov 17, 2024 at 03:33:28PM +0800, Markus Rechberger wrote:
> Hi,
> 
> 
> the issue was first reported at vdr-portal.de
> https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196
> 
> we've got around a highly critical bug in the xhci driver.
> 
> https://sundtek.de/support/uxvd32.txt
> 
> In xhci.c
> 
> The bug is still active in Mainline:
> https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382
> 
> static int xhci_check_bw_table(struct xhci_hcd *xhci,
>         struct xhci_virt_device *virt_dev,
>         int old_active_eps)
> 
> bw_table can end up with a NULL pointer.
> 
> This problem will lead to a complete kernel crash, rendering the entire
> system unusable without any access to the actual linux system.
> 
> How to trigger the problem?
> Short D+/D- or pull them to ground on a USB device while connecting the
> device.
> 
> The problem can happen due to following cases:
> * a device is getting suddenly disconnected during enumeration
> * a faulty cable
> * a faulty device 
> * a malicious device triggers this issue on purpose
> * if there are electrical issues during connecting a device.

Bad hardware is always a problem, can you send a patch to resolve this
as you seem to be able to reproduce it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 12:44 ` Markus Rechberger
@ 2024-11-17 13:12   ` Greg KH
  2024-11-17 14:35   ` Michał Pecio
  2024-11-17 15:18   ` Alan Stern
  2 siblings, 0 replies; 12+ messages in thread
From: Greg KH @ 2024-11-17 13:12 UTC (permalink / raw)
  To: Markus Rechberger; +Cc: linux-usb, linux-kernel

On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote:
> Basically the issue comes from hub_port_connect.
> 
> drivers/usb/core/hub.c
> 
> hub_port_init returns -71 -EPROTO and jumps to loop
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> 
> I'd question if usb_ep0_reinit is really required in loop which is
> running following functions:
>     usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
>     usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
>     usb_enable_endpoint(udev, &udev->ep0, true);
> 
> this is something only experience over the past decades can tell?
> 
> usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do
> much, but crashes the entire system with the upstream kernel when it
> triggers xhci_check_bw_table).
> 
> I removed usb_ep0_reinit here and devices are still workable under
> various conditions (again I shorted and pulled D+/D- to ground for
> testing).
> The NULL PTR check in xhci_check_bw_table would be a second line of
> defense but as indicated in the first mail it shouldn't even get there.
> 
> 
> 
> As a second issue I found in usb_reset_and_verify device 
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> 
>         ret = hub_port_init(parent_hub, udev, port1, i, &descriptor);
>         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
>             break;
>         }
> 
> hub_port_init can also return -71 / -EPROTO, the cases should be very
> rare when usb_reset_and_verify_device is triggered and that happens.
> 
> 
> I'm just waiting for comments now since this is some critical piece of
> infrastructure code before proceeding with a patch.

Send us a patch and we will be glad to review it.

thanks!

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 12:44 ` Markus Rechberger
  2024-11-17 13:12   ` Greg KH
@ 2024-11-17 14:35   ` Michał Pecio
  2024-11-17 15:03     ` Markus Rechberger
  2024-11-17 15:18   ` Alan Stern
  2 siblings, 1 reply; 12+ messages in thread
From: Michał Pecio @ 2024-11-17 14:35 UTC (permalink / raw)
  To: linuxusb.ml; +Cc: linux-kernel, linux-usb

Hi,

> Basically the issue comes from hub_port_connect.
> 
> drivers/usb/core/hub.c
> 
> hub_port_init returns -71 -EPROTO and jumps to loop
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> 
> I'd question if usb_ep0_reinit is really required in loop which is
> running following functions:
>     usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
>     usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
>     usb_enable_endpoint(udev, &udev->ep0, true);
> 
> this is something only experience over the past decades can tell?
> 
> usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do
> much, but crashes the entire system with the upstream kernel when it
> triggers xhci_check_bw_table).
> 
> I removed usb_ep0_reinit here and devices are still workable under
> various conditions (again I shorted and pulled D+/D- to ground for
> testing).

xHCI isn't the only host controller supported by Linux, and
usb_ep0_reinit() predates it. Maybe it's pointless today, maybe
it isn't, but it's not the root cause of your problem anyway.

> The NULL PTR check in xhci_check_bw_table would be a second line
> of defense but as indicated in the first mail it shouldn't even get
> there.

It's an xHCI bug that BW calculation is attempted on an uninitialized
device and crashes. Looks like a NULL check somewhere is exactly what
is needed, or maybe avoid it completely on EP0 (it's probably no-op).

Other similar bug recently:
https://lore.kernel.org/linux-usb/D3CKQQAETH47.1MUO22RTCH2O3@matfyz.cz/T/#u

Yours too should be unique to those Intel Panther Point chipsets.

> As a second issue I found in usb_reset_and_verify device 
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> 
>         ret = hub_port_init(parent_hub, udev, port1, i, &descriptor);
>         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
>             break;
>         }
> 
> hub_port_init can also return -71 / -EPROTO, the cases should be very
> rare when usb_reset_and_verify_device is triggered and that happens.

Right, and the intent seems to be to simply retry in this case.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 14:35   ` Michał Pecio
@ 2024-11-17 15:03     ` Markus Rechberger
  0 siblings, 0 replies; 12+ messages in thread
From: Markus Rechberger @ 2024-11-17 15:03 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-kernel, linux-usb

On Sun, 2024-11-17 at 15:35 +0100, Michał Pecio wrote:
> Hi,
> 
> > Basically the issue comes from hub_port_connect.
> > 
> > drivers/usb/core/hub.c
> > 
> > hub_port_init returns -71 -EPROTO and jumps to loop
> > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > 
> > I'd question if usb_ep0_reinit is really required in loop which is
> > running following functions:
> >     usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
> >     usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
> >     usb_enable_endpoint(udev, &udev->ep0, true);
> > 
> > this is something only experience over the past decades can tell?
> > 
> > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't
> > do
> > much, but crashes the entire system with the upstream kernel when
> > it
> > triggers xhci_check_bw_table).
> > 
> > I removed usb_ep0_reinit here and devices are still workable under
> > various conditions (again I shorted and pulled D+/D- to ground for
> > testing).
> 
> xHCI isn't the only host controller supported by Linux, and
> usb_ep0_reinit() predates it. Maybe it's pointless today, maybe
> it isn't, but it's not the root cause of your problem anyway.
> 

exactly, but it shouldn't go there anyway. This section of the code is
only for 'in case an error already happened'.
That's why I'm asking if anyone knows a practical situation where this
is really needed - and did it ever help?
I'm working with USB across many systems for nearly 2 decades and I
never saw any of those fallbacks succeeding. Usually the way to recover
a device which had connection issues was to reconnect the device
completely.

> > The NULL PTR check in xhci_check_bw_table would be a second line
> > of defense but as indicated in the first mail it shouldn't even get
> > there.
> 
> It's an xHCI bug that BW calculation is attempted on an uninitialized
> device and crashes. Looks like a NULL check somewhere is exactly what
> is needed, or maybe avoid it completely on EP0 (it's probably no-op).
> 

Yes this would be the second line of defense as indicated in my email
before.
I'm preparing 2 small patches with comments in the code.
One commenting out usb_ep0_reinit and the other one a simple NULL PTR
check - but
again the code shouldn't be executed at all when bw_table == NULL
something already
went wrong, the issue should be handled earlier in the code already.

bw_table is not setting itself to NULL or not at all.

Since it's infrastructure code it's a sensitive part.

This issue fully crashed my Asus notebook at least 3 times when working
with a USB ethernet adapter and a USB harddisk. Not only faulty
hardware is causing that problem also simple hotplug connection
problems.

> Other similar bug recently:
> https://lore.kernel.org/linux-usb/D3CKQQAETH47.1MUO22RTCH2O3@matfyz.cz/T/#u
> 
> Yours too should be unique to those Intel Panther Point chipsets.

This is exactly the same problem yes.
The problem will happen with all systems which use the xhci driver.

Best Regards,
Markus

> 
> > As a second issue I found in usb_reset_and_verify device 
> > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> > 
> >         ret = hub_port_init(parent_hub, udev, port1, i,
> > &descriptor);
> >         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
> >             break;
> >         }
> > 
> > hub_port_init can also return -71 / -EPROTO, the cases should be
> > very
> > rare when usb_reset_and_verify_device is triggered and that
> > happens.
> 
> Right, and the intent seems to be to simply retry in this case.
> 
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 12:44 ` Markus Rechberger
  2024-11-17 13:12   ` Greg KH
  2024-11-17 14:35   ` Michał Pecio
@ 2024-11-17 15:18   ` Alan Stern
  2024-11-17 15:47     ` Markus Rechberger
  2 siblings, 1 reply; 12+ messages in thread
From: Alan Stern @ 2024-11-17 15:18 UTC (permalink / raw)
  To: Markus Rechberger; +Cc: linux-usb, linux-kernel

On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote:
> Basically the issue comes from hub_port_connect.
> 
> drivers/usb/core/hub.c
> 
> hub_port_init returns -71 -EPROTO and jumps to loop
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> 
> I'd question if usb_ep0_reinit is really required in loop which is
> running following functions:

You mean that usb_ep0_reinit() runs the following, not that the loop 
does.

>     usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
>     usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
>     usb_enable_endpoint(udev, &udev->ep0, true);
> 
> this is something only experience over the past decades can tell?

It _is_ necessary, because the maxpacket size of ep0 may change from
one loop iteration to the next.  Therefore the endpoint must be disabled 
and re-enabled each time the loop repeats.

[Now that I go back through the git log, it appears the only reason for 
exporting usb_ep0_reinit was so that the WUSB driver could call it -- 
see commit fc721f5194dc ("wusb: make ep0_reinit available for modules").  
Since the kernel doesn't support WUSB any more, we should be able to 
stop exporting that function.]

> usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do
> much, but crashes the entire system with the upstream kernel when it
> triggers xhci_check_bw_table).
> 
> I removed usb_ep0_reinit here and devices are still workable under
> various conditions (again I shorted and pulled D+/D- to ground for
> testing).
> The NULL PTR check in xhci_check_bw_table would be a second line of
> defense but as indicated in the first mail it shouldn't even get there.
> 
> 
> 
> As a second issue I found in usb_reset_and_verify device 
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> 
>         ret = hub_port_init(parent_hub, udev, port1, i, &descriptor);
>         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
>             break;
>         }
> 
> hub_port_init can also return -71 / -EPROTO, the cases should be very
> rare when usb_reset_and_verify_device is triggered and that happens.

If that happens, the loop which this code sits inside will simply 
perform another iteration.  That's what  it's supposed to do, not an 
issue at all.

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 15:18   ` Alan Stern
@ 2024-11-17 15:47     ` Markus Rechberger
  2024-11-17 21:02       ` Alan Stern
  0 siblings, 1 reply; 12+ messages in thread
From: Markus Rechberger @ 2024-11-17 15:47 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, linux-kernel

On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote:
> > Basically the issue comes from hub_port_connect.
> > 
> > drivers/usb/core/hub.c
> > 
> > hub_port_init returns -71 -EPROTO and jumps to loop
> > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > 
> > I'd question if usb_ep0_reinit is really required in loop which is
> > running following functions:
> 
> You mean that usb_ep0_reinit() runs the following, not that the loop 
> does.
> 
> >     usb_disable_endpoint(udev, 0 + USB_DIR_IN, true);
> >     usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true);
> >     usb_enable_endpoint(udev, &udev->ep0, true);
> > 
> > this is something only experience over the past decades can tell?
> 
> It _is_ necessary, because the maxpacket size of ep0 may change from
> one loop iteration to the next.  Therefore the endpoint must be
> disabled 
> and re-enabled each time the loop repeats.
> 
> [Now that I go back through the git log, it appears the only reason
> for 
> exporting usb_ep0_reinit was so that the WUSB driver could call it --
> see commit fc721f5194dc ("wusb: make ep0_reinit available for
> modules").  
> Since the kernel doesn't support WUSB any more, we should be able to 
> stop exporting that function.]

This should only go down there in case an error happened earlier no?
In case the hardware signed up correctly it should not even enter that
code.

My experience is just - reconnect the device in case an error happened
those
workarounds did not work properly for the device I deal with (but yes
that's
why I'm asking - maybe someone else has different hardware with
different
experience).

> 
> > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't
> > do
> > much, but crashes the entire system with the upstream kernel when
> > it
> > triggers xhci_check_bw_table).
> > 
> > I removed usb_ep0_reinit here and devices are still workable under
> > various conditions (again I shorted and pulled D+/D- to ground for
> > testing).
> > The NULL PTR check in xhci_check_bw_table would be a second line of
> > defense but as indicated in the first mail it shouldn't even get
> > there.
> > 
> > 
> > 
> > As a second issue I found in usb_reset_and_verify device 
> > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> > 
> >         ret = hub_port_init(parent_hub, udev, port1, i,
> > &descriptor);
> >         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
> >             break;
> >         }
> > 
> > hub_port_init can also return -71 / -EPROTO, the cases should be
> > very
> > rare when usb_reset_and_verify_device is triggered and that
> > happens.
> 
> If that happens, the loop which this code sits inside will simply 
> perform another iteration.  That's what  it's supposed to do, not an 
> issue at all.
> 

It doesn't cause any issue yes but it's not correct either.
-EPROTO will be returned if I disconnect or short the device here. So
initially someone
thought he should check for -ENODEV/-ENOTCONN (which should also
indicate that the device is gone), so -EPROTO should also be checked in
that case.
Otherwise just remove all those error checks.

> Alan Stern
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 15:47     ` Markus Rechberger
@ 2024-11-17 21:02       ` Alan Stern
  2024-11-18  5:14         ` Markus Rechberger
  0 siblings, 1 reply; 12+ messages in thread
From: Alan Stern @ 2024-11-17 21:02 UTC (permalink / raw)
  To: Markus Rechberger; +Cc: linux-usb, linux-kernel

On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote:
> On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote:
> > > Basically the issue comes from hub_port_connect.
> > > 
> > > drivers/usb/core/hub.c
> > > 
> > > hub_port_init returns -71 -EPROTO and jumps to loop
> > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > > 
> > > I'd question if usb_ep0_reinit is really required in loop which is
> > > running following functions:

> This should only go down there in case an error happened earlier no?

Of course, since -EPROTO indicates there was an error.

> In case the hardware signed up correctly it should not even enter that
> code.
> 
> My experience is just - reconnect the device in case an error happened

You can't reconnect some devices; they are permanently attached.  There 
are other reasons why reconnecting might not be practical.

> those
> workarounds did not work properly for the device I deal with (but yes
> that's
> why I'm asking - maybe someone else has different hardware with
> different
> experience).

I doubt we will hear from these people, because they will not realize 
that anything was wrong.

In any case, it is generally recognized that the type of errors leading 
to -EPROTO (bad CRC or no response, for instance) can be temporary or 
intermittent.  Retrying is an appropriate strategy.

> > > As a second issue I found in usb_reset_and_verify device 
> > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> > > 
> > >         ret = hub_port_init(parent_hub, udev, port1, i,
> > > &descriptor);
> > >         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
> > >             break;
> > >         }
> > > 
> > > hub_port_init can also return -71 / -EPROTO, the cases should be
> > > very
> > > rare when usb_reset_and_verify_device is triggered and that
> > > happens.
> > 
> > If that happens, the loop which this code sits inside will simply 
> > perform another iteration.  That's what  it's supposed to do, not an 
> > issue at all.
> > 
> 
> It doesn't cause any issue yes but it's not correct either.

Why not?  What's wrong with it?

> -EPROTO will be returned if I disconnect or short the device here. So
> initially someone
> thought he should check for -ENODEV/-ENOTCONN (which should also
> indicate that the device is gone), so -EPROTO should also be checked in
> that case.
> Otherwise just remove all those error checks.

-EPROTO does not necessarily indicate the device is gone; it indicates 
there was a problem communicating with the device.  The problem might be 
caused by a disconnection, or it might be caused by temporary 
electromagnetic interference (for example), or by something else.

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-17 21:02       ` Alan Stern
@ 2024-11-18  5:14         ` Markus Rechberger
  2024-11-18 16:03           ` Alan Stern
  2024-11-18 21:23           ` Michał Pecio
  0 siblings, 2 replies; 12+ messages in thread
From: Markus Rechberger @ 2024-11-18  5:14 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, linux-kernel

On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote:
> On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote:
> > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger
> > > wrote:
> > > > Basically the issue comes from hub_port_connect.
> > > > 
> > > > drivers/usb/core/hub.c
> > > > 
> > > > hub_port_init returns -71 -EPROTO and jumps to loop
> > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > > > 
> > > > I'd question if usb_ep0_reinit is really required in loop which
> > > > is
> > > > running following functions:
> 
> > This should only go down there in case an error happened earlier
> > no?
> 
> Of course, since -EPROTO indicates there was an error.

Alan, first of all thank you for your feedback.

Do you know any practical way how to test this?

> 
> > In case the hardware signed up correctly it should not even enter
> > that
> > code.
> > 
> > My experience is just - reconnect the device in case an error
> > happened
> 
> You can't reconnect some devices; they are permanently attached. 
> There 
> are other reasons why reconnecting might not be practical.

I understand what you mean here but isn't it primarily about downsizing
USB 2.0 -> USB 1.1?
The next point would be if eg. an endpoint of a webcam downsizes to 1.1
you don't have to expect a workable device because video would exceed
the bandwidth easily.
Do you know any practical example/hardware where this is really
relevant?

The failure handling will call the xhci reset which is not meant to be
run without fully initialized data structures. In my case the driver
returns -EPROTO (while the device is already disconnected), then first
of all it tries to reset the endpoint which in the upstream kernel
causes the NULL PTR exeption. For XHCI this code stream logic is simply
not meant to be applied in such an error.
My first submitted patch will solve the issue but just papers a higher
logic issue. 

> 
> > those
> > workarounds did not work properly for the device I deal with (but
> > yes
> > that's
> > why I'm asking - maybe someone else has different hardware with
> > different
> > experience).
> 
> I doubt we will hear from these people, because they will not realize
> that anything was wrong.
> 
> In any case, it is generally recognized that the type of errors
> leading 
> to -EPROTO (bad CRC or no response, for instance) can be temporary or
> intermittent.  Retrying is an appropriate strategy.
> 
> > > > As a second issue I found in usb_reset_and_verify device 
> > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> > > > 
> > > >         ret = hub_port_init(parent_hub, udev, port1, i,
> > > > &descriptor);
> > > >         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
> > > >             break;
> > > >         }
> > > > 
> > > > hub_port_init can also return -71 / -EPROTO, the cases should
> > > > be
> > > > very
> > > > rare when usb_reset_and_verify_device is triggered and that
> > > > happens.
> > > 
> > > If that happens, the loop which this code sits inside will simply
> > > perform another iteration.  That's what  it's supposed to do, not
> > > an 
> > > issue at all.
> > > 
> > 
> > It doesn't cause any issue yes but it's not correct either.
> 
> Why not?  What's wrong with it?
> 
> > -EPROTO will be returned if I disconnect or short the device here.
> > So
> > initially someone
> > thought he should check for -ENODEV/-ENOTCONN (which should also
> > indicate that the device is gone), so -EPROTO should also be
> > checked in
> > that case.
> > Otherwise just remove all those error checks.
> 
> -EPROTO does not necessarily indicate the device is gone; it
> indicates 
> there was a problem communicating with the device.  The problem might
> be 
> caused by a disconnection, or it might be caused by temporary 
> electromagnetic interference (for example), or by something else.

a bad CRC response on a short cable -  something has gone very wrong
then.
If there are temporary EMI issues ... I'd consider that an as an a
absolute failure as well, they
need to be handled in HW rather than a software workaround.

In my experience with USB anything that is a 'temporary' failure can be
considered as 'permanent' failure and I've really seen a lot over the
last 1 1/2 decades.
However issues are mostly related to immature controllers / missing
quirks for some controllers.
Our devices in the field since 2008 usually pump around 100-300mbit
through the USB 2 link,
streaming  devices which usually run for a long period of time (up to
months / years).
'retrying' something on a link where something has gone wrong for sure
never worked properly for me, it would have continued with another
followup issue at some point.

I'm definitely in favor of telling the user if something went wrong
either reconnect or submit a bug report for that particular device to
add some quirk for special handling.

Anyway can you give a particular example where this 'retrying'
mechanism and reloading the endpoint size solves or solved a problem?
Over the years people will certainly forget why this was implemented
and take the code 'as is'
and the next ones will just say it was always like that don't change.
I don't mind keeping it as it is but it should be more clear / obvious
why it's really done that
way.

Markus
> 
> Alan Stern
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-18  5:14         ` Markus Rechberger
@ 2024-11-18 16:03           ` Alan Stern
  2024-11-18 21:23           ` Michał Pecio
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Stern @ 2024-11-18 16:03 UTC (permalink / raw)
  To: Markus Rechberger; +Cc: linux-usb, linux-kernel

On Mon, Nov 18, 2024 at 01:14:09PM +0800, Markus Rechberger wrote:
> 
> 
> On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote:
> > On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote:
> > > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> > > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger
> > > > wrote:
> > > > > Basically the issue comes from hub_port_connect.
> > > > > 
> > > > > drivers/usb/core/hub.c
> > > > > 
> > > > > hub_port_init returns -71 -EPROTO and jumps to loop
> > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > > > > 
> > > > > I'd question if usb_ep0_reinit is really required in loop which
> > > > > is
> > > > > running following functions:
> > 
> > > This should only go down there in case an error happened earlier
> > > no?
> > 
> > Of course, since -EPROTO indicates there was an error.
> 
> Alan, first of all thank you for your feedback.
> 
> Do you know any practical way how to test this?

Not any easy way, no.  Shorting out the signals like you do, but for 
only a few milliseconds, might work.

> > > My experience is just - reconnect the device in case an error
> > > happened
> > 
> > You can't reconnect some devices; they are permanently attached. 
> > There 
> > are other reasons why reconnecting might not be practical.
> 
> 
> I understand what you mean here but isn't it primarily about downsizing
> USB 2.0 -> USB 1.1?

No.  This has nothing to do with changing the speed of the connection.

> The next point would be if eg. an endpoint of a webcam downsizes to 1.1
> you don't have to expect a workable device because video would exceed
> the bandwidth easily.
> Do you know any practical example/hardware where this is really
> relevant?
> 
> The failure handling will call the xhci reset which is not meant to be
> run without fully initialized data structures. In my case the driver
> returns -EPROTO (while the device is already disconnected), then first
> of all it tries to reset the endpoint which in the upstream kernel
> causes the NULL PTR exeption. For XHCI this code stream logic is simply
> not meant to be applied in such an error.
> My first submitted patch will solve the issue but just papers a higher
> logic issue. 

I think it's fair to say that the logic error lies in the xhci-hcd 
driver, not in the hub driver.  xhci-hcd should be prepared for attempts 
to reset endpoint 0 before the device is fully initialized.  After all, 
initializing the device requires us to be able to communicate with it 
(to read its descriptors), which requires ep0 to be working properly.

> > In any case, it is generally recognized that the type of errors
> > leading 
> > to -EPROTO (bad CRC or no response, for instance) can be temporary or
> > intermittent.  Retrying is an appropriate strategy.

> > -EPROTO does not necessarily indicate the device is gone; it
> > indicates 
> > there was a problem communicating with the device.  The problem might
> > be 
> > caused by a disconnection, or it might be caused by temporary 
> > electromagnetic interference (for example), or by something else.
> 
> 
> a bad CRC response on a short cable -  something has gone very wrong
> then.

Just a bad CRC response; the cable length is irrelevant.  Another 
possiblity is a bit-stuffing error.  Yes, something has gone very badly 
wrong, but it may go right again if we wait a few milliseconds and 
retry.

I have seen reports from people who experienced problems of this sort 
caused by interference from a bad fluorescent light fixture or by 
electrical noise from a nearby fan.

> If there are temporary EMI issues ... I'd consider that an as an a
> absolute failure as well, they
> need to be handled in HW rather than a software workaround.
> 
> In my experience with USB anything that is a 'temporary' failure can be
> considered as 'permanent' failure and I've really seen a lot over the
> last 1 1/2 decades.
> However issues are mostly related to immature controllers / missing
> quirks for some controllers.

It's true that errors of this type are quite rare.  Nevertheless, 
retrying is a valid strategy for dealing with them and it doesn't hurt 
anything.

> Our devices in the field since 2008 usually pump around 100-300mbit
> through the USB 2 link,
> streaming  devices which usually run for a long period of time (up to
> months / years).
> 'retrying' something on a link where something has gone wrong for sure
> never worked properly for me, it would have continued with another
> followup issue at some point.

I also changed the ehci-hcd driver to retry multiple times after one of 
these errors, rather than using the strict "3 strikes and you're out" 
approach.  There were cases where that helped.

> I'm definitely in favor of telling the user if something went wrong
> either reconnect or submit a bug report for that particular device to
> add some quirk for special handling.
> 
> Anyway can you give a particular example where this 'retrying'
> mechanism and reloading the endpoint size solves or solved a problem?

I cannot.  The events referred to above occurred many years ago.  They 
were rare then, and they probably are even more rare now.

> Over the years people will certainly forget why this was implemented
> and take the code 'as is'
> and the next ones will just say it was always like that don't change.
> I don't mind keeping it as it is but it should be more clear / obvious
> why it's really done that
> way.

It seems obvious that what the code does is to retry when -EPROTO errors 
occur.  (Also -EILSEQ and -ETIME.)  And the reason for retrying should 
also be pretty obvious: There's a chance that the error will go away 
when the transfer is retried -- why else would you retry something?

This hardly seems like something we need to explain in a comment.  But 
if you want to submit a patch adding such a comment, please feel free to 
do so.

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Highly critical bug in XHCI Controller
  2024-11-18  5:14         ` Markus Rechberger
  2024-11-18 16:03           ` Alan Stern
@ 2024-11-18 21:23           ` Michał Pecio
  1 sibling, 0 replies; 12+ messages in thread
From: Michał Pecio @ 2024-11-18 21:23 UTC (permalink / raw)
  To: linuxusb.ml; +Cc: linux-kernel, linux-usb, stern

> In my experience with USB anything that is a 'temporary' failure can
> be considered as 'permanent' failure and I've really seen a lot over
> the last 1 1/2 decades.
> However issues are mostly related to immature controllers / missing
> quirks for some controllers.
> Our devices in the field since 2008 usually pump around 100-300mbit
> through the USB 2 link,
> streaming  devices which usually run for a long period of time (up to
> months / years).
> 'retrying' something on a link where something has gone wrong for sure
> never worked properly for me, it would have continued with another
> followup issue at some point.

You may have simply seen hardware going dead or buggy drivers failing
to recover from recoverable errors.

Random bit errors really happen and (excepting isochronous endpoints)
can be recovered from. But if you get -EPROTO on a bulk endpoint, for
example, it means the endpoint halted and should be reset. Few Linux
drivers seem to bother with such things.

I even think xHCI's handling of halted endpoints and usb_clear_halt()
is broken, but it looks like fixing it would break all the buggy class
drivers on the other hand, which are currently "sort of functional".

> Anyway can you give a particular example where this 'retrying'
> mechanism and reloading the endpoint size solves or solved a problem?

It seems to happen when you insert the plug slowly or at an angle, and
contact is briefly lost while the device is being initialized.

The first three lines below come from hub_port_init(), which looks like
it is being called by hub_port_connect() in a loop.

[81169.840924] usb 5-1: new full-speed USB device number 61 using ohci-pci
[81170.387927] usb 5-1: device not accepting address 61, error -62
[81170.742931] usb 5-1: new full-speed USB device number 62 using ohci-pci
[81170.901914] usb 5-1: New USB device found, idVendor=067b, idProduct=2303, bcdDevice= 3.00
[81170.901919] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[81170.901921] usb 5-1: Product: USB-Serial Controller
[81170.901922] usb 5-1: Manufacturer: Prolific Technology Inc.

Another example which could trigger retries is a device which includes
a permanent "presence detect" resistor (such as PL2303, coincidentally)
but takes a long time to initialize itself and start responding.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-18 21:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-17  7:33 Highly critical bug in XHCI Controller Markus Rechberger
2024-11-17 12:44 ` Markus Rechberger
2024-11-17 13:12   ` Greg KH
2024-11-17 14:35   ` Michał Pecio
2024-11-17 15:03     ` Markus Rechberger
2024-11-17 15:18   ` Alan Stern
2024-11-17 15:47     ` Markus Rechberger
2024-11-17 21:02       ` Alan Stern
2024-11-18  5:14         ` Markus Rechberger
2024-11-18 16:03           ` Alan Stern
2024-11-18 21:23           ` Michał Pecio
2024-11-17 13:12 ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).