* Highly critical bug in XHCI Controller
@ 2024-11-17 7:33 Markus Rechberger
2024-11-17 12:44 ` Markus Rechberger
2024-11-17 13:12 ` Greg KH
0 siblings, 2 replies; 12+ messages in thread
From: Markus Rechberger @ 2024-11-17 7:33 UTC (permalink / raw)
To: linux-usb, linux-kernel
Hi,
the issue was first reported at vdr-portal.de
https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196
we've got around a highly critical bug in the xhci driver.
https://sundtek.de/support/uxvd32.txt
In xhci.c
The bug is still active in Mainline:
https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382
static int xhci_check_bw_table(struct xhci_hcd *xhci,
struct xhci_virt_device *virt_dev,
int old_active_eps)
bw_table can end up with a NULL pointer.
This problem will lead to a complete kernel crash, rendering the entire
system unusable without any access to the actual linux system.
How to trigger the problem?
Short D+/D- or pull them to ground on a USB device while connecting the
device.
The problem can happen due to following cases:
* a device is getting suddenly disconnected during enumeration
* a faulty cable
* a faulty device
* a malicious device triggers this issue on purpose
* if there are electrical issues during connecting a device.
A quick hotfix would be to check if bw_table is NULL in
xhci_check_bw_table, however the check should be performed earlier - in
the area where bw_table is supposed to be assigned.
Best Regards,
Markus
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Highly critical bug in XHCI Controller 2024-11-17 7:33 Highly critical bug in XHCI Controller Markus Rechberger @ 2024-11-17 12:44 ` Markus Rechberger 2024-11-17 13:12 ` Greg KH ` (2 more replies) 2024-11-17 13:12 ` Greg KH 1 sibling, 3 replies; 12+ messages in thread From: Markus Rechberger @ 2024-11-17 12:44 UTC (permalink / raw) To: linux-usb, linux-kernel Basically the issue comes from hub_port_connect. drivers/usb/core/hub.c hub_port_init returns -71 -EPROTO and jumps to loop https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 I'd question if usb_ep0_reinit is really required in loop which is running following functions: usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); usb_enable_endpoint(udev, &udev->ep0, true); this is something only experience over the past decades can tell? usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do much, but crashes the entire system with the upstream kernel when it triggers xhci_check_bw_table). I removed usb_ep0_reinit here and devices are still workable under various conditions (again I shorted and pulled D+/D- to ground for testing). The NULL PTR check in xhci_check_bw_table would be a second line of defense but as indicated in the first mail it shouldn't even get there. As a second issue I found in usb_reset_and_verify device https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 ret = hub_port_init(parent_hub, udev, port1, i, &descriptor); if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { break; } hub_port_init can also return -71 / -EPROTO, the cases should be very rare when usb_reset_and_verify_device is triggered and that happens. I'm just waiting for comments now since this is some critical piece of infrastructure code before proceeding with a patch. On Sun, 2024-11-17 at 15:33 +0800, Markus Rechberger wrote: > Hi, > > > the issue was first reported at vdr-portal.de > https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196 > > we've got around a highly critical bug in the xhci driver. > > https://sundtek.de/support/uxvd32.txt > > In xhci.c > > The bug is still active in Mainline: > https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382 > > static int xhci_check_bw_table(struct xhci_hcd *xhci, > struct xhci_virt_device *virt_dev, > int old_active_eps) > > bw_table can end up with a NULL pointer. > > This problem will lead to a complete kernel crash, rendering the > entire > system unusable without any access to the actual linux system. > > How to trigger the problem? > Short D+/D- or pull them to ground on a USB device while connecting > the > device. > > The problem can happen due to following cases: > * a device is getting suddenly disconnected during enumeration > * a faulty cable > * a faulty device > * a malicious device triggers this issue on purpose > * if there are electrical issues during connecting a device. > > A quick hotfix would be to check if bw_table is NULL in > xhci_check_bw_table, however the check should be performed earlier - > in > the area where bw_table is supposed to be assigned. > > Best Regards, > Markus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 12:44 ` Markus Rechberger @ 2024-11-17 13:12 ` Greg KH 2024-11-17 14:35 ` Michał Pecio 2024-11-17 15:18 ` Alan Stern 2 siblings, 0 replies; 12+ messages in thread From: Greg KH @ 2024-11-17 13:12 UTC (permalink / raw) To: Markus Rechberger; +Cc: linux-usb, linux-kernel On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote: > Basically the issue comes from hub_port_connect. > > drivers/usb/core/hub.c > > hub_port_init returns -71 -EPROTO and jumps to loop > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > I'd question if usb_ep0_reinit is really required in loop which is > running following functions: > usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); > usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); > usb_enable_endpoint(udev, &udev->ep0, true); > > this is something only experience over the past decades can tell? > > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do > much, but crashes the entire system with the upstream kernel when it > triggers xhci_check_bw_table). > > I removed usb_ep0_reinit here and devices are still workable under > various conditions (again I shorted and pulled D+/D- to ground for > testing). > The NULL PTR check in xhci_check_bw_table would be a second line of > defense but as indicated in the first mail it shouldn't even get there. > > > > As a second issue I found in usb_reset_and_verify device > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > ret = hub_port_init(parent_hub, udev, port1, i, &descriptor); > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > break; > } > > hub_port_init can also return -71 / -EPROTO, the cases should be very > rare when usb_reset_and_verify_device is triggered and that happens. > > > I'm just waiting for comments now since this is some critical piece of > infrastructure code before proceeding with a patch. Send us a patch and we will be glad to review it. thanks! greg k-h ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 12:44 ` Markus Rechberger 2024-11-17 13:12 ` Greg KH @ 2024-11-17 14:35 ` Michał Pecio 2024-11-17 15:03 ` Markus Rechberger 2024-11-17 15:18 ` Alan Stern 2 siblings, 1 reply; 12+ messages in thread From: Michał Pecio @ 2024-11-17 14:35 UTC (permalink / raw) To: linuxusb.ml; +Cc: linux-kernel, linux-usb Hi, > Basically the issue comes from hub_port_connect. > > drivers/usb/core/hub.c > > hub_port_init returns -71 -EPROTO and jumps to loop > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > I'd question if usb_ep0_reinit is really required in loop which is > running following functions: > usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); > usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); > usb_enable_endpoint(udev, &udev->ep0, true); > > this is something only experience over the past decades can tell? > > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do > much, but crashes the entire system with the upstream kernel when it > triggers xhci_check_bw_table). > > I removed usb_ep0_reinit here and devices are still workable under > various conditions (again I shorted and pulled D+/D- to ground for > testing). xHCI isn't the only host controller supported by Linux, and usb_ep0_reinit() predates it. Maybe it's pointless today, maybe it isn't, but it's not the root cause of your problem anyway. > The NULL PTR check in xhci_check_bw_table would be a second line > of defense but as indicated in the first mail it shouldn't even get > there. It's an xHCI bug that BW calculation is attempted on an uninitialized device and crashes. Looks like a NULL check somewhere is exactly what is needed, or maybe avoid it completely on EP0 (it's probably no-op). Other similar bug recently: https://lore.kernel.org/linux-usb/D3CKQQAETH47.1MUO22RTCH2O3@matfyz.cz/T/#u Yours too should be unique to those Intel Panther Point chipsets. > As a second issue I found in usb_reset_and_verify device > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > ret = hub_port_init(parent_hub, udev, port1, i, &descriptor); > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > break; > } > > hub_port_init can also return -71 / -EPROTO, the cases should be very > rare when usb_reset_and_verify_device is triggered and that happens. Right, and the intent seems to be to simply retry in this case. Regards, Michal ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 14:35 ` Michał Pecio @ 2024-11-17 15:03 ` Markus Rechberger 0 siblings, 0 replies; 12+ messages in thread From: Markus Rechberger @ 2024-11-17 15:03 UTC (permalink / raw) To: Michał Pecio; +Cc: linux-kernel, linux-usb On Sun, 2024-11-17 at 15:35 +0100, Michał Pecio wrote: > Hi, > > > Basically the issue comes from hub_port_connect. > > > > drivers/usb/core/hub.c > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > I'd question if usb_ep0_reinit is really required in loop which is > > running following functions: > > usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); > > usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); > > usb_enable_endpoint(udev, &udev->ep0, true); > > > > this is something only experience over the past decades can tell? > > > > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't > > do > > much, but crashes the entire system with the upstream kernel when > > it > > triggers xhci_check_bw_table). > > > > I removed usb_ep0_reinit here and devices are still workable under > > various conditions (again I shorted and pulled D+/D- to ground for > > testing). > > xHCI isn't the only host controller supported by Linux, and > usb_ep0_reinit() predates it. Maybe it's pointless today, maybe > it isn't, but it's not the root cause of your problem anyway. > exactly, but it shouldn't go there anyway. This section of the code is only for 'in case an error already happened'. That's why I'm asking if anyone knows a practical situation where this is really needed - and did it ever help? I'm working with USB across many systems for nearly 2 decades and I never saw any of those fallbacks succeeding. Usually the way to recover a device which had connection issues was to reconnect the device completely. > > The NULL PTR check in xhci_check_bw_table would be a second line > > of defense but as indicated in the first mail it shouldn't even get > > there. > > It's an xHCI bug that BW calculation is attempted on an uninitialized > device and crashes. Looks like a NULL check somewhere is exactly what > is needed, or maybe avoid it completely on EP0 (it's probably no-op). > Yes this would be the second line of defense as indicated in my email before. I'm preparing 2 small patches with comments in the code. One commenting out usb_ep0_reinit and the other one a simple NULL PTR check - but again the code shouldn't be executed at all when bw_table == NULL something already went wrong, the issue should be handled earlier in the code already. bw_table is not setting itself to NULL or not at all. Since it's infrastructure code it's a sensitive part. This issue fully crashed my Asus notebook at least 3 times when working with a USB ethernet adapter and a USB harddisk. Not only faulty hardware is causing that problem also simple hotplug connection problems. > Other similar bug recently: > https://lore.kernel.org/linux-usb/D3CKQQAETH47.1MUO22RTCH2O3@matfyz.cz/T/#u > > Yours too should be unique to those Intel Panther Point chipsets. This is exactly the same problem yes. The problem will happen with all systems which use the xhci driver. Best Regards, Markus > > > As a second issue I found in usb_reset_and_verify device > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > > > ret = hub_port_init(parent_hub, udev, port1, i, > > &descriptor); > > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > > break; > > } > > > > hub_port_init can also return -71 / -EPROTO, the cases should be > > very > > rare when usb_reset_and_verify_device is triggered and that > > happens. > > Right, and the intent seems to be to simply retry in this case. > > Regards, > Michal ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 12:44 ` Markus Rechberger 2024-11-17 13:12 ` Greg KH 2024-11-17 14:35 ` Michał Pecio @ 2024-11-17 15:18 ` Alan Stern 2024-11-17 15:47 ` Markus Rechberger 2 siblings, 1 reply; 12+ messages in thread From: Alan Stern @ 2024-11-17 15:18 UTC (permalink / raw) To: Markus Rechberger; +Cc: linux-usb, linux-kernel On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote: > Basically the issue comes from hub_port_connect. > > drivers/usb/core/hub.c > > hub_port_init returns -71 -EPROTO and jumps to loop > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > I'd question if usb_ep0_reinit is really required in loop which is > running following functions: You mean that usb_ep0_reinit() runs the following, not that the loop does. > usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); > usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); > usb_enable_endpoint(udev, &udev->ep0, true); > > this is something only experience over the past decades can tell? It _is_ necessary, because the maxpacket size of ep0 may change from one loop iteration to the next. Therefore the endpoint must be disabled and re-enabled each time the loop repeats. [Now that I go back through the git log, it appears the only reason for exporting usb_ep0_reinit was so that the WUSB driver could call it -- see commit fc721f5194dc ("wusb: make ep0_reinit available for modules"). Since the kernel doesn't support WUSB any more, we should be able to stop exporting that function.] > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't do > much, but crashes the entire system with the upstream kernel when it > triggers xhci_check_bw_table). > > I removed usb_ep0_reinit here and devices are still workable under > various conditions (again I shorted and pulled D+/D- to ground for > testing). > The NULL PTR check in xhci_check_bw_table would be a second line of > defense but as indicated in the first mail it shouldn't even get there. > > > > As a second issue I found in usb_reset_and_verify device > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > ret = hub_port_init(parent_hub, udev, port1, i, &descriptor); > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > break; > } > > hub_port_init can also return -71 / -EPROTO, the cases should be very > rare when usb_reset_and_verify_device is triggered and that happens. If that happens, the loop which this code sits inside will simply perform another iteration. That's what it's supposed to do, not an issue at all. Alan Stern ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 15:18 ` Alan Stern @ 2024-11-17 15:47 ` Markus Rechberger 2024-11-17 21:02 ` Alan Stern 0 siblings, 1 reply; 12+ messages in thread From: Markus Rechberger @ 2024-11-17 15:47 UTC (permalink / raw) To: Alan Stern; +Cc: linux-usb, linux-kernel On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote: > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote: > > Basically the issue comes from hub_port_connect. > > > > drivers/usb/core/hub.c > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > I'd question if usb_ep0_reinit is really required in loop which is > > running following functions: > > You mean that usb_ep0_reinit() runs the following, not that the loop > does. > > > usb_disable_endpoint(udev, 0 + USB_DIR_IN, true); > > usb_disable_endpoint(udev, 0 + USB_DIR_OUT, true); > > usb_enable_endpoint(udev, &udev->ep0, true); > > > > this is something only experience over the past decades can tell? > > It _is_ necessary, because the maxpacket size of ep0 may change from > one loop iteration to the next. Therefore the endpoint must be > disabled > and re-enabled each time the loop repeats. > > [Now that I go back through the git log, it appears the only reason > for > exporting usb_ep0_reinit was so that the WUSB driver could call it -- > see commit fc721f5194dc ("wusb: make ep0_reinit available for > modules"). > Since the kernel doesn't support WUSB any more, we should be able to > stop exporting that function.] This should only go down there in case an error happened earlier no? In case the hardware signed up correctly it should not even enter that code. My experience is just - reconnect the device in case an error happened those workarounds did not work properly for the device I deal with (but yes that's why I'm asking - maybe someone else has different hardware with different experience). > > > usb_enable_endpoint will trigger xhci_endpoint_reset which doesn't > > do > > much, but crashes the entire system with the upstream kernel when > > it > > triggers xhci_check_bw_table). > > > > I removed usb_ep0_reinit here and devices are still workable under > > various conditions (again I shorted and pulled D+/D- to ground for > > testing). > > The NULL PTR check in xhci_check_bw_table would be a second line of > > defense but as indicated in the first mail it shouldn't even get > > there. > > > > > > > > As a second issue I found in usb_reset_and_verify device > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > > > ret = hub_port_init(parent_hub, udev, port1, i, > > &descriptor); > > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > > break; > > } > > > > hub_port_init can also return -71 / -EPROTO, the cases should be > > very > > rare when usb_reset_and_verify_device is triggered and that > > happens. > > If that happens, the loop which this code sits inside will simply > perform another iteration. That's what it's supposed to do, not an > issue at all. > It doesn't cause any issue yes but it's not correct either. -EPROTO will be returned if I disconnect or short the device here. So initially someone thought he should check for -ENODEV/-ENOTCONN (which should also indicate that the device is gone), so -EPROTO should also be checked in that case. Otherwise just remove all those error checks. > Alan Stern > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 15:47 ` Markus Rechberger @ 2024-11-17 21:02 ` Alan Stern 2024-11-18 5:14 ` Markus Rechberger 0 siblings, 1 reply; 12+ messages in thread From: Alan Stern @ 2024-11-17 21:02 UTC (permalink / raw) To: Markus Rechberger; +Cc: linux-usb, linux-kernel On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote: > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote: > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger wrote: > > > Basically the issue comes from hub_port_connect. > > > > > > drivers/usb/core/hub.c > > > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > > > I'd question if usb_ep0_reinit is really required in loop which is > > > running following functions: > This should only go down there in case an error happened earlier no? Of course, since -EPROTO indicates there was an error. > In case the hardware signed up correctly it should not even enter that > code. > > My experience is just - reconnect the device in case an error happened You can't reconnect some devices; they are permanently attached. There are other reasons why reconnecting might not be practical. > those > workarounds did not work properly for the device I deal with (but yes > that's > why I'm asking - maybe someone else has different hardware with > different > experience). I doubt we will hear from these people, because they will not realize that anything was wrong. In any case, it is generally recognized that the type of errors leading to -EPROTO (bad CRC or no response, for instance) can be temporary or intermittent. Retrying is an appropriate strategy. > > > As a second issue I found in usb_reset_and_verify device > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > > > > > ret = hub_port_init(parent_hub, udev, port1, i, > > > &descriptor); > > > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > > > break; > > > } > > > > > > hub_port_init can also return -71 / -EPROTO, the cases should be > > > very > > > rare when usb_reset_and_verify_device is triggered and that > > > happens. > > > > If that happens, the loop which this code sits inside will simply > > perform another iteration. That's what it's supposed to do, not an > > issue at all. > > > > It doesn't cause any issue yes but it's not correct either. Why not? What's wrong with it? > -EPROTO will be returned if I disconnect or short the device here. So > initially someone > thought he should check for -ENODEV/-ENOTCONN (which should also > indicate that the device is gone), so -EPROTO should also be checked in > that case. > Otherwise just remove all those error checks. -EPROTO does not necessarily indicate the device is gone; it indicates there was a problem communicating with the device. The problem might be caused by a disconnection, or it might be caused by temporary electromagnetic interference (for example), or by something else. Alan Stern ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 21:02 ` Alan Stern @ 2024-11-18 5:14 ` Markus Rechberger 2024-11-18 16:03 ` Alan Stern 2024-11-18 21:23 ` Michał Pecio 0 siblings, 2 replies; 12+ messages in thread From: Markus Rechberger @ 2024-11-18 5:14 UTC (permalink / raw) To: Alan Stern; +Cc: linux-usb, linux-kernel On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote: > On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote: > > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote: > > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger > > > wrote: > > > > Basically the issue comes from hub_port_connect. > > > > > > > > drivers/usb/core/hub.c > > > > > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > > > > > I'd question if usb_ep0_reinit is really required in loop which > > > > is > > > > running following functions: > > > This should only go down there in case an error happened earlier > > no? > > Of course, since -EPROTO indicates there was an error. Alan, first of all thank you for your feedback. Do you know any practical way how to test this? > > > In case the hardware signed up correctly it should not even enter > > that > > code. > > > > My experience is just - reconnect the device in case an error > > happened > > You can't reconnect some devices; they are permanently attached. > There > are other reasons why reconnecting might not be practical. I understand what you mean here but isn't it primarily about downsizing USB 2.0 -> USB 1.1? The next point would be if eg. an endpoint of a webcam downsizes to 1.1 you don't have to expect a workable device because video would exceed the bandwidth easily. Do you know any practical example/hardware where this is really relevant? The failure handling will call the xhci reset which is not meant to be run without fully initialized data structures. In my case the driver returns -EPROTO (while the device is already disconnected), then first of all it tries to reset the endpoint which in the upstream kernel causes the NULL PTR exeption. For XHCI this code stream logic is simply not meant to be applied in such an error. My first submitted patch will solve the issue but just papers a higher logic issue. > > > those > > workarounds did not work properly for the device I deal with (but > > yes > > that's > > why I'm asking - maybe someone else has different hardware with > > different > > experience). > > I doubt we will hear from these people, because they will not realize > that anything was wrong. > > In any case, it is generally recognized that the type of errors > leading > to -EPROTO (bad CRC or no response, for instance) can be temporary or > intermittent. Retrying is an appropriate strategy. > > > > > As a second issue I found in usb_reset_and_verify device > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > > > > > > > ret = hub_port_init(parent_hub, udev, port1, i, > > > > &descriptor); > > > > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > > > > break; > > > > } > > > > > > > > hub_port_init can also return -71 / -EPROTO, the cases should > > > > be > > > > very > > > > rare when usb_reset_and_verify_device is triggered and that > > > > happens. > > > > > > If that happens, the loop which this code sits inside will simply > > > perform another iteration. That's what it's supposed to do, not > > > an > > > issue at all. > > > > > > > It doesn't cause any issue yes but it's not correct either. > > Why not? What's wrong with it? > > > -EPROTO will be returned if I disconnect or short the device here. > > So > > initially someone > > thought he should check for -ENODEV/-ENOTCONN (which should also > > indicate that the device is gone), so -EPROTO should also be > > checked in > > that case. > > Otherwise just remove all those error checks. > > -EPROTO does not necessarily indicate the device is gone; it > indicates > there was a problem communicating with the device. The problem might > be > caused by a disconnection, or it might be caused by temporary > electromagnetic interference (for example), or by something else. a bad CRC response on a short cable - something has gone very wrong then. If there are temporary EMI issues ... I'd consider that an as an a absolute failure as well, they need to be handled in HW rather than a software workaround. In my experience with USB anything that is a 'temporary' failure can be considered as 'permanent' failure and I've really seen a lot over the last 1 1/2 decades. However issues are mostly related to immature controllers / missing quirks for some controllers. Our devices in the field since 2008 usually pump around 100-300mbit through the USB 2 link, streaming devices which usually run for a long period of time (up to months / years). 'retrying' something on a link where something has gone wrong for sure never worked properly for me, it would have continued with another followup issue at some point. I'm definitely in favor of telling the user if something went wrong either reconnect or submit a bug report for that particular device to add some quirk for special handling. Anyway can you give a particular example where this 'retrying' mechanism and reloading the endpoint size solves or solved a problem? Over the years people will certainly forget why this was implemented and take the code 'as is' and the next ones will just say it was always like that don't change. I don't mind keeping it as it is but it should be more clear / obvious why it's really done that way. Markus > > Alan Stern > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-18 5:14 ` Markus Rechberger @ 2024-11-18 16:03 ` Alan Stern 2024-11-18 21:23 ` Michał Pecio 1 sibling, 0 replies; 12+ messages in thread From: Alan Stern @ 2024-11-18 16:03 UTC (permalink / raw) To: Markus Rechberger; +Cc: linux-usb, linux-kernel On Mon, Nov 18, 2024 at 01:14:09PM +0800, Markus Rechberger wrote: > > > On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote: > > On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote: > > > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote: > > > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger > > > > wrote: > > > > > Basically the issue comes from hub_port_connect. > > > > > > > > > > drivers/usb/core/hub.c > > > > > > > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > > > > > > > I'd question if usb_ep0_reinit is really required in loop which > > > > > is > > > > > running following functions: > > > > > This should only go down there in case an error happened earlier > > > no? > > > > Of course, since -EPROTO indicates there was an error. > > Alan, first of all thank you for your feedback. > > Do you know any practical way how to test this? Not any easy way, no. Shorting out the signals like you do, but for only a few milliseconds, might work. > > > My experience is just - reconnect the device in case an error > > > happened > > > > You can't reconnect some devices; they are permanently attached. > > There > > are other reasons why reconnecting might not be practical. > > > I understand what you mean here but isn't it primarily about downsizing > USB 2.0 -> USB 1.1? No. This has nothing to do with changing the speed of the connection. > The next point would be if eg. an endpoint of a webcam downsizes to 1.1 > you don't have to expect a workable device because video would exceed > the bandwidth easily. > Do you know any practical example/hardware where this is really > relevant? > > The failure handling will call the xhci reset which is not meant to be > run without fully initialized data structures. In my case the driver > returns -EPROTO (while the device is already disconnected), then first > of all it tries to reset the endpoint which in the upstream kernel > causes the NULL PTR exeption. For XHCI this code stream logic is simply > not meant to be applied in such an error. > My first submitted patch will solve the issue but just papers a higher > logic issue. I think it's fair to say that the logic error lies in the xhci-hcd driver, not in the hub driver. xhci-hcd should be prepared for attempts to reset endpoint 0 before the device is fully initialized. After all, initializing the device requires us to be able to communicate with it (to read its descriptors), which requires ep0 to be working properly. > > In any case, it is generally recognized that the type of errors > > leading > > to -EPROTO (bad CRC or no response, for instance) can be temporary or > > intermittent. Retrying is an appropriate strategy. > > -EPROTO does not necessarily indicate the device is gone; it > > indicates > > there was a problem communicating with the device. The problem might > > be > > caused by a disconnection, or it might be caused by temporary > > electromagnetic interference (for example), or by something else. > > > a bad CRC response on a short cable - something has gone very wrong > then. Just a bad CRC response; the cable length is irrelevant. Another possiblity is a bit-stuffing error. Yes, something has gone very badly wrong, but it may go right again if we wait a few milliseconds and retry. I have seen reports from people who experienced problems of this sort caused by interference from a bad fluorescent light fixture or by electrical noise from a nearby fan. > If there are temporary EMI issues ... I'd consider that an as an a > absolute failure as well, they > need to be handled in HW rather than a software workaround. > > In my experience with USB anything that is a 'temporary' failure can be > considered as 'permanent' failure and I've really seen a lot over the > last 1 1/2 decades. > However issues are mostly related to immature controllers / missing > quirks for some controllers. It's true that errors of this type are quite rare. Nevertheless, retrying is a valid strategy for dealing with them and it doesn't hurt anything. > Our devices in the field since 2008 usually pump around 100-300mbit > through the USB 2 link, > streaming devices which usually run for a long period of time (up to > months / years). > 'retrying' something on a link where something has gone wrong for sure > never worked properly for me, it would have continued with another > followup issue at some point. I also changed the ehci-hcd driver to retry multiple times after one of these errors, rather than using the strict "3 strikes and you're out" approach. There were cases where that helped. > I'm definitely in favor of telling the user if something went wrong > either reconnect or submit a bug report for that particular device to > add some quirk for special handling. > > Anyway can you give a particular example where this 'retrying' > mechanism and reloading the endpoint size solves or solved a problem? I cannot. The events referred to above occurred many years ago. They were rare then, and they probably are even more rare now. > Over the years people will certainly forget why this was implemented > and take the code 'as is' > and the next ones will just say it was always like that don't change. > I don't mind keeping it as it is but it should be more clear / obvious > why it's really done that > way. It seems obvious that what the code does is to retry when -EPROTO errors occur. (Also -EILSEQ and -ETIME.) And the reason for retrying should also be pretty obvious: There's a chance that the error will go away when the transfer is retried -- why else would you retry something? This hardly seems like something we need to explain in a comment. But if you want to submit a patch adding such a comment, please feel free to do so. Alan Stern ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-18 5:14 ` Markus Rechberger 2024-11-18 16:03 ` Alan Stern @ 2024-11-18 21:23 ` Michał Pecio 1 sibling, 0 replies; 12+ messages in thread From: Michał Pecio @ 2024-11-18 21:23 UTC (permalink / raw) To: linuxusb.ml; +Cc: linux-kernel, linux-usb, stern > In my experience with USB anything that is a 'temporary' failure can > be considered as 'permanent' failure and I've really seen a lot over > the last 1 1/2 decades. > However issues are mostly related to immature controllers / missing > quirks for some controllers. > Our devices in the field since 2008 usually pump around 100-300mbit > through the USB 2 link, > streaming devices which usually run for a long period of time (up to > months / years). > 'retrying' something on a link where something has gone wrong for sure > never worked properly for me, it would have continued with another > followup issue at some point. You may have simply seen hardware going dead or buggy drivers failing to recover from recoverable errors. Random bit errors really happen and (excepting isochronous endpoints) can be recovered from. But if you get -EPROTO on a bulk endpoint, for example, it means the endpoint halted and should be reset. Few Linux drivers seem to bother with such things. I even think xHCI's handling of halted endpoints and usb_clear_halt() is broken, but it looks like fixing it would break all the buggy class drivers on the other hand, which are currently "sort of functional". > Anyway can you give a particular example where this 'retrying' > mechanism and reloading the endpoint size solves or solved a problem? It seems to happen when you insert the plug slowly or at an angle, and contact is briefly lost while the device is being initialized. The first three lines below come from hub_port_init(), which looks like it is being called by hub_port_connect() in a loop. [81169.840924] usb 5-1: new full-speed USB device number 61 using ohci-pci [81170.387927] usb 5-1: device not accepting address 61, error -62 [81170.742931] usb 5-1: new full-speed USB device number 62 using ohci-pci [81170.901914] usb 5-1: New USB device found, idVendor=067b, idProduct=2303, bcdDevice= 3.00 [81170.901919] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [81170.901921] usb 5-1: Product: USB-Serial Controller [81170.901922] usb 5-1: Manufacturer: Prolific Technology Inc. Another example which could trigger retries is a device which includes a permanent "presence detect" resistor (such as PL2303, coincidentally) but takes a long time to initialize itself and start responding. Regards, Michal ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Highly critical bug in XHCI Controller 2024-11-17 7:33 Highly critical bug in XHCI Controller Markus Rechberger 2024-11-17 12:44 ` Markus Rechberger @ 2024-11-17 13:12 ` Greg KH 1 sibling, 0 replies; 12+ messages in thread From: Greg KH @ 2024-11-17 13:12 UTC (permalink / raw) To: Markus Rechberger; +Cc: linux-usb, linux-kernel On Sun, Nov 17, 2024 at 03:33:28PM +0800, Markus Rechberger wrote: > Hi, > > > the issue was first reported at vdr-portal.de > https://www-vdr--portal-de.translate.goog/forum/index.php?thread/136541-empfehlung-dvb-s2-tuner-oder-satip/&postID=1376196&_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp#post1376196 > > we've got around a highly critical bug in the xhci driver. > > https://sundtek.de/support/uxvd32.txt > > In xhci.c > > The bug is still active in Mainline: > https://github.com/torvalds/linux/blob/master/drivers/usb/host/xhci.c#L2382 > > static int xhci_check_bw_table(struct xhci_hcd *xhci, > struct xhci_virt_device *virt_dev, > int old_active_eps) > > bw_table can end up with a NULL pointer. > > This problem will lead to a complete kernel crash, rendering the entire > system unusable without any access to the actual linux system. > > How to trigger the problem? > Short D+/D- or pull them to ground on a USB device while connecting the > device. > > The problem can happen due to following cases: > * a device is getting suddenly disconnected during enumeration > * a faulty cable > * a faulty device > * a malicious device triggers this issue on purpose > * if there are electrical issues during connecting a device. Bad hardware is always a problem, can you send a patch to resolve this as you seem to be able to reproduce it? thanks, greg k-h ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-11-18 21:24 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-17 7:33 Highly critical bug in XHCI Controller Markus Rechberger 2024-11-17 12:44 ` Markus Rechberger 2024-11-17 13:12 ` Greg KH 2024-11-17 14:35 ` Michał Pecio 2024-11-17 15:03 ` Markus Rechberger 2024-11-17 15:18 ` Alan Stern 2024-11-17 15:47 ` Markus Rechberger 2024-11-17 21:02 ` Alan Stern 2024-11-18 5:14 ` Markus Rechberger 2024-11-18 16:03 ` Alan Stern 2024-11-18 21:23 ` Michał Pecio 2024-11-17 13:12 ` Greg KH
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).