* [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() @ 2026-04-15 11:59 Matt Fleming 2026-04-15 12:16 ` Corey Minyard 0 siblings, 1 reply; 5+ messages in thread From: Matt Fleming @ 2026-04-15 11:59 UTC (permalink / raw) To: Corey Minyard Cc: openipmi-developer, linux-kernel, kernel-team, Matt Fleming, Matt Fleming From: Matt Fleming <mfleming@cloudflare.com> When the BMC does not respond to a "Get Device ID" command, the wait_event() in __get_device_id() blocks forever in TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent sysfs reader then piles up in D state. Replace with wait_event_timeout() to return -EIO after 1 second. Signed-off-by: Matt Fleming <matt@readmodwrite.com> --- drivers/char/ipmi/ipmi_msghandler.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c index c41f51c82edd..efa9588e8210 100644 --- a/drivers/char/ipmi/ipmi_msghandler.c +++ b/drivers/char/ipmi/ipmi_msghandler.c @@ -2599,7 +2599,13 @@ static int __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc) if (rv) goto out_reset_handler; - wait_event(intf->waitq, bmc->dyn_id_set != 2); + if (!wait_event_timeout(intf->waitq, bmc->dyn_id_set != 2, + msecs_to_jiffies(1000))) { + dev_warn(intf->si_dev, + "Timed out waiting for get bmc device id response\n"); + rv = -EIO; + goto out_reset_handler; + } if (!bmc->dyn_id_set) { if (bmc->cc != IPMI_CC_NO_ERROR && -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() 2026-04-15 11:59 [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() Matt Fleming @ 2026-04-15 12:16 ` Corey Minyard 2026-04-15 15:46 ` Tony Camuso 0 siblings, 1 reply; 5+ messages in thread From: Corey Minyard @ 2026-04-15 12:16 UTC (permalink / raw) To: Matt Fleming, Tony Camuso Cc: openipmi-developer, linux-kernel, kernel-team, Matt Fleming On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote: > From: Matt Fleming <mfleming@cloudflare.com> > > When the BMC does not respond to a "Get Device ID" command, the > wait_event() in __get_device_id() blocks forever in TASK_UNINTERRUPTIBLE > while holding bmc->dyn_mutex. Every subsequent sysfs reader then piles > up in D state. Replace with wait_event_timeout() to return -EIO after 1 > second. This is the second report I have of something like this. So something is up. I'm adding Tony, who reported something like this dealing with the watchdog. The lower level driver should never not return an answer, it is supposed to guarantee that it returns an error if the BMC doesn't respond. So the bug is not here, the bug is elsewhere. My guess is that there is some new failure mode where a BMC is not working but it responds well enough that it sort of works and fools the driver. But that's only a guess. I've seen this before in several scenarios, including a system that put IPMI in the ACPI tree and it sort of worked but there was no BMC present. I had to disable that particular device. What hardware is involved here? Can you give a more detailed example of what's happening in the low-level hardware? If it's KCS there's a debug flag in the drivers/char/ipmi/ipmi_kcs_sm.c file that should help. Thanks, -corey > > Signed-off-by: Matt Fleming <matt@readmodwrite.com> > --- > drivers/char/ipmi/ipmi_msghandler.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c > index c41f51c82edd..efa9588e8210 100644 > --- a/drivers/char/ipmi/ipmi_msghandler.c > +++ b/drivers/char/ipmi/ipmi_msghandler.c > @@ -2599,7 +2599,13 @@ static int __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc) > if (rv) > goto out_reset_handler; > > - wait_event(intf->waitq, bmc->dyn_id_set != 2); > + if (!wait_event_timeout(intf->waitq, bmc->dyn_id_set != 2, > + msecs_to_jiffies(1000))) { > + dev_warn(intf->si_dev, > + "Timed out waiting for get bmc device id response\n"); > + rv = -EIO; > + goto out_reset_handler; > + } > > if (!bmc->dyn_id_set) { > if (bmc->cc != IPMI_CC_NO_ERROR && > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() 2026-04-15 12:16 ` Corey Minyard @ 2026-04-15 15:46 ` Tony Camuso 2026-04-15 21:22 ` Frederick Lawler 0 siblings, 1 reply; 5+ messages in thread From: Tony Camuso @ 2026-04-15 15:46 UTC (permalink / raw) To: corey, Matt Fleming Cc: openipmi-developer, linux-kernel, kernel-team, Matt Fleming On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote: > From: Matt Fleming <mfl...@cl...> > > When the BMC does not respond to a "Get Device ID" command, the > wait_event() in __get_device_id() blocks forever in > TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent > sysfs reader then piles up in D state. Replace with > wait_event_timeout() to return -EIO after 1 second. On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote: > This is the second report I have of something like this. So > something is up. I'm adding Tony, who reported something like this > dealing with the watchdog. > > The lower level driver should never not return an answer, it is > supposed to guarantee that it returns an error if the BMC doesn't > respond. So the bug is not here, the bug is elsewhere. I've been tracking a related issue (RHEL customer case) where BMC reset while the IPMI watchdog is active causes D-state hangs. This appears to be the same root cause Matt is hitting. I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel (54 patches bringing it to mainline parity) and tested today on a Dell R640. Test: Trigger `ipmitool mc reset cold` while watchdog daemon is running. Results with backported fixes: [ 245.376402] IPMI Watchdog: heartbeat completion received [ 246.376392] IPMI Watchdog: heartbeat send failure: -16 [ 247.377560] IPMI Watchdog: heartbeat send failure: -16 ... [ 252.413240] IPMI Watchdog: set timeout error: -16 The watchdog daemon received error 16 (EBUSY) and eventually initiated orderly shutdown: write watchdog device gave error 16 = 'Device or resource busy' shutting down the system because of error 16 Key finding: With the upstream fixes, the driver returns -EBUSY instead of blocking forever. No D-state hang. The watchdog daemon handles the error and initiates orderly reboot. Note: There was still a delay of several minutes before the daemon timed out and triggered shutdown. The driver returned errors promptly, but the watchdog daemon's retry logic (error retry time-out = 120 seconds) extended the overall recovery time. This may warrant a separate look at whether the daemon's retry behavior is appropriate when the BMC is completely unresponsive. This confirms Corey's assessment - the bug is in the lower-level driver not returning errors, not in __get_device_id(). Matt's timeout patch would be a defensive fallback, but the real fix is ensuring KCS/SI properly returns errors when the BMC is unresponsive. Tony ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() 2026-04-15 15:46 ` Tony Camuso @ 2026-04-15 21:22 ` Frederick Lawler 2026-04-16 14:28 ` Tony Camuso 0 siblings, 1 reply; 5+ messages in thread From: Frederick Lawler @ 2026-04-15 21:22 UTC (permalink / raw) To: Tony Camuso Cc: corey, Matt Fleming, openipmi-developer, linux-kernel, kernel-team, Matt Fleming Hi Corey & Tony, On Wed, Apr 15, 2026 at 11:46:27AM -0400, 'Tony Camuso' via kernel-team wrote: > On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote: > > From: Matt Fleming <mfl...@cl...> > > > > When the BMC does not respond to a "Get Device ID" command, the > > wait_event() in __get_device_id() blocks forever in > > TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent > > sysfs reader then piles up in D state. Replace with > > wait_event_timeout() to return -EIO after 1 second. > > On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote: > > This is the second report I have of something like this. So > > something is up. I'm adding Tony, who reported something like this > > dealing with the watchdog. > > > > The lower level driver should never not return an answer, it is > > supposed to guarantee that it returns an error if the BMC doesn't > > respond. So the bug is not here, the bug is elsewhere. This is a bit of a throwback to our previous discussions around [1]. I did end up applying [2] based on that discussion, and had limited success, but we still have external resets that cause us to enter this undesirable state :( [1]: https://lore.kernel.org/all/aJUMlAG17c6lCgFq@mail.minyard.net/ [2]: https://lore.kernel.org/all/20250807230648.1112569-2-corey@minyard.net/ > > I've been tracking a related issue (RHEL customer case) where BMC > reset while the IPMI watchdog is active causes D-state hangs. This > appears to be the same root cause Matt is hitting. > > I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel > (54 patches bringing it to mainline parity) and tested today on a > Dell R640. I assume this patch series: "ipmi:watchdog: Fix panic, D-state hang, and lost protection on BMC reset" [3]? [3]: https://lore.kernel.org/all/20260407175134.3367345-1-tcamuso@redhat.com/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() 2026-04-15 21:22 ` Frederick Lawler @ 2026-04-16 14:28 ` Tony Camuso 0 siblings, 0 replies; 5+ messages in thread From: Tony Camuso @ 2026-04-16 14:28 UTC (permalink / raw) To: Frederick Lawler Cc: corey, Matt Fleming, openipmi-developer, linux-kernel, kernel-team, Matt Fleming On 4/15/26 5:22 PM, Frederick Lawler wrote: > Hi Corey & Tony, > > On Wed, Apr 15, 2026 at 11:46:27AM -0400, 'Tony Camuso' via kernel-team wrote: >> On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote: >>> From: Matt Fleming <mfl...@cl...> >>> >>> When the BMC does not respond to a "Get Device ID" command, the >>> wait_event() in __get_device_id() blocks forever in >>> TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent >>> sysfs reader then piles up in D state. Replace with >>> wait_event_timeout() to return -EIO after 1 second. >> >> On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote: >>> This is the second report I have of something like this. So >>> something is up. I'm adding Tony, who reported something like this >>> dealing with the watchdog. >>> >>> The lower level driver should never not return an answer, it is >>> supposed to guarantee that it returns an error if the BMC doesn't >>> respond. So the bug is not here, the bug is elsewhere. > > This is a bit of a throwback to our previous discussions around [1]. > > I did end up applying [2] based on that discussion, and had limited > success, but we still have external resets that cause us to enter > this undesirable state :( In my testing with updates from the Linus tree, after a BMC cold reset: 1. The KCS driver returned -EBUSY to callers (good) 2. The watchdog daemon received the error and initiated shutdown 3. No D-state hang My tests, conducted on a Dell PER640, verified that Corey's upstream fixes cause the driver to properly return errors instead of blocking. At least on that platform. Which hich low-level driver are you using (KCS, BT, SSIF)? The PER640 uses KCS. # cat /sys/class/ipmi/ipmi0/device/params 2>/dev/null kcs,i/o,0xca8,rsp=4,rsi=1,rsh=0,irq=10,ipmb=32 > [1]: https://lore.kernel.org/all/aJUMlAG17c6lCgFq@mail.minyard.net/ > [2]: https://lore.kernel.org/all/20250807230648.1112569-2-corey@minyard.net/ >> >> I've been tracking a related issue (RHEL customer case) where BMC >> reset while the IPMI watchdog is active causes D-state hangs. This >> appears to be the same root cause Matt is hitting. >> >> I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel >> (54 patches bringing it to mainline parity) and tested today on a >> Dell R640. > > I assume this patch series: "ipmi:watchdog: Fix panic, D-state hang, and > lost protection on BMC reset" [3]? > > [3]: https://lore.kernel.org/all/20260407175134.3367345-1-tcamuso@redhat.com/ > Actually, no. The 54 commits I backported simply bring my RHEL-9 test kernel to parity with the Linus tree, which includes [2] and ... cae66f1a1dcd 2026-02-13 corey@minyard.net ipmi:si: Fix check for a misbehaving BMC ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-16 14:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-15 11:59 [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() Matt Fleming 2026-04-15 12:16 ` Corey Minyard 2026-04-15 15:46 ` Tony Camuso 2026-04-15 21:22 ` Frederick Lawler 2026-04-16 14:28 ` Tony Camuso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox