From: Corey Minyard <corey@minyard.net>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>,
Guenter Roeck <linux@roeck-us.net>,
Igor Raits <igor@gooddata.com>,
linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org,
Daniel Secik <daniel.secik@gooddata.com>,
Zdenek Pesek <zdenek.pesek@gooddata.com>,
Jiri Jurica <jiri.jurica@gooddata.com>,
Huisong Li <lihuisong@huawei.com>
Subject: Re: [BISECTED - impi related]: acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y
Date: Thu, 12 Feb 2026 16:06:15 -0600 [thread overview]
Message-ID: <aY5O1z6SbrGSDaDs@mail.minyard.net> (raw)
In-Reply-To: <CAJZ5v0iS939x7VMQoNgks=Edichx2C+Qq8_kfiLpLh0ov=gwpw@mail.gmail.com>
On Thu, Feb 12, 2026 at 10:33:15PM +0100, Rafael J. Wysocki wrote:
> On Thu, Feb 12, 2026 at 7:35???PM Corey Minyard <corey@minyard.net> wrote:
> >
> > On Thu, Feb 12, 2026 at 06:22:08PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Feb 12, 2026 at 5:48???PM Corey Minyard <corey@minyard.net> wrote:
> > > >
> > > > On Thu, Feb 12, 2026 at 01:27:41PM +0100, Rafael J. Wysocki wrote:
> > > > > On Thu, Feb 12, 2026 at 10:11???AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 6, 2026 at 4:58???PM Corey Minyard <corey@minyard.net> wrote:
> > > > > > > >
> > > > > > > > On Fri, Feb 06, 2026 at 01:08:56PM +0100, Rafael J. Wysocki wrote:
> > > > > > > > > On Thu, Feb 5, 2026 at 11:34???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 05, 2026 at 08:04:12PM +0100, Rafael J. Wysocki wrote:
> > > > > > > > > > > Cc: Corey
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 5, 2026 at 6:51???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 05, 2026 at 08:25:57AM +0100, Igor Raits wrote:
> > > > > > > > > > > > > On Wed, Feb 4, 2026 at 11:49???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 2/4/26 11:54, Igor Raits wrote:
> > > > > > > > > > > > > > > I have written a patch with the help of AI and it fixes the problem. Attached.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > "No MIME, no links, no compression, no attachments. Just plain text"
> > > > > > > > > > > > >
> > > > > > > > > > > > > Sorry for that, I had assumed that attaching the file would make it in-line.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > ... which means I can not provide inline feedback, which is the whole
> > > > > > > > > > > > > > point of the above.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Your patch crosses subsystems, so it will need to be split in two
> > > > > > > > > > > > > > (assuming the ACPI side is even needed). Also, references to iDRAC
> > > > > > > > > > > > > > in common code seem inappropriate.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, this I believe was the essential part (it was the last piece in
> > > > > > > > > > > > > my testing which fixed the hanging):
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Then I'll need to ask differently: What happens if you drop the IPMI code,
> > > > > > > > > > > > and just keep the wait_for_completion -> wait_for_completion_timeout
> > > > > > > > > > > > change ? Would that be sufficient to solve the problem ?
> > > > > > > > > > >
> > > > > > > > > > > I'd rather say "Would that be sufficient to make the symptoms go
> > > > > > > > > > > away?" as it most likely papers over the real problem.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good point. Worse, it may result in UAF or memory leaks.
> > > > > > > > > >
> > > > > > > > > > > > Either case, the need for this change suggests that the ipmi change
> > > > > > > > > > > > may not be complete, since it should send a completion with an error.
> > > > > > > > > > >
> > > > > > > > > > > I think that reverting commit bc3a9d217755 ("ipmi:si: Gracefully
> > > > > > > > > > > handle if the BMC is non-functional") should also be considered as a
> > > > > > > > > > > possible way forward because it clearly did not improve things as
> > > > > > > > > > > expected, at least in this particular case.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I tend to agree. I ran a number of AI code reviews over the patch, and
> > > > > > > > > > each time it finds new (and different) problems. The fact that the acpi
> > > > > > > > > > patch is still needed even after applying the ipmi changes suggests that
> > > > > > > > > > something is still missing in the ipmi code.
> > > > > > > > > >
> > > > > > > > > > > It evidently did something that confuses things quite a bit. Either
> > > > > > > > > > > it is returning IPMI_BUS_ERR instead of IPMI_ERR_UNSPECIFIED, or it is
> > > > > > > > > > > the "hosed" state and refusing to accept messages.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > More than that. My latest AI results are below, just for reference
> > > > > > > > > > (using Gemini 3 with Chris Mason's debug prompts). The prompt I used
> > > > > > > > > > for this run is:
> > > > > > > > >
> > > > > > > > > Well, I guess it's time to send a revert patch then.
> > > > > > > >
> > > > > > > > Thanks for the CC.
> > > > > > > >
> > > > > > > > Let's fix it right in the IPMI driver.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > "
> > > > > > > > > > The top commit in the linux/ directory results in hung tasks if the BMC
> > > > > > > > > > stops responding. Using @review-prompts/kernel/debugging.md analyze the
> > > > > > > > > > patch, identify the reason for the hung task problem, suggest and implement
> > > > > > > > > > a fix. Note that there may be more than one problem in the patch, so analyze
> > > > > > > > > > the complete patch and do not stop after fiding the first regression.
> > > > > > > > > > "
> > > > > > > > > >
> > > > > > > > > > I think that catches most of the problem, but not all of it.
> > > > > > > > > >
> > > > > > > > > > Guenter
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > >
> > > > > > > > > > Summary of crash or warning:
> > > > > > > > > > Hung task detected in ipmi_si driver when BMC becomes non-functional.
> > > > > > > > > > Processes waiting for IPMI responses (e.g. ipmitool, monitoring agents) enter D state and never recover.
> > > > > > > > > >
> > > > > > > > > > Kernel version if available:
> > > > > > > > > > Top of tree (commit bc3a9d217755f65c137f145600f23bf1d6c31ea9)
> > > > > > > > > >
> > > > > > > > > > Machine type if available:
> > > > > > > > > > Generic Server with BMC
> > > > > > > > > >
> > > > > > > > > > Cleaned up copy of oops or stack trace:
> > > > > > > > > > [ 120.123456] INFO: task ipmitool:1234 blocked for more than 120 seconds.
> > > > > > > > > > [ 120.123457] Not tainted 6.14.0-rc1 #1
> > > > > > > > > > [ 120.123458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > > > > [ 120.123459] task:ipmitool state:D stack: 0 pid: 1234 ppid: 100 flags:0x00000000
> > > > > > > > > > [ 120.123460] Call Trace:
> > > > > > > > > > [ 120.123461] <TASK>
> > > > > > > > > > [ 120.123462] __schedule+0x123/0x456
> > > > > > > > > > [ 120.123463] schedule+0x45/0x78
> > > > > > > > > > [ 120.123464] schedule_timeout+0x9a/0xbc
> > > > > > > > > > [ 120.123465] wait_for_completion+0xde/0xf0
> > > > > > > > > > [ 120.123466] ipmi_request_settime+0x123/0x145
> > > > > > > > > > [ 120.123467] ...
> > > > > > > > > > [ 120.123468] </TASK>
> > > > > > > > > >
> > > > > > > > > > Any other kernel messages you found relevant:
> > > > > > > > > > N/A
> > > > > > > > > >
> > > > > > > > > > Explanation of the problem:
> > > > > > > > > > 1. Hung Task:
> > > > > > > > > > The patch "ipmi:si: Gracefully handle if the BMC is non-functional" introduces a new state `SI_HOSED` to handle BMC failures. When the driver detects that the BMC is not responding, it transitions to `SI_HOSED` and fails the currently processing message (`curr_msg`). However, if a new message is queued via `sender()` (populating `waiting_msg`) during a recovery probe (state `SI_GETTING_FLAGS`), and that probe subsequently fails, the state machine transitions back to `SI_HOSED`. In this transition, the driver checks and fails `curr_msg`, but it neglects to check or fail `waiting_msg`. As a result, the `waiting_msg` remains in the queue indefinitely, causing the waiting process to hang.
> > > > > > > > >
> > > > > > > > > That's quite convincing and it would explain the observed symptoms.
> > > > > > > >
> > > > > > > > Yes, and it's a fairly easy fix, I think. The waiting message just
> > > > > > > > needs to be returned in that case. The following patch should do it:
> > > > > > >
> > > > > > > Jaroslav, it would be good to test the patch below on top of 6.19. I
> > > > > > > can put it on a test git branch if need be, so please let me know.
> > > > > > >
> > > > > > > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > > > > > > > index 5459ffdde8dc..ff159b1162b9 100644
> > > > > > > > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > > > > > > > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > > > > > > > @@ -809,6 +809,12 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info,
> > > > > > > > */
> > > > > > > > return_hosed_msg(smi_info, IPMI_BUS_ERR);
> > > > > > > > }
> > > > > > > > + if (smi_info->waiting_msg != NULL) {
> > > > > > > > + /* Also handle if there was a message waiting. */
> > > > > > > > + smi_info->curr_msg = smi_info->waiting_msg;
> > > > > > > > + smi_info->waiting_msg = NULL;
> > > > > > > > + return_hosed_msg(smi_info, IPMI_BUS_ERR);
> > > > > > > > + }
> > > > > > > > smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_HOSED);
> > > > > > > > goto out;
> > > > > > > > }
> > > > > >
> > > > > > I apply ^ patch to both 6.18.10 and 6.19 and reproduced the issue on
> > > > > > both, so it does not fix the problem.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > With all due respect to everyone involved (including the AI), this
> > > > > means that we are not anywhere close to fixing the problem and it
> > > > > would be a shame to ship 7.0 with it.
> > > > >
> > > > > I'm sending a revert patch shortly.
> > > >
> > > > Unfortunately, that patch fixed an issue others were having.
> > >
> > > Granted, it broke something else, so it needs to be fixed or reverted.
> >
> > Yes, certainly.
> >
> > >
> > > Maybe there is a way to address the original problem fixed by it differently?
> >
> > I'm not sure. This is not the first attempt...
>
> I see.
>
> > >
> > > Do you have any pointers to any problem reports regarding that one?
> >
> > The original problem came as a patch set:
> >
> > https://lore.kernel.org/lkml/20221007092617.87597-1-zhangyuchen.lcr@bytedance.com/
> >
> > That had a lockup problem, and it had some other issues. So I reworked
> > the code to the current form.
>
> OK, thanks!
>
> > I'm working on qemu now. This needs to be added as part of the test suite, anyway.
>
> There is something in the current code that seems to be problematic.
>
> When acpi_ipmi_space_handler() runs, it calls ipmi_request_settime()
> to queue up a message. AFAICS, if all goes well, this ends up calling
> smi_send() via i_ipmi_request().
>
> If intf->curr_msg is NULL, the new message will not be added to a list
> in there, but intf->curr_msg will be set to point to it instead and
> handlers->sender() will be called on it. But handlers->sender points
> to sender() defined in ipmi_si_intf.c which returns IPMI_BUS_ERR
> without doing anything if smi_info->si_state == SI_HOSED and its
> return value is ignored.
>
> The message is only pointed to by intf->curr_msg at that point and
> AFAICS it will never get actually processed because intf->curr_msg is
> never really dereferenced (it is only compared with other pointers and
> checked against NULL if I'm not mistaken).
>
> It looks like smi_send() needs to check the handlers->sender() return
> value and maybe return it to the caller so i_ipmi_request() can return
> an error if it fails.
Yes, I think you might be right. I've just gotten qemu to a point where
I can test this.
Until that code was added handlers->sender() never returned an error.
Hopefully I can figure this out soon.
-corey
next prev parent reply other threads:[~2026-02-12 22:06 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-22 18:21 acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y Jaroslav Pulchart
2026-01-22 18:55 ` Rafael J. Wysocki
2026-01-22 19:51 ` Guenter Roeck
2026-01-22 20:42 ` Rafael J. Wysocki
2026-01-22 22:28 ` Guenter Roeck
2026-01-23 10:19 ` Jaroslav Pulchart
2026-01-23 18:23 ` Guenter Roeck
2026-01-27 11:58 ` Jaroslav Pulchart
2026-01-27 14:24 ` Guenter Roeck
2026-01-29 15:26 ` Jaroslav Pulchart
2026-01-29 18:22 ` Guenter Roeck
2026-02-02 16:48 ` Jaroslav Pulchart
2026-02-02 18:00 ` Guenter Roeck
2026-02-02 18:14 ` Jaroslav Pulchart
2026-02-02 19:26 ` Rafael J. Wysocki
2026-02-03 0:26 ` Guenter Roeck
2026-02-03 8:23 ` Jaroslav Pulchart
2026-02-03 23:21 ` [BISECTED]: " Jaroslav Pulchart
2026-02-04 1:01 ` Guenter Roeck
2026-02-04 8:20 ` Jaroslav Pulchart
2026-02-04 19:54 ` Igor Raits
2026-02-04 22:48 ` Guenter Roeck
2026-02-05 7:25 ` Igor Raits
2026-02-05 17:51 ` Guenter Roeck
2026-02-05 19:04 ` Rafael J. Wysocki
2026-02-05 20:57 ` [BISECTED - impi related]: " Guenter Roeck
2026-02-06 12:08 ` Rafael J. Wysocki
2026-02-06 15:58 ` Corey Minyard
2026-02-06 19:33 ` Rafael J. Wysocki
2026-02-10 16:31 ` Rafael J. Wysocki
2026-02-12 9:10 ` Jaroslav Pulchart
2026-02-12 12:27 ` Rafael J. Wysocki
2026-02-12 16:48 ` Corey Minyard
2026-02-12 17:22 ` Rafael J. Wysocki
2026-02-12 18:34 ` Corey Minyard
2026-02-12 21:33 ` Rafael J. Wysocki
2026-02-12 22:06 ` Corey Minyard [this message]
2026-02-13 6:55 ` Corey Minyard
2026-02-13 12:47 ` Rafael J. Wysocki
2026-02-06 16:08 ` Corey Minyard
2026-02-06 16:31 ` Guenter Roeck
2026-02-06 19:35 ` Rafael J. Wysocki
2026-01-23 2:53 ` lihuisong (C)
2026-01-27 16:13 ` Guenter Roeck
2026-01-28 18:18 ` Guenter Roeck
2026-01-28 18:45 ` Rafael J. Wysocki
2026-01-28 19:52 ` Rafael J. Wysocki
2026-01-28 21:52 ` Guenter Roeck
2026-01-29 14:18 ` [PATCH v1] hwmon: (acpi_power_meter) Fix deadlocks related to acpi_power_meter_notify() Rafael J. Wysocki
2026-01-30 1:07 ` Guenter Roeck
2026-01-30 17:51 ` Rafael J. Wysocki
2026-01-30 18:51 ` Guenter Roeck
2026-01-30 1:47 ` lihuisong (C)
2026-01-30 5:07 ` Guenter Roeck
2026-01-30 8:40 ` lihuisong (C)
2026-01-30 17:53 ` Rafael J. Wysocki
2026-01-31 10:06 ` lihuisong (C)
2026-01-31 15:29 ` Guenter Roeck
2026-02-01 11:40 ` Rafael J. Wysocki
2026-01-29 1:55 ` acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y lihuisong (C)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aY5O1z6SbrGSDaDs@mail.minyard.net \
--to=corey@minyard.net \
--cc=daniel.secik@gooddata.com \
--cc=igor@gooddata.com \
--cc=jaroslav.pulchart@gooddata.com \
--cc=jiri.jurica@gooddata.com \
--cc=lihuisong@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-hwmon@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=rafael@kernel.org \
--cc=zdenek.pesek@gooddata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.