From: Corey Minyard <corey@minyard.net>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>,
Guenter Roeck <linux@roeck-us.net>,
Igor Raits <igor@gooddata.com>,
linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org,
Daniel Secik <daniel.secik@gooddata.com>,
Zdenek Pesek <zdenek.pesek@gooddata.com>,
Jiri Jurica <jiri.jurica@gooddata.com>,
Huisong Li <lihuisong@huawei.com>
Subject: Re: [BISECTED - impi related]: acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y
Date: Thu, 12 Feb 2026 10:48:17 -0600 [thread overview]
Message-ID: <aY4EUQcGbmT3Rexz@mail.minyard.net> (raw)
In-Reply-To: <CAJZ5v0hCEECyuAJZ4GAn=DeJR-UgtUXOMYYN9JQgEyghVnCYsw@mail.gmail.com>
On Thu, Feb 12, 2026 at 01:27:41PM +0100, Rafael J. Wysocki wrote:
> On Thu, Feb 12, 2026 at 10:11???AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Fri, Feb 6, 2026 at 4:58???PM Corey Minyard <corey@minyard.net> wrote:
> > > >
> > > > On Fri, Feb 06, 2026 at 01:08:56PM +0100, Rafael J. Wysocki wrote:
> > > > > On Thu, Feb 5, 2026 at 11:34???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > >
> > > > > > On Thu, Feb 05, 2026 at 08:04:12PM +0100, Rafael J. Wysocki wrote:
> > > > > > > Cc: Corey
> > > > > > >
> > > > > > > On Thu, Feb 5, 2026 at 6:51???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > > > >
> > > > > > > > On Thu, Feb 05, 2026 at 08:25:57AM +0100, Igor Raits wrote:
> > > > > > > > > On Wed, Feb 4, 2026 at 11:49???PM Guenter Roeck <linux@roeck-us.net> wrote:
> > > > > > > > > >
> > > > > > > > > > On 2/4/26 11:54, Igor Raits wrote:
> > > > > > > > > > > I have written a patch with the help of AI and it fixes the problem. Attached.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > "No MIME, no links, no compression, no attachments. Just plain text"
> > > > > > > > >
> > > > > > > > > Sorry for that, I had assumed that attaching the file would make it in-line.
> > > > > > > > >
> > > > > > > > > > ... which means I can not provide inline feedback, which is the whole
> > > > > > > > > > point of the above.
> > > > > > > > > >
> > > > > > > > > > Your patch crosses subsystems, so it will need to be split in two
> > > > > > > > > > (assuming the ACPI side is even needed). Also, references to iDRAC
> > > > > > > > > > in common code seem inappropriate.
> > > > > > > > >
> > > > > > > > > Yes, this I believe was the essential part (it was the last piece in
> > > > > > > > > my testing which fixed the hanging):
> > > > > > > > >
> > > > > > > >
> > > > > > > > Then I'll need to ask differently: What happens if you drop the IPMI code,
> > > > > > > > and just keep the wait_for_completion -> wait_for_completion_timeout
> > > > > > > > change ? Would that be sufficient to solve the problem ?
> > > > > > >
> > > > > > > I'd rather say "Would that be sufficient to make the symptoms go
> > > > > > > away?" as it most likely papers over the real problem.
> > > > > > >
> > > > > >
> > > > > > Good point. Worse, it may result in UAF or memory leaks.
> > > > > >
> > > > > > > > Either case, the need for this change suggests that the ipmi change
> > > > > > > > may not be complete, since it should send a completion with an error.
> > > > > > >
> > > > > > > I think that reverting commit bc3a9d217755 ("ipmi:si: Gracefully
> > > > > > > handle if the BMC is non-functional") should also be considered as a
> > > > > > > possible way forward because it clearly did not improve things as
> > > > > > > expected, at least in this particular case.
> > > > > > >
> > > > > >
> > > > > > I tend to agree. I ran a number of AI code reviews over the patch, and
> > > > > > each time it finds new (and different) problems. The fact that the acpi
> > > > > > patch is still needed even after applying the ipmi changes suggests that
> > > > > > something is still missing in the ipmi code.
> > > > > >
> > > > > > > It evidently did something that confuses things quite a bit. Either
> > > > > > > it is returning IPMI_BUS_ERR instead of IPMI_ERR_UNSPECIFIED, or it is
> > > > > > > the "hosed" state and refusing to accept messages.
> > > > > > >
> > > > > >
> > > > > > More than that. My latest AI results are below, just for reference
> > > > > > (using Gemini 3 with Chris Mason's debug prompts). The prompt I used
> > > > > > for this run is:
> > > > >
> > > > > Well, I guess it's time to send a revert patch then.
> > > >
> > > > Thanks for the CC.
> > > >
> > > > Let's fix it right in the IPMI driver.
> > > >
> > > > >
> > > > > > "
> > > > > > The top commit in the linux/ directory results in hung tasks if the BMC
> > > > > > stops responding. Using @review-prompts/kernel/debugging.md analyze the
> > > > > > patch, identify the reason for the hung task problem, suggest and implement
> > > > > > a fix. Note that there may be more than one problem in the patch, so analyze
> > > > > > the complete patch and do not stop after fiding the first regression.
> > > > > > "
> > > > > >
> > > > > > I think that catches most of the problem, but not all of it.
> > > > > >
> > > > > > Guenter
> > > > > >
> > > > > > ---
> > > > > >
> > > > > > Summary of crash or warning:
> > > > > > Hung task detected in ipmi_si driver when BMC becomes non-functional.
> > > > > > Processes waiting for IPMI responses (e.g. ipmitool, monitoring agents) enter D state and never recover.
> > > > > >
> > > > > > Kernel version if available:
> > > > > > Top of tree (commit bc3a9d217755f65c137f145600f23bf1d6c31ea9)
> > > > > >
> > > > > > Machine type if available:
> > > > > > Generic Server with BMC
> > > > > >
> > > > > > Cleaned up copy of oops or stack trace:
> > > > > > [ 120.123456] INFO: task ipmitool:1234 blocked for more than 120 seconds.
> > > > > > [ 120.123457] Not tainted 6.14.0-rc1 #1
> > > > > > [ 120.123458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > [ 120.123459] task:ipmitool state:D stack: 0 pid: 1234 ppid: 100 flags:0x00000000
> > > > > > [ 120.123460] Call Trace:
> > > > > > [ 120.123461] <TASK>
> > > > > > [ 120.123462] __schedule+0x123/0x456
> > > > > > [ 120.123463] schedule+0x45/0x78
> > > > > > [ 120.123464] schedule_timeout+0x9a/0xbc
> > > > > > [ 120.123465] wait_for_completion+0xde/0xf0
> > > > > > [ 120.123466] ipmi_request_settime+0x123/0x145
> > > > > > [ 120.123467] ...
> > > > > > [ 120.123468] </TASK>
> > > > > >
> > > > > > Any other kernel messages you found relevant:
> > > > > > N/A
> > > > > >
> > > > > > Explanation of the problem:
> > > > > > 1. Hung Task:
> > > > > > The patch "ipmi:si: Gracefully handle if the BMC is non-functional" introduces a new state `SI_HOSED` to handle BMC failures. When the driver detects that the BMC is not responding, it transitions to `SI_HOSED` and fails the currently processing message (`curr_msg`). However, if a new message is queued via `sender()` (populating `waiting_msg`) during a recovery probe (state `SI_GETTING_FLAGS`), and that probe subsequently fails, the state machine transitions back to `SI_HOSED`. In this transition, the driver checks and fails `curr_msg`, but it neglects to check or fail `waiting_msg`. As a result, the `waiting_msg` remains in the queue indefinitely, causing the waiting process to hang.
> > > > >
> > > > > That's quite convincing and it would explain the observed symptoms.
> > > >
> > > > Yes, and it's a fairly easy fix, I think. The waiting message just
> > > > needs to be returned in that case. The following patch should do it:
> > >
> > > Jaroslav, it would be good to test the patch below on top of 6.19. I
> > > can put it on a test git branch if need be, so please let me know.
> > >
> > > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > > > index 5459ffdde8dc..ff159b1162b9 100644
> > > > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > > > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > > > @@ -809,6 +809,12 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info,
> > > > */
> > > > return_hosed_msg(smi_info, IPMI_BUS_ERR);
> > > > }
> > > > + if (smi_info->waiting_msg != NULL) {
> > > > + /* Also handle if there was a message waiting. */
> > > > + smi_info->curr_msg = smi_info->waiting_msg;
> > > > + smi_info->waiting_msg = NULL;
> > > > + return_hosed_msg(smi_info, IPMI_BUS_ERR);
> > > > + }
> > > > smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_HOSED);
> > > > goto out;
> > > > }
> >
> > I apply ^ patch to both 6.18.10 and 6.19 and reproduced the issue on
> > both, so it does not fix the problem.
>
> Thanks!
>
> With all due respect to everyone involved (including the AI), this
> means that we are not anywhere close to fixing the problem and it
> would be a shame to ship 7.0 with it.
>
> I'm sending a revert patch shortly.
Unfortunately, that patch fixed an issue others were having.
I'm going to try to hack on qemu and see if I can reproduce this. I
don't have any hardware where this is feasible.
-corey
next prev parent reply other threads:[~2026-02-12 16:48 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-22 18:21 acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y Jaroslav Pulchart
2026-01-22 18:55 ` Rafael J. Wysocki
2026-01-22 19:51 ` Guenter Roeck
2026-01-22 20:42 ` Rafael J. Wysocki
2026-01-22 22:28 ` Guenter Roeck
2026-01-23 10:19 ` Jaroslav Pulchart
2026-01-23 18:23 ` Guenter Roeck
2026-01-27 11:58 ` Jaroslav Pulchart
2026-01-27 14:24 ` Guenter Roeck
2026-01-29 15:26 ` Jaroslav Pulchart
2026-01-29 18:22 ` Guenter Roeck
2026-02-02 16:48 ` Jaroslav Pulchart
2026-02-02 18:00 ` Guenter Roeck
2026-02-02 18:14 ` Jaroslav Pulchart
2026-02-02 19:26 ` Rafael J. Wysocki
2026-02-03 0:26 ` Guenter Roeck
2026-02-03 8:23 ` Jaroslav Pulchart
2026-02-03 23:21 ` [BISECTED]: " Jaroslav Pulchart
2026-02-04 1:01 ` Guenter Roeck
2026-02-04 8:20 ` Jaroslav Pulchart
2026-02-04 19:54 ` Igor Raits
2026-02-04 22:48 ` Guenter Roeck
2026-02-05 7:25 ` Igor Raits
2026-02-05 17:51 ` Guenter Roeck
2026-02-05 19:04 ` Rafael J. Wysocki
2026-02-05 20:57 ` [BISECTED - impi related]: " Guenter Roeck
2026-02-06 12:08 ` Rafael J. Wysocki
2026-02-06 15:58 ` Corey Minyard
2026-02-06 19:33 ` Rafael J. Wysocki
2026-02-10 16:31 ` Rafael J. Wysocki
2026-02-12 9:10 ` Jaroslav Pulchart
2026-02-12 12:27 ` Rafael J. Wysocki
2026-02-12 16:48 ` Corey Minyard [this message]
2026-02-12 17:22 ` Rafael J. Wysocki
2026-02-12 18:34 ` Corey Minyard
2026-02-12 21:33 ` Rafael J. Wysocki
2026-02-12 22:06 ` Corey Minyard
2026-02-13 6:55 ` Corey Minyard
2026-02-13 12:47 ` Rafael J. Wysocki
2026-02-06 16:08 ` Corey Minyard
2026-02-06 16:31 ` Guenter Roeck
2026-02-06 19:35 ` Rafael J. Wysocki
2026-01-23 2:53 ` lihuisong (C)
2026-01-27 16:13 ` Guenter Roeck
2026-01-28 18:18 ` Guenter Roeck
2026-01-28 18:45 ` Rafael J. Wysocki
2026-01-28 19:52 ` Rafael J. Wysocki
2026-01-28 21:52 ` Guenter Roeck
2026-01-29 14:18 ` [PATCH v1] hwmon: (acpi_power_meter) Fix deadlocks related to acpi_power_meter_notify() Rafael J. Wysocki
2026-01-30 1:07 ` Guenter Roeck
2026-01-30 17:51 ` Rafael J. Wysocki
2026-01-30 18:51 ` Guenter Roeck
2026-01-30 1:47 ` lihuisong (C)
2026-01-30 5:07 ` Guenter Roeck
2026-01-30 8:40 ` lihuisong (C)
2026-01-30 17:53 ` Rafael J. Wysocki
2026-01-31 10:06 ` lihuisong (C)
2026-01-31 15:29 ` Guenter Roeck
2026-02-01 11:40 ` Rafael J. Wysocki
2026-01-29 1:55 ` acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y lihuisong (C)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aY4EUQcGbmT3Rexz@mail.minyard.net \
--to=corey@minyard.net \
--cc=daniel.secik@gooddata.com \
--cc=igor@gooddata.com \
--cc=jaroslav.pulchart@gooddata.com \
--cc=jiri.jurica@gooddata.com \
--cc=lihuisong@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-hwmon@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=rafael@kernel.org \
--cc=zdenek.pesek@gooddata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.