From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com [209.85.210.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5AFDF2DB78D for ; Thu, 12 Feb 2026 22:06:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.43 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770933983; cv=none; b=OBeN9ethJ/ywUucnNQkrFd14fKSQFZHcGsFtYewt1T6Y5nU1ulO6Fb4L4BeV/7OaLWjdWqckV4nDqA2qGSEk7aIZpGYo09DadwAs2eVHu0sW913sTkvxupeP8//TLCgq5FkzeBYJty/pDE7hqjmg2BJ4BjbYVtrvd6IwMwPxIdU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770933983; c=relaxed/simple; bh=FA9Sa4ZScDMsKUsdS5zZWNPvFD772JVJVeWxvkxf5L8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=qQgpn1OPxVRMUt36RqiFTwRUwFEmJSNjwBYd7jP64hP3F0O/N6A+ERnRdpVDSEW2zLJIwH2RUC/LODMcF0YI6k3sLb5N8Hfmet4QXFinExgcz281Fq4tsTMGaQpvnZJTDvBwaO8FKBzyP6RfgzDwXm8yUT8kQEQ2vcILWbj1aug= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=minyard.net; spf=pass smtp.mailfrom=minyard.net; dkim=pass (2048-bit key) header.d=minyard.net header.i=@minyard.net header.b=b0N7+eXP; arc=none smtp.client-ip=209.85.210.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=minyard.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=minyard.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=minyard.net header.i=@minyard.net header.b="b0N7+eXP" Received: by mail-ot1-f43.google.com with SMTP id 46e09a7af769-7d4be7c4ebeso159306a34.1 for ; Thu, 12 Feb 2026 14:06:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=minyard.net; s=google; t=1770933980; x=1771538780; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=VMZsU008s1b0NUMycPyjWaPsbDd+SQs/nQ6naksV2T0=; b=b0N7+eXPxbbc6CMCgSFoNtXgxvjQsTPAGyCJVp+KcYR6Jr/GabuWhn3PoYTCa9qJJ2 kMQ2cWwdbWhOkl0qI2YAOcLT/EOY3f8A6dv3hfCMki34MEKTMC/dEDpM1GZkXWw/Magn i6HAdXsqWaR50lVtsDtZGaE52MkR/T2f/3KOGKEgVLgJrOOtOHgZy1AeLrBXN2g83Hgt L23rDHHcEUAd/TxtZDuGXitE1/fiLAwYz+PYTh6oLCgFJHzxHqxzO8H0N2NDh23aPq0r JEmWc2sJQX7xnm8Q+J3kYmuCaQu5TrgXaYz8E3Kt3f4V5FqibLUQB+TNFuBFr8I4xVsZ ZFag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770933980; x=1771538780; h=in-reply-to:content-disposition:mime-version:references:reply-to :message-id:subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=VMZsU008s1b0NUMycPyjWaPsbDd+SQs/nQ6naksV2T0=; b=biTgpLKooVMt87dUkkrozxlgN+yJAxVXr17c4UfNuP5H7ibFCPhJDyXPBh7w5G3KJI iDUzAOJlxapX6TfaL7zr1eopTIfgaee36R9HY6JI38YY/EsRIXQsiR+ECCUH9Dz7xBS8 4P7OH+MMaRL+U5sxv029gCcFfMV4Kf7upHk+zJdZV7UuqcIX+i2MvcAcjMqYXtixvW5a fFvaA7g1Yab3/xXSn54RBAxDkLC36hibDDozM9i8MUF7co/CmNv8rAU19yzjOcUUEorw 79cAadF8UgIaXdXYcQA6af9xJ92DV3pmzYP9EBEzYRmr7iQkGfH6HB9FZ+cwXFFA2SPQ cJIg== X-Forwarded-Encrypted: i=1; AJvYcCVb0qRZckZOCV4jS1JHoXiB9Tt/q5pESYFv/wMr7qLLcWTJ3ZknVa72yO8I8aC0Q5O2vhmGdEKwVzvt@vger.kernel.org X-Gm-Message-State: AOJu0YyMCijlBL9OFbO0CVgjcou8iF+R4k0neYAIROG7aIJR97nWnDTx idZwXEl2vDJFP5as7RwTZ5bnxsHgUvWqKh6oySsNIb/xkJHWHNiJQEGpQWc4Y4+LQ8Q= X-Gm-Gg: AZuq6aIC5AQlxuIAL0wMPjIqeSe+sQosSKkPIj5YuoRX7MIc9RY3ZkDeKJ4KAENu96m mgY11PArNn4wFSjObDFUX99lrxgxeIPMjqlpt7r0gC6StrIQjL4rvWxDZ6bT/PRYaJ+TKlVb90P sf5nh/ncUYzWg3WnyCFK4CMrG1PJAn06Q07SpyiU8SqkvP4LNg5B2ED9BXgTkHRDYR8xtZ/c91o DEUlwcT4t222FkM7zh+4CEIff849bcTAsQePgP1LJHzM/WPCberrgWlF9ZaljKNeM/lJenUvM3I ETT/3GCe2sQB0XgNKsxCcZ7GR/FquQT93QD/2dEO++z88/BRBxwKzUD+7Zy5qTvmLKibcQ356Su 5Eo7+GfCfyry7EE8f7ODp3tAmzwDQ2R0onRl9APOTt/NSnoczmXjZwGphpFxcvRAPfsNLW2NMU1 glMckv6TDG27YFSXqwIcXiCXlvEKMsscTNXpyM0KGH/sCm0zOBDYRqC7wtMrK/kpytdMX2/kbkH V4HokY+ac9d7G4= X-Received: by 2002:a05:6830:7313:b0:7c7:4f2:e15d with SMTP id 46e09a7af769-7d4c3028d5bmr316756a34.16.1770933980193; Thu, 12 Feb 2026 14:06:20 -0800 (PST) Received: from mail.minyard.net ([2001:470:b8f6:1b:a3ab:7352:1dc1:6b46]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d4a76df04dsm4998416a34.15.2026.02.12.14.06.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Feb 2026 14:06:19 -0800 (PST) Date: Thu, 12 Feb 2026 16:06:15 -0600 From: Corey Minyard To: "Rafael J. Wysocki" Cc: Jaroslav Pulchart , Guenter Roeck , Igor Raits , linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org, Daniel Secik , Zdenek Pesek , Jiri Jurica , Huisong Li Subject: Re: [BISECTED - impi related]: acpi_power_meter: power*_average sysfs read hangs, mutex deadlock in hwmon_attr_show since v6.18.y Message-ID: Reply-To: corey@minyard.net References: <1642aec8-e8c1-4ad4-a5b7-556feeedfd93@roeck-us.net> Precedence: bulk X-Mailing-List: linux-acpi@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Thu, Feb 12, 2026 at 10:33:15PM +0100, Rafael J. Wysocki wrote: > On Thu, Feb 12, 2026 at 7:35???PM Corey Minyard wrote: > > > > On Thu, Feb 12, 2026 at 06:22:08PM +0100, Rafael J. Wysocki wrote: > > > On Thu, Feb 12, 2026 at 5:48???PM Corey Minyard wrote: > > > > > > > > On Thu, Feb 12, 2026 at 01:27:41PM +0100, Rafael J. Wysocki wrote: > > > > > On Thu, Feb 12, 2026 at 10:11???AM Jaroslav Pulchart > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Feb 6, 2026 at 4:58???PM Corey Minyard wrote: > > > > > > > > > > > > > > > > On Fri, Feb 06, 2026 at 01:08:56PM +0100, Rafael J. Wysocki wrote: > > > > > > > > > On Thu, Feb 5, 2026 at 11:34???PM Guenter Roeck wrote: > > > > > > > > > > > > > > > > > > > > On Thu, Feb 05, 2026 at 08:04:12PM +0100, Rafael J. Wysocki wrote: > > > > > > > > > > > Cc: Corey > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 5, 2026 at 6:51???PM Guenter Roeck wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 05, 2026 at 08:25:57AM +0100, Igor Raits wrote: > > > > > > > > > > > > > On Wed, Feb 4, 2026 at 11:49???PM Guenter Roeck wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2/4/26 11:54, Igor Raits wrote: > > > > > > > > > > > > > > > I have written a patch with the help of AI and it fixes the problem. Attached. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "No MIME, no links, no compression, no attachments. Just plain text" > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry for that, I had assumed that attaching the file would make it in-line. > > > > > > > > > > > > > > > > > > > > > > > > > > > ... which means I can not provide inline feedback, which is the whole > > > > > > > > > > > > > > point of the above. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Your patch crosses subsystems, so it will need to be split in two > > > > > > > > > > > > > > (assuming the ACPI side is even needed). Also, references to iDRAC > > > > > > > > > > > > > > in common code seem inappropriate. > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, this I believe was the essential part (it was the last piece in > > > > > > > > > > > > > my testing which fixed the hanging): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then I'll need to ask differently: What happens if you drop the IPMI code, > > > > > > > > > > > > and just keep the wait_for_completion -> wait_for_completion_timeout > > > > > > > > > > > > change ? Would that be sufficient to solve the problem ? > > > > > > > > > > > > > > > > > > > > > > I'd rather say "Would that be sufficient to make the symptoms go > > > > > > > > > > > away?" as it most likely papers over the real problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good point. Worse, it may result in UAF or memory leaks. > > > > > > > > > > > > > > > > > > > > > > Either case, the need for this change suggests that the ipmi change > > > > > > > > > > > > may not be complete, since it should send a completion with an error. > > > > > > > > > > > > > > > > > > > > > > I think that reverting commit bc3a9d217755 ("ipmi:si: Gracefully > > > > > > > > > > > handle if the BMC is non-functional") should also be considered as a > > > > > > > > > > > possible way forward because it clearly did not improve things as > > > > > > > > > > > expected, at least in this particular case. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tend to agree. I ran a number of AI code reviews over the patch, and > > > > > > > > > > each time it finds new (and different) problems. The fact that the acpi > > > > > > > > > > patch is still needed even after applying the ipmi changes suggests that > > > > > > > > > > something is still missing in the ipmi code. > > > > > > > > > > > > > > > > > > > > > It evidently did something that confuses things quite a bit. Either > > > > > > > > > > > it is returning IPMI_BUS_ERR instead of IPMI_ERR_UNSPECIFIED, or it is > > > > > > > > > > > the "hosed" state and refusing to accept messages. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > More than that. My latest AI results are below, just for reference > > > > > > > > > > (using Gemini 3 with Chris Mason's debug prompts). The prompt I used > > > > > > > > > > for this run is: > > > > > > > > > > > > > > > > > > Well, I guess it's time to send a revert patch then. > > > > > > > > > > > > > > > > Thanks for the CC. > > > > > > > > > > > > > > > > Let's fix it right in the IPMI driver. > > > > > > > > > > > > > > > > > > > > > > > > > > > " > > > > > > > > > > The top commit in the linux/ directory results in hung tasks if the BMC > > > > > > > > > > stops responding. Using @review-prompts/kernel/debugging.md analyze the > > > > > > > > > > patch, identify the reason for the hung task problem, suggest and implement > > > > > > > > > > a fix. Note that there may be more than one problem in the patch, so analyze > > > > > > > > > > the complete patch and do not stop after fiding the first regression. > > > > > > > > > > " > > > > > > > > > > > > > > > > > > > > I think that catches most of the problem, but not all of it. > > > > > > > > > > > > > > > > > > > > Guenter > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > > > Summary of crash or warning: > > > > > > > > > > Hung task detected in ipmi_si driver when BMC becomes non-functional. > > > > > > > > > > Processes waiting for IPMI responses (e.g. ipmitool, monitoring agents) enter D state and never recover. > > > > > > > > > > > > > > > > > > > > Kernel version if available: > > > > > > > > > > Top of tree (commit bc3a9d217755f65c137f145600f23bf1d6c31ea9) > > > > > > > > > > > > > > > > > > > > Machine type if available: > > > > > > > > > > Generic Server with BMC > > > > > > > > > > > > > > > > > > > > Cleaned up copy of oops or stack trace: > > > > > > > > > > [ 120.123456] INFO: task ipmitool:1234 blocked for more than 120 seconds. > > > > > > > > > > [ 120.123457] Not tainted 6.14.0-rc1 #1 > > > > > > > > > > [ 120.123458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > > > > > > [ 120.123459] task:ipmitool state:D stack: 0 pid: 1234 ppid: 100 flags:0x00000000 > > > > > > > > > > [ 120.123460] Call Trace: > > > > > > > > > > [ 120.123461] > > > > > > > > > > [ 120.123462] __schedule+0x123/0x456 > > > > > > > > > > [ 120.123463] schedule+0x45/0x78 > > > > > > > > > > [ 120.123464] schedule_timeout+0x9a/0xbc > > > > > > > > > > [ 120.123465] wait_for_completion+0xde/0xf0 > > > > > > > > > > [ 120.123466] ipmi_request_settime+0x123/0x145 > > > > > > > > > > [ 120.123467] ... > > > > > > > > > > [ 120.123468] > > > > > > > > > > > > > > > > > > > > Any other kernel messages you found relevant: > > > > > > > > > > N/A > > > > > > > > > > > > > > > > > > > > Explanation of the problem: > > > > > > > > > > 1. Hung Task: > > > > > > > > > > The patch "ipmi:si: Gracefully handle if the BMC is non-functional" introduces a new state `SI_HOSED` to handle BMC failures. When the driver detects that the BMC is not responding, it transitions to `SI_HOSED` and fails the currently processing message (`curr_msg`). However, if a new message is queued via `sender()` (populating `waiting_msg`) during a recovery probe (state `SI_GETTING_FLAGS`), and that probe subsequently fails, the state machine transitions back to `SI_HOSED`. In this transition, the driver checks and fails `curr_msg`, but it neglects to check or fail `waiting_msg`. As a result, the `waiting_msg` remains in the queue indefinitely, causing the waiting process to hang. > > > > > > > > > > > > > > > > > > That's quite convincing and it would explain the observed symptoms. > > > > > > > > > > > > > > > > Yes, and it's a fairly easy fix, I think. The waiting message just > > > > > > > > needs to be returned in that case. The following patch should do it: > > > > > > > > > > > > > > Jaroslav, it would be good to test the patch below on top of 6.19. I > > > > > > > can put it on a test git branch if need be, so please let me know. > > > > > > > > > > > > > > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c > > > > > > > > index 5459ffdde8dc..ff159b1162b9 100644 > > > > > > > > --- a/drivers/char/ipmi/ipmi_si_intf.c > > > > > > > > +++ b/drivers/char/ipmi/ipmi_si_intf.c > > > > > > > > @@ -809,6 +809,12 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info, > > > > > > > > */ > > > > > > > > return_hosed_msg(smi_info, IPMI_BUS_ERR); > > > > > > > > } > > > > > > > > + if (smi_info->waiting_msg != NULL) { > > > > > > > > + /* Also handle if there was a message waiting. */ > > > > > > > > + smi_info->curr_msg = smi_info->waiting_msg; > > > > > > > > + smi_info->waiting_msg = NULL; > > > > > > > > + return_hosed_msg(smi_info, IPMI_BUS_ERR); > > > > > > > > + } > > > > > > > > smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_HOSED); > > > > > > > > goto out; > > > > > > > > } > > > > > > > > > > > > I apply ^ patch to both 6.18.10 and 6.19 and reproduced the issue on > > > > > > both, so it does not fix the problem. > > > > > > > > > > Thanks! > > > > > > > > > > With all due respect to everyone involved (including the AI), this > > > > > means that we are not anywhere close to fixing the problem and it > > > > > would be a shame to ship 7.0 with it. > > > > > > > > > > I'm sending a revert patch shortly. > > > > > > > > Unfortunately, that patch fixed an issue others were having. > > > > > > Granted, it broke something else, so it needs to be fixed or reverted. > > > > Yes, certainly. > > > > > > > > Maybe there is a way to address the original problem fixed by it differently? > > > > I'm not sure. This is not the first attempt... > > I see. > > > > > > > Do you have any pointers to any problem reports regarding that one? > > > > The original problem came as a patch set: > > > > https://lore.kernel.org/lkml/20221007092617.87597-1-zhangyuchen.lcr@bytedance.com/ > > > > That had a lockup problem, and it had some other issues. So I reworked > > the code to the current form. > > OK, thanks! > > > I'm working on qemu now. This needs to be added as part of the test suite, anyway. > > There is something in the current code that seems to be problematic. > > When acpi_ipmi_space_handler() runs, it calls ipmi_request_settime() > to queue up a message. AFAICS, if all goes well, this ends up calling > smi_send() via i_ipmi_request(). > > If intf->curr_msg is NULL, the new message will not be added to a list > in there, but intf->curr_msg will be set to point to it instead and > handlers->sender() will be called on it. But handlers->sender points > to sender() defined in ipmi_si_intf.c which returns IPMI_BUS_ERR > without doing anything if smi_info->si_state == SI_HOSED and its > return value is ignored. > > The message is only pointed to by intf->curr_msg at that point and > AFAICS it will never get actually processed because intf->curr_msg is > never really dereferenced (it is only compared with other pointers and > checked against NULL if I'm not mistaken). > > It looks like smi_send() needs to check the handlers->sender() return > value and maybe return it to the caller so i_ipmi_request() can return > an error if it fails. Yes, I think you might be right. I've just gotten qemu to a point where I can test this. Until that code was added handlers->sender() never returned an error. Hopefully I can figure this out soon. -corey