public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Corey Minyard <corey@minyard.net>
To: Matt Fleming <matt@readmodwrite.com>
Cc: Tony Camuso <tcamuso@redhat.com>,
	openipmi-developer@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, kernel-team@cloudflare.com,
	Matt Fleming <mfleming@cloudflare.com>,
	Frederick Lawler <fred@cloudflare.com>
Subject: Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()
Date: Mon, 20 Apr 2026 11:33:01 -0500	[thread overview]
Message-ID: <aeZVPUV_gPrgITJQ@mail.minyard.net> (raw)
In-Reply-To: <aeUsnI2nHAbtqoqt@matt-Precision-5490>

On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote:
> On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote:
> > 
> > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful
> > READ_EVENT_MSG_BUFFER command completes.  Getting data from the
> > BMC has higher priority than sending data to the BMC.
> > 
> > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then
> > that would certainly wedge the driver.  But it would have to continually
> > report success for that command, which would be strange as its supposed
> > to error out when the queue is empty.
>  
> That does indeed appear to be what's happening.
> 
> The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER
> handler does not fail when there is nothing to read,
> 
>   https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704

Actually, that is so clearly wrong that it's hard to imagine they made
this mistake.  Anyway, a defect needs to be filed against it.  It will
certainly break other software on other OSes.

I think I have an easy workaround, though I'm guessing.  I'm guessing
they are returning zero data bytes. There's no check on the size at that
point in the code (it's later).

Can you try the following patch?

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 4a9e9de4d684..cf8674a93af1 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info)
                 */
                msg = smi_info->curr_msg;
                smi_info->curr_msg = NULL;
-               if (msg->rsp[2] != 0) {
+               /*
+                * It appears some BMCs, with no event data, return no
+                * data in the message and not a 0x80 error as the
+                * spec says they should.  Shut down processing if
+                * the data is not the right length.
+                */
+               if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) {
                        /* Error getting event, probably done. */
                        msg->done(msg);

 
With your approval on that, I'll send it to Linus after letting it sit
in the next tree for a bit.  Actually, I'll probably add it in any case,
as it's a good check to do.

> 
> > If it's really something like that, I could also look at adding limits
> > for those operations.
> 
> That would be great. Me and Fred would be happy to test out any patch.
> 
> I still think the original patch I sent is a worthwhile defense.
> 
> Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to
> block behind one another when we hit these kinds of issues in the IPMI
> code. Untangling that across thousands of machines can be time
> consuming and a more explicit EIO or ETIMEDOUT would help with triage.

Unfortunately, that might have other issues, similar to the ones the
people with the watchdog issue found.  I'll look at it a bit, but those
sorts of things would have to be scattered all over the code, not just
in that one place.  As you say, it would make debugging easier.

I think adding a counter to the number of operations occuring from a
single flag fetch would be a way to avoid this issue.  That's going to
take a little more time, but I'll definitely work on that.

Thanks,

-corey

  reply	other threads:[~2026-04-20 16:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 11:59 [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() Matt Fleming
2026-04-15 12:16 ` Corey Minyard
2026-04-15 15:46   ` Tony Camuso
2026-04-15 21:22     ` Frederick Lawler
2026-04-16 14:28       ` Tony Camuso
2026-04-17 16:01         ` Matt Fleming
2026-04-17 15:41   ` Matt Fleming
2026-04-17 22:23   ` Matt Fleming
2026-04-17 23:53     ` Corey Minyard
2026-04-19 20:50       ` Matt Fleming
2026-04-20 16:33         ` Corey Minyard [this message]
2026-04-20 18:11           ` Corey Minyard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeZVPUV_gPrgITJQ@mail.minyard.net \
    --to=corey@minyard.net \
    --cc=fred@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@readmodwrite.com \
    --cc=mfleming@cloudflare.com \
    --cc=openipmi-developer@lists.sourceforge.net \
    --cc=tcamuso@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox