public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Corey Minyard <corey@minyard.net>
To: Matt Fleming <matt@readmodwrite.com>,
	Tony Camuso <tcamuso@redhat.com>,
	openipmi-developer@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, kernel-team@cloudflare.com,
	Matt Fleming <mfleming@cloudflare.com>,
	Frederick Lawler <fred@cloudflare.com>
Subject: Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()
Date: Mon, 20 Apr 2026 13:11:25 -0500	[thread overview]
Message-ID: <aeZsTQdBNBzvmNG2@mail.minyard.net> (raw)
In-Reply-To: <aeZVPUV_gPrgITJQ@mail.minyard.net>

On Mon, Apr 20, 2026 at 11:33:01AM -0500, Corey Minyard wrote:
> On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote:
> > On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote:
> > > 
> > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful
> > > READ_EVENT_MSG_BUFFER command completes.  Getting data from the
> > > BMC has higher priority than sending data to the BMC.
> > > 
> > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then
> > > that would certainly wedge the driver.  But it would have to continually
> > > report success for that command, which would be strange as its supposed
> > > to error out when the queue is empty.
> >  
> > That does indeed appear to be what's happening.
> > 
> > The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER
> > handler does not fail when there is nothing to read,
> > 
> >   https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704
> 
> Actually, that is so clearly wrong that it's hard to imagine they made
> this mistake.  Anyway, a defect needs to be filed against it.  It will
> certainly break other software on other OSes.
> 
> I think I have an easy workaround, though I'm guessing.  I'm guessing
> they are returning zero data bytes. There's no check on the size at that
> point in the code (it's later).
> 
> Can you try the following patch?

Actually ignore that, I was thinking too much about the other stuff and
didn't actually read my patch.

Here's a working patch:

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 4a9e9de4d684..08c208cc64c5 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info)
                 */
                msg = smi_info->curr_msg;
                smi_info->curr_msg = NULL;
-               if (msg->rsp[2] != 0) {
+               /*
+                * It appears some BMCs, with no event data, return no
+                * data in the message and not a 0x80 error as the
+                * spec says they should.  Shut down processing if
+                * the data is not the right length.
+                */
+               if (msg->rsp[2] != 0 || msg->rsp_size != 19) {
                        /* Error getting event, probably done. */
                        msg->done(msg);



> 
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> index 4a9e9de4d684..cf8674a93af1 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info)
>                  */
>                 msg = smi_info->curr_msg;
>                 smi_info->curr_msg = NULL;
> -               if (msg->rsp[2] != 0) {
> +               /*
> +                * It appears some BMCs, with no event data, return no
> +                * data in the message and not a 0x80 error as the
> +                * spec says they should.  Shut down processing if
> +                * the data is not the right length.
> +                */
> +               if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) {
>                         /* Error getting event, probably done. */
>                         msg->done(msg);
> 
>  
> With your approval on that, I'll send it to Linus after letting it sit
> in the next tree for a bit.  Actually, I'll probably add it in any case,
> as it's a good check to do.
> 
> > 
> > > If it's really something like that, I could also look at adding limits
> > > for those operations.
> > 
> > That would be great. Me and Fred would be happy to test out any patch.
> > 
> > I still think the original patch I sent is a worthwhile defense.
> > 
> > Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to
> > block behind one another when we hit these kinds of issues in the IPMI
> > code. Untangling that across thousands of machines can be time
> > consuming and a more explicit EIO or ETIMEDOUT would help with triage.
> 
> Unfortunately, that might have other issues, similar to the ones the
> people with the watchdog issue found.  I'll look at it a bit, but those
> sorts of things would have to be scattered all over the code, not just
> in that one place.  As you say, it would make debugging easier.
> 
> I think adding a counter to the number of operations occuring from a
> single flag fetch would be a way to avoid this issue.  That's going to
> take a little more time, but I'll definitely work on that.
> 
> Thanks,
> 
> -corey

      reply	other threads:[~2026-04-20 18:11 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 11:59 [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() Matt Fleming
2026-04-15 12:16 ` Corey Minyard
2026-04-15 15:46   ` Tony Camuso
2026-04-15 21:22     ` Frederick Lawler
2026-04-16 14:28       ` Tony Camuso
2026-04-17 16:01         ` Matt Fleming
2026-04-17 15:41   ` Matt Fleming
2026-04-17 22:23   ` Matt Fleming
2026-04-17 23:53     ` Corey Minyard
2026-04-19 20:50       ` Matt Fleming
2026-04-20 16:33         ` Corey Minyard
2026-04-20 18:11           ` Corey Minyard [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeZsTQdBNBzvmNG2@mail.minyard.net \
    --to=corey@minyard.net \
    --cc=fred@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@readmodwrite.com \
    --cc=mfleming@cloudflare.com \
    --cc=openipmi-developer@lists.sourceforge.net \
    --cc=tcamuso@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox