All of lore.kernel.org
 help / color / mirror / Atom feed
* Preventing a system power on before BMC Ready
@ 2023-05-02 20:48 Andrew Geissler
  2023-05-02 21:50 ` Michael Richardson
  2023-05-03  0:48 ` Ed Tanous
  0 siblings, 2 replies; 4+ messages in thread
From: Andrew Geissler @ 2023-05-02 20:48 UTC (permalink / raw)
  To: OpenBMC List

About once a month a bug arrives internally where someone has powered on the
host without waiting for the BMC to reach its Ready state. Our systems for a
variety of reasons require the BMC to be at Ready before initiating a system
power on.

The defects are usually returned as user error in that users are supposed to
know to wait. Our Redfish clients (including the web UI) know to not allow a
power on operation until Ready. Recently however we had a bug where our external
Redfish client allowed a power on before Ready. That client is event driven once
connected to the BMC and because they never got an event about an unexpected BMC
reboot, they allowed a power on before Ready when the BMC came back up. Granted
there is only about a 30s window where we have a problem here, but as we all
know, when there's a window, someone finds it.

That got us brainstorming about some possible solutions:
- Write some code in bmcweb to send a “bmc state change event” anytime bmcweb
  comes up to ensure listening clients know “something” has happened
- Add an optional compile option to bmcweb (or PSM/x86-power-control) to require
  BMC Ready before issuing chassis or system POST requests (return error if not
  at Ready)
- Queue up the power on request and execute it once we reach BMC Ready (not sure
  what type of response that would be to Redfish clients or what error path
  looks like if we never reach Ready?)
- Find a way in the client to better detect an unexpected bmc reboot (heartbeat
  of some sort)
- Push bmcweb further in the startup to BMC Ready, ensuring clients can't talk
  to the BMC until it's near Ready state

Thoughts?
Andrew

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preventing a system power on before BMC Ready
  2023-05-02 20:48 Preventing a system power on before BMC Ready Andrew Geissler
@ 2023-05-02 21:50 ` Michael Richardson
  2023-05-03  0:48 ` Ed Tanous
  1 sibling, 0 replies; 4+ messages in thread
From: Michael Richardson @ 2023-05-02 21:50 UTC (permalink / raw)
  To: Andrew Geissler, OpenBMC List

[-- Attachment #1: Type: text/plain, Size: 814 bytes --]


Andrew Geissler <geissonator@gmail.com> wrote:
    > That got us brainstorming about some possible solutions: - Write some
    > code in bmcweb to send a “bmc state change event” anytime bmcweb comes
    > up to ensure listening clients know “something” has happened

useful, but not foolproof.

    > Queue up the power on request and execute it once we
    > reach BMC Ready (not sure what type of response that would be to
    > Redfish clients or what error path looks like if we never reach Ready?)

this seems like the best plan.

    > Push bmcweb further in the startup to BMC
    > Ready, ensuring clients can't talk to the BMC until it's near Ready
    > state

The problem  with this is that if you can't talk to the BMC, then you can't
find out why it was never Ready.



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preventing a system power on before BMC Ready
  2023-05-02 20:48 Preventing a system power on before BMC Ready Andrew Geissler
  2023-05-02 21:50 ` Michael Richardson
@ 2023-05-03  0:48 ` Ed Tanous
  2023-05-09 20:00   ` Andrew Geissler
  1 sibling, 1 reply; 4+ messages in thread
From: Ed Tanous @ 2023-05-03  0:48 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: OpenBMC List

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

On Tue, May 2, 2023 at 1:49 PM Andrew Geissler <geissonator@gmail.com>
wrote:
>
> About once a month a bug arrives internally where someone has powered on
the
> host without waiting for the BMC to reach its Ready state. Our systems
for a
> variety of reasons require the BMC to be at Ready before initiating a
system
> power on.
>
> The defects are usually returned as user error in that users are supposed
to
> know to wait. Our Redfish clients (including the web UI) know to not
allow a
> power on operation until Ready. Recently however we had a bug where our
external
> Redfish client allowed a power on before Ready. That client is event
driven once
> connected to the BMC and because they never got an event about an
unexpected BMC
> reboot, they allowed a power on before Ready when the BMC came back up.
Granted
> there is only about a 30s window where we have a problem here, but as we
all
> know, when there's a window, someone finds it.
>
> That got us brainstorming about some possible solutions:
> - Write some code in bmcweb to send a “bmc state change event” anytime
bmcweb
>   comes up to ensure listening clients know “something” has happened
> - Add an optional compile option to bmcweb (or PSM/x86-power-control) to
require
>   BMC Ready before issuing chassis or system POST requests (return error
if not
>   at Ready)

PSM or x86-power-control mods would be my preference.  bmcweb should not be
in charge of business logic.  If the system shouldn't allow power on while
the bmc is in ready state, then the daemons that handle power on need to
have that as a constraint, otherwise you'd have the same problem if a user
tried from IPMI.

> - Queue up the power on request and execute it once we reach BMC Ready
(not sure
>   what type of response that would be to Redfish clients or what error
path
>   looks like if we never reach Ready?)

Redfish has async tasks for this exact use case, and we already have code
to do them.  Alternatively you could just return an error that the
operation is not possible, along with a retry-after header instructing the
user when to retry their request.  We do this in the few update apis
already.

> - Find a way in the client to better detect an unexpected bmc reboot
(heartbeat
>   of some sort)
> - Push bmcweb further in the startup to BMC Ready, ensuring clients can't
talk
>   to the BMC until it's near Ready state

For your use case, if this is possible, that’s probably easiest and most
client friendly, so long as you can handle the case where the bmc never
hits “ready”

>
> Thoughts?
> Andrew
-- 
-Ed

[-- Attachment #2: Type: text/html, Size: 3277 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preventing a system power on before BMC Ready
  2023-05-03  0:48 ` Ed Tanous
@ 2023-05-09 20:00   ` Andrew Geissler
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Geissler @ 2023-05-09 20:00 UTC (permalink / raw)
  To: Ed Tanous, Michael Richardson; +Cc: OpenBMC List

[-- Attachment #1: Type: text/plain, Size: 2910 bytes --]



> On May 2, 2023, at 7:48 PM, Ed Tanous <ed@tanous.net> wrote:
> 
> 
> 
> On Tue, May 2, 2023 at 1:49 PM Andrew Geissler <geissonator@gmail.com <mailto:geissonator@gmail.com>> wrote:
> >
> > That got us brainstorming about some possible solutions:
> > - Write some code in bmcweb to send a “bmc state change event” anytime bmcweb
> >   comes up to ensure listening clients know “something” has happened
> > - Add an optional compile option to bmcweb (or PSM/x86-power-control) to require
> >   BMC Ready before issuing chassis or system POST requests (return error if not
> >   at Ready)
> 
> PSM or x86-power-control mods would be my preference.  bmcweb should not be in charge of business logic.  If the system shouldn't allow power on while the bmc is in ready state, then the daemons that handle power on need to have that as a constraint, otherwise you'd have the same problem if a user tried from IPMI.

Thanks for the responses guys. I’m going to go down the path of an optional config
option to PSM that will require BMC Ready for chassis or host operations. It will
return a well defined d-bus error that bmcweb can look at and return an error
to the redfish client indicating the operation is not possible (and when they should retry).

Long term, we’d really like to see the power on/off operations return a redfish
task so clients could track the power operation vs. the required polling and/or boot
event notifications by them now. That timeline for us is out there a bit though.

> > - Queue up the power on request and execute it once we reach BMC Ready (not sure
> >   what type of response that would be to Redfish clients or what error path
> >   looks like if we never reach Ready?)
> 
> Redfish has async tasks for this exact use case, and we already have code to do them.  Alternatively you could just return an error that the operation is not possible, along with a retry-after header instructing the user when to retry their request.  We do this in the few update apis already.

Yep, I like the alternative here medium term.

> 
> > - Find a way in the client to better detect an unexpected bmc reboot (heartbeat
> >   of some sort)
> > - Push bmcweb further in the startup to BMC Ready, ensuring clients can't talk
> >   to the BMC until it's near Ready state
> 
> For your use case, if this is possible, that’s probably easiest and most client friendly, so long as you can handle the case where the bmc never hits “ready”

Possible, but our redfish client does potentially manage a lot of systems, so anything that
increases repeated traffic is frowned upon. And since this seems like something that could
affect any Redfish client with similar event driven requirements, it seems best to ensure
the openbmc back end provides an adequate error in this situation.

> 
> >
> > Thoughts?
> > Andrew
> -- 
> -Ed


[-- Attachment #2: Type: text/html, Size: 4570 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-09 20:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-02 20:48 Preventing a system power on before BMC Ready Andrew Geissler
2023-05-02 21:50 ` Michael Richardson
2023-05-03  0:48 ` Ed Tanous
2023-05-09 20:00   ` Andrew Geissler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.