All of lore.kernel.org
 help / color / mirror / Atom feed
* checkstop processing
@ 2017-11-13 21:34 Sergey Kachkin
  2017-11-14  3:42 ` Joel Stanley
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Sergey Kachkin @ 2017-11-13 21:34 UTC (permalink / raw)
  To: openbmc

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

Hi all,

i'm investigating the checkstop processing and looking for a way to isolate
a faulty component with OpenBmc.
So far SEL logs available via REST are not really helpful.

Is there any data source in the openbmc to troubleshoot checkstops?

I guess eSEL binary data parsed with eSEL.pl can be more informative but do
we have any procedure to grab the binary sel data and parse it with the
latest obmc?

Currently it seems that IPL checkstop analysis is not really working. i
mean that faulty component is not deconfigured on the next boot and gard
list is empty.
It can be easily duplicated by injecting an error manually via putscom.

thanks in advance,

regards,
Sergey

[-- Attachment #2: Type: text/html, Size: 909 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-13 21:34 checkstop processing Sergey Kachkin
@ 2017-11-14  3:42 ` Joel Stanley
  2017-11-14  5:15   ` Oliver
  2017-11-14  6:00   ` Stewart Smith
  2017-11-14  4:51 ` Oliver
  2017-11-14 13:17 ` Balbir Singh
  2 siblings, 2 replies; 7+ messages in thread
From: Joel Stanley @ 2017-11-14  3:42 UTC (permalink / raw)
  To: Sergey Kachkin, Alistair Popple, Benjamin Herrenschmidt,
	Oliver O'Halloran, bsingharora, Stewart Smith
  Cc: OpenBMC Maillist

On Tue, Nov 14, 2017 at 8:04 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
> Hi all,
>
> i'm investigating the checkstop processing and looking for a way to isolate
> a faulty component with OpenBmc.
> So far SEL logs available via REST are not really helpful.
>
> Is there any data source in the openbmc to troubleshoot checkstops?
>
> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
> we have any procedure to grab the binary sel data and parse it with the
> latest obmc?
>
> Currently it seems that IPL checkstop analysis is not really working. i mean
> that faulty component is not deconfigured on the next boot and gard list is
> empty.
> It can be easily duplicated by injecting an error manually via putscom.

I think you've identified an area that would be great for improvement.

I'd like to expand the scope beyond just checkstop to other boot
failures: I've tried to boot machines recently that have failed to
even start hostboot, and I haven't known what has failed.

A tool that inspects recent error logs, and the state of the SBE would
be useful. We can leverage libpdbg to talk to the host.

Cheers,

Joel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-13 21:34 checkstop processing Sergey Kachkin
  2017-11-14  3:42 ` Joel Stanley
@ 2017-11-14  4:51 ` Oliver
  2017-11-14 13:17 ` Balbir Singh
  2 siblings, 0 replies; 7+ messages in thread
From: Oliver @ 2017-11-14  4:51 UTC (permalink / raw)
  To: Sergey Kachkin; +Cc: openbmc

On Tue, Nov 14, 2017 at 8:34 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
> Hi all,
>
> i'm investigating the checkstop processing and looking for a way to isolate
> a faulty component with OpenBmc.

What did you have in mind? The IPL time checkstop analysis that
hostboot does *should* handle all this stuff for you. I'm not sure how
straightforward porting that functionality to the BMC would be since
it might require access to data from the system's MRW.

> So far SEL logs available via REST are not really helpful.
>
> Is there any data source in the openbmc to troubleshoot checkstops?
>
> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
> we have any procedure to grab the binary sel data and parse it with the
> latest obmc?
>
> Currently it seems that IPL checkstop analysis is not really working. i mean
> that faulty component is not deconfigured on the next boot and gard list is
> empty.
> It can be easily duplicated by injecting an error manually via putscom.

What errors are you injecting and what are you using to check for GARD
records? There's an open bug (SW404983) concerning hostboot generating
bad gard records which the openpower gard tool doesn't understand and
a side effect of that bug is that hostboot might overwrite records
rather than creating a new one. You might be getting bitten by that.

>
> thanks in advance,
>
> regards,
> Sergey
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-14  3:42 ` Joel Stanley
@ 2017-11-14  5:15   ` Oliver
  2017-11-14  6:01     ` Stewart Smith
  2017-11-14  6:00   ` Stewart Smith
  1 sibling, 1 reply; 7+ messages in thread
From: Oliver @ 2017-11-14  5:15 UTC (permalink / raw)
  To: Joel Stanley
  Cc: Sergey Kachkin, Alistair Popple, Benjamin Herrenschmidt,
	Balbir Singh, Stewart Smith, OpenBMC Maillist

On Tue, Nov 14, 2017 at 2:42 PM, Joel Stanley <joel@jms.id.au> wrote:
> On Tue, Nov 14, 2017 at 8:04 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
>> Hi all,
>>
>> i'm investigating the checkstop processing and looking for a way to isolate
>> a faulty component with OpenBmc.
>> So far SEL logs available via REST are not really helpful.
>>
>> Is there any data source in the openbmc to troubleshoot checkstops?
>>
>> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
>> we have any procedure to grab the binary sel data and parse it with the
>> latest obmc?
>>
>> Currently it seems that IPL checkstop analysis is not really working. i mean
>> that faulty component is not deconfigured on the next boot and gard list is
>> empty.
>> It can be easily duplicated by injecting an error manually via putscom.
>
> I think you've identified an area that would be great for improvement.
>
> I'd like to expand the scope beyond just checkstop to other boot
> failures: I've tried to boot machines recently that have failed to
> even start hostboot, and I haven't known what has failed.
>
> A tool that inspects recent error logs, and the state of the SBE would
> be useful. We can leverage libpdbg to talk to the host.

The SBE stores some state information in cfam 2809 that we can use to
find out the currents istep. I think we can also dump the SBE trace
buffer out of PIB memory on non-secure systems too. Parsing the trace
buffer requires the tracehash file from the SBE build, but we can
probably able to add that to the squashfs file for the host firmware.

>
> Cheers,
>
> Joel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-14  3:42 ` Joel Stanley
  2017-11-14  5:15   ` Oliver
@ 2017-11-14  6:00   ` Stewart Smith
  1 sibling, 0 replies; 7+ messages in thread
From: Stewart Smith @ 2017-11-14  6:00 UTC (permalink / raw)
  To: Joel Stanley, Sergey Kachkin, Alistair Popple,
	Benjamin Herrenschmidt, Oliver O'Halloran, bsingharora
  Cc: OpenBMC Maillist

Joel Stanley <joel@jms.id.au> writes:
> On Tue, Nov 14, 2017 at 8:04 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
>> Hi all,
>>
>> i'm investigating the checkstop processing and looking for a way to isolate
>> a faulty component with OpenBmc.
>> So far SEL logs available via REST are not really helpful.
>>
>> Is there any data source in the openbmc to troubleshoot checkstops?
>>
>> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
>> we have any procedure to grab the binary sel data and parse it with the
>> latest obmc?
>>
>> Currently it seems that IPL checkstop analysis is not really working. i mean
>> that faulty component is not deconfigured on the next boot and gard list is
>> empty.
>> It can be easily duplicated by injecting an error manually via putscom.
>
> I think you've identified an area that would be great for improvement.

Understatement of the year right there :)

This (of course) isn't an OpenBMC specific problem, but rather an
opportunity for OpenBMC to clearly excel against other BMC
implementations.

I'd love to see even the parsed ESELs show up through the REST API,
rather than the current mess which is literally just "printf("ESEL=%02x
%02x %02x...)".

If we have a PEL hidden in there, there's existing userspace to parse it
too (opal-elog-parse), and there's no reason why the BMC couldn't just
output the text representation of it all in addition to the binary.

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-14  5:15   ` Oliver
@ 2017-11-14  6:01     ` Stewart Smith
  0 siblings, 0 replies; 7+ messages in thread
From: Stewart Smith @ 2017-11-14  6:01 UTC (permalink / raw)
  To: Oliver, Joel Stanley
  Cc: Sergey Kachkin, Alistair Popple, Benjamin Herrenschmidt,
	Balbir Singh, OpenBMC Maillist

Oliver <oohall@gmail.com> writes:
> On Tue, Nov 14, 2017 at 2:42 PM, Joel Stanley <joel@jms.id.au> wrote:
>> On Tue, Nov 14, 2017 at 8:04 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
>>> Hi all,
>>>
>>> i'm investigating the checkstop processing and looking for a way to isolate
>>> a faulty component with OpenBmc.
>>> So far SEL logs available via REST are not really helpful.
>>>
>>> Is there any data source in the openbmc to troubleshoot checkstops?
>>>
>>> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
>>> we have any procedure to grab the binary sel data and parse it with the
>>> latest obmc?
>>>
>>> Currently it seems that IPL checkstop analysis is not really working. i mean
>>> that faulty component is not deconfigured on the next boot and gard list is
>>> empty.
>>> It can be easily duplicated by injecting an error manually via putscom.
>>
>> I think you've identified an area that would be great for improvement.
>>
>> I'd like to expand the scope beyond just checkstop to other boot
>> failures: I've tried to boot machines recently that have failed to
>> even start hostboot, and I haven't known what has failed.
>>
>> A tool that inspects recent error logs, and the state of the SBE would
>> be useful. We can leverage libpdbg to talk to the host.
>
> The SBE stores some state information in cfam 2809 that we can use to
> find out the currents istep. I think we can also dump the SBE trace
> buffer out of PIB memory on non-secure systems too. Parsing the trace
> buffer requires the tracehash file from the SBE build, but we can
> probably able to add that to the squashfs file for the host firmware.

This would be ideal to put in a sensor for boot progress.

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: checkstop processing
  2017-11-13 21:34 checkstop processing Sergey Kachkin
  2017-11-14  3:42 ` Joel Stanley
  2017-11-14  4:51 ` Oliver
@ 2017-11-14 13:17 ` Balbir Singh
  2 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-11-14 13:17 UTC (permalink / raw)
  To: Sergey Kachkin; +Cc: openbmc

On Tue, Nov 14, 2017 at 8:34 AM, Sergey Kachkin <s.kachkin@gmail.com> wrote:
> Hi all,
>
> i'm investigating the checkstop processing and looking for a way to isolate
> a faulty component with OpenBmc.
> So far SEL logs available via REST are not really helpful.
>
> Is there any data source in the openbmc to troubleshoot checkstops?
>

Not yet! I guess you'd want to use some of the built-in pdbg infrastructure
to look at the checkstop issues.



> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
> we have any procedure to grab the binary sel data and parse it with the
> latest obmc?
>

The workflow as I understand is

1. run IPMI commands from another host, extract eSEL logs
2. Decode those logs with eSEL.pl

Hostboot has gotten better at decoding checkstops at boot, so thats a
good first step



> Currently it seems that IPL checkstop analysis is not really working. i mean
> that faulty component is not deconfigured on the next boot and gard list is
> empty.
> It can be easily duplicated by injecting an error manually via putscom.

I've seen the opposite to be honest. What error are you injecting?

Balbir Singh.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-14 13:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-13 21:34 checkstop processing Sergey Kachkin
2017-11-14  3:42 ` Joel Stanley
2017-11-14  5:15   ` Oliver
2017-11-14  6:01     ` Stewart Smith
2017-11-14  6:00   ` Stewart Smith
2017-11-14  4:51 ` Oliver
2017-11-14 13:17 ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.