All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Williams <patrick@stwcx.xyz>
To: Kun Zhao <zkxz@hotmail.com>
Cc: "openbmc@lists.ozlabs.org" <openbmc@lists.ozlabs.org>
Subject: Re: SQUASHFS errors and OpenBMC hang
Date: Tue, 1 Sep 2020 07:35:06 -0500	[thread overview]
Message-ID: <20200901123506.GR3532@heinlein> (raw)
In-Reply-To: <BYAPR14MB2342C9C346B57B87F44E3200CF530@BYAPR14MB2342.namprd14.prod.outlook.com>

[-- Attachment #1: Type: text/plain, Size: 3540 bytes --]

On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
> Hi Team,
> 
> I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?

Yeah, this does look like a bad SPI NOR.  Have you tried flashing on a
fresh image to the NOR and then reading it back to confirm all the bits
keep their values?  It is possible that the corruption is hitting the
other BMC code in a less-important location.

> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}

I'm surprised to see anyone using jffs2.  Don't we generally use ubifs
in OpenBMC?  Is there a reason you've chosen to use jffs2?

I don't necessarily think jffs2 will be better or worse in this
particular scenario but we've seen lots of upgrade issues over the years
with jffs2.

> BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,
> 
> root@dgx:~# df
> Filesystem 1K-blocks Used Available Use% Mounted on
> dev 212904 0 212904 0% /dev
> tmpfs 246728 20172 226556 8% /run
> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
> cow 4096 880 3216 21% /
> tmpfs 246728 8 246720 0% /dev/shm
> tmpfs 246728 0 246728 0% /sys/fs/cgroup
> tmpfs 246728 0 246728 0% /tmp
> tmpfs 246728 8 246720 0% /var/volatile
> 
> and can see more and more ipmid coredump files,

This implies to me that we need to adjust the systemd recovery for
ipmid.  We shouldn't just keep re-launching the same process over and
over after a coredump.  Systemd has some thresholding capability.

> I found the following actions could trigger this failure,
> 
> 
>   1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
> $ ssh root@<bmc ip>
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.

I have no idea why this procedure would solve SPI NOR issues.  It
doesn't seem connected on the surface.

> The code is based on upstream commit 5ddb5fa99ec259 on master branch.
> The flash layout definition is the default openbmc-flash-layout.dtsi.
> The SPI ROM is Macronix MX25L25635F
> 
> Some questions,
> 
>   1.  Any SPI lock feature enabled in OpenBMC?
>   2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?

There is not, to my knowledge, a software SPI lock.  Some machines have
a 'golden' NOR which they enable by, in hardware, setting the
write-protect input pin on the SPI NOR (with a strapping resistor).
Does your machine do this mechanism?  If so, it is possible that you're
booting onto the 'wrong' NOR flash in some conditions and a reboot
resets the chip-select logic in the SPI controller.  (Usually, you have
the watchdog configured to automatically swap the chip-select after some
number of boot failures.)

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2020-09-01 12:35 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-29  0:40 SQUASHFS errors and OpenBMC hang Kun Zhao
2020-09-01 12:35 ` Patrick Williams [this message]
2020-09-01 23:07   ` Milton Miller II
2020-09-02 22:56     ` Kun Zhao
2020-09-02 22:46   ` Kun Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200901123506.GR3532@heinlein \
    --to=patrick@stwcx.xyz \
    --cc=openbmc@lists.ozlabs.org \
    --cc=zkxz@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.