netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
To: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: "Willy Tarreau" <w@1wt.eu>, "Andrew Lunn" <andrew@lunn.ch>,
	"Jason Cooper" <jason@lakedaemon.net>,
	netdev@vger.kernel.org, "Ethan Tuttle" <ethan@ethantuttle.com>,
	"Ezequiel Garcia" <ezequiel.garcia@free-electrons.com>,
	"Gregory Clément" <gregory.clement@free-electrons.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: mvneta: oops in __rcu_read_lock on mirabox
Date: Mon, 16 Sep 2013 18:14:16 +0100	[thread overview]
Message-ID: <20130916171416.GM12758@n2100.arm.linux.org.uk> (raw)
In-Reply-To: <20130916182450.639084c6@skate>

On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote:
> Could this be caused by bitflips in the RAM due to bad timings, or
> overheating or that kind of things?

Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core.
From what I understand, this is a CPU designed entirely by Marvell, so
the interpretation of these codes may not be correct.  This is made
harder to diagnose in that Marvell is soo secret with their
documentation; indeed for this CPU there is no information publically
available (there's only the product briefs).

Bad timings could certainly cause bitflips, as could poor routing of
data line D8 (eg, incorrect termination or routing causing reflections
on the data line - remember that with modern hardware, almost every
signal is a transmission line).

Marginal or noisy power supplies could also be a problem - for example,
if the impedance of the power supply connections is too great, it may
work with some patterns of use but not others.

There's soo many possibilities...

However, if the fault codes above really do equate to what's in the ARMv7
Architecture Reference Manual, I think we can rule out the routing and
RAM chips - because a cache parity error points to bit flips in the cache,
or if there is no cache parity checking implemented, it means something
is corrupting the state of the SoC - which could be due to bad power
supplies.

How do we get to the bottom of this?  That's a very good question - one
which is going to be very difficult to solve.  Ideally, it means working
with the manufacturer's design team to try and work out what's going on
at the board level, probably using logic analysers to capture the bus
activity leading up to the failure.  Also, checking the power supplies
at the SoC too - checking that they're within correct tolerance and
checking the amount of noise on them.

I think all we can do at the moment is to wait for further reports to roll
in and see whether a better pattern emerges.

If you want to try something - and you suspect it may be heat related,
you could try putting the board inside a container, monitor the temperature
inside the container, and put it in your freezer!  Just be careful of the
temperature of the other devices on the board getting too cold though -
remember, most consumer electronics is only rated for an *operating*
temperature range of 0°C to 70°C and your freezer will be something like
-20°C - so don't let the ambient temperature inside the container go
below 0°C!  If the CPU is producing lots of heat though, it may keep the
container sufficiently warm that that's not a problem.  The theory is
that by making the ambient 15 to 20°C cooler, you will also lower the
temperature of the hotter parts by a similar amount.

  reply	other threads:[~2013-09-16 17:14 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-15  1:05 mvneta: oops in __rcu_read_lock on mirabox Ethan Tuttle
2013-09-15 18:57 ` Thomas Petazzoni
2013-09-16  6:50   ` Willy Tarreau
2013-09-16  8:56     ` Ethan Tuttle
2013-09-16 15:51     ` Thomas Petazzoni
2013-09-16 16:22       ` Russell King - ARM Linux
2013-09-16 16:24         ` Thomas Petazzoni
2013-09-16 17:14           ` Russell King - ARM Linux [this message]
2013-09-16 17:45             ` Willy Tarreau
2013-09-16 18:25               ` Russell King - ARM Linux
2013-09-16 16:35       ` Ethan Tuttle
2013-09-16 16:39         ` Willy Tarreau
2013-09-16 16:44           ` Willy Tarreau
2013-09-16 17:24             ` Ethan Tuttle
2013-09-16 17:47               ` Willy Tarreau
2013-09-16 18:28                 ` Russell King - ARM Linux
2013-09-17  3:43                   ` Ethan Tuttle
2013-09-17  6:01                     ` Willy Tarreau
2013-09-18  6:30                       ` Ethan Tuttle
2013-09-18 16:35                         ` Thomas Petazzoni
2013-09-18 16:49                           ` Willy Tarreau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130916171416.GM12758@n2100.arm.linux.org.uk \
    --to=linux@arm.linux.org.uk \
    --cc=andrew@lunn.ch \
    --cc=ethan@ethantuttle.com \
    --cc=ezequiel.garcia@free-electrons.com \
    --cc=gregory.clement@free-electrons.com \
    --cc=jason@lakedaemon.net \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=netdev@vger.kernel.org \
    --cc=thomas.petazzoni@free-electrons.com \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).