From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from b.ns.miles-group.at ([95.130.255.144] helo=radon.swed.at) by merlin.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1WAaiP-0006Vh-Bp for linux-mtd@lists.infradead.org; Tue, 04 Feb 2014 07:46:42 +0000 Message-ID: <52F09AC9.6090604@nod.at> Date: Tue, 04 Feb 2014 08:46:17 +0100 From: Richard Weinberger MIME-Version: 1.0 To: dedekind1@gmail.com Subject: Re: UBI leb_write_unlock NULL pointer Oops (continuation) References: <52EF772D.8080207@nod.at> <52EF9FFE.4020405@nod.at> <1391498545.1795.29.camel@sauron.fi.intel.com> In-Reply-To: <1391498545.1795.29.camel@sauron.fi.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: "Wiedemer, Thorsten \(Lawo AG\)" , "linux-mtd@lists.infradead.org" List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Am 04.02.2014 08:22, schrieb Artem Bityutskiy: > On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote: >> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG): >>> Hi, >>> >>> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices. >>> A test over the last weekend resultet In 6 devices showing the bug. >>> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut. >>> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals. >>> >>> If I have further access to the "big" setup for some days, I will try to make a test without preemption. >> >> Hmm, ok. >> Please also apply this patch, just in case... >> >> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c >> index 0e11671d..48fd2aa 100644 >> --- a/drivers/mtd/ubi/eba.c >> +++ b/drivers/mtd/ubi/eba.c >> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum) >> >> spin_lock(&ubi->ltree_lock); >> le = ltree_lookup(ubi, vol_id, lnum); >> + ubi_assert(le); >> le->users -= 1; >> ubi_assert(le->users >= 0); >> up_write(&le->mutex); > > The UBI LEB locking is a bit over-designed, it could be simplified, may > be this could help looking for the problem. > > The this report does really sound like there is something specific to > Thorsten's system which corrupts memory. Thorsten sees: Dec 25 03:59:22 kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000c (leb_write_unlock+0x74/0xf0) from [] (ubi_eba_write_leb+0x94/0x820 In July 2013 we got this report from a user: [ 300.554525] Unable to handle kernel NULL pointer dereference at virtual address 0000000c (leb_write_unlock+0xa0/0xf4) from [<802850e0>] (ubi_eba_write_leb+0x568/0x80c) In both cases we fault at address 0000000c and leb_write_unlock() was called by ubi_eba_write_leb(). Same user saw the issue also in the read path: [ 38.471134] Unable to handle kernel NULL pointer dereference at virtual address 00000000 (leb_read_unlock+0xa0/0xf4) from [<80285cdc>] (ubi_eba_read_leb+0x404/0x480) In that case the fault happened at 00000000 directly. A bit too deterministic for a memory corruption IMHO. > And it is difficult to debug this via the mailing list. Thorsten should > start adding various checks like this and try to come closer to the > root-cause. Yeah. We also need more oopses, maybe we can spot a pattern. Thanks, //richard