* UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926
[not found] ` <1391500492.1795.36.camel@sauron.fi.intel.com>
@ 2014-02-04 15:45 ` Bill Pringlemeir
2014-02-04 17:05 ` Bill Pringlemeir
0 siblings, 1 reply; 4+ messages in thread
From: Bill Pringlemeir @ 2014-02-04 15:45 UTC (permalink / raw)
To: linux-arm-kernel
>>>> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
>>>>> I can reproduce it fairly regularly, but not really "quickly". At
>>>>> the moment, I can use a setup of about identical 70 devices. A
>>>>> test over the last weekend resultet In 6 devices showing the bug.
>>>>> What we have are multiple processes which write in different
>>>>> intervals some data on the device and sync it, because this data
>>>>> should be available after a power cut. Perhaps I can force the
>>>>> error more often in writing test processes with shorter write/sync
>>>>> intervals.
>>>>>
>>>>> If I have further access to the "big" setup for some days, I will
>>>>> try to make a test without preemption.
>>> On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
>>>> Hmm, ok.
>>>> Please also apply this patch, just in case...
I don't think this patch will help.
On 4 Feb 2014, dedekind1 at gmail.com wrote:
> May be. Although sometimes corruptions are also deterministics - a
> buffer over-run at the same place causes the same side effects etc.
> But in any case, the only way I know to deal with this issues is start
> putting various prints and assertions, and trying to come closer to the
> root-cause. Sometimes bisecting helps, but this case would be difficult
> to bisect because the reproducability is hard. Indeed, one may think
> that there is no failure duding a day, so the commit as 'good' while it
> may be actually 'bad', the bug just happen to not manifest itself
> quickly enough.
I have seen the same crash on a 2.6.36 system with all of the UbiFs/Ubi
backported. It is also an IMX25 based system.
We have,
PC is at __up_write+0x3c/0x1a8
LR is at __up_write+0x24/0x1a8
pc : [<c0229400>] lr : [<c02293e8>] psr: a0000093
sp : c7225cc8 ip : 00020000 fp : c79fba00
r10: 00000523 r9 : 00000001 r8 : c7b33000
r7 : c793a800 r6 : c7bd473c r5 : c7bd4738 r4 : c7bd4720
r3 : 00000000 r2 : 00000002 r1 : 00000001 r0 : 00000002
Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
Code: e4863004 e5953004 e1560003 0a00002a (e593200c)
I run this,
$ printf "\x04\x30\x86\xe4"
"\x04\x30\x95\xe5"
"\x03\x00\x56\xe1"
"\x2a\x00\x00\x0a"
"\x0c\x20\x93\xe5" > crash.dump
$ objdump --disassemble-all -b binary -m arm crash.dump
crash.dump: file format binary
Disassembly of section .data:
00000000 <.data>:
0: e4863004 str r3, [r6], #4
4: e5953004 ldr r3, [r5, #4]
8: e1560003 cmp r6, r3
c: 0a00002a beq 0xbc
10: e593200c ldr r2, [r3, #12]
The values 'r6' and 'r5' are pointers and they are far from non-NULL and
look like good kernel data pointers. Something in the list is 'NULL'.
There is a load of 'r3' as NULL and then the use of '[r3, #12]' which
gives the crash address of '0xc'.
Here is the objdump with source interspersed for my build,
sem->activity = 0;
350: e3a0a000 mov sl, #0
354: e1a05000 mov r5, r0
358: e485a004 str sl, [r5], #4
* list_empty - tests whether a list is empty
* @head: the list to test.
*/
static inline int list_empty(const struct list_head *head)
{
return head->next == head;
35c: e5903004 ldr r3, [r0, #4]
if (!list_empty(&sem->wait_list))
360: e1550003 cmp r5, r3
364: 0a00002b beq 418 <__up_write+0xfc>
/* if we are allowed to wake writers try to grant a single write lock
* if there's a writer at the front of the queue
* - we leave the 'waiting count' incremented to signify potential
* contention
*/
if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
368: e593200c ldr r2, [r3, #12]
36c: e2124002 ands r4, r2, #2
370: 0a000006 beq 390 <__up_write+0x74>
374: ea000034 b 44c <__up_write+0x130>
The compiler picks different registers,
r3/sl+r3, r5/r0, r6/r5 but the code is the same.
The 'rw_semaphore' is
struct rw_semaphore {
__s32 activity;
struct list_head wait_list;
};
So, the 'wait_list' is non-NULL, the rw_semaphore is non-NULL, but
'wait_list->next' is NULL. This seems to be very consistent with this
'oops'.
It seems that the "ltree_lock" doesn't protect the use of the
ltree_lookup() versus insertions and deletions? Ie, ltree_lookup() may
return non-NULL, but an interrupt/pagefault before a 'le->users +/- =
1;' may mean the node is released? On a UP system, does 'spin_lock()'
actually do anything? The rw_semaphore uses spin_lock_irqsave().
However, that doesn't make sense as I think this occurs mainly on a
ARM926 system.
The ARM926 systems do not have proper 'lock free' idioms like
'ldrex/strex' and they try to do atomic operations by locking
interrupts. I think that UbiFs/UBI maybe called on a 'data fault' or
'program fault' (in user space) when memory pressure is present. I have
seen this occur in some sound drivers where the data source is coming
from disk (or maybe the driver uses vmalloc() or something). So I think
on occasion, the ltree_lookup() may not work or there is something weird
with the atomic primatives and data/page faults.
Fwiw,
Bill Pringlemeir.
^ permalink raw reply [flat|nested] 4+ messages in thread
* UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926
2014-02-04 15:45 ` UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926 Bill Pringlemeir
@ 2014-02-04 17:05 ` Bill Pringlemeir
2014-02-04 19:57 ` Bill Pringlemeir
0 siblings, 1 reply; 4+ messages in thread
From: Bill Pringlemeir @ 2014-02-04 17:05 UTC (permalink / raw)
To: linux-arm-kernel
On 4 Feb 2014, bpringlemeir at nbsps.com wrote:
> The ARM926 systems do not have proper 'lock free' idioms like
> 'ldrex/strex' and they try to do atomic operations by locking
> interrupts. I think that UbiFs/UBI maybe called on a 'data fault' or
> 'program fault' (in user space) when memory pressure is present. I have
> seen this occur in some sound drivers where the data source is coming
> from disk (or maybe the driver uses vmalloc() or something). So I think
> on occasion, the ltree_lookup() may not work or there is something weird
> with the atomic primatives and data/page faults.
https://www.google.ca/#q=site:infradead.org+leb_write_unlock+oops
http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html
at91sam9g20 - arm926, different MTD driver. Linux 3.6.9
Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)
0: e5903004 ldr r3, [r0, #4]
4: e58d2004 str r2, [sp, #4]
8: e1560003 cmp r6, r3
c: 0a00002a beq 0xbc
10: e593200c ldr r2, [r3, #12]
The code sequence looks identical and the Oops trace, etc is the same.
People from Pengutronix also indicated seeing the same type of Opps; I
think they deal with the IMX, but maybe this was on another board.
Regards,
Bill Pringlemeir.
^ permalink raw reply [flat|nested] 4+ messages in thread
* UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926
2014-02-04 17:05 ` Bill Pringlemeir
@ 2014-02-04 19:57 ` Bill Pringlemeir
2014-02-04 20:07 ` Richard Weinberger
0 siblings, 1 reply; 4+ messages in thread
From: Bill Pringlemeir @ 2014-02-04 19:57 UTC (permalink / raw)
To: linux-arm-kernel
On 4 Feb 2014, bpringlemeir at nbsps.com wrote:
> http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html
>
> at91sam9g20 - arm926, different MTD driver. Linux 3.6.9
>
> Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)
>
> 0: e5903004 ldr r3, [r0, #4]
> 4: e58d2004 str r2, [sp, #4]
> 8: e1560003 cmp r6, r3
> c: 0a00002a beq 0xbc
> 10: e593200c ldr r2, [r3, #12]
>
> The code sequence looks identical and the Oops trace, etc is the same.
> People from Pengutronix also indicated seeing the same type of Opps; I
> think they deal with the IMX, but maybe this was on another board.
>>>> schrieb Wiedemer, Thorsten (Lawo AG):
> Ehmm, OK, OK, even with the changes in kernel, ubi_assert() in
> leb_write_unlock() wouldn't have triggered ...
Another up_read() crash,
http://lists.infradead.org/pipermail/linux-mtd/2013-July/047512.html
Code: e1530001 0a000016 e3e01000 e5801000 (e8930003)
00000000 <.data>:
0: e1530001 cmp r3, r1
4: 0a000016 beq 0x64
8: e3e01000 mvn r1, #0
c: e5801000 str r1, [r0]
10: e8930003 ldm r3, {r0, r1}
Thorsten's Oops,
Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)
00000000 <.data>:
0: e3e02000 mvn r2, #0
4: e5842000 str r2, [r4]
8: e59fc084 ldr ip, [pc, #132] ; 0x94
c: e59f0084 ldr r0, [pc, #132] ; 0x98
10: e8930006 ldm r3, {r1, r2}
The registers are different, but the instruction sequence is similar.
In my ARM926 build, the __up_read() is,
static inline int list_empty(const struct list_head *head)
{
return head->next == head;
250: e1a01000 mov r1, r0
254: e5b12004 ldr r2, [r1, #4]!
258: e1520001 cmp r2, r1
25c: 0a000017 beq 2c0 <__up_read+0xb0>
__rwsem_wake_one_writer(struct rw_semaphore *sem)
{
struct rwsem_waiter *waiter;
struct task_struct *tsk;
sem->activity = -1;
260: e3e01000 mvn r1, #0
264: e5801000 str r1, [r0]
* in an undefined state.
*/
#ifndef CONFIG_DEBUG_LIST
static inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
268: e8920003 ldm r2, {r0, r1}
* This is only for internal list manipulation where we know
* the prev/next entries already!
*/
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
26c: e5801004 str r1, [r0, #4]
prev->next = next;
270: e5810000 str r0, [r1]
This is the same symptom,
__rwsem_wake_one_writer(struct rw_semaphore *sem)
{
...
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
list_del(&waiter->list);
The sem->wait_list is non-NULL, but the 'sem->wait_list.next' is NULL. I
would suggest you try with 'DEBUG_LOCK_ALLOC' or something like this.
The crash points are not the failure, it is when we insert a
rw_semaphore of 'NULL' or use some memory that is already freed.
Fwiw,
Bill Pringlemeir.
^ permalink raw reply [flat|nested] 4+ messages in thread
* UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926
2014-02-04 19:57 ` Bill Pringlemeir
@ 2014-02-04 20:07 ` Richard Weinberger
0 siblings, 0 replies; 4+ messages in thread
From: Richard Weinberger @ 2014-02-04 20:07 UTC (permalink / raw)
To: linux-arm-kernel
Am 04.02.2014 20:57, schrieb Bill Pringlemeir:
> On 4 Feb 2014, bpringlemeir at nbsps.com wrote:
>
>> http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html
>>
>> at91sam9g20 - arm926, different MTD driver. Linux 3.6.9
>>
>> Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)
>>
>> 0: e5903004 ldr r3, [r0, #4]
>> 4: e58d2004 str r2, [sp, #4]
>> 8: e1560003 cmp r6, r3
>> c: 0a00002a beq 0xbc
>> 10: e593200c ldr r2, [r3, #12]
>>
>> The code sequence looks identical and the Oops trace, etc is the same.
>> People from Pengutronix also indicated seeing the same type of Opps; I
>> think they deal with the IMX, but maybe this was on another board.
>
>>>>> schrieb Wiedemer, Thorsten (Lawo AG):
>
>> Ehmm, OK, OK, even with the changes in kernel, ubi_assert() in
>> leb_write_unlock() wouldn't have triggered ...
>
> Another up_read() crash,
>
> http://lists.infradead.org/pipermail/linux-mtd/2013-July/047512.html
>
> Code: e1530001 0a000016 e3e01000 e5801000 (e8930003)
>
> 00000000 <.data>:
> 0: e1530001 cmp r3, r1
> 4: 0a000016 beq 0x64
> 8: e3e01000 mvn r1, #0
> c: e5801000 str r1, [r0]
> 10: e8930003 ldm r3, {r0, r1}
>
> Thorsten's Oops,
>
> Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)
>
> 00000000 <.data>:
> 0: e3e02000 mvn r2, #0
> 4: e5842000 str r2, [r4]
> 8: e59fc084 ldr ip, [pc, #132] ; 0x94
> c: e59f0084 ldr r0, [pc, #132] ; 0x98
> 10: e8930006 ldm r3, {r1, r2}
>
> The registers are different, but the instruction sequence is similar.
> In my ARM926 build, the __up_read() is,
>
> static inline int list_empty(const struct list_head *head)
> {
> return head->next == head;
> 250: e1a01000 mov r1, r0
> 254: e5b12004 ldr r2, [r1, #4]!
> 258: e1520001 cmp r2, r1
> 25c: 0a000017 beq 2c0 <__up_read+0xb0>
> __rwsem_wake_one_writer(struct rw_semaphore *sem)
> {
> struct rwsem_waiter *waiter;
> struct task_struct *tsk;
>
> sem->activity = -1;
> 260: e3e01000 mvn r1, #0
> 264: e5801000 str r1, [r0]
> * in an undefined state.
> */
> #ifndef CONFIG_DEBUG_LIST
> static inline void list_del(struct list_head *entry)
> {
> __list_del(entry->prev, entry->next);
> 268: e8920003 ldm r2, {r0, r1}
> * This is only for internal list manipulation where we know
> * the prev/next entries already!
> */
> static inline void __list_del(struct list_head * prev, struct list_head * next)
> {
> next->prev = prev;
> 26c: e5801004 str r1, [r0, #4]
> prev->next = next;
> 270: e5810000 str r0, [r1]
>
>
> This is the same symptom,
>
> __rwsem_wake_one_writer(struct rw_semaphore *sem)
> {
> ...
> waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> list_del(&waiter->list);
>
> The sem->wait_list is non-NULL, but the 'sem->wait_list.next' is NULL. I
> would suggest you try with 'DEBUG_LOCK_ALLOC' or something like this.
> The crash points are not the failure, it is when we insert a
> rw_semaphore of 'NULL' or use some memory that is already freed.
CONFIG_DEBUG_LIST please.
Thanks,
//richard
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-02-04 20:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <D7B1B5F4F3F27A4CB073BF422331203F2A18997F1F@Exchange1.lawo.de>
[not found] ` <CAFLxGvya5WXoKcYmOgeM_SmVVEht1jEzeLG9vHhwFudFU+Ny8A@mail.gmail.com>
[not found] ` <D7B1B5F4F3F27A4CB073BF422331203F2A18997F8B@Exchange1.lawo.de>
[not found] ` <52EF772D.8080207@nod.at>
[not found] ` <D7B1B5F4F3F27A4CB073BF422331203F2A18DD7989@Exchange1.lawo.de>
[not found] ` <52EF9FFE.4020405@nod.at>
[not found] ` <1391498545.1795.29.camel@sauron.fi.intel.com>
[not found] ` <52F09AC9.6090604@nod.at>
[not found] ` <1391500492.1795.36.camel@sauron.fi.intel.com>
2014-02-04 15:45 ` UBI leb_write_unlock NULL pointer Oops (continuation) on ARM926 Bill Pringlemeir
2014-02-04 17:05 ` Bill Pringlemeir
2014-02-04 19:57 ` Bill Pringlemeir
2014-02-04 20:07 ` Richard Weinberger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).