linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
@ 2011-02-19 22:37 Mark Lord
  2011-02-20  0:05 ` Ted Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Lord @ 2011-02-19 22:37 UTC (permalink / raw)
  To: Linux Kernel, linux-ext4, Theodore Ts'o

32-bit x86 system, 2.6.37 SMP kernel, Core2duo, 3.3GB RAM, no swap.

The system just suddenly switched to fbconsole and dumped a traceback.
Here's the screen-shot photo:  http://rtr.ca/ext4_crash.jpg

Is this a known bug that got fixed in 2.6.37.1 ?

Thanks
-ml

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-19 22:37 ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations Mark Lord
@ 2011-02-20  0:05 ` Ted Ts'o
  2011-02-20  4:54   ` Mark Lord
  0 siblings, 1 reply; 7+ messages in thread
From: Ted Ts'o @ 2011-02-20  0:05 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, linux-ext4

On Sat, Feb 19, 2011 at 05:37:20PM -0500, Mark Lord wrote:
> 32-bit x86 system, 2.6.37 SMP kernel, Core2duo, 3.3GB RAM, no swap.
> 
> The system just suddenly switched to fbconsole and dumped a traceback.
> Here's the screen-shot photo:  http://rtr.ca/ext4_crash.jpg
> 
> Is this a known bug that got fixed in 2.6.37.1 ?

No, this looks like a new one.

And I can't make the Code: line make sense.  Can you send me the
fs/ext4/mballoc.s file after running the command "make
fs/ext4/mballoc.s" in your build tree where you built this kernel?

Thanks!!

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-20  0:05 ` Ted Ts'o
@ 2011-02-20  4:54   ` Mark Lord
  2011-02-20  5:05     ` Mark Lord
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Lord @ 2011-02-20  4:54 UTC (permalink / raw)
  To: Ted Ts'o, Linux Kernel, linux-ext4

On 11-02-19 07:05 PM, Ted Ts'o wrote:
> On Sat, Feb 19, 2011 at 05:37:20PM -0500, Mark Lord wrote:
>> 32-bit x86 system, 2.6.37 SMP kernel, Core2duo, 3.3GB RAM, no swap.
>>
>> The system just suddenly switched to fbconsole and dumped a traceback.
>> Here's the screen-shot photo:  http://rtr.ca/ext4_crash.jpg
>>
>> Is this a known bug that got fixed in 2.6.37.1 ?
> 
> No, this looks like a new one.
> 
> And I can't make the Code: line make sense.  Can you send me the
> fs/ext4/mballoc.s file after running the command "make
> fs/ext4/mballoc.s" in your build tree where you built this kernel?

Sent.  And here's an extract:

.globl ext4_discard_preallocations
        .type   ext4_discard_preallocations, @function
ext4_discard_preallocations:
        pushl   %ebp    #
        pushl   %edi    #
        leal    -136(%eax), %edi        #, ei
        pushl   %esi    #
        pushl   %ebx    #
        subl    $80, %esp       #,
        movl    172(%eax), %esi # <variable>.i_sb, sb
        movl    $0, 76(%esp)    #, group
        movzwl  122(%eax), %edx # <variable>.i_mode, tmp85
        andl    $61440, %edx    #, tmp85
        cmpl    $32768, %edx    #, tmp85
        jne     .L875   #,
        leal    68(%esp), %edx  #, tmp86
        leal    380(%eax), %ebx #, D.45176
        movl    %edx, 68(%esp)  # tmp86, list.next
        addl    $372, %eax      #,
        movl    %edx, 72(%esp)  # tmp86, list.prev
        movl    %eax, 28(%esp)  #, %sfp
.L876:
        movl    %ebx, %eax      # D.45176,
        call    _raw_spin_lock  #
        jmp     .L861   #
.L867:
        cmpl    %ebx, 60(%ebp)  # D.45176, <variable>.pa_obj_lock
        je      .L862   #,
#APP
# 3810 "fs/ext4/mballoc.c" 1
        1:      ud2

I wonder if the 003c offset is that "cmpl %ebx, 60(%ebp)" line?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-20  4:54   ` Mark Lord
@ 2011-02-20  5:05     ` Mark Lord
  2011-02-20  6:15       ` Ted Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Lord @ 2011-02-20  5:05 UTC (permalink / raw)
  To: Ted Ts'o, Linux Kernel, linux-ext4

On 11-02-19 11:54 PM, Mark Lord wrote:
> On 11-02-19 07:05 PM, Ted Ts'o wrote:
>> On Sat, Feb 19, 2011 at 05:37:20PM -0500, Mark Lord wrote:
>>> 32-bit x86 system, 2.6.37 SMP kernel, Core2duo, 3.3GB RAM, no swap.
>>>
>>> The system just suddenly switched to fbconsole and dumped a traceback.
>>> Here's the screen-shot photo:  http://rtr.ca/ext4_crash.jpg
>>>
>>> Is this a known bug that got fixed in 2.6.37.1 ?
>>
>> No, this looks like a new one.
>>
>> And I can't make the Code: line make sense.  Can you send me the
>> fs/ext4/mballoc.s file after running the command "make
>> fs/ext4/mballoc.s" in your build tree where you built this kernel?
> 
> Sent.  And here's an extract:
> 
> .globl ext4_discard_preallocations
>         .type   ext4_discard_preallocations, @function
> ext4_discard_preallocations:
>         pushl   %ebp    #
>         pushl   %edi    #
>         leal    -136(%eax), %edi        #, ei
>         pushl   %esi    #
>         pushl   %ebx    #
>         subl    $80, %esp       #,
>         movl    172(%eax), %esi # <variable>.i_sb, sb
>         movl    $0, 76(%esp)    #, group
>         movzwl  122(%eax), %edx # <variable>.i_mode, tmp85
>         andl    $61440, %edx    #, tmp85
>         cmpl    $32768, %edx    #, tmp85
>         jne     .L875   #,
>         leal    68(%esp), %edx  #, tmp86
>         leal    380(%eax), %ebx #, D.45176
>         movl    %edx, 68(%esp)  # tmp86, list.next
>         addl    $372, %eax      #,
>         movl    %edx, 72(%esp)  # tmp86, list.prev
>         movl    %eax, 28(%esp)  #, %sfp
> .L876:
>         movl    %ebx, %eax      # D.45176,
>         call    _raw_spin_lock  #
>         jmp     .L861   #
> .L867:
>         cmpl    %ebx, 60(%ebp)  # D.45176, <variable>.pa_obj_lock
>         je      .L862   #,
> #APP
> # 3810 "fs/ext4/mballoc.c" 1
>         1:      ud2
> 
> I wonder if the 003c offset is that "cmpl %ebx, 60(%ebp)" line?

I suppose it must be, as there's no other 0x3c offset in that function.
Which means it's probably this line that's crashing:

             BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);

...which could only happen if "pa" was NULL there.
I wonder how that happened ?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-20  5:05     ` Mark Lord
@ 2011-02-20  6:15       ` Ted Ts'o
  2011-02-20 13:55         ` Mark Lord
  0 siblings, 1 reply; 7+ messages in thread
From: Ted Ts'o @ 2011-02-20  6:15 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, linux-ext4

On Sun, Feb 20, 2011 at 12:05:27AM -0500, Mark Lord wrote:
> I suppose it must be, as there's no other 0x3c offset in that function.
> Which means it's probably this line that's crashing:
> 
>              BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
> 
> ...which could only happen if "pa" was NULL there.
> I wonder how that happened ?

Which could only happen if ei->i_prealloc_list were not properly
initialized (i..e, it was still NULL).  Which shouldn't ever
happen...., since all ext4_inodes are initialized in
ext4_alloc_inode().

Hmm, can you replicate the crash?

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-20  6:15       ` Ted Ts'o
@ 2011-02-20 13:55         ` Mark Lord
  2011-02-20 14:39           ` Mark Lord
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Lord @ 2011-02-20 13:55 UTC (permalink / raw)
  To: Ted Ts'o, Linux Kernel, linux-ext4

On 11-02-20 01:15 AM, Ted Ts'o wrote:
> On Sun, Feb 20, 2011 at 12:05:27AM -0500, Mark Lord wrote:
>> I suppose it must be, as there's no other 0x3c offset in that function.
>> Which means it's probably this line that's crashing:
>>
>>              BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
>>
>> ...which could only happen if "pa" was NULL there.
>> I wonder how that happened ?
> 
> Which could only happen if ei->i_prealloc_list were not properly
> initialized (i..e, it was still NULL).  Which shouldn't ever
> happen...., since all ext4_inodes are initialized in
> ext4_alloc_inode().
> 
> Hmm, can you replicate the crash?

So far it has been a one time deal here,
but stuff like this is pretty serious nonetheless.

I suppose it could also happen if another thread did a list-delete
at the same time as that function was running.  Which would require
that there be a locking bug/confusion somewhere.

Looking over the code, most places use rcu to protect accesses,
except for the fragment that crashed.  That's probably just fine,
but something to reexamine just out of paranoia.

Also, the spinlock pointer appears to be dynamic, one of two
possible spinlocks.  Maybe something got confused there
(well, obviously *something* got confused, so..).

Tough nut to crack, but if I saw strangeness like this again
I'd get really concerned about the state of our top grade filesystems
(had an XFS crash recently on a totally different machine).

I'll poke a bit more, looking specifically at recent ext4 changes.

Thanks Ted.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations
  2011-02-20 13:55         ` Mark Lord
@ 2011-02-20 14:39           ` Mark Lord
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Lord @ 2011-02-20 14:39 UTC (permalink / raw)
  To: Ted Ts'o, Linux Kernel, linux-ext4

On 11-02-20 08:55 AM, Mark Lord wrote:
> On 11-02-20 01:15 AM, Ted Ts'o wrote:
>> On Sun, Feb 20, 2011 at 12:05:27AM -0500, Mark Lord wrote:
>>> I suppose it must be, as there's no other 0x3c offset in that function.
>>> Which means it's probably this line that's crashing:
>>>
>>>              BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
>>>
>>> ...which could only happen if "pa" was NULL there.
>>> I wonder how that happened ?
>>
>> Which could only happen if ei->i_prealloc_list were not properly
>> initialized (i..e, it was still NULL).  Which shouldn't ever
>> happen...., since all ext4_inodes are initialized in
>> ext4_alloc_inode().
>>
>> Hmm, can you replicate the crash?
> 
> So far it has been a one time deal here,
> but stuff like this is pretty serious nonetheless.
> 
> I suppose it could also happen if another thread did a list-delete
> at the same time as that function was running.  Which would require
> that there be a locking bug/confusion somewhere.
> 
> Looking over the code, most places use rcu to protect accesses,
> except for the fragment that crashed.  That's probably just fine,
> but something to reexamine just out of paranoia.
> 
> Also, the spinlock pointer appears to be dynamic, one of two
> possible spinlocks.  Maybe something got confused there
> (well, obviously *something* got confused, so..).

That looks like the best candidate:  perhaps pa->pa_obj_lock was
one of the per-cpu lg_prealloc_lock's at that point in time.
In which case an item could be deleted from the pa list
concurrently with the function that actually crashed?

That's as far as I can get with it in the time available.
You folks do know this code much better, so perhaps just expend
a few little grey cells on that theory before calling it quits?

Cheers!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-02-20 14:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-19 22:37 ext4 crash on 2.6.37: NULL ptr in ext4_discard_preallocations Mark Lord
2011-02-20  0:05 ` Ted Ts'o
2011-02-20  4:54   ` Mark Lord
2011-02-20  5:05     ` Mark Lord
2011-02-20  6:15       ` Ted Ts'o
2011-02-20 13:55         ` Mark Lord
2011-02-20 14:39           ` Mark Lord

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).