BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

public inbox for linux-media@vger.kernel.org
 help / color / mirror / Atom feed

* BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
@ 2011-03-04  2:06 Andy Walls
  2011-03-04 15:50 ` Devin Heitmueller
  2011-03-05 21:59 ` Andy Walls
  0 siblings, 2 replies; 12+ messages in thread
From: Andy Walls @ 2011-03-04  2:06 UTC (permalink / raw)
  To: linux-kernel, linux-media; +Cc: Devin Heitmueller

Hi,

I got a BUG when loading the cx18.ko module (which in turn requests the
cx18-alsa.ko module) on a kernel built from this repository

	http://git.linuxtv.org/media_tree.git staging/for_v2.6.39

which I beleive is based on 2.6.38-rc2.

The BUG is mmap related and I'm almost certain it has to do with
userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
doesn't provide any mmap() related file ops.

So here is my transcription of a fuzzy digital photo of the screen:

kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/module/snd_pcm/initstate
Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
...
Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
RIP: 0010:[<ffffffff810eb50b>]  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
...
CR2: 0000000000000000 ...
....
Process udevd (pid: 25__, threadinfo ffff________, ...
Stack:
 000000000000015e ffff00003bc0e1d0 0000000000000246 ....
.....
Call Trace:
... mmput+0x63/0xcf
... exit_mm+0x132/0x13f
... do_exit+0x238/0x749
... ? __dequeue_signal+0xfa/0x12f
... do_group_exit+0x7d/0xa5
... get_signal_to_deliver+0x371/0x395
... do_signal+0x72/0x692
... ? do_page_fault+0x24a/0x391
... ? printk+0x41/0x47
... ? sigprocmask+0xa3/0xcd
... do_notify_resume+0x2c/0x64
... retint_signal+0x48/0x8c

Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
RIP  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
 RSP <ffff880018c1fc28>
general protection fault: 0000 [#2] SMP
last sysfs file: /sys/devices/virtual/sound/card2/uevent
CPU 1
Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
videodev v4l2_compat_ioctl32


I'm not very familiar with mmap() nor ALSA and I did not author the
cx18-alsa part of the cx18 driver, so any hints at where to look for the
problem are appreciated.

Regards,
Andy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-04  2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls
@ 2011-03-04 15:50 ` Devin Heitmueller
  2011-03-04 17:13   ` Andy Walls
  2011-03-05 21:59 ` Andy Walls
  1 sibling, 1 reply; 12+ messages in thread
From: Devin Heitmueller @ 2011-03-04 15:50 UTC (permalink / raw)
  To: Andy Walls; +Cc: linux-kernel, linux-media

On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> Hi,
>
> I got a BUG when loading the cx18.ko module (which in turn requests the
> cx18-alsa.ko module) on a kernel built from this repository
>
>        http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
>
> which I beleive is based on 2.6.38-rc2.
>
> The BUG is mmap related and I'm almost certain it has to do with
> userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> doesn't provide any mmap() related file ops.
>
> So here is my transcription of a fuzzy digital photo of the screen:
<snip>
> I'm not very familiar with mmap() nor ALSA and I did not author the
> cx18-alsa part of the cx18 driver, so any hints at where to look for the
> problem are appreciated.

Hi Andy,

I'm traveling on business for about two weeks, so I won't be able to
look into this right now.

Any idea whether this is some new regression?  I'm just trying to
understand whether this is something that has always been there since
I originally added the ALSA support to cx18 or whether it's something
that is new, in which case it might make sense to drag the ALSA people
into the conversation since there haven't been any changes in the cx18
driver lately.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-04 15:50 ` Devin Heitmueller
@ 2011-03-04 17:13   ` Andy Walls
  2011-03-07 10:32     ` Takashi Iwai
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Walls @ 2011-03-04 17:13 UTC (permalink / raw)
  To: Devin Heitmueller; +Cc: linux-kernel, linux-media

On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote:
> On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> > Hi,
> >
> > I got a BUG when loading the cx18.ko module (which in turn requests the
> > cx18-alsa.ko module) on a kernel built from this repository
> >
> >        http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> >
> > which I beleive is based on 2.6.38-rc2.
> >
> > The BUG is mmap related and I'm almost certain it has to do with
> > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> > doesn't provide any mmap() related file ops.
> >
> > So here is my transcription of a fuzzy digital photo of the screen:
> <snip>
> > I'm not very familiar with mmap() nor ALSA and I did not author the
> > cx18-alsa part of the cx18 driver, so any hints at where to look for the
> > problem are appreciated.
> 
> Hi Andy,
> 
> I'm traveling on business for about two weeks, so I won't be able to
> look into this right now.
> 
> Any idea whether this is some new regression?

I do not know.  I normally don't let cx18-alsa.ko load, due to
PulseAudio's persistence at keeping the device nodes open (which makes
unloading the cx18.ko module for development a hassle.)

>   I'm just trying to
> understand whether this is something that has always been there since
> I originally added the ALSA support to cx18 or whether it's something
> that is new, in which case it might make sense to drag the ALSA people
> into the conversation since there haven't been any changes in the cx18
> driver lately.

I can add some information about what is going on in userspace.  This
was on a Fedora 10 machine.  When devices nodes show up, the HAL daemon
and PulseAudio start using the device nodes right away.

That activity  triggers cx18.ko to do a firmware load which gets udevd
running to satisfy firmware requests, and then the cx18 driver issues
some simple commands to the CX23418 firmware, which will have
acknowledgment interrupts coming back from the CX23418.  I resolved the
firmware race in cx18*.ko a while ago, so I'm confident its not an
issue.

The BUG looks like some sort of mmap() race or memory management problem
outside of the cx18*.ko modules, given that mmput(), which appears to be
an mm specific reference counting function, is involved.

It could also be in ALSA I guess.

I'm not sure how in the cx18-alsa.ko things can be screwed up so badly
that it messes up the kernel's reference counting of mm structures.

I'll take a harder look at it myself this weekend, but the kernel mm
system is a little out of my current realm of experience.  Looks like I
get to learn, because I'm not going to bisect a BUG() that halts the
machine and risks disk corruption every time.

Regards,
Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-04 17:13   ` Andy Walls
@ 2011-03-07 10:32     ` Takashi Iwai
  0 siblings, 0 replies; 12+ messages in thread
From: Takashi Iwai @ 2011-03-07 10:32 UTC (permalink / raw)
  To: Andy Walls; +Cc: Devin Heitmueller, linux-kernel, linux-media

At Fri, 04 Mar 2011 12:13:04 -0500,
Andy Walls wrote:
> 
> On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote:
> > On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> > > Hi,
> > >
> > > I got a BUG when loading the cx18.ko module (which in turn requests the
> > > cx18-alsa.ko module) on a kernel built from this repository
> > >
> > >        http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> > >
> > > which I beleive is based on 2.6.38-rc2.
> > >
> > > The BUG is mmap related and I'm almost certain it has to do with
> > > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> > > doesn't provide any mmap() related file ops.
> > >
> > > So here is my transcription of a fuzzy digital photo of the screen:
> > <snip>
> > > I'm not very familiar with mmap() nor ALSA and I did not author the
> > > cx18-alsa part of the cx18 driver, so any hints at where to look for the
> > > problem are appreciated.
> > 
> > Hi Andy,
> > 
> > I'm traveling on business for about two weeks, so I won't be able to
> > look into this right now.
> > 
> > Any idea whether this is some new regression?
> 
> I do not know.  I normally don't let cx18-alsa.ko load, due to
> PulseAudio's persistence at keeping the device nodes open (which makes
> unloading the cx18.ko module for development a hassle.)
> 
> 
> >   I'm just trying to
> > understand whether this is something that has always been there since
> > I originally added the ALSA support to cx18 or whether it's something
> > that is new, in which case it might make sense to drag the ALSA people
> > into the conversation since there haven't been any changes in the cx18
> > driver lately.
> 
> I can add some information about what is going on in userspace.  This
> was on a Fedora 10 machine.  When devices nodes show up, the HAL daemon
> and PulseAudio start using the device nodes right away.
> 
> That activity  triggers cx18.ko to do a firmware load which gets udevd
> running to satisfy firmware requests, and then the cx18 driver issues
> some simple commands to the CX23418 firmware, which will have
> acknowledgment interrupts coming back from the CX23418.  I resolved the
> firmware race in cx18*.ko a while ago, so I'm confident its not an
> issue.
> 
> The BUG looks like some sort of mmap() race or memory management problem
> outside of the cx18*.ko modules, given that mmput(), which appears to be
> an mm specific reference counting function, is involved.
> 
> It could also be in ALSA I guess.

There is no change in ALSA core regarding mmap for really long time.
If it's a regression, it must be triggered by some other changes.


Takashi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-04  2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls
  2011-03-04 15:50 ` Devin Heitmueller
@ 2011-03-05 21:59 ` Andy Walls
  2011-03-06  2:03   ` Andy Walls
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Walls @ 2011-03-05 21:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, David Miller, linux-media, Devin Heitmueller, Hugh Dickins,
	Hugh Dickins

On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> Hi,
> 
> I got a BUG when loading the cx18.ko module (which in turn requests the
> cx18-alsa.ko module) on a kernel built from this repository
> 
> 	http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> 
> which I beleive is based on 2.6.38-rc2.

[snip]

> So here is my transcription of a fuzzy digital photo of the screen:
> 
> kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/module/snd_pcm/initstate
> Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> ...
> Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> RIP: 0010:[<ffffffff810eb50b>]  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> ...
> CR2: 0000000000000000 ...
> ....
> Process udevd (pid: 25__, threadinfo ffff________, ...
> Stack:
>  000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> .....
> Call Trace:
> ... mmput+0x63/0xcf
> ... exit_mm+0x132/0x13f
> ... do_exit+0x238/0x749
> ... ? __dequeue_signal+0xfa/0x12f
> ... do_group_exit+0x7d/0xa5
> ... get_signal_to_deliver+0x371/0x395
> ... do_signal+0x72/0x692
> ... ? do_page_fault+0x24a/0x391
> ... ? printk+0x41/0x47
> ... ? sigprocmask+0xa3/0xcd
> ... do_notify_resume+0x2c/0x64
> ... retint_signal+0x48/0x8c
> 
> Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> RIP  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
>  RSP <ffff880018c1fc28>
> general protection fault: 0000 [#2] SMP
> last sysfs file: /sys/devices/virtual/sound/card2/uevent
> CPU 1
> Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> videodev v4l2_compat_ioctl32


I'm dumping all my previous assumtpions about this BUG.  After a bit of
reading, all I can say is that it's a page table deallocation problem at
process exit.  After all the page table deallocations on exit,
mm->nr_ptes is still > 0, and that's a bad thing.

It apparently happened in a child udevd exiting shortly after cx18.ko
loaded.  The cx18 driver allocating large amounts kernel memory for DMA
buffers upon load may be related to triggering the problem, but I doubt
it is a root cause of the BUG.


This monsterous thread from 5 years ago is somewhat enlightening:

	http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
	http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html

so it gives me a place to start looking for the problem.

Any advice on what data to collect is appreciated.

Regards,
Andy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-05 21:59 ` Andy Walls
@ 2011-03-06  2:03   ` Andy Walls
  2011-03-06 18:37     ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Walls @ 2011-03-06  2:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, David Miller, linux-media, Devin Heitmueller, Hugh Dickins

On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> > Hi,
> > 
> > I got a BUG when loading the cx18.ko module (which in turn requests the
> > cx18-alsa.ko module) on a kernel built from this repository
> > 
> > 	http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> > 
> > which I beleive is based on 2.6.38-rc2.
> 
> [snip]
> 
> > So here is my transcription of a fuzzy digital photo of the screen:
> > 
> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/module/snd_pcm/initstate
> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> > ...
> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> > RIP: 0010:[<ffffffff810eb50b>]  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> > ...
> > CR2: 0000000000000000 ...
> > ....
> > Process udevd (pid: 25__, threadinfo ffff________, ...
> > Stack:
> >  000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> > .....
> > Call Trace:
> > ... mmput+0x63/0xcf
> > ... exit_mm+0x132/0x13f
> > ... do_exit+0x238/0x749
> > ... ? __dequeue_signal+0xfa/0x12f
> > ... do_group_exit+0x7d/0xa5
> > ... get_signal_to_deliver+0x371/0x395
> > ... do_signal+0x72/0x692
> > ... ? do_page_fault+0x24a/0x391
> > ... ? printk+0x41/0x47
> > ... ? sigprocmask+0xa3/0xcd
> > ... do_notify_resume+0x2c/0x64
> > ... retint_signal+0x48/0x8c
> > 
> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> > RIP  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> >  RSP <ffff880018c1fc28>
> > general protection fault: 0000 [#2] SMP
> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
> > CPU 1
> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> > videodev v4l2_compat_ioctl32
> 
> 
> I'm dumping all my previous assumtpions about this BUG.  After a bit of
> reading, all I can say is that it's a page table deallocation problem at
> process exit.  After all the page table deallocations on exit,
> mm->nr_ptes is still > 0, and that's a bad thing.
> 
> It apparently happened in a child udevd exiting shortly after cx18.ko
> loaded.  The cx18 driver allocating large amounts kernel memory for DMA
> buffers upon load may be related to triggering the problem, but I doubt
> it is a root cause of the BUG.
> 
> 
> This monsterous thread from 5 years ago is somewhat enlightening:
> 
> 	http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
> 	http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
> 
> so it gives me a place to start looking for the problem.
> 
> Any advice on what data to collect is appreciated.

When attemtping to reproduce this BUG, I got another bug related to
memory management:

(Details handtyped from a photo):
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
PGD 37cdd067 PUD 336c__67 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
CPU 0
Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
Pid: 2470, comm: work_for_cpu Tainted: G        W 2.6.28-rc2-cx18-vb2-proto+
RIP: 0010:[<ffffffff010f22fa>]  [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
...
RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
RDX: 0000000000000000 ....
...
CR2: 0000000000000000 ....
Stack:
 ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
....
Call Trace:
... __vunmap+0x3e/0xbd
... vfree+0x2e/0x30
... dvb_dmx_init+0x7e/0x253 [dvb_core]
... cx18_dvb_register+0xd2/0x75c [cx18]
... cx18_streams_resgister+0x6a/0x26a [cx18]
... cx18_streams_setup+0x3cc/0x486 [cx18]
... cx18_probe+0x11cc/0x12fb [cx18]
......

The code appears to be failing here:

/home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
    161d:       eb 06                   jmp    1625 <remove_vm_area+0x45>
    161f:       48 89 c2                mov    %rax,%rdx
    1622:       48 8b 00                mov    (%rax),%rax    <--- Oops  p = &tmp->next)  (tmp = *p)
    1625:       48 39 d8                cmp    %rbx,%rax		(tmp = *p) != vm;	
    1628:       75 f5                   jne    161f <remove_vm_area+0x3f>
/home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354

Corresponding to this code in mm/vmalloc.c:

struct vm_struct *remove_vm_area(const void *addr)
{
        struct vmap_area *va;

        va = find_vmap_area((unsigned long)addr);
        if (va && va->flags & VM_VM_AREA) {
                struct vm_struct *vm = va->private;
                struct vm_struct *tmp, **p;
                /*
                 * remove from list and disallow access to this vm_struct
                 * before unmap. (address range confliction is maintained by
                 * vmap.)
                 */
                write_lock(&vmlist_lock);
                for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)  <--- Ooops
			;
[...]

That for() loop appears to assume the vm_struct will be on the vmlist
somewhere.  If it isn't, then I suppose the for() loop could end up
doing a NULL dereference.

This BUG happened in the final stages of the cx18 driver setting up a
CX23418 card instance.  I have 2 cards in this machine, so a number of
buffers had certainly been allocated using kmalloc().  The code in the
dvb_core that is failing got BUG'ed in this case was this:

int dvb_dmx_init(struct dvb_demux *dvbdemux)
{
        int i;
        struct dmx_demux *dmx = &dvbdemux->dmx;

        dvbdemux->cnt_storage = NULL;
        dvbdemux->users = 0;
        dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));

        if (!dvbdemux->filter)
                return -ENOMEM;

        dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
        if (!dvbdemux->feed) {
                vfree(dvbdemux->filter);     <------- BUG/Oops happened in this call
                dvbdemux->filter = NULL;
                return -ENOMEM;
        }
...

Which is kind of interesting:
1. The first vmalloc() succeeded.
2. The second vmalloc() failed.
3. The vfree() of the pointer from the first vmalloc() caused an
Oops/BUG.

I'm not sure where to go from here.

Regards,
Andy









> Regards,
> Andy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-06  2:03   ` Andy Walls
@ 2011-03-06 18:37     ` Hugh Dickins
  2011-03-06 21:04       ` Andy Walls
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2011-03-06 18:37 UTC (permalink / raw)
  To: Andy Walls
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
>> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
>> > Hi,
>> >
>> > I got a BUG when loading the cx18.ko module (which in turn requests the
>> > cx18-alsa.ko module) on a kernel built from this repository
>> >
>> >     http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
>> >
>> > which I beleive is based on 2.6.38-rc2.
>>
>> [snip]
>>
>> > So here is my transcription of a fuzzy digital photo of the screen:
>> >
>> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
>> > invalid opcode: 0000 [#1] SMP
>> > last sysfs file: /sys/module/snd_pcm/initstate
>> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
>> > ...
>> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
>> > RIP: 0010:[<ffffffff810eb50b>]  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
>> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
>> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
>> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
>> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
>> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
>> > ...
>> > CR2: 0000000000000000 ...
>> > ....
>> > Process udevd (pid: 25__, threadinfo ffff________, ...
>> > Stack:
>> >  000000000000015e ffff00003bc0e1d0 0000000000000246 ....
>> > .....
>> > Call Trace:
>> > ... mmput+0x63/0xcf
>> > ... exit_mm+0x132/0x13f
>> > ... do_exit+0x238/0x749
>> > ... ? __dequeue_signal+0xfa/0x12f
>> > ... do_group_exit+0x7d/0xa5
>> > ... get_signal_to_deliver+0x371/0x395
>> > ... do_signal+0x72/0x692
>> > ... ? do_page_fault+0x24a/0x391
>> > ... ? printk+0x41/0x47
>> > ... ? sigprocmask+0xa3/0xcd
>> > ... do_notify_resume+0x2c/0x64
>> > ... retint_signal+0x48/0x8c
>> >
>> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
>> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
>> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
>> > RIP  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
>> >  RSP <ffff880018c1fc28>
>> > general protection fault: 0000 [#2] SMP
>> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
>> > CPU 1
>> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
>> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
>> > videodev v4l2_compat_ioctl32
>>
>>
>> I'm dumping all my previous assumtpions about this BUG.  After a bit of
>> reading, all I can say is that it's a page table deallocation problem at
>> process exit.  After all the page table deallocations on exit,
>> mm->nr_ptes is still > 0, and that's a bad thing.
>>
>> It apparently happened in a child udevd exiting shortly after cx18.ko
>> loaded.  The cx18 driver allocating large amounts kernel memory for DMA
>> buffers upon load may be related to triggering the problem, but I doubt
>> it is a root cause of the BUG.
>>
>>
>> This monsterous thread from 5 years ago is somewhat enlightening:
>>
>>       http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
>>       http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
>>
>> so it gives me a place to start looking for the problem.
>>
>> Any advice on what data to collect is appreciated.
>
> When attemtping to reproduce this BUG, I got another bug related to
> memory management:
>
> (Details handtyped from a photo):
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> PGD 37cdd067 PUD 336c__67 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
> CPU 0
> Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
> Pid: 2470, comm: work_for_cpu Tainted: G        W 2.6.28-rc2-cx18-vb2-proto+
> RIP: 0010:[<ffffffff010f22fa>]  [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> ...
> RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
> RDX: 0000000000000000 ....
> ...
> CR2: 0000000000000000 ....
> Stack:
>  ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
> ....
> Call Trace:
> ... __vunmap+0x3e/0xbd
> ... vfree+0x2e/0x30
> ... dvb_dmx_init+0x7e/0x253 [dvb_core]
> ... cx18_dvb_register+0xd2/0x75c [cx18]
> ... cx18_streams_resgister+0x6a/0x26a [cx18]
> ... cx18_streams_setup+0x3cc/0x486 [cx18]
> ... cx18_probe+0x11cc/0x12fb [cx18]
> ......
>
> The code appears to be failing here:
>
> /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
>    161d:       eb 06                   jmp    1625 <remove_vm_area+0x45>
>    161f:       48 89 c2                mov    %rax,%rdx
>    1622:       48 8b 00                mov    (%rax),%rax    <--- Oops  p = &tmp->next)  (tmp = *p)
>    1625:       48 39 d8                cmp    %rbx,%rax                (tmp = *p) != vm;
>    1628:       75 f5                   jne    161f <remove_vm_area+0x3f>
> /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354
>
> Corresponding to this code in mm/vmalloc.c:
>
> struct vm_struct *remove_vm_area(const void *addr)
> {
>        struct vmap_area *va;
>
>        va = find_vmap_area((unsigned long)addr);
>        if (va && va->flags & VM_VM_AREA) {
>                struct vm_struct *vm = va->private;
>                struct vm_struct *tmp, **p;
>                /*
>                 * remove from list and disallow access to this vm_struct
>                 * before unmap. (address range confliction is maintained by
>                 * vmap.)
>                 */
>                write_lock(&vmlist_lock);
>                for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)  <--- Ooops
>                        ;
> [...]
>
> That for() loop appears to assume the vm_struct will be on the vmlist
> somewhere.  If it isn't, then I suppose the for() loop could end up
> doing a NULL dereference.
>
> This BUG happened in the final stages of the cx18 driver setting up a
> CX23418 card instance.  I have 2 cards in this machine, so a number of
> buffers had certainly been allocated using kmalloc().  The code in the
> dvb_core that is failing got BUG'ed in this case was this:
>
> int dvb_dmx_init(struct dvb_demux *dvbdemux)
> {
>        int i;
>        struct dmx_demux *dmx = &dvbdemux->dmx;
>
>        dvbdemux->cnt_storage = NULL;
>        dvbdemux->users = 0;
>        dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));
>
>        if (!dvbdemux->filter)
>                return -ENOMEM;
>
>        dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
>        if (!dvbdemux->feed) {
>                vfree(dvbdemux->filter);     <------- BUG/Oops happened in this call
>                dvbdemux->filter = NULL;
>                return -ENOMEM;
>        }
> ...
>
> Which is kind of interesting:
> 1. The first vmalloc() succeeded.
> 2. The second vmalloc() failed.
> 3. The vfree() of the pointer from the first vmalloc() caused an
> Oops/BUG.
>
> I'm not sure where to go from here.

Thanks for all the effort you are putting into investigating this: you
deserve a better response than I can give you.

mm/vmalloc.c's vmap_area handling is entirely separate from
mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
explained if a next pointer has been corrupted to NULL.

Probably just coincidence that they both manifest that way, though the
underlying problem may turn out to be one.

If you have not already, it would be well worth turning on
CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
CONFIG_SLUB_DEBUG_ON.

If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from
getting on with your work, or slowing down reproduction of the
testcase, you should be able to replace it by a WARN_ON.  You will
probably leak at least one page (the page table) and perhaps many
pages (those that that page table points to) each time it hits, but it
shouldn't actually be unsafe to continue - it's really a development
BUG_ON, to check that  new architectures added are freeing all the
page tables they have allocated.

I do expect the underlying problem to be somewhere down the driver
end, given that nobody else has been reporting these issues.  I'm
hoping that once the cx18 guys have time to try to reproduce it,
they'll be better able to track it down. But you are having trouble
reproducing it yourself?  hitting this vmalloc one before you could
reproduce the exit_mmap one?  No chance to bisect it to a particular
commit if you cannot reliably reproduce it.

There was a horrid list corruption bug in early 2.6.38-rc, fixed in
-rc6; but although I guess it could cause all kinds of havoc, its
particular signature was not like this, so I don't really believe that
one was to blame here.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-06 18:37     ` Hugh Dickins
@ 2011-03-06 21:04       ` Andy Walls
  2011-03-07  2:34         ` Hugh Dickins
                           ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andy Walls @ 2011-03-06 21:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
> On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> > On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
> >> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> >> > Hi,
> >> >
> >> > I got a BUG when loading the cx18.ko module (which in turn requests the
> >> > cx18-alsa.ko module) on a kernel built from this repository
> >> >
> >> >     http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> >> >
> >> > which I beleive is based on 2.6.38-rc2.
> >>
> >> [snip]
> >>
> >> > So here is my transcription of a fuzzy digital photo of the screen:
> >> >
> >> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> >> > invalid opcode: 0000 [#1] SMP
> >> > last sysfs file: /sys/module/snd_pcm/initstate
> >> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> >> > ...
> >> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> >> > RIP: 0010:[<ffffffff810eb50b>]  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> >> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> >> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> >> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> >> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> >> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> >> > ...
> >> > CR2: 0000000000000000 ...
> >> > ....
> >> > Process udevd (pid: 25__, threadinfo ffff________, ...
> >> > Stack:
> >> >  000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> >> > .....
> >> > Call Trace:
> >> > ... mmput+0x63/0xcf
> >> > ... exit_mm+0x132/0x13f
> >> > ... do_exit+0x238/0x749
> >> > ... ? __dequeue_signal+0xfa/0x12f
> >> > ... do_group_exit+0x7d/0xa5
> >> > ... get_signal_to_deliver+0x371/0x395
> >> > ... do_signal+0x72/0x692
> >> > ... ? do_page_fault+0x24a/0x391
> >> > ... ? printk+0x41/0x47
> >> > ... ? sigprocmask+0xa3/0xcd
> >> > ... do_notify_resume+0x2c/0x64
> >> > ... retint_signal+0x48/0x8c
> >> >
> >> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> >> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> >> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> >> > RIP  [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> >> >  RSP <ffff880018c1fc28>
> >> > general protection fault: 0000 [#2] SMP
> >> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
> >> > CPU 1
> >> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> >> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> >> > videodev v4l2_compat_ioctl32
> >>
> >>
> >> I'm dumping all my previous assumtpions about this BUG.  After a bit of
> >> reading, all I can say is that it's a page table deallocation problem at
> >> process exit.  After all the page table deallocations on exit,
> >> mm->nr_ptes is still > 0, and that's a bad thing.
> >>
> >> It apparently happened in a child udevd exiting shortly after cx18.ko
> >> loaded.  The cx18 driver allocating large amounts kernel memory for DMA
> >> buffers upon load may be related to triggering the problem, but I doubt
> >> it is a root cause of the BUG.
> >>
> >>
> >> This monsterous thread from 5 years ago is somewhat enlightening:
> >>
> >>       http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
> >>       http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
> >>
> >> so it gives me a place to start looking for the problem.
> >>
> >> Any advice on what data to collect is appreciated.
> >
> > When attemtping to reproduce this BUG, I got another bug related to
> > memory management:
> >
> > (Details handtyped from a photo):
> > BUG: unable to handle kernel NULL pointer dereference at           (null)
> > IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> > PGD 37cdd067 PUD 336c__67 PMD 0
> > Oops: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
> > CPU 0
> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
> > Pid: 2470, comm: work_for_cpu Tainted: G        W 2.6.28-rc2-cx18-vb2-proto+
> > RIP: 0010:[<ffffffff010f22fa>]  [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> > ...
> > RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
> > RDX: 0000000000000000 ....
> > ...
> > CR2: 0000000000000000 ....
> > Stack:
> >  ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
> > ....
> > Call Trace:
> > ... __vunmap+0x3e/0xbd
> > ... vfree+0x2e/0x30
> > ... dvb_dmx_init+0x7e/0x253 [dvb_core]
> > ... cx18_dvb_register+0xd2/0x75c [cx18]
> > ... cx18_streams_resgister+0x6a/0x26a [cx18]
> > ... cx18_streams_setup+0x3cc/0x486 [cx18]
> > ... cx18_probe+0x11cc/0x12fb [cx18]
> > ......
> >
> > The code appears to be failing here:
> >
> > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
> >    161d:       eb 06                   jmp    1625 <remove_vm_area+0x45>
> >    161f:       48 89 c2                mov    %rax,%rdx
> >    1622:       48 8b 00                mov    (%rax),%rax    <--- Oops  p = &tmp->next)  (tmp = *p)
> >    1625:       48 39 d8                cmp    %rbx,%rax                (tmp = *p) != vm;
> >    1628:       75 f5                   jne    161f <remove_vm_area+0x3f>
> > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354
> >
> > Corresponding to this code in mm/vmalloc.c:
> >
> > struct vm_struct *remove_vm_area(const void *addr)
> > {
> >        struct vmap_area *va;
> >
> >        va = find_vmap_area((unsigned long)addr);
> >        if (va && va->flags & VM_VM_AREA) {
> >                struct vm_struct *vm = va->private;
> >                struct vm_struct *tmp, **p;
> >                /*
> >                 * remove from list and disallow access to this vm_struct
> >                 * before unmap. (address range confliction is maintained by
> >                 * vmap.)
> >                 */
> >                write_lock(&vmlist_lock);
> >                for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)  <--- Ooops
> >                        ;
> > [...]
> >
> > That for() loop appears to assume the vm_struct will be on the vmlist
> > somewhere.  If it isn't, then I suppose the for() loop could end up
> > doing a NULL dereference.
> >
> > This BUG happened in the final stages of the cx18 driver setting up a
> > CX23418 card instance.  I have 2 cards in this machine, so a number of
> > buffers had certainly been allocated using kmalloc().  The code in the
> > dvb_core that is failing got BUG'ed in this case was this:
> >
> > int dvb_dmx_init(struct dvb_demux *dvbdemux)
> > {
> >        int i;
> >        struct dmx_demux *dmx = &dvbdemux->dmx;
> >
> >        dvbdemux->cnt_storage = NULL;
> >        dvbdemux->users = 0;
> >        dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));
> >
> >        if (!dvbdemux->filter)
> >                return -ENOMEM;
> >
> >        dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
> >        if (!dvbdemux->feed) {
> >                vfree(dvbdemux->filter);     <------- BUG/Oops happened in this call
> >                dvbdemux->filter = NULL;
> >                return -ENOMEM;
> >        }
> > ...
> >
> > Which is kind of interesting:
> > 1. The first vmalloc() succeeded.
> > 2. The second vmalloc() failed.
> > 3. The vfree() of the pointer from the first vmalloc() caused an
> > Oops/BUG.
> >
> > I'm not sure where to go from here.
> 
> Thanks for all the effort you are putting into investigating this: you
> deserve a better response than I can give you.
> 
> mm/vmalloc.c's vmap_area handling is entirely separate from
> mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
> explained if a next pointer has been corrupted to NULL.
> 
> Probably just coincidence that they both manifest that way, though the
> underlying problem may turn out to be one.

Hi Hugh,

I suspect the underlying problem is the same.  Because VM area and
Vmalloc handling is very different, I agree, it is likely not in either
one, but somewhere else.  (e.g. list.[ch] maybe?)

> If you have not already, it would be well worth turning on
> CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
> CONFIG_SLUB_DEBUG_ON.

Will do.

> If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from
> getting on with your work, or slowing down reproduction of the
> testcase, you should be able to replace it by a WARN_ON.  You will
> probably leak at least one page (the page table) and perhaps many
> pages (those that that page table points to) each time it hits, but it
> shouldn't actually be unsafe to continue - it's really a development
> BUG_ON, to check that  new architectures added are freeing all the
> page tables they have allocated.

I added debug to dump the VM Areas' start & end values and get past the
BUG.  That's when I got the seconds BUG with involing vmlist. :(


> I do expect the underlying problem to be somewhere down the driver
> end, given that nobody else has been reporting these issues.  I'm
> hoping that once the cx18 guys have time to try to reproduce it,
> they'll be better able to track it down.

Well, I am the cx18 guy (except for Devin who wrote the guts of
cx18-alsa*c).  Given that the cx18 driver has been pretty static for the
past few months, I'm doubting it is there.  

I have been doing a recent change to add dynamic SCB MDL entry
management, but that is in a small 64kB region in the ioremap()'ed
memory of the CX23418 device.

(one patch I have locally hasn't been pushed yet, but here is the one
patch I pushed before I began testing:
http://git.linuxtv.org/awalls/media_tree.git?a=shortlog;h=refs/heads/cx18-vb2-proto )

My overall objective is to start convert cx18 to use the videobuf2
infrastruutre instead of allocating so many buffers at module load.

The cx18 driver kmalloc()'s *a lot* of buffers per card at module load,
and I have two cards in my machine.  I think these consecutive
allocations by the cx18 driver are creating conditions that induce
corruption in the memory management system.  I'm guessing it is probably
due to some locking or list handling bug.




>  But you are having trouble
> reproducing it yourself?

I can't say yet.  I'm currently two for two.

I hate BUG-ing our household "production" machine that has all the kids'
homework, financal tracking, saved e-mails, etc.  I've got to make a
backup first or just slap in a new disk for development.


>   hitting this vmalloc one before you could
> reproduce the exit_mmap one?

I hit the exit_mmap() one; added debug to get past the BUG; and then got
the new Oops/BUG.

Loading the cx18 driver was clearly the action in both cases that caused
the BUGs.

>   No chance to bisect it to a particular
> commit if you cannot reliably reproduce it.

I have to back up home directories before I can begin a bisect on this
one.

> There was a horrid list corruption bug in early 2.6.38-rc, fixed in
> -rc6; but although I guess it could cause all kinds of havoc, its
> particular signature was not like this, so I don't really believe that
> one was to blame here.

Sounds like it may be worth me reviewing the commits that introduced the
failure and the commit that fixed it.  Do you happen to know what they
are?

Regards,
Andy

> Hugh



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-06 21:04       ` Andy Walls
@ 2011-03-07  2:34         ` Hugh Dickins
  2011-03-09  0:37         ` Andy Walls
  2011-03-11  0:34         ` Andy Walls
  2 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2011-03-07  2:34 UTC (permalink / raw)
  To: Andy Walls
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Sun, 6 Mar 2011, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
> 
> > There was a horrid list corruption bug in early 2.6.38-rc, fixed in
> > -rc6; but although I guess it could cause all kinds of havoc, its
> > particular signature was not like this, so I don't really believe that
> > one was to blame here.
> 
> Sounds like it may be worth me reviewing the commits that introduced the
> failure and the commit that fixed it.  Do you happen to know what they
> are?

Here are the several fixes, which reference LKML threads and culprits:
it seems to have been a danger since 2.6.33, made much worse recently.

commit ceaaec98ad99859ac90ac6863ad0a6cd075d8e0e
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Feb 17 22:59:19 2011 +0000

    net: deinit automatic LIST_HEAD
    
    commit 9b5e383c11b08784 (net: Introduce
    unregister_netdevice_many()) left an active LIST_HEAD() in
    rollback_registered(), with possible memory corruption.
    
    Even if device is freed without touching its unreg_list (and therefore
    touching the previous memory location holding LISTE_HEAD(single), better
    close the bug for good, since its really subtle.
    
    (Same fix for default_device_exit_batch() for completeness)
    
    Reported-by: Michal Hocko <mhocko@suse.cz>
    Tested-by: Michal Hocko <mhocko@suse.cz>
    Reported-by: Eric W. Biderman <ebiderman@xmission.com>
    Tested-by: Eric W. Biderman <ebiderman@xmission.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    CC: Ingo Molnar <mingo@elte.hu>
    CC: Octavian Purdila <opurdila@ixiacom.com>
    CC: stable <stable@kernel.org> [.33+]
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit f87e6f47933e3ebeced9bb12615e830a72cedce4
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Feb 17 22:54:38 2011 +0000

    net: dont leave active on stack LIST_HEAD
    
    Eric W. Biderman and Michal Hocko reported various memory corruptions
    that we suspected to be related to a LIST head located on stack, that
    was manipulated after thread left function frame (and eventually exited,
    so its stack was freed and reused).
    
    Eric Dumazet suggested the problem was probably coming from commit
    443457242beb (net: factorize
    sync-rcu call in unregister_netdevice_many)
    
    This patch fixes __dev_close() and dev_close() to properly deinit their
    respective LIST_HEAD(single) before exiting.
    
    References: https://lkml.org/lkml/2011/2/16/304
    References: https://lkml.org/lkml/2011/2/14/223
    
    Reported-by: Michal Hocko <mhocko@suse.cz>
    Tested-by: Michal Hocko <mhocko@suse.cz>
    Reported-by: Eric W. Biderman <ebiderman@xmission.com>
    Tested-by: Eric W. Biderman <ebiderman@xmission.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    CC: Ingo Molnar <mingo@elte.hu>
    CC: Octavian Purdila <opurdila@ixiacom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 3c18d4de86e4a7f93815c081e50e0543fa27200f
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Feb 18 11:32:28 2011 -0800

    Expand CONFIG_DEBUG_LIST to several other list operations
    
    When list debugging is enabled, we aim to readably show list corruption
    errors, and the basic list_add/list_del operations end up having extra
    debugging code in them to do some basic validation of the list entries.
    
    However, "list_del_init()" and "list_move[_tail]()" ended up avoiding
    the debug code due to how they were written. This fixes that.
    
    So the _next_ time we have list_move() problems with stale list entries,
    we'll hopefully have an easier time finding them..
    
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-06 21:04       ` Andy Walls
  2011-03-07  2:34         ` Hugh Dickins
@ 2011-03-09  0:37         ` Andy Walls
  2011-03-11  0:34         ` Andy Walls
  2 siblings, 0 replies; 12+ messages in thread
From: Andy Walls @ 2011-03-09  0:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:

> > 
> > Thanks for all the effort you are putting into investigating this: you
> > deserve a better response than I can give you.
> > 
> > mm/vmalloc.c's vmap_area handling is entirely separate from
> > mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
> > explained if a next pointer has been corrupted to NULL.
> > 
> > Probably just coincidence that they both manifest that way, though the
> > underlying problem may turn out to be one.

> > If you have not already, it would be well worth turning on
> > CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
> > CONFIG_SLUB_DEBUG_ON.

> 
> >  But you are having trouble
> > reproducing it yourself?
> 
> I can't say yet.  I'm currently two for two.

After backing up the machine and testing again, I'm now 3 for 3.

This time it happened in the memset() in kernel/module.c:move_module()
when modprobe was trying to load the cx18-alsa.ko module.

        static int move_module(struct module *mod, struct load_info
        *info)
        {
                int i;
                void *ptr;
        
                /* Do the allocs. */
                ptr = module_alloc_update_bounds(mod->core_size);
                /*
                 * The pointer to this block is stored in the module structure
                 * which is inside the block. Just mark it as not being
        a
                 * leak.
                 */
                kmemleak_not_leak(ptr);
                if (!ptr)
                        return -ENOMEM;
        
                memset(ptr, 0, mod->core_size);   <----- Ooops/BUG
        
        /home/andy/cx18dev/git/media_tree/kernel/module.c:2529
            385c:       41 8b 8c 24 64 01 00    mov    0x164(%r12),%ecx
            3863:       00 
            3864:       31 c0                   xor    %eax,%eax
            3866:       48 89 d7                mov    %rdx,%rdi
            3869:       f3 aa                   rep stos %al,%es:(%rdi)  <----- Oops/BUG
        
ptr had a value of 0x0000000000001000

I'm starting a git bisect now.

Regards,
Andy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-06 21:04       ` Andy Walls
  2011-03-07  2:34         ` Hugh Dickins
  2011-03-09  0:37         ` Andy Walls
@ 2011-03-11  0:34         ` Andy Walls
  2011-03-11  0:47           ` Hugh Dickins
  2 siblings, 1 reply; 12+ messages in thread
From: Andy Walls @ 2011-03-11  0:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:

> > I do expect the underlying problem to be somewhere down the driver
> > end, given that nobody else has been reporting these issues.  I'm
> > hoping that once the cx18 guys have time to try to reproduce it,
> > they'll be better able to track it down.

Hi Hugh,

You were correct.  The mistake was in the cx18 driver, in the last thing
that I touched, of course.  The code causing the bug isn't anywhere
aside from my private repo.

All,

Sorry for all the noise.

The bug was so idiotic, I fell compelled to show the fix:

diff --git a/drivers/media/video/cx18/cx18-scb.c b/drivers/media/video/cx18/cx18
index fd89ad0..d17ffc8 100644
--- a/drivers/media/video/cx18/cx18-scb.c
+++ b/drivers/media/video/cx18/cx18-scb.c
@@ -28,8 +28,8 @@
 
 int cx18_scb_init_mdl_ent_mgmt(struct cx18 *cx)
 {
-       cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES),
-                                     GFP_KERNEL);
+       cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES)
+                                               * sizeof(long), GFP_KERNEL);
        if (cx->scb_mdl_ent_map == NULL) {
                CX18_ERR("Fatal: unable to allocate bitmap for managing SCB MDL"
                         "entries\n");

So now the subsequent call to

	bitmap_zero(cx->scb_mdl_ent_map, SCB_MDL_ENTRIES);

doesn't walk off the end of what was allocated.

Apparently BITS_TO_LONGS() is not BITS_TO_LONGS_TO_BYTES().

Regards,
Andy


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded
  2011-03-11  0:34         ` Andy Walls
@ 2011-03-11  0:47           ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2011-03-11  0:47 UTC (permalink / raw)
  To: Andy Walls
  Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller

On Thu, Mar 10, 2011 at 4:34 PM, Andy Walls <awalls@md.metrocast.net> wrote:
> On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
>> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
>
>> > I do expect the underlying problem to be somewhere down the driver
>> > end, given that nobody else has been reporting these issues.  I'm
>> > hoping that once the cx18 guys have time to try to reproduce it,
>> > they'll be better able to track it down.
>
> Hi Hugh,
>
> You were correct.  The mistake was in the cx18 driver, in the last thing
> that I touched, of course.  The code causing the bug isn't anywhere
> aside from my private repo.

Thanks a lot for reporting back, Andy: relief all round.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-03-11  0:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-04  2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls
2011-03-04 15:50 ` Devin Heitmueller
2011-03-04 17:13   ` Andy Walls
2011-03-07 10:32     ` Takashi Iwai
2011-03-05 21:59 ` Andy Walls
2011-03-06  2:03   ` Andy Walls
2011-03-06 18:37     ` Hugh Dickins
2011-03-06 21:04       ` Andy Walls
2011-03-07  2:34         ` Hugh Dickins
2011-03-09  0:37         ` Andy Walls
2011-03-11  0:34         ` Andy Walls
2011-03-11  0:47           ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox