* BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded @ 2011-03-04 2:06 Andy Walls 2011-03-04 15:50 ` Devin Heitmueller 2011-03-05 21:59 ` Andy Walls 0 siblings, 2 replies; 12+ messages in thread From: Andy Walls @ 2011-03-04 2:06 UTC (permalink / raw) To: linux-kernel, linux-media; +Cc: Devin Heitmueller Hi, I got a BUG when loading the cx18.ko module (which in turn requests the cx18-alsa.ko module) on a kernel built from this repository http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 which I beleive is based on 2.6.38-rc2. The BUG is mmap related and I'm almost certain it has to do with userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko doesn't provide any mmap() related file ops. So here is my transcription of a fuzzy digital photo of the screen: kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/snd_pcm/initstate Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ... ... Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+ RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000 RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202 RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004 R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680 R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001 ... CR2: 0000000000000000 ... .... Process udevd (pid: 25__, threadinfo ffff________, ... Stack: 000000000000015e ffff00003bc0e1d0 0000000000000246 .... ..... Call Trace: ... mmput+0x63/0xcf ... exit_mm+0x132/0x13f ... do_exit+0x238/0x749 ... ? __dequeue_signal+0xfa/0x12f ... do_group_exit+0x7d/0xa5 ... get_signal_to_deliver+0x371/0x395 ... do_signal+0x72/0x692 ... ? do_page_fault+0x24a/0x391 ... ? printk+0x41/0x47 ... ? sigprocmask+0xa3/0xcd ... do_notify_resume+0x2c/0x64 ... retint_signal+0x48/0x8c Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e RSP <ffff880018c1fc28> general protection fault: 0000 [#2] SMP last sysfs file: /sys/devices/virtual/sound/card2/uevent CPU 1 Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common videodev v4l2_compat_ioctl32 I'm not very familiar with mmap() nor ALSA and I did not author the cx18-alsa part of the cx18 driver, so any hints at where to look for the problem are appreciated. Regards, Andy ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-04 2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls @ 2011-03-04 15:50 ` Devin Heitmueller 2011-03-04 17:13 ` Andy Walls 2011-03-05 21:59 ` Andy Walls 1 sibling, 1 reply; 12+ messages in thread From: Devin Heitmueller @ 2011-03-04 15:50 UTC (permalink / raw) To: Andy Walls; +Cc: linux-kernel, linux-media On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote: > Hi, > > I got a BUG when loading the cx18.ko module (which in turn requests the > cx18-alsa.ko module) on a kernel built from this repository > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > > which I beleive is based on 2.6.38-rc2. > > The BUG is mmap related and I'm almost certain it has to do with > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko > doesn't provide any mmap() related file ops. > > So here is my transcription of a fuzzy digital photo of the screen: <snip> > I'm not very familiar with mmap() nor ALSA and I did not author the > cx18-alsa part of the cx18 driver, so any hints at where to look for the > problem are appreciated. Hi Andy, I'm traveling on business for about two weeks, so I won't be able to look into this right now. Any idea whether this is some new regression? I'm just trying to understand whether this is something that has always been there since I originally added the ALSA support to cx18 or whether it's something that is new, in which case it might make sense to drag the ALSA people into the conversation since there haven't been any changes in the cx18 driver lately. Devin -- Devin J. Heitmueller - Kernel Labs http://www.kernellabs.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-04 15:50 ` Devin Heitmueller @ 2011-03-04 17:13 ` Andy Walls 2011-03-07 10:32 ` Takashi Iwai 0 siblings, 1 reply; 12+ messages in thread From: Andy Walls @ 2011-03-04 17:13 UTC (permalink / raw) To: Devin Heitmueller; +Cc: linux-kernel, linux-media On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote: > On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote: > > Hi, > > > > I got a BUG when loading the cx18.ko module (which in turn requests the > > cx18-alsa.ko module) on a kernel built from this repository > > > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > > > > which I beleive is based on 2.6.38-rc2. > > > > The BUG is mmap related and I'm almost certain it has to do with > > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko > > doesn't provide any mmap() related file ops. > > > > So here is my transcription of a fuzzy digital photo of the screen: > <snip> > > I'm not very familiar with mmap() nor ALSA and I did not author the > > cx18-alsa part of the cx18 driver, so any hints at where to look for the > > problem are appreciated. > > Hi Andy, > > I'm traveling on business for about two weeks, so I won't be able to > look into this right now. > > Any idea whether this is some new regression? I do not know. I normally don't let cx18-alsa.ko load, due to PulseAudio's persistence at keeping the device nodes open (which makes unloading the cx18.ko module for development a hassle.) > I'm just trying to > understand whether this is something that has always been there since > I originally added the ALSA support to cx18 or whether it's something > that is new, in which case it might make sense to drag the ALSA people > into the conversation since there haven't been any changes in the cx18 > driver lately. I can add some information about what is going on in userspace. This was on a Fedora 10 machine. When devices nodes show up, the HAL daemon and PulseAudio start using the device nodes right away. That activity triggers cx18.ko to do a firmware load which gets udevd running to satisfy firmware requests, and then the cx18 driver issues some simple commands to the CX23418 firmware, which will have acknowledgment interrupts coming back from the CX23418. I resolved the firmware race in cx18*.ko a while ago, so I'm confident its not an issue. The BUG looks like some sort of mmap() race or memory management problem outside of the cx18*.ko modules, given that mmput(), which appears to be an mm specific reference counting function, is involved. It could also be in ALSA I guess. I'm not sure how in the cx18-alsa.ko things can be screwed up so badly that it messes up the kernel's reference counting of mm structures. I'll take a harder look at it myself this weekend, but the kernel mm system is a little out of my current realm of experience. Looks like I get to learn, because I'm not going to bisect a BUG() that halts the machine and risks disk corruption every time. Regards, Andy ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-04 17:13 ` Andy Walls @ 2011-03-07 10:32 ` Takashi Iwai 0 siblings, 0 replies; 12+ messages in thread From: Takashi Iwai @ 2011-03-07 10:32 UTC (permalink / raw) To: Andy Walls; +Cc: Devin Heitmueller, linux-kernel, linux-media At Fri, 04 Mar 2011 12:13:04 -0500, Andy Walls wrote: > > On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote: > > On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <awalls@md.metrocast.net> wrote: > > > Hi, > > > > > > I got a BUG when loading the cx18.ko module (which in turn requests the > > > cx18-alsa.ko module) on a kernel built from this repository > > > > > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > > > > > > which I beleive is based on 2.6.38-rc2. > > > > > > The BUG is mmap related and I'm almost certain it has to do with > > > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko > > > doesn't provide any mmap() related file ops. > > > > > > So here is my transcription of a fuzzy digital photo of the screen: > > <snip> > > > I'm not very familiar with mmap() nor ALSA and I did not author the > > > cx18-alsa part of the cx18 driver, so any hints at where to look for the > > > problem are appreciated. > > > > Hi Andy, > > > > I'm traveling on business for about two weeks, so I won't be able to > > look into this right now. > > > > Any idea whether this is some new regression? > > I do not know. I normally don't let cx18-alsa.ko load, due to > PulseAudio's persistence at keeping the device nodes open (which makes > unloading the cx18.ko module for development a hassle.) > > > > I'm just trying to > > understand whether this is something that has always been there since > > I originally added the ALSA support to cx18 or whether it's something > > that is new, in which case it might make sense to drag the ALSA people > > into the conversation since there haven't been any changes in the cx18 > > driver lately. > > I can add some information about what is going on in userspace. This > was on a Fedora 10 machine. When devices nodes show up, the HAL daemon > and PulseAudio start using the device nodes right away. > > That activity triggers cx18.ko to do a firmware load which gets udevd > running to satisfy firmware requests, and then the cx18 driver issues > some simple commands to the CX23418 firmware, which will have > acknowledgment interrupts coming back from the CX23418. I resolved the > firmware race in cx18*.ko a while ago, so I'm confident its not an > issue. > > The BUG looks like some sort of mmap() race or memory management problem > outside of the cx18*.ko modules, given that mmput(), which appears to be > an mm specific reference counting function, is involved. > > It could also be in ALSA I guess. There is no change in ALSA core regarding mmap for really long time. If it's a regression, it must be triggered by some other changes. Takashi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-04 2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls 2011-03-04 15:50 ` Devin Heitmueller @ 2011-03-05 21:59 ` Andy Walls 2011-03-06 2:03 ` Andy Walls 1 sibling, 1 reply; 12+ messages in thread From: Andy Walls @ 2011-03-05 21:59 UTC (permalink / raw) To: linux-kernel Cc: akpm, David Miller, linux-media, Devin Heitmueller, Hugh Dickins, Hugh Dickins On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote: > Hi, > > I got a BUG when loading the cx18.ko module (which in turn requests the > cx18-alsa.ko module) on a kernel built from this repository > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > > which I beleive is based on 2.6.38-rc2. [snip] > So here is my transcription of a fuzzy digital photo of the screen: > > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/module/snd_pcm/initstate > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ... > ... > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+ > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000 > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202 > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004 > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680 > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001 > ... > CR2: 0000000000000000 ... > .... > Process udevd (pid: 25__, threadinfo ffff________, ... > Stack: > 000000000000015e ffff00003bc0e1d0 0000000000000246 .... > ..... > Call Trace: > ... mmput+0x63/0xcf > ... exit_mm+0x132/0x13f > ... do_exit+0x238/0x749 > ... ? __dequeue_signal+0xfa/0x12f > ... do_group_exit+0x7d/0xa5 > ... get_signal_to_deliver+0x371/0x395 > ... do_signal+0x72/0x692 > ... ? do_page_fault+0x24a/0x391 > ... ? printk+0x41/0x47 > ... ? sigprocmask+0xa3/0xcd > ... do_notify_resume+0x2c/0x64 > ... retint_signal+0x48/0x8c > > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > RSP <ffff880018c1fc28> > general protection fault: 0000 [#2] SMP > last sysfs file: /sys/devices/virtual/sound/card2/uevent > CPU 1 > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409 > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common > videodev v4l2_compat_ioctl32 I'm dumping all my previous assumtpions about this BUG. After a bit of reading, all I can say is that it's a page table deallocation problem at process exit. After all the page table deallocations on exit, mm->nr_ptes is still > 0, and that's a bad thing. It apparently happened in a child udevd exiting shortly after cx18.ko loaded. The cx18 driver allocating large amounts kernel memory for DMA buffers upon load may be related to triggering the problem, but I doubt it is a root cause of the BUG. This monsterous thread from 5 years ago is somewhat enlightening: http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html so it gives me a place to start looking for the problem. Any advice on what data to collect is appreciated. Regards, Andy ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-05 21:59 ` Andy Walls @ 2011-03-06 2:03 ` Andy Walls 2011-03-06 18:37 ` Hugh Dickins 0 siblings, 1 reply; 12+ messages in thread From: Andy Walls @ 2011-03-06 2:03 UTC (permalink / raw) To: linux-kernel Cc: akpm, David Miller, linux-media, Devin Heitmueller, Hugh Dickins On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote: > On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote: > > Hi, > > > > I got a BUG when loading the cx18.ko module (which in turn requests the > > cx18-alsa.ko module) on a kernel built from this repository > > > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > > > > which I beleive is based on 2.6.38-rc2. > > [snip] > > > So here is my transcription of a fuzzy digital photo of the screen: > > > > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309! > > invalid opcode: 0000 [#1] SMP > > last sysfs file: /sys/module/snd_pcm/initstate > > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ... > > ... > > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+ > > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000 > > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202 > > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004 > > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680 > > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001 > > ... > > CR2: 0000000000000000 ... > > .... > > Process udevd (pid: 25__, threadinfo ffff________, ... > > Stack: > > 000000000000015e ffff00003bc0e1d0 0000000000000246 .... > > ..... > > Call Trace: > > ... mmput+0x63/0xcf > > ... exit_mm+0x132/0x13f > > ... do_exit+0x238/0x749 > > ... ? __dequeue_signal+0xfa/0x12f > > ... do_group_exit+0x7d/0xa5 > > ... get_signal_to_deliver+0x371/0x395 > > ... do_signal+0x72/0x692 > > ... ? do_page_fault+0x24a/0x391 > > ... ? printk+0x41/0x47 > > ... ? sigprocmask+0xa3/0xcd > > ... do_notify_resume+0x2c/0x64 > > ... retint_signal+0x48/0x8c > > > > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe > > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b > > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 > > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > > RSP <ffff880018c1fc28> > > general protection fault: 0000 [#2] SMP > > last sysfs file: /sys/devices/virtual/sound/card2/uevent > > CPU 1 > > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409 > > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common > > videodev v4l2_compat_ioctl32 > > > I'm dumping all my previous assumtpions about this BUG. After a bit of > reading, all I can say is that it's a page table deallocation problem at > process exit. After all the page table deallocations on exit, > mm->nr_ptes is still > 0, and that's a bad thing. > > It apparently happened in a child udevd exiting shortly after cx18.ko > loaded. The cx18 driver allocating large amounts kernel memory for DMA > buffers upon load may be related to triggering the problem, but I doubt > it is a root cause of the BUG. > > > This monsterous thread from 5 years ago is somewhat enlightening: > > http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html > http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html > > so it gives me a place to start looking for the problem. > > Any advice on what data to collect is appreciated. When attemtping to reproduce this BUG, I got another bug related to memory management: (Details handtyped from a photo): BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 PGD 37cdd067 PUD 336c__67 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading CPU 0 Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ... Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+ RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 ... RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000 RDX: 0000000000000000 .... ... CR2: 0000000000000000 .... Stack: ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__ .... Call Trace: ... __vunmap+0x3e/0xbd ... vfree+0x2e/0x30 ... dvb_dmx_init+0x7e/0x253 [dvb_core] ... cx18_dvb_register+0xd2/0x75c [cx18] ... cx18_streams_resgister+0x6a/0x26a [cx18] ... cx18_streams_setup+0x3cc/0x486 [cx18] ... cx18_probe+0x11cc/0x12fb [cx18] ...... The code appears to be failing here: /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352 161d: eb 06 jmp 1625 <remove_vm_area+0x45> 161f: 48 89 c2 mov %rax,%rdx 1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p) 1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm; 1628: 75 f5 jne 161f <remove_vm_area+0x3f> /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354 Corresponding to this code in mm/vmalloc.c: struct vm_struct *remove_vm_area(const void *addr) { struct vmap_area *va; va = find_vmap_area((unsigned long)addr); if (va && va->flags & VM_VM_AREA) { struct vm_struct *vm = va->private; struct vm_struct *tmp, **p; /* * remove from list and disallow access to this vm_struct * before unmap. (address range confliction is maintained by * vmap.) */ write_lock(&vmlist_lock); for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops ; [...] That for() loop appears to assume the vm_struct will be on the vmlist somewhere. If it isn't, then I suppose the for() loop could end up doing a NULL dereference. This BUG happened in the final stages of the cx18 driver setting up a CX23418 card instance. I have 2 cards in this machine, so a number of buffers had certainly been allocated using kmalloc(). The code in the dvb_core that is failing got BUG'ed in this case was this: int dvb_dmx_init(struct dvb_demux *dvbdemux) { int i; struct dmx_demux *dmx = &dvbdemux->dmx; dvbdemux->cnt_storage = NULL; dvbdemux->users = 0; dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter)); if (!dvbdemux->filter) return -ENOMEM; dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed)); if (!dvbdemux->feed) { vfree(dvbdemux->filter); <------- BUG/Oops happened in this call dvbdemux->filter = NULL; return -ENOMEM; } ... Which is kind of interesting: 1. The first vmalloc() succeeded. 2. The second vmalloc() failed. 3. The vfree() of the pointer from the first vmalloc() caused an Oops/BUG. I'm not sure where to go from here. Regards, Andy > Regards, > Andy > > -- > To unsubscribe from this list: send the line "unsubscribe linux-media" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-06 2:03 ` Andy Walls @ 2011-03-06 18:37 ` Hugh Dickins 2011-03-06 21:04 ` Andy Walls 0 siblings, 1 reply; 12+ messages in thread From: Hugh Dickins @ 2011-03-06 18:37 UTC (permalink / raw) To: Andy Walls Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <awalls@md.metrocast.net> wrote: > On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote: >> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote: >> > Hi, >> > >> > I got a BUG when loading the cx18.ko module (which in turn requests the >> > cx18-alsa.ko module) on a kernel built from this repository >> > >> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 >> > >> > which I beleive is based on 2.6.38-rc2. >> >> [snip] >> >> > So here is my transcription of a fuzzy digital photo of the screen: >> > >> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309! >> > invalid opcode: 0000 [#1] SMP >> > last sysfs file: /sys/module/snd_pcm/initstate >> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ... >> > ... >> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+ >> > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e >> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000 >> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202 >> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004 >> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680 >> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001 >> > ... >> > CR2: 0000000000000000 ... >> > .... >> > Process udevd (pid: 25__, threadinfo ffff________, ... >> > Stack: >> > 000000000000015e ffff00003bc0e1d0 0000000000000246 .... >> > ..... >> > Call Trace: >> > ... mmput+0x63/0xcf >> > ... exit_mm+0x132/0x13f >> > ... do_exit+0x238/0x749 >> > ... ? __dequeue_signal+0xfa/0x12f >> > ... do_group_exit+0x7d/0xa5 >> > ... get_signal_to_deliver+0x371/0x395 >> > ... do_signal+0x72/0x692 >> > ... ? do_page_fault+0x24a/0x391 >> > ... ? printk+0x41/0x47 >> > ... ? sigprocmask+0xa3/0xcd >> > ... do_notify_resume+0x2c/0x64 >> > ... retint_signal+0x48/0x8c >> > >> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe >> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b >> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 >> > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e >> > RSP <ffff880018c1fc28> >> > general protection fault: 0000 [#2] SMP >> > last sysfs file: /sys/devices/virtual/sound/card2/uevent >> > CPU 1 >> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409 >> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common >> > videodev v4l2_compat_ioctl32 >> >> >> I'm dumping all my previous assumtpions about this BUG. After a bit of >> reading, all I can say is that it's a page table deallocation problem at >> process exit. After all the page table deallocations on exit, >> mm->nr_ptes is still > 0, and that's a bad thing. >> >> It apparently happened in a child udevd exiting shortly after cx18.ko >> loaded. The cx18 driver allocating large amounts kernel memory for DMA >> buffers upon load may be related to triggering the problem, but I doubt >> it is a root cause of the BUG. >> >> >> This monsterous thread from 5 years ago is somewhat enlightening: >> >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html >> >> so it gives me a place to start looking for the problem. >> >> Any advice on what data to collect is appreciated. > > When attemtping to reproduce this BUG, I got another bug related to > memory management: > > (Details handtyped from a photo): > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 > PGD 37cdd067 PUD 336c__67 PMD 0 > Oops: 0000 [#1] SMP > last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading > CPU 0 > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ... > Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+ > RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 > ... > RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000 > RDX: 0000000000000000 .... > ... > CR2: 0000000000000000 .... > Stack: > ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__ > .... > Call Trace: > ... __vunmap+0x3e/0xbd > ... vfree+0x2e/0x30 > ... dvb_dmx_init+0x7e/0x253 [dvb_core] > ... cx18_dvb_register+0xd2/0x75c [cx18] > ... cx18_streams_resgister+0x6a/0x26a [cx18] > ... cx18_streams_setup+0x3cc/0x486 [cx18] > ... cx18_probe+0x11cc/0x12fb [cx18] > ...... > > The code appears to be failing here: > > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352 > 161d: eb 06 jmp 1625 <remove_vm_area+0x45> > 161f: 48 89 c2 mov %rax,%rdx > 1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p) > 1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm; > 1628: 75 f5 jne 161f <remove_vm_area+0x3f> > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354 > > Corresponding to this code in mm/vmalloc.c: > > struct vm_struct *remove_vm_area(const void *addr) > { > struct vmap_area *va; > > va = find_vmap_area((unsigned long)addr); > if (va && va->flags & VM_VM_AREA) { > struct vm_struct *vm = va->private; > struct vm_struct *tmp, **p; > /* > * remove from list and disallow access to this vm_struct > * before unmap. (address range confliction is maintained by > * vmap.) > */ > write_lock(&vmlist_lock); > for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops > ; > [...] > > That for() loop appears to assume the vm_struct will be on the vmlist > somewhere. If it isn't, then I suppose the for() loop could end up > doing a NULL dereference. > > This BUG happened in the final stages of the cx18 driver setting up a > CX23418 card instance. I have 2 cards in this machine, so a number of > buffers had certainly been allocated using kmalloc(). The code in the > dvb_core that is failing got BUG'ed in this case was this: > > int dvb_dmx_init(struct dvb_demux *dvbdemux) > { > int i; > struct dmx_demux *dmx = &dvbdemux->dmx; > > dvbdemux->cnt_storage = NULL; > dvbdemux->users = 0; > dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter)); > > if (!dvbdemux->filter) > return -ENOMEM; > > dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed)); > if (!dvbdemux->feed) { > vfree(dvbdemux->filter); <------- BUG/Oops happened in this call > dvbdemux->filter = NULL; > return -ENOMEM; > } > ... > > Which is kind of interesting: > 1. The first vmalloc() succeeded. > 2. The second vmalloc() failed. > 3. The vfree() of the pointer from the first vmalloc() caused an > Oops/BUG. > > I'm not sure where to go from here. Thanks for all the effort you are putting into investigating this: you deserve a better response than I can give you. mm/vmalloc.c's vmap_area handling is entirely separate from mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be explained if a next pointer has been corrupted to NULL. Probably just coincidence that they both manifest that way, though the underlying problem may turn out to be one. If you have not already, it would be well worth turning on CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with CONFIG_SLUB_DEBUG_ON. If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from getting on with your work, or slowing down reproduction of the testcase, you should be able to replace it by a WARN_ON. You will probably leak at least one page (the page table) and perhaps many pages (those that that page table points to) each time it hits, but it shouldn't actually be unsafe to continue - it's really a development BUG_ON, to check that new architectures added are freeing all the page tables they have allocated. I do expect the underlying problem to be somewhere down the driver end, given that nobody else has been reporting these issues. I'm hoping that once the cx18 guys have time to try to reproduce it, they'll be better able to track it down. But you are having trouble reproducing it yourself? hitting this vmalloc one before you could reproduce the exit_mmap one? No chance to bisect it to a particular commit if you cannot reliably reproduce it. There was a horrid list corruption bug in early 2.6.38-rc, fixed in -rc6; but although I guess it could cause all kinds of havoc, its particular signature was not like this, so I don't really believe that one was to blame here. Hugh ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-06 18:37 ` Hugh Dickins @ 2011-03-06 21:04 ` Andy Walls 2011-03-07 2:34 ` Hugh Dickins ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Andy Walls @ 2011-03-06 21:04 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote: > On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <awalls@md.metrocast.net> wrote: > > On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote: > >> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote: > >> > Hi, > >> > > >> > I got a BUG when loading the cx18.ko module (which in turn requests the > >> > cx18-alsa.ko module) on a kernel built from this repository > >> > > >> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39 > >> > > >> > which I beleive is based on 2.6.38-rc2. > >> > >> [snip] > >> > >> > So here is my transcription of a fuzzy digital photo of the screen: > >> > > >> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309! > >> > invalid opcode: 0000 [#1] SMP > >> > last sysfs file: /sys/module/snd_pcm/initstate > >> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ... > >> > ... > >> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+ > >> > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > >> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000 > >> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202 > >> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004 > >> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680 > >> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001 > >> > ... > >> > CR2: 0000000000000000 ... > >> > .... > >> > Process udevd (pid: 25__, threadinfo ffff________, ... > >> > Stack: > >> > 000000000000015e ffff00003bc0e1d0 0000000000000246 .... > >> > ..... > >> > Call Trace: > >> > ... mmput+0x63/0xcf > >> > ... exit_mm+0x132/0x13f > >> > ... do_exit+0x238/0x749 > >> > ... ? __dequeue_signal+0xfa/0x12f > >> > ... do_group_exit+0x7d/0xa5 > >> > ... get_signal_to_deliver+0x371/0x395 > >> > ... do_signal+0x72/0x692 > >> > ... ? do_page_fault+0x24a/0x391 > >> > ... ? printk+0x41/0x47 > >> > ... ? sigprocmask+0xa3/0xcd > >> > ... do_notify_resume+0x2c/0x64 > >> > ... retint_signal+0x48/0x8c > >> > > >> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe > >> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b > >> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 > >> > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e > >> > RSP <ffff880018c1fc28> > >> > general protection fault: 0000 [#2] SMP > >> > last sysfs file: /sys/devices/virtual/sound/card2/uevent > >> > CPU 1 > >> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409 > >> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common > >> > videodev v4l2_compat_ioctl32 > >> > >> > >> I'm dumping all my previous assumtpions about this BUG. After a bit of > >> reading, all I can say is that it's a page table deallocation problem at > >> process exit. After all the page table deallocations on exit, > >> mm->nr_ptes is still > 0, and that's a bad thing. > >> > >> It apparently happened in a child udevd exiting shortly after cx18.ko > >> loaded. The cx18 driver allocating large amounts kernel memory for DMA > >> buffers upon load may be related to triggering the problem, but I doubt > >> it is a root cause of the BUG. > >> > >> > >> This monsterous thread from 5 years ago is somewhat enlightening: > >> > >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html > >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html > >> > >> so it gives me a place to start looking for the problem. > >> > >> Any advice on what data to collect is appreciated. > > > > When attemtping to reproduce this BUG, I got another bug related to > > memory management: > > > > (Details handtyped from a photo): > > BUG: unable to handle kernel NULL pointer dereference at (null) > > IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 > > PGD 37cdd067 PUD 336c__67 PMD 0 > > Oops: 0000 [#1] SMP > > last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading > > CPU 0 > > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ... > > Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+ > > RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77 > > ... > > RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000 > > RDX: 0000000000000000 .... > > ... > > CR2: 0000000000000000 .... > > Stack: > > ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__ > > .... > > Call Trace: > > ... __vunmap+0x3e/0xbd > > ... vfree+0x2e/0x30 > > ... dvb_dmx_init+0x7e/0x253 [dvb_core] > > ... cx18_dvb_register+0xd2/0x75c [cx18] > > ... cx18_streams_resgister+0x6a/0x26a [cx18] > > ... cx18_streams_setup+0x3cc/0x486 [cx18] > > ... cx18_probe+0x11cc/0x12fb [cx18] > > ...... > > > > The code appears to be failing here: > > > > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352 > > 161d: eb 06 jmp 1625 <remove_vm_area+0x45> > > 161f: 48 89 c2 mov %rax,%rdx > > 1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p) > > 1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm; > > 1628: 75 f5 jne 161f <remove_vm_area+0x3f> > > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354 > > > > Corresponding to this code in mm/vmalloc.c: > > > > struct vm_struct *remove_vm_area(const void *addr) > > { > > struct vmap_area *va; > > > > va = find_vmap_area((unsigned long)addr); > > if (va && va->flags & VM_VM_AREA) { > > struct vm_struct *vm = va->private; > > struct vm_struct *tmp, **p; > > /* > > * remove from list and disallow access to this vm_struct > > * before unmap. (address range confliction is maintained by > > * vmap.) > > */ > > write_lock(&vmlist_lock); > > for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops > > ; > > [...] > > > > That for() loop appears to assume the vm_struct will be on the vmlist > > somewhere. If it isn't, then I suppose the for() loop could end up > > doing a NULL dereference. > > > > This BUG happened in the final stages of the cx18 driver setting up a > > CX23418 card instance. I have 2 cards in this machine, so a number of > > buffers had certainly been allocated using kmalloc(). The code in the > > dvb_core that is failing got BUG'ed in this case was this: > > > > int dvb_dmx_init(struct dvb_demux *dvbdemux) > > { > > int i; > > struct dmx_demux *dmx = &dvbdemux->dmx; > > > > dvbdemux->cnt_storage = NULL; > > dvbdemux->users = 0; > > dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter)); > > > > if (!dvbdemux->filter) > > return -ENOMEM; > > > > dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed)); > > if (!dvbdemux->feed) { > > vfree(dvbdemux->filter); <------- BUG/Oops happened in this call > > dvbdemux->filter = NULL; > > return -ENOMEM; > > } > > ... > > > > Which is kind of interesting: > > 1. The first vmalloc() succeeded. > > 2. The second vmalloc() failed. > > 3. The vfree() of the pointer from the first vmalloc() caused an > > Oops/BUG. > > > > I'm not sure where to go from here. > > Thanks for all the effort you are putting into investigating this: you > deserve a better response than I can give you. > > mm/vmalloc.c's vmap_area handling is entirely separate from > mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be > explained if a next pointer has been corrupted to NULL. > > Probably just coincidence that they both manifest that way, though the > underlying problem may turn out to be one. Hi Hugh, I suspect the underlying problem is the same. Because VM area and Vmalloc handling is very different, I agree, it is likely not in either one, but somewhere else. (e.g. list.[ch] maybe?) > If you have not already, it would be well worth turning on > CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with > CONFIG_SLUB_DEBUG_ON. Will do. > If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from > getting on with your work, or slowing down reproduction of the > testcase, you should be able to replace it by a WARN_ON. You will > probably leak at least one page (the page table) and perhaps many > pages (those that that page table points to) each time it hits, but it > shouldn't actually be unsafe to continue - it's really a development > BUG_ON, to check that new architectures added are freeing all the > page tables they have allocated. I added debug to dump the VM Areas' start & end values and get past the BUG. That's when I got the seconds BUG with involing vmlist. :( > I do expect the underlying problem to be somewhere down the driver > end, given that nobody else has been reporting these issues. I'm > hoping that once the cx18 guys have time to try to reproduce it, > they'll be better able to track it down. Well, I am the cx18 guy (except for Devin who wrote the guts of cx18-alsa*c). Given that the cx18 driver has been pretty static for the past few months, I'm doubting it is there. I have been doing a recent change to add dynamic SCB MDL entry management, but that is in a small 64kB region in the ioremap()'ed memory of the CX23418 device. (one patch I have locally hasn't been pushed yet, but here is the one patch I pushed before I began testing: http://git.linuxtv.org/awalls/media_tree.git?a=shortlog;h=refs/heads/cx18-vb2-proto ) My overall objective is to start convert cx18 to use the videobuf2 infrastruutre instead of allocating so many buffers at module load. The cx18 driver kmalloc()'s *a lot* of buffers per card at module load, and I have two cards in my machine. I think these consecutive allocations by the cx18 driver are creating conditions that induce corruption in the memory management system. I'm guessing it is probably due to some locking or list handling bug. > But you are having trouble > reproducing it yourself? I can't say yet. I'm currently two for two. I hate BUG-ing our household "production" machine that has all the kids' homework, financal tracking, saved e-mails, etc. I've got to make a backup first or just slap in a new disk for development. > hitting this vmalloc one before you could > reproduce the exit_mmap one? I hit the exit_mmap() one; added debug to get past the BUG; and then got the new Oops/BUG. Loading the cx18 driver was clearly the action in both cases that caused the BUGs. > No chance to bisect it to a particular > commit if you cannot reliably reproduce it. I have to back up home directories before I can begin a bisect on this one. > There was a horrid list corruption bug in early 2.6.38-rc, fixed in > -rc6; but although I guess it could cause all kinds of havoc, its > particular signature was not like this, so I don't really believe that > one was to blame here. Sounds like it may be worth me reviewing the commits that introduced the failure and the commit that fixed it. Do you happen to know what they are? Regards, Andy > Hugh ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-06 21:04 ` Andy Walls @ 2011-03-07 2:34 ` Hugh Dickins 2011-03-09 0:37 ` Andy Walls 2011-03-11 0:34 ` Andy Walls 2 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2011-03-07 2:34 UTC (permalink / raw) To: Andy Walls Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Sun, 6 Mar 2011, Andy Walls wrote: > On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote: > > > There was a horrid list corruption bug in early 2.6.38-rc, fixed in > > -rc6; but although I guess it could cause all kinds of havoc, its > > particular signature was not like this, so I don't really believe that > > one was to blame here. > > Sounds like it may be worth me reviewing the commits that introduced the > failure and the commit that fixed it. Do you happen to know what they > are? Here are the several fixes, which reference LKML threads and culprits: it seems to have been a danger since 2.6.33, made much worse recently. commit ceaaec98ad99859ac90ac6863ad0a6cd075d8e0e Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu Feb 17 22:59:19 2011 +0000 net: deinit automatic LIST_HEAD commit 9b5e383c11b08784 (net: Introduce unregister_netdevice_many()) left an active LIST_HEAD() in rollback_registered(), with possible memory corruption. Even if device is freed without touching its unreg_list (and therefore touching the previous memory location holding LISTE_HEAD(single), better close the bug for good, since its really subtle. (Same fix for default_device_exit_batch() for completeness) Reported-by: Michal Hocko <mhocko@suse.cz> Tested-by: Michal Hocko <mhocko@suse.cz> Reported-by: Eric W. Biderman <ebiderman@xmission.com> Tested-by: Eric W. Biderman <ebiderman@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Octavian Purdila <opurdila@ixiacom.com> CC: stable <stable@kernel.org> [.33+] Signed-off-by: David S. Miller <davem@davemloft.net> commit f87e6f47933e3ebeced9bb12615e830a72cedce4 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Feb 17 22:54:38 2011 +0000 net: dont leave active on stack LIST_HEAD Eric W. Biderman and Michal Hocko reported various memory corruptions that we suspected to be related to a LIST head located on stack, that was manipulated after thread left function frame (and eventually exited, so its stack was freed and reused). Eric Dumazet suggested the problem was probably coming from commit 443457242beb (net: factorize sync-rcu call in unregister_netdevice_many) This patch fixes __dev_close() and dev_close() to properly deinit their respective LIST_HEAD(single) before exiting. References: https://lkml.org/lkml/2011/2/16/304 References: https://lkml.org/lkml/2011/2/14/223 Reported-by: Michal Hocko <mhocko@suse.cz> Tested-by: Michal Hocko <mhocko@suse.cz> Reported-by: Eric W. Biderman <ebiderman@xmission.com> Tested-by: Eric W. Biderman <ebiderman@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 3c18d4de86e4a7f93815c081e50e0543fa27200f Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri Feb 18 11:32:28 2011 -0800 Expand CONFIG_DEBUG_LIST to several other list operations When list debugging is enabled, we aim to readably show list corruption errors, and the basic list_add/list_del operations end up having extra debugging code in them to do some basic validation of the list entries. However, "list_del_init()" and "list_move[_tail]()" ended up avoiding the debug code due to how they were written. This fixes that. So the _next_ time we have list_move() problems with stale list entries, we'll hopefully have an easier time finding them.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-06 21:04 ` Andy Walls 2011-03-07 2:34 ` Hugh Dickins @ 2011-03-09 0:37 ` Andy Walls 2011-03-11 0:34 ` Andy Walls 2 siblings, 0 replies; 12+ messages in thread From: Andy Walls @ 2011-03-09 0:37 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote: > On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote: > > > > Thanks for all the effort you are putting into investigating this: you > > deserve a better response than I can give you. > > > > mm/vmalloc.c's vmap_area handling is entirely separate from > > mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be > > explained if a next pointer has been corrupted to NULL. > > > > Probably just coincidence that they both manifest that way, though the > > underlying problem may turn out to be one. > > If you have not already, it would be well worth turning on > > CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with > > CONFIG_SLUB_DEBUG_ON. > > > But you are having trouble > > reproducing it yourself? > > I can't say yet. I'm currently two for two. After backing up the machine and testing again, I'm now 3 for 3. This time it happened in the memset() in kernel/module.c:move_module() when modprobe was trying to load the cx18-alsa.ko module. static int move_module(struct module *mod, struct load_info *info) { int i; void *ptr; /* Do the allocs. */ ptr = module_alloc_update_bounds(mod->core_size); /* * The pointer to this block is stored in the module structure * which is inside the block. Just mark it as not being a * leak. */ kmemleak_not_leak(ptr); if (!ptr) return -ENOMEM; memset(ptr, 0, mod->core_size); <----- Ooops/BUG /home/andy/cx18dev/git/media_tree/kernel/module.c:2529 385c: 41 8b 8c 24 64 01 00 mov 0x164(%r12),%ecx 3863: 00 3864: 31 c0 xor %eax,%eax 3866: 48 89 d7 mov %rdx,%rdi 3869: f3 aa rep stos %al,%es:(%rdi) <----- Oops/BUG ptr had a value of 0x0000000000001000 I'm starting a git bisect now. Regards, Andy ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-06 21:04 ` Andy Walls 2011-03-07 2:34 ` Hugh Dickins 2011-03-09 0:37 ` Andy Walls @ 2011-03-11 0:34 ` Andy Walls 2011-03-11 0:47 ` Hugh Dickins 2 siblings, 1 reply; 12+ messages in thread From: Andy Walls @ 2011-03-11 0:34 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote: > On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote: > > I do expect the underlying problem to be somewhere down the driver > > end, given that nobody else has been reporting these issues. I'm > > hoping that once the cx18 guys have time to try to reproduce it, > > they'll be better able to track it down. Hi Hugh, You were correct. The mistake was in the cx18 driver, in the last thing that I touched, of course. The code causing the bug isn't anywhere aside from my private repo. All, Sorry for all the noise. The bug was so idiotic, I fell compelled to show the fix: diff --git a/drivers/media/video/cx18/cx18-scb.c b/drivers/media/video/cx18/cx18 index fd89ad0..d17ffc8 100644 --- a/drivers/media/video/cx18/cx18-scb.c +++ b/drivers/media/video/cx18/cx18-scb.c @@ -28,8 +28,8 @@ int cx18_scb_init_mdl_ent_mgmt(struct cx18 *cx) { - cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES), - GFP_KERNEL); + cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES) + * sizeof(long), GFP_KERNEL); if (cx->scb_mdl_ent_map == NULL) { CX18_ERR("Fatal: unable to allocate bitmap for managing SCB MDL" "entries\n"); So now the subsequent call to bitmap_zero(cx->scb_mdl_ent_map, SCB_MDL_ENTRIES); doesn't walk off the end of what was allocated. Apparently BITS_TO_LONGS() is not BITS_TO_LONGS_TO_BYTES(). Regards, Andy ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded 2011-03-11 0:34 ` Andy Walls @ 2011-03-11 0:47 ` Hugh Dickins 0 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2011-03-11 0:47 UTC (permalink / raw) To: Andy Walls Cc: linux-kernel, akpm, David Miller, linux-media, Devin Heitmueller On Thu, Mar 10, 2011 at 4:34 PM, Andy Walls <awalls@md.metrocast.net> wrote: > On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote: >> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote: > >> > I do expect the underlying problem to be somewhere down the driver >> > end, given that nobody else has been reporting these issues. I'm >> > hoping that once the cx18 guys have time to try to reproduce it, >> > they'll be better able to track it down. > > Hi Hugh, > > You were correct. The mistake was in the cx18 driver, in the last thing > that I touched, of course. The code causing the bug isn't anywhere > aside from my private repo. Thanks a lot for reporting back, Andy: relief all round. Hugh ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-03-11 0:47 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-04 2:06 BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded Andy Walls 2011-03-04 15:50 ` Devin Heitmueller 2011-03-04 17:13 ` Andy Walls 2011-03-07 10:32 ` Takashi Iwai 2011-03-05 21:59 ` Andy Walls 2011-03-06 2:03 ` Andy Walls 2011-03-06 18:37 ` Hugh Dickins 2011-03-06 21:04 ` Andy Walls 2011-03-07 2:34 ` Hugh Dickins 2011-03-09 0:37 ` Andy Walls 2011-03-11 0:34 ` Andy Walls 2011-03-11 0:47 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox