* BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() @ 2012-10-08 9:15 Cong Wang 2012-10-11 8:38 ` Cong Wang 0 siblings, 1 reply; 5+ messages in thread From: Cong Wang @ 2012-10-08 9:15 UTC (permalink / raw) To: stephen hemminger; +Cc: Eric Dumazet, David S. Miller, netdev, Thomas Graf Hi, all, We got the following kernel crash on RHEL6 and I confirmed upstream has the same problem (I didn't save this kernel log though): BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq] PGD 1fbed067 PUD 1b103067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:08.0/virtio4/net/eth2/address CPU 0 Modules linked in: cls_u32 sch_qfq sch_cbq ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-259.el6.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffffa02c3dca>] [<ffffffffa02c3dca>] qfq_dequeue +0x30a/0x490 [sch_qfq] RSP: 0018:ffff880002203da0 EFLAGS: 00010287 RAX: ffffffffffffffb0 RBX: ffff88001f45e0c0 RCX: 0000000000000029 RDX: fffffe0000000000 RSI: 0000000000000001 RDI: ffff88001f45f718 RBP: ffff880002203de0 R08: 0000000000000007 R09: 0000000225c602e3 R10: 00000000ffffffff R11: dead000000200200 R12: 0000000000000013 R13: ffff88001f124ea8 R14: ffff88001f45f6b8 R15: 0028940000000000 FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000010 CR3: 000000001b277000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a8d020) Stack: ffff88001f45e000 0028900000000000 ffff880002203de0 ffff88001f4fcc00 <d> ffff88001f4fcc00 0000000000000000 0000000000000001 ffff88001ad640c0 <d> ffff880002203e60 ffffffffa02b9c85 ffff88001f4fcc00 ffff88001f4fcc00 Call Trace: <IRQ> [<ffffffffa02b9c85>] cbq_dequeue+0x365/0x730 [sch_cbq] [<ffffffff81456c3f>] __qdisc_run+0x3f/0xe0 [<ffffffff81436c00>] net_tx_action+0x130/0x1c0 [<ffffffff8102b46d>] ? lapic_next_event+0x1d/0x30 [<ffffffff81073d81>] __do_softirq+0xc1/0x1e0 [<ffffffff81096b10>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c24c>] call_softirq+0x1c/0x30 [<ffffffff8100de85>] do_softirq+0x65/0xa0 [<ffffffff81073b65>] irq_exit+0x85/0x90 [<ffffffff81502bc0>] smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10 [<ffffffff810149cd>] default_idle+0x4d/0xb0 [<ffffffff81009e06>] cpu_idle+0xb6/0x110 [<ffffffff814e137a>] rest_init+0x7a/0x80 [<ffffffff81c21f7b>] start_kernel+0x424/0x430 [<ffffffff81c2133a>] x86_64_start_reservations+0x125/0x129 [<ffffffff81c21438>] x86_64_start_kernel+0xfa/0x109 Code: 7c 03 50 4d 8b 7e 58 e8 b5 f6 ff ff 48 85 c0 0f 84 3c 01 00 00 41 8b 4e 60 be 01 00 00 00 49 8d 7e 60 48 89 f2 48 d3 e2 48 f7 da <48> 23 50 60 49 39 56 50 0f 84 d6 00 00 00 b8 02 00 00 00 49 89 RIP [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq] RSP <ffff880002203da0> CR2: 0000000000000010 This crash can be easily reproduced in KVM guests by the following steps: 1. on virt-guest1 setup qdisc with qfq with this script: http://pastebin.com/BRaSXLzq 2. on virt-guest2 start listening on ports 1234, 1235 # nc -l 1234 > /dev/null 2>&1 # nc -l 1235 > /dev/null 2>&1 3. on virt-guest1 send traffic to virt-guest2 # yes | nc $virt-guest2_ip_addr 1234 # yes | nc $virt-guest2_ip_addr 1235 I am not familiar with qfq qdisc. Any ideas? Thanks! ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() 2012-10-08 9:15 BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() Cong Wang @ 2012-10-11 8:38 ` Cong Wang 2012-10-11 15:05 ` Eric Dumazet 0 siblings, 1 reply; 5+ messages in thread From: Cong Wang @ 2012-10-11 8:38 UTC (permalink / raw) To: stephen hemminger; +Cc: Eric Dumazet, David S. Miller, netdev, Thomas Graf On Mon, 2012-10-08 at 17:15 +0800, Cong Wang wrote: > Hi, all, > > We got the following kernel crash on RHEL6 and I confirmed upstream has > the same problem (I didn't save this kernel log though): Ok, I got the backtrace of the latest kernel, see below. Seems qfq_slot_scan() in qfq_dequeue() returns something bad, 'cl' becomes '0x10'. (gdb) bt #0 delay_tsc (__loops=<optimized out>) at arch/x86/lib/delay.c:69 #1 0xffffffff8143f4c9 in __delay (loops=<optimized out>) at arch/x86/lib/delay.c:112 #2 0xffffffff8143f4ef in __const_udelay (xloops=<optimized out>) at arch/x86/lib/delay.c:126 #3 0xffffffff81914536 in panic (fmt=<optimized out>) at kernel/panic.c:182 #4 0xffffffff8193de42 in oops_end (flags=582, regs=0xffff88007d203d08, signr=9) at arch/x86/kernel/dumpstack.c:248 #5 0xffffffff81913698 in no_context (regs=0xffff88007d203d08, error_code=<optimized out>, address=16, signal=<optimized out>, si_code=<optimized out>) at arch/x86/mm/fault.c:690 #6 0xffffffff81913885 in __bad_area_nosemaphore (regs=0xffff88007d203d08, error_code=0, address=16, si_code=196609) at arch/x86/mm/fault.c:768 #7 0xffffffff819138b9 in bad_area_nosemaphore (regs=<optimized out>, error_code=<optimized out>, address=<optimized out>) at arch/x86/mm/fault.c:775 #8 0xffffffff81940803 in __do_page_fault (regs=0xffff88007d203d08, error_code=0) at arch/x86/mm/fault.c:1105 #9 0xffffffff81940883 in do_page_fault (regs=<optimized out>, error_code=<optimized out>) at arch/x86/mm/fault.c:1237 #10 0xffffffff81940129 in do_async_page_fault (regs=<optimized out>, error_code=<optimized out>) at arch/x86/kernel/kvm.c:246 #11 <signal handler called> #12 0xffffffff8176c72c in qfq_round_down (shift=41, ts=Cannot access memory at address 0x10 ) at net/sched/sch_qfq.c:575 #13 qfq_dequeue (sch=0xffff8800757542c0) at net/sched/sch_qfq.c:819 #14 0xffffffff8175c0cb in cbq_dequeue_prio (prio=1, sch=0xffff880076b15d00) at net/sched/sch_cbq.c:851 #15 cbq_dequeue_1 (sch=0xffff880076b15d00) at net/sched/sch_cbq.c:934 #16 cbq_dequeue (sch=0xffff880076b15d00) at net/sched/sch_cbq.c:973 #17 0xffffffff81750ff7 in dequeue_skb (q=0xffff880076b15d00) at net/sched/sch_generic.c:69 #18 qdisc_restart (q=0xffff880076b15d00) at net/sched/sch_generic.c:178 #19 __qdisc_run (q=0xffff880076b15d00) at net/sched/sch_generic.c:193 #20 0xffffffff81725625 in qdisc_run (q=0xffff880076b15d00) at include/net/pkt_sched.h:99 #21 net_tx_action (h=<optimized out>) at net/core/dev.c:3070 #22 0xffffffff81057069 in __do_softirq () at kernel/softirq.c:247 #23 0xffffffff81945b3c in ?? () at arch/x86/kernel/entry_64.S:1353 #24 0xffffffff81003f36 in do_softirq () at arch/x86/kernel/irq_64.c:106 #25 0xffffffff810572f8 in invoke_softirq () at kernel/softirq.c:329 #26 irq_exit () at kernel/softirq.c:348 #27 0xffffffff81081dd3 in scheduler_ipi () at kernel/sched/core.c:1355 #28 0xffffffff8101fee4 in smp_reschedule_interrupt (regs=<optimized out>) at arch/x86/kernel/smp.c:256 #29 <signal handler called> #30 0xffffffffffffff02 in ?? () Cannot access memory at address 0x246 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() 2012-10-11 8:38 ` Cong Wang @ 2012-10-11 15:05 ` Eric Dumazet 2012-10-11 15:20 ` Eric Dumazet 0 siblings, 1 reply; 5+ messages in thread From: Eric Dumazet @ 2012-10-11 15:05 UTC (permalink / raw) To: Cong Wang; +Cc: stephen hemminger, David S. Miller, netdev, Thomas Graf, rizzo On Thu, 2012-10-11 at 16:38 +0800, Cong Wang wrote: > On Mon, 2012-10-08 at 17:15 +0800, Cong Wang wrote: > > Hi, all, > > > > We got the following kernel crash on RHEL6 and I confirmed upstream has > > the same problem (I didn't save this kernel log though): > > Ok, I got the backtrace of the latest kernel, see below. Seems > qfq_slot_scan() in qfq_dequeue() returns something bad, 'cl' becomes > '0x10'. not exactly, cl is -0x50 > > (gdb) bt > #0 delay_tsc (__loops=<optimized out>) at arch/x86/lib/delay.c:69 > #1 0xffffffff8143f4c9 in __delay (loops=<optimized out>) at > arch/x86/lib/delay.c:112 > #2 0xffffffff8143f4ef in __const_udelay (xloops=<optimized out>) at > arch/x86/lib/delay.c:126 > #3 0xffffffff81914536 in panic (fmt=<optimized out>) at > kernel/panic.c:182 > #4 0xffffffff8193de42 in oops_end (flags=582, regs=0xffff88007d203d08, > signr=9) at arch/x86/kernel/dumpstack.c:248 > #5 0xffffffff81913698 in no_context (regs=0xffff88007d203d08, > error_code=<optimized out>, address=16, signal=<optimized out>, > si_code=<optimized out>) at arch/x86/mm/fault.c:690 > #6 0xffffffff81913885 in __bad_area_nosemaphore > (regs=0xffff88007d203d08, error_code=0, address=16, si_code=196609) > at arch/x86/mm/fault.c:768 > #7 0xffffffff819138b9 in bad_area_nosemaphore (regs=<optimized out>, > error_code=<optimized out>, address=<optimized out>) > at arch/x86/mm/fault.c:775 > #8 0xffffffff81940803 in __do_page_fault (regs=0xffff88007d203d08, > error_code=0) at arch/x86/mm/fault.c:1105 > #9 0xffffffff81940883 in do_page_fault (regs=<optimized out>, > error_code=<optimized out>) at arch/x86/mm/fault.c:1237 > #10 0xffffffff81940129 in do_async_page_fault (regs=<optimized out>, > error_code=<optimized out>) at arch/x86/kernel/kvm.c:246 > #11 <signal handler called> > #12 0xffffffff8176c72c in qfq_round_down (shift=41, ts=Cannot access > memory at address 0x10 > ) at net/sched/sch_qfq.c:575 > #13 qfq_dequeue (sch=0xffff8800757542c0) at net/sched/sch_qfq.c:819 > #14 0xffffffff8175c0cb in cbq_dequeue_prio (prio=1, > sch=0xffff880076b15d00) at net/sched/sch_cbq.c:851 > #15 cbq_dequeue_1 (sch=0xffff880076b15d00) at net/sched/sch_cbq.c:934 > #16 cbq_dequeue (sch=0xffff880076b15d00) at net/sched/sch_cbq.c:973 > #17 0xffffffff81750ff7 in dequeue_skb (q=0xffff880076b15d00) at > net/sched/sch_generic.c:69 > #18 qdisc_restart (q=0xffff880076b15d00) at net/sched/sch_generic.c:178 > #19 __qdisc_run (q=0xffff880076b15d00) at net/sched/sch_generic.c:193 > #20 0xffffffff81725625 in qdisc_run (q=0xffff880076b15d00) at > include/net/pkt_sched.h:99 > #21 net_tx_action (h=<optimized out>) at net/core/dev.c:3070 > #22 0xffffffff81057069 in __do_softirq () at kernel/softirq.c:247 > #23 0xffffffff81945b3c in ?? () at arch/x86/kernel/entry_64.S:1353 > #24 0xffffffff81003f36 in do_softirq () at arch/x86/kernel/irq_64.c:106 > #25 0xffffffff810572f8 in invoke_softirq () at kernel/softirq.c:329 > #26 irq_exit () at kernel/softirq.c:348 > #27 0xffffffff81081dd3 in scheduler_ipi () at kernel/sched/core.c:1355 > #28 0xffffffff8101fee4 in smp_reschedule_interrupt (regs=<optimized > out>) at arch/x86/kernel/smp.c:256 > #29 <signal handler called> > #30 0xffffffffffffff02 in ?? () > Cannot access memory at address 0x246 > > static struct qfq_class *qfq_slot_head(struct qfq_group *grp) { return hlist_entry(grp->slots[grp->front].first, struct qfq_class, next); } problem is : grp->slots[grp->front].first is NULL here, so we return RAX = -offsetof(struct qfq_class, next) (ie -0x50 : ffffffffffffffb0) So one bit is set in full_slots while the corresponding slots[] is empty. I wonder if qfq_slot_remove() is correct ? static void qfq_slot_remove(struct qfq_sched *q, struct qfq_group *grp, struct qfq_class *cl) { unsigned int i, offset; u64 roundedS; roundedS = qfq_round_down(cl->S, grp->slot_shift); offset = (roundedS - grp->S) >> grp->slot_shift; i = (grp->front + offset) % QFQ_MAX_SLOTS; hlist_del(&cl->next); if (hlist_empty(&grp->slots[i])) __clear_bit(offset, &grp->full_slots); } What guarantee do we have cl was removed from slots[i] and not another one ? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() 2012-10-11 15:05 ` Eric Dumazet @ 2012-10-11 15:20 ` Eric Dumazet 2012-10-12 1:25 ` Cong Wang 0 siblings, 1 reply; 5+ messages in thread From: Eric Dumazet @ 2012-10-11 15:20 UTC (permalink / raw) To: Cong Wang; +Cc: stephen hemminger, David S. Miller, netdev, Thomas Graf, rizzo On Thu, 2012-10-11 at 17:05 +0200, Eric Dumazet wrote: > On Thu, 2012-10-11 at 16:38 +0800, Cong Wang wrote: > > On Mon, 2012-10-08 at 17:15 +0800, Cong Wang wrote: > > > Hi, all, > > > > > > We got the following kernel crash on RHEL6 and I confirmed upstream has > > > the same problem (I didn't save this kernel log though): > > > > Ok, I got the backtrace of the latest kernel, see below. Seems > > qfq_slot_scan() in qfq_dequeue() returns something bad, 'cl' becomes > > '0x10'. > > not exactly, cl is -0x50 > > > > > > static struct qfq_class *qfq_slot_head(struct qfq_group *grp) > { > return hlist_entry(grp->slots[grp->front].first, > struct qfq_class, next); > } > > > problem is : grp->slots[grp->front].first is NULL here, > > so we return RAX = -offsetof(struct qfq_class, next) > > (ie -0x50 : ffffffffffffffb0) > > > So one bit is set in full_slots while the corresponding slots[] is > empty. > > I wonder if qfq_slot_remove() is correct ? I just realize its a 2.6.32 redhat kernel, while QFQ is a 3.0 addition. Can you reproduce the bug on current kernel (3.6 or git tree) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() 2012-10-11 15:20 ` Eric Dumazet @ 2012-10-12 1:25 ` Cong Wang 0 siblings, 0 replies; 5+ messages in thread From: Cong Wang @ 2012-10-12 1:25 UTC (permalink / raw) To: Eric Dumazet Cc: stephen hemminger, David S. Miller, netdev, Thomas Graf, rizzo On Thu, 2012-10-11 at 17:20 +0200, Eric Dumazet wrote: > > I just realize its a 2.6.32 redhat kernel, while QFQ is a 3.0 addition. > > Can you reproduce the bug on current kernel (3.6 or git tree) > Sure, the gdb backtrace is from the latest -net kernel. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-10-12 1:25 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-08 9:15 BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue() Cong Wang 2012-10-11 8:38 ` Cong Wang 2012-10-11 15:05 ` Eric Dumazet 2012-10-11 15:20 ` Eric Dumazet 2012-10-12 1:25 ` Cong Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).