* NMI lockup, 2.6.26 release @ 2008-07-22 18:42 denys 2008-07-22 20:13 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: denys @ 2008-07-22 18:42 UTC (permalink / raw) To: netdev workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) Jul 22 21:33:30 10.154.154.1 [17146.584293] BUG: NMI Watchdog detected LOCKUP Jul 22 21:33:30 10.154.154.1 on CPU1, ip c01d78d4, registers: Jul 22 21:33:30 10.154.154.1 [17146.584293] Process swapper (pid: 0, ti=c0842000 task=f7c35c80 task.ti=f7c3d000) Jul 22 21:33:30 10.154.154.1 Jul 22 21:33:30 10.154.154.1 [17146.584293] Stack: Jul 22 21:33:30 10.154.154.1 c29f6498 Jul 22 21:33:30 10.154.154.1 00000000 Jul 22 21:33:30 10.154.154.1 00000000 Jul 22 21:33:30 10.154.154.1 f6c9fcc4 Jul 22 21:33:30 10.154.154.1 f6c9fcc8 Jul 22 21:33:30 10.154.154.1 c29f6490 Jul 22 21:33:30 10.154.154.1 c0842d1c Jul 22 21:33:30 10.154.154.1 c0133ddc Jul 22 21:33:30 10.154.154.1 Jul 22 21:33:30 10.154.154.1 [17146.584293] Jul 22 21:33:30 10.154.154.1 00000001 Jul 22 21:33:30 10.154.154.1 f6c9fcc4 Jul 22 21:33:30 10.154.154.1 00000000 Jul 22 21:33:30 10.154.154.1 c29f6490 Jul 22 21:33:30 10.154.154.1 f6c9fcc4 Jul 22 21:33:30 10.154.154.1 c29f6490 Jul 22 21:33:30 10.154.154.1 c0842d44 Jul 22 21:33:30 10.154.154.1 c0134413 Jul 22 21:33:30 10.154.154.1 Jul 22 21:33:30 10.154.154.1 [17146.584293] Jul 22 21:33:30 10.154.154.1 c29f6484 Jul 22 21:33:30 10.154.154.1 0230ac00 Jul 22 21:33:30 10.154.154.1 00000f97 Jul 22 21:33:30 10.154.154.1 00000000 Jul 22 21:33:30 10.154.154.1 00000286 Jul 22 21:33:30 10.154.154.1 f6c9f800 Jul 22 21:33:30 10.154.154.1 f7d2f000 Jul 22 21:33:30 10.154.154.1 00000000 Jul 22 21:33:30 10.154.154.1 Jul 22 21:33:30 10.154.154.1 [17146.584293] Call Trace: Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0133ddc>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 enqueue_hrtimer+0xfa/0x106 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0134413>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 hrtimer_start+0xee/0x118 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c026790e>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89eade1>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 htb_dequeue+0x6a8/0x6b3 [sch_htb] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f899bad4>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 sfq_enqueue+0x16/0x1be [sch_sfq] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c02669b9>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 __qdisc_run+0x5f/0x191 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c025ba50>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 dev_queue_xmit+0x1ba/0x316 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f8a21245>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 tcf_mirred+0x132/0x153 [act_mirred] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f8a21113>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 tcf_mirred+0x0/0x153 [act_mirred] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0268ce4>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 tcf_action_exec+0x44/0x77 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89ce792>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 u32_classify+0x119/0x24e [cls_u32] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0266dfa>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 tc_classify_compat+0x2f/0x5e Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0267806>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 tc_classify+0x17/0x78 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89f10a3>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 ingress_enqueue+0x1a/0x53 [sch_ingress] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0258eca>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 netif_receive_skb+0x26b/0x406 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0255bfb>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 __netdev_alloc_skb+0x17/0x34 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89b8c7a>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 e1000_receive_skb+0x13b/0x162 [e1000e] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89bb599>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 e1000_clean_rx_irq+0x1ff/0x28d [e1000e] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<f89b843a>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 e1000_clean+0x57/0x1de [e1000e] Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c025b006>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 net_rx_action+0xb3/0x1e0 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0125757>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 __do_softirq+0x6f/0xe9 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c010614b>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 do_softirq+0x5e/0xa8 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0146eb9>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 handle_edge_irq+0x0/0x10a Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c01256b5>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 irq_exit+0x44/0x77 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0106235>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 do_IRQ+0xa0/0xb7 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0108fb5>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 mwait_idle+0x0/0x43 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c01042fa>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 common_interrupt+0x2e/0x34 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0108fb5>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 mwait_idle+0x0/0x43 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c01300d8>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 param_array_set+0x96/0xc8 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0108fee>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 mwait_idle+0x39/0x43 Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c0102596>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 cpu_idle+0x9a/0xba Jul 22 21:33:30 10.154.154.1 [17146.584293] [<c02aebfb>] Jul 22 21:33:30 10.154.154.1 ? Jul 22 21:33:30 10.154.154.1 start_secondary+0x160/0x165 Jul 22 21:33:30 10.154.154.1 [17146.584293] ======================= Jul 22 21:33:30 10.154.154.1 [17146.584293] Code: Jul 22 21:33:30 10.154.154.1 83 Jul 22 21:33:30 10.154.154.1 09 Jul 22 21:33:30 10.154.154.1 01 Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 f0 Jul 22 21:33:30 10.154.154.1 83 Jul 22 21:33:30 10.154.154.1 26 Jul 22 21:33:30 10.154.154.1 fe Jul 22 21:33:30 10.154.154.1 8b Jul 22 21:33:30 10.154.154.1 55 Jul 22 21:33:30 10.154.154.1 e8 Jul 22 21:33:30 10.154.154.1 e8 Jul 22 21:33:30 10.154.154.1 57 ul 22 21:33:30 10.154.154.1 ff Jul 22 21:33:30 10.154.154.1 ff Jul 22 21:33:30 10.154.154.1 ff Jul 22 21:33:30 10.154.154.1 eb Jul 22 21:33:30 10.154.154.1 42 Jul 22 21:33:30 10.154.154.1 85 Jul 22 21:33:30 10.154.154.1 d2 Jul 22 21:33:30 10.154.154.1 74 Jul 22 21:33:30 10.154.154.1 15 Jul 22 21:33:30 10.154.154.1 8b Jul 22 21:33:30 10.154.154.1 02 Jul 22 21:33:30 10.154.154.1 a8 Jul 22 21:33:30 10.154.154.1 01 Jul 22 21:33:30 10.154.154.1 75 Jul 22 21:33:30 10.154.154.1 0f Jul 22 21:33:30 10.154.154.1 83 Jul 22 21:33:30 10.154.154.1 c8 Jul 22 21:33:30 10.154.154.1 01 Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 f7 Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 02 Jul 22 21:33:30 10.154.154.1 83 Jul 22 21:33:30 10.154.154.1 0b Jul 22 21:33:30 10.154.154.1 01 Jul 22 21:33:30 10.154.154.1 83 Jul 22 21:33:30 10.154.154.1 26 Jul 22 21:33:30 10.154.154.1 fe Jul 22 21:33:30 10.154.154.1 eb Jul 22 21:33:30 10.154.154.1 29 Jul 22 21:33:30 10.154.154.1 unparseable log message: "<8b> " Jul 22 21:33:30 10.154.154.1 53 Jul 22 21:33:30 10.154.154.1 08 Jul 22 21:33:30 10.154.154.1 39 Jul 22 21:33:30 10.154.154.1 fa Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 55 Jul 22 21:33:30 10.154.154.1 ec Jul 22 21:33:30 10.154.154.1 75 Jul 22 21:33:30 10.154.154.1 0f Jul 22 21:33:30 10.154.154.1 8b Jul 22 21:33:30 10.154.154.1 55 Jul 22 21:33:30 10.154.154.1 e8 Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 d8 Jul 22 21:33:30 10.154.154.1 89 Jul 22 21:33:30 10.154.154.1 df Jul 22 21:33:30 10.154.154.1 e8 Jul 22 21:33:30 10.154.154.1 26 Jul 22 21:33:30 10.154.154.1 ff Jul 22 21:33:30 10.154.154.1 ff Jul 22 21:33:30 10.154.154.1 And one before, but also today (not sure about kernel version, between rc8 and release) ul 22 13:32:03 10.154.154.1 [143348.473981] BUG: NMI Watchdog detected LOCKUP Jul 22 13:32:03 10.154.154.1 on CPU1, ip c01ca1bf, registers: Jul 22 13:32:03 10.154.154.1 [143348.473981] Modules linked in: Jul 22 13:32:03 10.154.154.1 netconsole Jul 22 13:32:03 10.154.154.1 configfs Jul 22 13:32:03 10.154.154.1 coretemp Jul 22 13:32:03 10.154.154.1 hwmon Jul 22 13:32:03 10.154.154.1 i2c_i801 Jul 22 13:32:03 10.154.154.1 i2c_core Jul 22 13:32:03 10.154.154.1 nf_nat_ftp Jul 22 13:32:03 10.154.154.1 nf_conntrack_ftp Jul 22 13:32:03 10.154.154.1 softdog Jul 22 13:32:03 10.154.154.1 nf_nat_pptp Jul 22 13:32:03 10.154.154.1 nf_conntrack_pptp Jul 22 13:32:03 10.154.154.1 nf_conntrack_proto_gre Jul 22 13:32:03 10.154.154.1 nf_nat_proto_gre Jul 22 13:32:03 10.154.154.1 hangcheck_timer Jul 22 13:32:03 10.154.154.1 act_mirred Jul 22 13:32:03 10.154.154.1 sch_ingress Jul 22 13:32:03 10.154.154.1 act_police Jul 22 13:32:03 10.154.154.1 cls_u32 Jul 22 13:32:03 10.154.154.1 sch_sfq Jul 22 13:32:03 10.154.154.1 sch_htb Jul 22 13:32:03 10.154.154.1 iptable_nat Jul 22 13:32:03 10.154.154.1 nf_nat Jul 22 13:32:03 10.154.154.1 nf_conntrack_ipv4 Jul 22 13:32:03 10.154.154.1 xt_tcpudp Jul 22 13:32:03 10.154.154.1 ipt_TTL Jul 22 13:32:03 10.154.154.1 ipt_ttl Jul 22 13:32:03 10.154.154.1 xt_NOTRACK Jul 22 13:32:03 10.154.154.1 nf_conntrack Jul 22 13:32:03 10.154.154.1 iptable_raw Jul 22 13:32:03 10.154.154.1 iptable_mangle Jul 22 13:32:03 10.154.154.1 ifb Jul 22 13:32:03 10.154.154.1 e1000e Jul 22 13:32:03 10.154.154.1 iptable_filter Jul 22 13:32:03 10.154.154.1 ip_tables Jul 22 13:32:03 10.154.154.1 x_tables Jul 22 13:32:03 10.154.154.1 8021q Jul 22 13:32:03 10.154.154.1 tun Jul 22 13:32:03 10.154.154.1 tulip Jul 22 13:32:03 10.154.154.1 r8169 Jul 22 13:32:03 10.154.154.1 sky2 Jul 22 13:32:03 10.154.154.1 via_velocity Jul 22 13:32:03 10.154.154.1 via_rhine Jul 22 13:32:03 10.154.154.1 sis900 Jul 22 13:32:03 10.154.154.1 ne2k_pci Jul 22 13:32:03 10.154.154.1 8390 Jul 22 13:32:03 10.154.154.1 tg3 Jul 22 13:32:03 10.154.154.1 8139too Jul 22 13:32:03 10.154.154.1 e1000 Jul 22 13:32:03 10.154.154.1 e100 Jul 22 13:32:03 10.154.154.1 usb_storage Jul 22 13:32:03 10.154.154.1 mtdblock Jul 22 13:32:03 10.154.154.1 mtd_blkdevs Jul 22 13:32:03 10.154.154.1 usbhid Jul 22 13:32:03 10.154.154.1 uhci_hcd Jul 22 13:32:03 10.154.154.1 ehci_hcd Jul 22 13:32:03 10.154.154.1 ohci_hcd Jul 22 13:32:03 10.154.154.1 usbcore Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 [143348.473981] Jul 22 13:32:03 10.154.154.1 [143348.473981] Pid: 0, comm: swapper Not tainted (2.6.26-rc8-build-0029 #33) Jul 22 13:32:03 10.154.154.1 [143348.473981] EIP: 0060:[<c01ca1bf>] EFLAGS: 00000082 CPU: 1 Jul 22 13:32:03 10.154.154.1 [143348.473981] EIP is at rb_insert_color+0x57/0xbc Jul 22 13:32:03 10.154.154.1 [143348.473981] EAX: f75c24a4 EBX: f75c24a4 ECX: f75c24a4 EDX: 00000000 Jul 22 13:32:03 10.154.154.1 [143348.473981] ESI: f75c24a4 EDI: f75c24a4 EBP: c08f2d0c ESP: c08f2cf4 Jul 22 13:32:03 10.154.154.1 [143348.473981] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Jul 22 13:32:03 10.154.154.1 [143348.473981] Process swapper (pid: 0, ti=c08f2000 task=f7c314a0 task.ti=f7c3c000) Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 [143348.473981] Stack: Jul 22 13:32:03 10.154.154.1 c1ff8480 Jul 22 13:32:03 10.154.154.1 00000000 Jul 22 13:32:03 10.154.154.1 00000000 Jul 22 13:32:03 10.154.154.1 f75c24a4 Jul 22 13:32:03 10.154.154.1 f75c24a8 Jul 22 13:32:03 10.154.154.1 c1ff8478 Jul 22 13:32:03 10.154.154.1 c08f2d2c Jul 22 13:32:03 10.154.154.1 c0130358 Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 [143348.473981] Jul 22 13:32:03 10.154.154.1 00000001 Jul 22 13:32:03 10.154.154.1 f75c24a4 Jul 22 13:32:03 10.154.154.1 00000000 Jul 22 13:32:03 10.154.154.1 c1ff8478 Jul 22 13:32:03 10.154.154.1 f75c24a4 Jul 22 13:32:03 10.154.154.1 c1ff8478 Jul 22 13:32:03 10.154.154.1 c08f2d50 Jul 22 13:32:03 10.154.154.1 c0130a46 Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 [143348.473981] Jul 22 13:32:03 10.154.154.1 55c50c00 Jul 22 13:32:03 10.154.154.1 0000825f Jul 22 13:32:03 10.154.154.1 00000000 Jul 22 13:32:03 10.154.154.1 00000286 Jul 22 13:32:03 10.154.154.1 f75c2000 Jul 22 13:32:03 10.154.154.1 f7c9d000 Jul 22 13:32:03 10.154.154.1 00000000 Jul 22 13:32:03 10.154.154.1 c08f2d60 Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 [143348.473981] Call Trace: Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0130358>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 enqueue_hrtimer+0xf5/0x101 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0130a46>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 hrtimer_start+0xe2/0x10c Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c025545d>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f8a11dd6>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 htb_dequeue+0x6a6/0x6b1 [sch_htb] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f8988ad0>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 sfq_enqueue+0x16/0x1c2 [sch_sfq] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c025451e>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 __qdisc_run+0x5f/0x187 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0249a9b>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 dev_queue_xmit+0x1a3/0x2c0 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f8a1523d>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 tcf_mirred+0x12d/0x149 [act_mirred] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f8a15110>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 tcf_mirred+0x0/0x149 [act_mirred] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c025682c>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 tcf_action_exec+0x44/0x77 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f89bd792>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 u32_classify+0x119/0x24e [cls_u32] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f88d2da8>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e100_exec_cb+0xee/0xf9 [e100] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f88d221b>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e100_xmit_prepare+0x0/0x87 [e100] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f88d2e0d>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e100_xmit_frame+0x5a/0xc4 [e100] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c02477c0>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 dev_hard_start_xmit+0x1f8/0x266 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0254952>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 tc_classify_compat+0x2f/0x5e Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0255355>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 tc_classify+0x17/0x78 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f89f40a2>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 ingress_enqueue+0x1a/0x53 [sch_ingress] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c024717f>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 netif_receive_skb+0x20e/0x393 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f89a8c21>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e1000_receive_skb+0x138/0x15f [e1000e] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f89ab4b8>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e1000_clean_rx_irq+0x1fc/0x28a [e1000e] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<f89a8406>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 e1000_clean+0x50/0x1be [e1000e] Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0248ff5>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 net_rx_action+0x8f/0x199 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0122a63>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 __do_softirq+0x64/0xcd Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0105e26>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 do_softirq+0x55/0x89 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c013e76c>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 handle_edge_irq+0x0/0x100 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c01229cc>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 irq_exit+0x38/0x6b Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0105efa>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 do_IRQ+0xa0/0xb6 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0108860>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 mwait_idle+0x0/0x38 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c01041df>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 common_interrupt+0x23/0x28 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0108860>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 mwait_idle+0x0/0x38 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c0108892>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 mwait_idle+0x32/0x38 Jul 22 13:32:03 10.154.154.1 [143348.473981] [<c010255d>] Jul 22 13:32:03 10.154.154.1 ? Jul 22 13:32:03 10.154.154.1 cpu_idle+0x71/0x8a Jul 22 13:32:03 10.154.154.1 [143348.473981] ======================= Jul 22 13:32:03 10.154.154.1 [143348.473981] Code: Jul 22 13:32:03 10.154.154.1 8b Jul 22 13:32:03 10.154.154.1 43 Jul 22 13:32:03 10.154.154.1 04 Jul 22 13:32:03 10.154.154.1 39 Jul 22 13:32:03 10.154.154.1 f8 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 45 Jul 22 13:32:03 10.154.154.1 f0 Jul 22 13:32:03 10.154.154.1 75 Jul 22 13:32:03 10.154.154.1 0f Jul 22 13:32:03 10.154.154.1 8b Jul 22 13:32:03 10.154.154.1 55 Jul 22 13:32:03 10.154.154.1 e8 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 d8 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 df Jul 22 13:32:03 10.154.154.1 e8 Jul 22 13:32:03 10.154.154.1 16 Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 8b Jul 22 13:32:03 10.154.154.1 4d Jul 22 13:32:03 10.154.154.1 f0 Jul 22 13:32:03 10.154.154.1 83 Jul 22 13:32:03 10.154.154.1 09 Jul 22 13:32:03 10.154.154.1 01 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 f0 Jul 22 13:32:03 10.154.154.1 83 Jul 22 13:32:03 10.154.154.1 26 Jul 22 13:32:03 10.154.154.1 fe Jul 22 13:32:03 10.154.154.1 8b Jul 22 13:32:03 10.154.154.1 55 Jul 22 13:32:03 10.154.154.1 e8 Jul 22 13:32:03 10.154.154.1 e8 Jul 22 13:32:03 10.154.154.1 57 Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 ff Jul 22 13:32:03 10.154.154.1 eb Jul 22 13:32:03 10.154.154.1 42 Jul 22 13:32:03 10.154.154.1 Jul 22 13:32:03 10.154.154.1 d2 Jul 22 13:32:03 10.154.154.1 74 Jul 22 13:32:03 10.154.154.1 15 Jul 22 13:32:03 10.154.154.1 8b Jul 22 13:32:03 10.154.154.1 02 Jul 22 13:32:03 10.154.154.1 a8 Jul 22 13:32:03 10.154.154.1 01 Jul 22 13:32:03 10.154.154.1 75 Jul 22 13:32:03 10.154.154.1 0f Jul 22 13:32:03 10.154.154.1 83 Jul 22 13:32:03 10.154.154.1 c8 Jul 22 13:32:03 10.154.154.1 01 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 f7 Jul 22 13:32:03 10.154.154.1 89 Jul 22 13:32:03 10.154.154.1 02 Jul 22 13:32:03 10.154.154.1 83 Jul 22 13:32:03 10.154.154.1 0b Jul 22 13:32:03 10.154.154.1 01 Jul 22 13:32:03 10.154.154.1 83 Jul 22 13:32:03 10.154.154.1 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 18:42 NMI lockup, 2.6.26 release denys @ 2008-07-22 20:13 ` Jarek Poplawski 2008-07-22 20:35 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-07-22 20:13 UTC (permalink / raw) To: denys; +Cc: netdev denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) > Maybe it's unconnected but I'd recommend to try this fresh patch for possible problems with e1000e vs. netconsole: http://permalink.gmane.org/gmane.linux.network/100581 Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 20:13 ` Jarek Poplawski @ 2008-07-22 20:35 ` Jarek Poplawski 2008-07-22 20:46 ` denys 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-07-22 20:35 UTC (permalink / raw) To: denys; +Cc: netdev Jarek Poplawski wrote, On 07/22/2008 10:13 PM: > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) >> > > > Maybe it's unconnected but I'd recommend to try this fresh patch for > possible problems with e1000e vs. netconsole: > > http://permalink.gmane.org/gmane.linux.network/100581 ...and if you have TSO enabled, here is another "maybe unconnected" recommendation: http://permalink.gmane.org/gmane.linux.network/99585 Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 20:35 ` Jarek Poplawski @ 2008-07-22 20:46 ` denys 2008-07-22 21:36 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: denys @ 2008-07-22 20:46 UTC (permalink / raw) To: Jarek Poplawski; +Cc: denys, netdev First patch - probably not related. Netconsole is on e100. Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot! On Tuesday 22 July 2008, Jarek Poplawski wrote: > Jarek Poplawski wrote, On 07/22/2008 10:13 PM: > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) > >> > > > > > > Maybe it's unconnected but I'd recommend to try this fresh patch for > > possible problems with e1000e vs. netconsole: > > > > http://permalink.gmane.org/gmane.linux.network/100581 > > > ...and if you have TSO enabled, here is another "maybe unconnected" > recommendation: > > http://permalink.gmane.org/gmane.linux.network/99585 > > Jarek P. > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 20:46 ` denys @ 2008-07-22 21:36 ` Jarek Poplawski 2008-07-22 21:45 ` denys 2008-07-23 19:47 ` denys 0 siblings, 2 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-07-22 21:36 UTC (permalink / raw) To: denys; +Cc: netdev On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote: > First patch - probably not related. Netconsole is on e100. > Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot! On the other hand, it seems the corruption of skbs fixed by this second patch should probably show in some other places... I wonder if you tried to run this without netconsole (or maybe netconsole on another dev/driver) or without this NMI watchdog, and similar lockups still happened? Jarek P. > > On Tuesday 22 July 2008, Jarek Poplawski wrote: > > Jarek Poplawski wrote, On 07/22/2008 10:13 PM: > > > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > > > > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) > > >> > > > > > > > > > Maybe it's unconnected but I'd recommend to try this fresh patch for > > > possible problems with e1000e vs. netconsole: > > > > > > http://permalink.gmane.org/gmane.linux.network/100581 > > > > > > ...and if you have TSO enabled, here is another "maybe unconnected" > > recommendation: > > > > http://permalink.gmane.org/gmane.linux.network/99585 > > > > Jarek P. > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 21:36 ` Jarek Poplawski @ 2008-07-22 21:45 ` denys 2008-07-23 19:47 ` denys 1 sibling, 0 replies; 29+ messages in thread From: denys @ 2008-07-22 21:45 UTC (permalink / raw) To: Jarek Poplawski; +Cc: denys, netdev Well, i cannot see similar or not, only guess. Machine is crashing after few days, even without neconsole. Sometimes rebooting, sometimes not. So sometimes i am getting call at middle of night and running to power switch to reboot PC. I cannot check screen, it is in unreachable for me area. This bug probably is since 2.6.23 even... I will engage kexec with panic kernel just to be more safe, since it is not always rebooting over panic sysctl. As i see in sources path to panic kexec kernel is much shorter... On Wednesday 23 July 2008, Jarek Poplawski wrote: > On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote: > > First patch - probably not related. Netconsole is on e100. > > Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot! > > On the other hand, it seems the corruption of skbs fixed by this > second patch should probably show in some other places... > > I wonder if you tried to run this without netconsole (or maybe > netconsole on another dev/driver) or without this NMI watchdog, > and similar lockups still happened? > > Jarek P. > > > > > On Tuesday 22 July 2008, Jarek Poplawski wrote: > > > Jarek Poplawski wrote, On 07/22/2008 10:13 PM: > > > > > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > > > > > > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) > > > >> > > > > > > > > > > > > Maybe it's unconnected but I'd recommend to try this fresh patch for > > > > possible problems with e1000e vs. netconsole: > > > > > > > > http://permalink.gmane.org/gmane.linux.network/100581 > > > > > > > > > ...and if you have TSO enabled, here is another "maybe unconnected" > > > recommendation: > > > > > > http://permalink.gmane.org/gmane.linux.network/99585 > > > > > > Jarek P. > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-22 21:36 ` Jarek Poplawski 2008-07-22 21:45 ` denys @ 2008-07-23 19:47 ` denys 2008-07-23 21:09 ` Jarek Poplawski 2008-07-23 22:26 ` Jarek Poplawski 1 sibling, 2 replies; 29+ messages in thread From: denys @ 2008-07-23 19:47 UTC (permalink / raw) To: Jarek Poplawski; +Cc: denys, netdev Seems none of patches help with issue. I will recheck if i apply them correctly. Here is latest. Jul 23 21:37:19 10.154.154.1 [28896.091934] Process swapper (pid: 0, ti=c0842000 task=f7c35c80 task.ti=f7c3d000) Jul 23 21:37:19 10.154.154.1 Jul 23 21:37:19 10.154.154.1 [28896.091934] Stack: Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 c0842eec Jul 23 21:37:19 10.154.154.1 c01d7cc9 Jul 23 21:37:19 10.154.154.1 c2df6498 Jul 23 21:37:19 10.154.154.1 00000000 Jul 23 21:37:19 10.154.154.1 00000000 Jul 23 21:37:19 10.154.154.1 Jul 23 21:37:19 10.154.154.1 [28896.091934] Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 f6ce64c8 Jul 23 21:37:19 10.154.154.1 c2df6490 Jul 23 21:37:19 10.154.154.1 c0842f0c Jul 23 21:37:19 10.154.154.1 c0133e9c Jul 23 21:37:19 10.154.154.1 00000001 Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 00000000 Jul 23 21:37:19 10.154.154.1 Jul 23 21:37:19 10.154.154.1 [28896.091934] Jul 23 21:37:19 10.154.154.1 c2df6490 Jul 23 21:37:19 10.154.154.1 f6ce64c4 Jul 23 21:37:19 10.154.154.1 c2df6490 Jul 23 21:37:19 10.154.154.1 c0842f34 Jul 23 21:37:19 10.154.154.1 c01344d3 Jul 23 21:37:19 10.154.154.1 c2df6484 Jul 23 21:37:19 10.154.154.1 76103400 Jul 23 21:37:19 10.154.154.1 00001a45 Jul 23 21:37:19 10.154.154.1 Jul 23 21:37:19 10.154.154.1 [28896.091934] Call Trace: Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c01d7cc9>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 rb_insert_color+0x99/0xbc Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0133e9c>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 enqueue_hrtimer+0xfa/0x106 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c01344d3>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 hrtimer_start+0xee/0x118 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0267db2>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f Jul 23 21:37:19 10.154.154.1 [28896.091934] [<f89eadf3>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 htb_dequeue+0x6a8/0x6b3 [sch_htb] Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0266e5d>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 __qdisc_run+0x5f/0x191 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c025b1a5>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 net_tx_action+0xb4/0xda Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0125817>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 __do_softirq+0x6f/0xe9 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c010614b>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 do_softirq+0x5e/0xa8 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0146f79>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 handle_edge_irq+0x0/0x10a Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0125775>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 irq_exit+0x44/0x77 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0106235>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 do_IRQ+0xa0/0xb7 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0108fb5>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 mwait_idle+0x0/0x43 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c01042fa>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 common_interrupt+0x2e/0x34 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0108fb5>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 mwait_idle+0x0/0x43 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c01300d8>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 param_set_copystring+0x36/0x60 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0108fee>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 mwait_idle+0x39/0x43 Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c0102596>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 cpu_idle+0x9a/0xba Jul 23 21:37:19 10.154.154.1 [28896.091934] [<c02af0a7>] Jul 23 21:37:19 10.154.154.1 ? Jul 23 21:37:19 10.154.154.1 start_secondary+0x160/0x165 Jul 23 21:37:19 10.154.154.1 [28896.091934] ======================= Jul 23 21:37:19 10.154.154.1 [28896.091934] Code: Jul 23 21:37:19 10.154.154.1 09 Jul 23 21:37:19 10.154.154.1 8b Jul 23 21:37:19 10.154.154.1 01 Jul 23 21:37:19 10.154.154.1 83 Jul 23 21:37:19 10.154.154.1 e0 Jul 23 21:37:19 10.154.154.1 03 Jul 23 21:37:19 10.154.154.1 09 Jul 23 21:37:19 10.154.154.1 d8 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 01 Jul 23 21:37:19 10.154.154.1 8b Jul 23 21:37:19 10.154.154.1 02 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 5a Jul 23 21:37:19 10.154.154.1 08 Jul 23 21:37:19 10.154.154.1 83 Jul 23 21:37:19 10.154.154.1 e0 Jul 23 21:37:19 10.154.154.1 03 Jul 23 21:37:19 10.154.154.1 09 Jul 23 21:37:19 10.154.154.1 f0 Jul 23 21:37:19 10.154.154.1 85 Jul 23 21:37:19 10.154.154.1 f6 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 02 Jul 23 21:37:19 10.154.154.1 74 Jul 23 21:37:19 10.154.154.1 0f Jul 23 21:37:19 10.154.154.1 3b Jul 23 21:37:19 10.154.154.1 5e Jul 23 21:37:19 10.154.154.1 08 Jul 23 21:37:19 10.154.154.1 75 Jul 23 21:37:19 10.154.154.1 05 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 56 Jul 23 21:37:19 10.154.154.1 08 Jul 23 21:37:19 10.154.154.1 eb Jul 23 21:37:19 10.154.154.1 07 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 56 Jul 23 21:37:19 10.154.154.1 04 Jul 23 21:37:19 10.154.154.1 eb Jul 23 21:37:19 10.154.154.1 02 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 17 Jul 23 21:37:19 10.154.154.1 unparseable log message: "<8b> " Jul 23 21:37:19 10.154.154.1 03 Jul 23 21:37:19 10.154.154.1 83 Jul 23 21:37:19 10.154.154.1 e0 Jul 23 21:37:19 10.154.154.1 03 Jul 23 21:37:19 10.154.154.1 09 Jul 23 21:37:19 10.154.154.1 d0 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 03 Jul 23 21:37:19 10.154.154.1 5b Jul 23 21:37:19 10.154.154.1 5e Jul 23 21:37:19 10.154.154.1 5f Jul 23 21:37:19 10.154.154.1 5d Jul 23 21:37:19 10.154.154.1 c3 Jul 23 21:37:19 10.154.154.1 55 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 e5 Jul 23 21:37:19 10.154.154.1 57 Jul 23 21:37:19 10.154.154.1 89 Jul 23 21:37:19 10.154.154.1 d7 Jul 23 21:37:19 10.154.154.1 56 Jul 23 21:37:19 10.154.154.1 On Wednesday 23 July 2008, Jarek Poplawski wrote: > On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote: > > First patch - probably not related. Netconsole is on e100. > > Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot! > > On the other hand, it seems the corruption of skbs fixed by this > second patch should probably show in some other places... > > I wonder if you tried to run this without netconsole (or maybe > netconsole on another dev/driver) or without this NMI watchdog, > and similar lockups still happened? > > Jarek P. > > > > > On Tuesday 22 July 2008, Jarek Poplawski wrote: > > > Jarek Poplawski wrote, On 07/22/2008 10:13 PM: > > > > > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM: > > > > > > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100) > > > >> > > > > > > > > > > > > Maybe it's unconnected but I'd recommend to try this fresh patch for > > > > possible problems with e1000e vs. netconsole: > > > > > > > > http://permalink.gmane.org/gmane.linux.network/100581 > > > > > > > > > ...and if you have TSO enabled, here is another "maybe unconnected" > > > recommendation: > > > > > > http://permalink.gmane.org/gmane.linux.network/99585 > > > > > > Jarek P. > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 19:47 ` denys @ 2008-07-23 21:09 ` Jarek Poplawski 2008-07-23 22:26 ` Jarek Poplawski 1 sibling, 0 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-07-23 21:09 UTC (permalink / raw) To: denys; +Cc: netdev On Wed, Jul 23, 2008 at 10:47:17PM +0300, denys@visp.net.lb wrote: > Seems none of patches help with issue. > I will recheck if i apply them correctly. It's a pity! On the other hand it looks like you have something really special... But, since all these reports stop at hrtimers, my proposal is to check where it could hit without them. I attach a patch for debugging, which turns back a timer instead of hrtimer watchdog. Alas I didn't test it, so be cautious. (This should work with 2.6.26 or .25, I hope.) Jarek P. --- net/sched/sch_htb.c | 28 +++++++++++++++++++++++----- 1 files changed, 23 insertions(+), 5 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 3fb58f4..fdc84c3 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -169,7 +169,8 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ - struct qdisc_watchdog watchdog; + //struct qdisc_watchdog watchdog; + struct timer_list timer; /* send delay timer */ /* non shaped skbs; let them go directly thru */ struct sk_buff_head direct_queue; @@ -893,6 +894,14 @@ next: return skb; } +static void htb_timer(unsigned long arg) +{ + struct Qdisc *sch = (struct Qdisc*)arg; + sch->flags &= ~TCQ_F_THROTTLED; + wmb(); + netif_schedule(sch->dev); +} + static struct sk_buff *htb_dequeue(struct Qdisc *sch) { struct sk_buff *skb = NULL; @@ -943,7 +952,9 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + //qdisc_watchdog_schedule(&q->watchdog, next_event); + mod_timer(&q->timer, (unsigned long)next_event / + PSCHED_TICKS_PER_SEC * HZ); fin: return skb; } @@ -996,7 +1007,9 @@ static void htb_reset(struct Qdisc *sch) } } - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + sch->flags &= ~TCQ_F_THROTTLED; + del_timer(&q->timer); __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row)); @@ -1047,7 +1060,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt) for (i = 0; i < TC_HTB_NUMPRIO; i++) INIT_LIST_HEAD(q->drops + i); - qdisc_watchdog_init(&q->watchdog, sch); + //qdisc_watchdog_init(&q->watchdog, sch); + q->timer.function = htb_timer; + q->timer.data = (unsigned long)sch; + init_timer(&q->timer); + skb_queue_head_init(&q->direct_queue); q->direct_qlen = sch->dev->tx_queue_len; @@ -1262,7 +1279,8 @@ static void htb_destroy(struct Qdisc *sch) { struct htb_sched *q = qdisc_priv(sch); - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + del_timer_sync(&q->timer); /* This line used to be after htb_destroy_class call below and surprisingly it worked in 2.4. But it must precede it because filter need its target class alive to be able to call ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 19:47 ` denys 2008-07-23 21:09 ` Jarek Poplawski @ 2008-07-23 22:26 ` Jarek Poplawski 2008-07-23 23:24 ` Jarek Poplawski 1 sibling, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-07-23 22:26 UTC (permalink / raw) To: denys; +Cc: netdev (take 2) Hmm... this should be more accurate. Jarek P. --- net/sched/sch_htb.c | 28 +++++++++++++++++++++++----- 1 files changed, 23 insertions(+), 5 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 3fb58f4..fdc84c3 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -169,7 +169,8 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ - struct qdisc_watchdog watchdog; + //struct qdisc_watchdog watchdog; + struct timer_list timer; /* send delay timer */ /* non shaped skbs; let them go directly thru */ struct sk_buff_head direct_queue; @@ -893,6 +894,14 @@ next: return skb; } +static void htb_timer(unsigned long arg) +{ + struct Qdisc *sch = (struct Qdisc*)arg; + sch->flags &= ~TCQ_F_THROTTLED; + wmb(); + netif_schedule(sch->dev); +} + static struct sk_buff *htb_dequeue(struct Qdisc *sch) { struct sk_buff *skb = NULL; @@ -943,7 +952,9 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + //qdisc_watchdog_schedule(&q->watchdog, next_event); + mod_timer(&q->timer, (unsigned long)next_event * HZ / + PSCHED_TICKS_PER_SEC); fin: return skb; } @@ -996,7 +1007,9 @@ static void htb_reset(struct Qdisc *sch) } } - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + sch->flags &= ~TCQ_F_THROTTLED; + del_timer(&q->timer); __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row)); @@ -1047,7 +1060,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt) for (i = 0; i < TC_HTB_NUMPRIO; i++) INIT_LIST_HEAD(q->drops + i); - qdisc_watchdog_init(&q->watchdog, sch); + //qdisc_watchdog_init(&q->watchdog, sch); + q->timer.function = htb_timer; + q->timer.data = (unsigned long)sch; + init_timer(&q->timer); + skb_queue_head_init(&q->direct_queue); q->direct_qlen = sch->dev->tx_queue_len; @@ -1262,7 +1279,8 @@ static void htb_destroy(struct Qdisc *sch) { struct htb_sched *q = qdisc_priv(sch); - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + del_timer_sync(&q->timer); /* This line used to be after htb_destroy_class call below and surprisingly it worked in 2.4. But it must precede it because filter need its target class alive to be able to call ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 22:26 ` Jarek Poplawski @ 2008-07-23 23:24 ` Jarek Poplawski 2008-07-23 23:56 ` denys 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-07-23 23:24 UTC (permalink / raw) To: denys; +Cc: netdev (take 3) ...I'm really sorry! If it's possible better try to restart with this one. Jarek P. --- diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 3fb58f4..6b6fff3 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -169,7 +169,8 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ - struct qdisc_watchdog watchdog; + //struct qdisc_watchdog watchdog; + struct timer_list timer; /* send delay timer */ /* non shaped skbs; let them go directly thru */ struct sk_buff_head direct_queue; @@ -893,6 +894,14 @@ next: return skb; } +static void htb_timer(unsigned long arg) +{ + struct Qdisc *sch = (struct Qdisc*)arg; + sch->flags &= ~TCQ_F_THROTTLED; + wmb(); + netif_schedule(sch->dev); +} + static struct sk_buff *htb_dequeue(struct Qdisc *sch) { struct sk_buff *skb = NULL; @@ -943,7 +952,10 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + //qdisc_watchdog_schedule(&q->watchdog, next_event); + next_event = next_event - q->now; + mod_timer(&q->timer, jiffies + (unsigned long)next_event * HZ / + PSCHED_TICKS_PER_SEC); fin: return skb; } @@ -996,7 +1008,9 @@ static void htb_reset(struct Qdisc *sch) } } - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + sch->flags &= ~TCQ_F_THROTTLED; + del_timer(&q->timer); __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row)); @@ -1047,7 +1061,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt) for (i = 0; i < TC_HTB_NUMPRIO; i++) INIT_LIST_HEAD(q->drops + i); - qdisc_watchdog_init(&q->watchdog, sch); + //qdisc_watchdog_init(&q->watchdog, sch); + q->timer.function = htb_timer; + q->timer.data = (unsigned long)sch; + init_timer(&q->timer); + skb_queue_head_init(&q->direct_queue); q->direct_qlen = sch->dev->tx_queue_len; @@ -1262,7 +1280,8 @@ static void htb_destroy(struct Qdisc *sch) { struct htb_sched *q = qdisc_priv(sch); - qdisc_watchdog_cancel(&q->watchdog); + //qdisc_watchdog_cancel(&q->watchdog); + del_timer_sync(&q->timer); /* This line used to be after htb_destroy_class call below and surprisingly it worked in 2.4. But it must precede it because filter need its target class alive to be able to call ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 23:24 ` Jarek Poplawski @ 2008-07-23 23:56 ` denys 2008-07-24 14:56 ` denys 2008-07-25 7:36 ` Jarek Poplawski 0 siblings, 2 replies; 29+ messages in thread From: denys @ 2008-07-23 23:56 UTC (permalink / raw) To: Jarek Poplawski; +Cc: denys, netdev I dont have any problem to restart :-) Build system and restart (over kexec) is automated.. Rebooting now. On Thursday 24 July 2008, Jarek Poplawski wrote: > (take 3) > > ...I'm really sorry! If it's possible better try to restart with this one. > > Jarek P. > > --- > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > index 3fb58f4..6b6fff3 100644 > --- a/net/sched/sch_htb.c > +++ b/net/sched/sch_htb.c > @@ -169,7 +169,8 @@ struct htb_sched { > > int rate2quantum; /* quant = rate / rate2quantum */ > psched_time_t now; /* cached dequeue time */ > - struct qdisc_watchdog watchdog; > + //struct qdisc_watchdog watchdog; > + struct timer_list timer; /* send delay timer */ > > /* non shaped skbs; let them go directly thru */ > struct sk_buff_head direct_queue; > @@ -893,6 +894,14 @@ next: > return skb; > } > > +static void htb_timer(unsigned long arg) > +{ > + struct Qdisc *sch = (struct Qdisc*)arg; > + sch->flags &= ~TCQ_F_THROTTLED; > + wmb(); > + netif_schedule(sch->dev); > +} > + > static struct sk_buff *htb_dequeue(struct Qdisc *sch) > { > struct sk_buff *skb = NULL; > @@ -943,7 +952,10 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) > } > } > sch->qstats.overlimits++; > - qdisc_watchdog_schedule(&q->watchdog, next_event); > + //qdisc_watchdog_schedule(&q->watchdog, next_event); > + next_event = next_event - q->now; > + mod_timer(&q->timer, jiffies + (unsigned long)next_event * HZ / > + PSCHED_TICKS_PER_SEC); > fin: > return skb; > } > @@ -996,7 +1008,9 @@ static void htb_reset(struct Qdisc *sch) > > } > } > - qdisc_watchdog_cancel(&q->watchdog); > + //qdisc_watchdog_cancel(&q->watchdog); > + sch->flags &= ~TCQ_F_THROTTLED; > + del_timer(&q->timer); > __skb_queue_purge(&q->direct_queue); > sch->q.qlen = 0; > memset(q->row, 0, sizeof(q->row)); > @@ -1047,7 +1061,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt) > for (i = 0; i < TC_HTB_NUMPRIO; i++) > INIT_LIST_HEAD(q->drops + i); > > - qdisc_watchdog_init(&q->watchdog, sch); > + //qdisc_watchdog_init(&q->watchdog, sch); > + q->timer.function = htb_timer; > + q->timer.data = (unsigned long)sch; > + init_timer(&q->timer); > + > skb_queue_head_init(&q->direct_queue); > > q->direct_qlen = sch->dev->tx_queue_len; > @@ -1262,7 +1280,8 @@ static void htb_destroy(struct Qdisc *sch) > { > struct htb_sched *q = qdisc_priv(sch); > > - qdisc_watchdog_cancel(&q->watchdog); > + //qdisc_watchdog_cancel(&q->watchdog); > + del_timer_sync(&q->timer); > /* This line used to be after htb_destroy_class call below > and surprisingly it worked in 2.4. But it must precede it > because filter need its target class alive to be able to call > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 23:56 ` denys @ 2008-07-24 14:56 ` denys 2008-07-24 17:45 ` Jarek Poplawski 2008-07-25 7:36 ` Jarek Poplawski 1 sibling, 1 reply; 29+ messages in thread From: denys @ 2008-07-24 14:56 UTC (permalink / raw) To: denys; +Cc: Jarek Poplawski, netdev Nothing yet. Still waiting, in 6 hours i will have "peak time". But yesterday with same settings(without patch) it crashed 3 times. It is Core2Duo E6750 CPU, clocksource is TSC. TSC synchronization is always passed. HPET is available, so in case... i can try it too. It was crashing with nmi_watchdog=1(at least it brings crash message) and without it (hard lockup). Fishy point is ifb, but i have few hundreds of NAS servers with ifb + htb (but sure not 300-400 Mbps like on this host) and i check weekly log on all of them - none of them crashed. So it is or hardware specific (but it is crashing always in same point), even maybe memory corruption, it is not a server-grade equipment, but it crashes always in same place, same way. I have running MCE, so if there is thermal events - it must catch it. But sure there is large list of errata for this hardware :-) On Thursday 24 July 2008, denys@visp.net.lb wrote: > I dont have any problem to restart :-) > Build system and restart (over kexec) is automated.. > Rebooting now. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-24 14:56 ` denys @ 2008-07-24 17:45 ` Jarek Poplawski 0 siblings, 0 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-07-24 17:45 UTC (permalink / raw) To: denys; +Cc: netdev On Thu, Jul 24, 2008 at 05:56:43PM +0300, denys@visp.net.lb wrote: > Nothing yet. Still waiting, in 6 hours i will have "peak time". > But yesterday with same settings(without patch) it crashed 3 times. OK, so let's better wait to be sure of something here. Jarek P. (Btw., could try in the meantime to do something to shorten your lines? It's not the most popular format on kernel lists...) > It is Core2Duo E6750 CPU, clocksource is TSC. > TSC synchronization is always passed. HPET is available, so in case... i can try it too. > It was crashing with nmi_watchdog=1(at least it brings crash message) and without it (hard lockup). > Fishy point is ifb, but i have few hundreds of NAS servers with ifb + htb > (but sure not 300-400 Mbps like on this host) and i check weekly log on all of them - none of them crashed. > > So it is or hardware specific (but it is crashing always in same point), > even maybe memory corruption, it is not a server-grade equipment, but it crashes always in same place, same way. > I have running MCE, so if there is thermal events - it must catch it. > But sure there is large list of errata for this hardware :-) > > > On Thursday 24 July 2008, denys@visp.net.lb wrote: > > I dont have any problem to restart :-) > > Build system and restart (over kexec) is automated.. > > Rebooting now. > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-23 23:56 ` denys 2008-07-24 14:56 ` denys @ 2008-07-25 7:36 ` Jarek Poplawski 2008-07-25 21:09 ` denys 2008-08-02 12:55 ` Denys Fedoryshchenko 1 sibling, 2 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-07-25 7:36 UTC (permalink / raw) To: denys; +Cc: netdev Hi Denys, In case this is still working after hrtimer -> timer change here is another patch for testing: to check if limiting hrtimers scheduling could matter here. Btw. could you write what is the approximate number of htb qdiscs and classes working on each device of this box (including ifbs)? Thanks, Jarek P. (This patch should be applided to 2.6.26 or .25 after reverting the previous debugging patch.) --- net/sched/sch_htb.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 30c999c..ff9e965 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -162,6 +162,7 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ + psched_time_t next_watchdog; struct qdisc_watchdog watchdog; /* non shaped skbs; let them go directly thru */ @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + if (q->next_watchdog < q->now || next_event <= + q->next_watchdog - PSCHED_TICKS_PER_SEC / HZ) { + qdisc_watchdog_schedule(&q->watchdog, next_event); + q->next_watchdog = next_event; + } fin: return skb; } @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) } } qdisc_watchdog_cancel(&q->watchdog); + q->next_watchdog = 0; __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row)); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-25 7:36 ` Jarek Poplawski @ 2008-07-25 21:09 ` denys 2008-07-25 22:31 ` hrtimers lockups " Jarek Poplawski 2008-08-02 12:55 ` Denys Fedoryshchenko 1 sibling, 1 reply; 29+ messages in thread From: denys @ 2008-07-25 21:09 UTC (permalink / raw) To: Jarek Poplawski; +Cc: denys, netdev I will try to explain all details, maybe anything matter around 150-300 megs passing Core 2 Duo E6750 3 ifb's 29 htb classes (summary) 26 qdiscs (sfq and bfifo) NAT is running (465-700K connections) maximum bfifo qdisc size is 600Kbyte mostly all filters u32 (one is police mtu) quantum is 1514, one is 1515 Load is low (below 30-35)% by mpstat The only error i have in dmesg (a LOT of this messages, different ip port, ) [162014.265116] UDP: short packet: From 200.122.35.205:64599 8409/1480 to 213.254.233.9:6073 [162014.373110] UDP: short packet: From 200.122.35.205:52015 10698/1480 to 213.254.233.9:4855 [162088.232099] UDP: bad checksum. From 96.234.33.9:1077 to 213.254.233.9:49520 ulen 111 I run time-warp-test from Ingo Molnar - nothing, no warps. If required - i can send all rules to private e-mail. I will apply patch after 30-60 minutes (off peak time). Thanks for help a lot! On Friday 25 July 2008, Jarek Poplawski wrote: > Hi Denys, > > In case this is still working after hrtimer -> timer change here is > another patch for testing: to check if limiting hrtimers scheduling > could matter here. Btw. could you write what is the approximate > number of htb qdiscs and classes working on each device of this box > (including ifbs)? > > Thanks, > Jarek P. > > (This patch should be applided to 2.6.26 or .25 after reverting > the previous debugging patch.) > > --- > ^ permalink raw reply [flat|nested] 29+ messages in thread
* hrtimers lockups Re: NMI lockup, 2.6.26 release 2008-07-25 21:09 ` denys @ 2008-07-25 22:31 ` Jarek Poplawski 0 siblings, 0 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-07-25 22:31 UTC (permalink / raw) To: denys; +Cc: Thomas Gleixner, netdev, linux-kernel Hi, This netdev thread describes lockups breaking in hrtimers code: http://marc.info/?l=linux-netdev&m=121675217927170&w=2 Very similar reports from Denys Fedoryshchenko could be found in netdev archives a few kernel versions before. It looks like replacing hrtimers with timers in sch_htb code removes problems. I hope, Thomas or somebody from linux-kernel could give some clue on this. Thanks, Jarek P. Denys, read below: On Sat, Jul 26, 2008 at 12:09:52AM +0300, denys@visp.net.lb wrote: > I will try to explain all details, maybe anything matter > > around 150-300 megs passing > Core 2 Duo E6750 > 3 ifb's > 29 htb classes (summary) > 26 qdiscs (sfq and bfifo) > NAT is running (465-700K connections) > maximum bfifo qdisc size is 600Kbyte > mostly all filters u32 (one is police mtu) > quantum is 1514, one is 1515 > Load is low (below 30-35)% by mpstat > > The only error i have in dmesg (a LOT of this messages, different ip port, ) > [162014.265116] UDP: short packet: From 200.122.35.205:64599 8409/1480 to > 213.254.233.9:6073 > [162014.373110] UDP: short packet: From 200.122.35.205:52015 10698/1480 to > 213.254.233.9:4855 > > [162088.232099] UDP: bad checksum. From 96.234.33.9:1077 to > 213.254.233.9:49520 ulen 111 > > > I run time-warp-test from Ingo Molnar - nothing, no warps. > > If required - i can send all rules to private e-mail. > > I will apply patch after 30-60 minutes (off peak time). Thanks for help a lot! You are very helpful too! But, I think we will need some help from hrtimers/hardware gurus. IMHO, since it works with timers, the bug doesn't seem to belong to "netdev". I can't see any obvious possibility of "abusing" hrtimers with e.g. too big number of hrtimers with your config (1 hrtimer per qdisc). So, I'm not very optimistic about this new patch, but even if it works it looks like something else is wrong. That's why I added some CC to this. Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-07-25 7:36 ` Jarek Poplawski 2008-07-25 21:09 ` denys @ 2008-08-02 12:55 ` Denys Fedoryshchenko 2008-08-02 13:07 ` Jarek Poplawski 1 sibling, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-02 12:55 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev Sorry for long delay, i had to make sure that old change "fixed" the problem. Now i am running second patch, it will take also around up to one week to make sure, that it is also make (or doesn't make) any changes, because sometimes it takes 3-4 days till crash happen. On Friday 25 July 2008, Jarek Poplawski wrote: > Hi Denys, > > In case this is still working after hrtimer -> timer change here is > another patch for testing: to check if limiting hrtimers scheduling > could matter here. Btw. could you write what is the approximate > number of htb qdiscs and classes working on each device of this box > (including ifbs)? > > Thanks, > Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-02 12:55 ` Denys Fedoryshchenko @ 2008-08-02 13:07 ` Jarek Poplawski 2008-08-12 11:31 ` Denys Fedoryshchenko 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-08-02 13:07 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev On Sat, Aug 02, 2008 at 03:55:10PM +0300, Denys Fedoryshchenko wrote: > Sorry for long delay, i had to make sure that old change "fixed" the problem. > > Now i am running second patch, it will take also around up to one week to make > sure, that it is also make (or doesn't make) any changes, because sometimes > it takes 3-4 days till crash happen. OK, don't hurry! Cheers, Jarek P. > > On Friday 25 July 2008, Jarek Poplawski wrote: > > Hi Denys, > > > > In case this is still working after hrtimer -> timer change here is > > another patch for testing: to check if limiting hrtimers scheduling > > could matter here. Btw. could you write what is the approximate > > number of htb qdiscs and classes working on each device of this box > > (including ifbs)? > > > > Thanks, > > Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-02 13:07 ` Jarek Poplawski @ 2008-08-12 11:31 ` Denys Fedoryshchenko 2008-08-12 12:40 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-12 11:31 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev On Saturday 02 August 2008, Jarek Poplawski wrote: > On Sat, Aug 02, 2008 at 03:55:10PM +0300, Denys Fedoryshchenko wrote: > > Sorry for long delay, i had to make sure that old change "fixed" the > > problem. > > > > Now i am running second patch, it will take also around up to one week to > > make sure, that it is also make (or doesn't make) any changes, because > > sometimes it takes 3-4 days till crash happen. > > OK, don't hurry! > > Cheers, > Jarek P. With second patch it works fine, 9 days uptime now ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-12 11:31 ` Denys Fedoryshchenko @ 2008-08-12 12:40 ` Jarek Poplawski 2008-08-13 7:28 ` Denys Fedoryshchenko 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-08-12 12:40 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote: ... > With second patch it works fine, 9 days uptime now Great! I didn't expect it would be so easy with this strange problem. So, it looks like hrtimers could break probably after some overscheduling. The only problem with this is to find some reasonable limit which is both safe and doesn't harm resolution too much for others. IMHO this second patch with 1 jiffie watchdog resolution looks reasonable and should be acceptable, but it would be nice to check if we can go lower. Here is "the same" patch with only change in resolution (1/10 of jiffie). If there are any problems with testing this please let me know. (It should be applied after reverting patch #2.) Thanks, Jarek P. (testing patch #3) --- net/sched/sch_htb.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 30c999c..ff9e965 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -162,6 +162,7 @@ struct htb_sched { int rate2quantum; /* quant = rate / rate2quantum */ psched_time_t now; /* cached dequeue time */ + psched_time_t next_watchdog; struct qdisc_watchdog watchdog; /* non shaped skbs; let them go directly thru */ @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) } } sch->qstats.overlimits++; - qdisc_watchdog_schedule(&q->watchdog, next_event); + if (q->next_watchdog < q->now || next_event <= + q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) { + qdisc_watchdog_schedule(&q->watchdog, next_event); + q->next_watchdog = next_event; + } fin: return skb; } @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) } } qdisc_watchdog_cancel(&q->watchdog); + q->next_watchdog = 0; __skb_queue_purge(&q->direct_queue); sch->q.qlen = 0; memset(q->row, 0, sizeof(q->row)); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-12 12:40 ` Jarek Poplawski @ 2008-08-13 7:28 ` Denys Fedoryshchenko 2008-08-13 7:43 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-13 7:28 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev Just as proposal, maybe we can catch situation when "things going wrong" and panic? So we can forward some info to hrtimers guys? If it is hrtimers bug... On Tuesday 12 August 2008, Jarek Poplawski wrote: > On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote: > ... > > > With second patch it works fine, 9 days uptime now > > Great! I didn't expect it would be so easy with this strange problem. > So, it looks like hrtimers could break probably after some > overscheduling. The only problem with this is to find some reasonable > limit which is both safe and doesn't harm resolution too much for > others. > > IMHO this second patch with 1 jiffie watchdog resolution looks > reasonable and should be acceptable, but it would be nice to check if > we can go lower. Here is "the same" patch with only change in > resolution (1/10 of jiffie). If there are any problems with testing > this please let me know. (It should be applied after reverting > patch #2.) > > Thanks, > Jarek P. > > (testing patch #3) > --- > > net/sched/sch_htb.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > index 30c999c..ff9e965 100644 > --- a/net/sched/sch_htb.c > +++ b/net/sched/sch_htb.c > @@ -162,6 +162,7 @@ struct htb_sched { > > int rate2quantum; /* quant = rate / rate2quantum */ > psched_time_t now; /* cached dequeue time */ > + psched_time_t next_watchdog; > struct qdisc_watchdog watchdog; > > /* non shaped skbs; let them go directly thru */ > @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) > } > } > sch->qstats.overlimits++; > - qdisc_watchdog_schedule(&q->watchdog, next_event); > + if (q->next_watchdog < q->now || next_event <= > + q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) { > + qdisc_watchdog_schedule(&q->watchdog, next_event); > + q->next_watchdog = next_event; > + } > fin: > return skb; > } > @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) > } > } > qdisc_watchdog_cancel(&q->watchdog); > + q->next_watchdog = 0; > __skb_queue_purge(&q->direct_queue); > sch->q.qlen = 0; > memset(q->row, 0, sizeof(q->row)); > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 7:28 ` Denys Fedoryshchenko @ 2008-08-13 7:43 ` Jarek Poplawski 2008-08-13 8:02 ` Denys Fedoryshchenko 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-08-13 7:43 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev On Wed, Aug 13, 2008 at 10:28:11AM +0300, Denys Fedoryshchenko wrote: > Just as proposal, maybe we can catch situation when "things going wrong" and > panic? So we can forward some info to hrtimers guys? > If it is hrtimers bug... Yes, it would be the best, but I don't know how much I can "use" you and your clients for debugging this. So, of course, if it's possible you could simply edit this patch and try with increased values like (100 * HZ) or (1000 * HZ), or even something like: + if (q->next_watchdog < q->now || next_event <= + q->next_watchdog - 10) { Alas hrtimers guys didn't look like very interested, so the main concern should be doing this optimal in net at least. Jarek P. > > On Tuesday 12 August 2008, Jarek Poplawski wrote: > > On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote: > > ... > > > > > With second patch it works fine, 9 days uptime now > > > > Great! I didn't expect it would be so easy with this strange problem. > > So, it looks like hrtimers could break probably after some > > overscheduling. The only problem with this is to find some reasonable > > limit which is both safe and doesn't harm resolution too much for > > others. > > > > IMHO this second patch with 1 jiffie watchdog resolution looks > > reasonable and should be acceptable, but it would be nice to check if > > we can go lower. Here is "the same" patch with only change in > > resolution (1/10 of jiffie). If there are any problems with testing > > this please let me know. (It should be applied after reverting > > patch #2.) > > > > Thanks, > > Jarek P. > > > > (testing patch #3) > > --- > > > > net/sched/sch_htb.c | 8 +++++++- > > 1 files changed, 7 insertions(+), 1 deletions(-) > > > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > > index 30c999c..ff9e965 100644 > > --- a/net/sched/sch_htb.c > > +++ b/net/sched/sch_htb.c > > @@ -162,6 +162,7 @@ struct htb_sched { > > > > int rate2quantum; /* quant = rate / rate2quantum */ > > psched_time_t now; /* cached dequeue time */ > > + psched_time_t next_watchdog; > > struct qdisc_watchdog watchdog; > > > > /* non shaped skbs; let them go directly thru */ > > @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch) > > } > > } > > sch->qstats.overlimits++; > > - qdisc_watchdog_schedule(&q->watchdog, next_event); > > + if (q->next_watchdog < q->now || next_event <= > > + q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) { > > + qdisc_watchdog_schedule(&q->watchdog, next_event); > > + q->next_watchdog = next_event; > > + } > > fin: > > return skb; > > } > > @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch) > > } > > } > > qdisc_watchdog_cancel(&q->watchdog); > > + q->next_watchdog = 0; > > __skb_queue_purge(&q->direct_queue); > > sch->q.qlen = 0; > > memset(q->row, 0, sizeof(q->row)); > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 7:43 ` Jarek Poplawski @ 2008-08-13 8:02 ` Denys Fedoryshchenko 2008-08-13 8:49 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-13 8:02 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev As soon as kernel reboot themself, it won't hurt me much. With NMI watchdog i notice there was panic missing, so nmi_watchdog was showing message and was not rebooting. It is fixed in next kernel and i patch in my kernel - so i will not crash+freeze anymore i guess and will not need to run to power switch at night. It can be related to another problem (some corruption) which is not fixed yet, so prefferably to show timer guys exact location of problem. Maybe you can make some patch like: + if (q->next_watchdog < q->now || next_event <= + q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) { + qdisc_watchdog_schedule(&q->watchdog, next_event); + q->next_watchdog = next_event; + } else { something like BUG() } ? Probably also i will try to migrate to "rc" versions of kernel to see if problem still exist there, a lot of changes done there... is HTB corruption problem tracked finally and completely? I seen some discussions about it recently... On Wednesday 13 August 2008, Jarek Poplawski wrote: > On Wed, Aug 13, 2008 at 10:28:11AM +0300, Denys Fedoryshchenko wrote: > > Just as proposal, maybe we can catch situation when "things going wrong" > > and panic? So we can forward some info to hrtimers guys? > > If it is hrtimers bug... > > Yes, it would be the best, but I don't know how much I can "use" you > and your clients for debugging this. So, of course, if it's possible > you could simply edit this patch and try with increased values like > (100 * HZ) or (1000 * HZ), or even something like: > > + if (q->next_watchdog < q->now || next_event <= > + q->next_watchdog - 10) { > > Alas hrtimers guys didn't look like very interested, so the main > concern should be doing this optimal in net at least. > > Jarek P. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 8:02 ` Denys Fedoryshchenko @ 2008-08-13 8:49 ` Jarek Poplawski 2008-08-13 9:08 ` Denys Fedoryshchenko ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-08-13 8:49 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev On Wed, Aug 13, 2008 at 11:02:34AM +0300, Denys Fedoryshchenko wrote: > As soon as kernel reboot themself, it won't hurt me much. > With NMI watchdog i notice there was panic missing, so nmi_watchdog was > showing message and was not rebooting. It is fixed in next kernel and i patch > in my kernel - so i will not crash+freeze anymore i guess and will not need > to run to power switch at night. > > It can be related to another problem (some corruption) which is not fixed yet, > so prefferably to show timer guys exact location of problem. > > Maybe you can make some patch like: > > + if (q->next_watchdog < q->now || next_event <= > + q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) { > + qdisc_watchdog_schedule(&q->watchdog, next_event); > + q->next_watchdog = next_event; > + } else { > something like BUG() > } > ? I don't think it's right: there could be probably some small time differences between cpus on SMP or even some inaccuracy related to hardware, but I don't think it's the right place or method to verify this. And eg. re-scheduling with the same time shouldn't be wrong too. Anyway, narrowing the problem with such tests should give us better understanding what could be a real problem here. BTW, could you "remind" us the .config on this box (especially various *HZ*, *TIME* and *TIMERS* settings). > Probably also i will try to migrate to "rc" versions of kernel to see if > problem still exist there, a lot of changes done there... is HTB corruption > problem tracked finally and completely? I seen some discussions about it > recently... I doubt current rc versions are stable enough for any production. HTB waits for one fix, but it's nothing critical if it didn't bothered you until now. There could be still some problems around schedulers generally, after last big changes. Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 8:49 ` Jarek Poplawski @ 2008-08-13 9:08 ` Denys Fedoryshchenko 2008-08-14 15:07 ` Denys Fedoryshchenko 2008-08-15 13:13 ` NMI lockup, 2.6.26 release Denys Fedoryshchenko 2 siblings, 0 replies; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-13 9:08 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev On Wednesday 13 August 2008, Jarek Poplawski wrote: > I don't think it's right: there could be probably some small time > differences between cpus on SMP or even some inaccuracy related to > hardware, but I don't think it's the right place or method to verify > this. And eg. re-scheduling with the same time shouldn't be wrong too. OK! Got you. Difference possible, on that machine it is using TSC from Core 2 Duo. I tried to run some code from Ingo Molnar to check is TSC synchronised - after 2 days it doesn't detect anything. > > Anyway, narrowing the problem with such tests should give us better > understanding what could be a real problem here. BTW, could you > "remind" us the .config on this box (especially various *HZ*, *TIME* > and *TIMERS* settings). http://www.nuclearcat.com/files/config_2.6.26.2.txt Same used for 2.6.26.2, this config from 2.6.26.1 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 8:49 ` Jarek Poplawski 2008-08-13 9:08 ` Denys Fedoryshchenko @ 2008-08-14 15:07 ` Denys Fedoryshchenko 2008-08-14 15:10 ` New: softlockup in 2.6.27-rc3-git2 Denys Fedoryshchenko 2008-08-15 13:13 ` NMI lockup, 2.6.26 release Denys Fedoryshchenko 2 siblings, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-14 15:07 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev On 2.6.27 rc3-git2 i am getting softlockup after 60-120 seconds after running. Netconsole is almost dead, i tried to use it get stacktrace, but it is sending few lines of header only. It happens when many tc sessions running in parallel, thats only info i have now. Update: got interesting info, maybe this is the issue: Aug 14 18:07:09 194.146.153.146 [ 41.496997] Aug 14 18:07:09 194.146.153.146 [ 41.496997] ============================================= Aug 14 18:07:09 194.146.153.146 [ 41.496997] [ INFO: possible recursive locking detected ] Aug 14 18:07:09 194.146.153.146 [ 41.496997] 2.6.27-rc3-git2-build-0030 #5 Aug 14 18:07:09 194.146.153.146 [ 41.496997] --------------------------------------------- Aug 14 18:07:09 194.146.153.146 [ 41.496997] swapper/0 is trying to acquire lock: Aug 14 18:07:09 194.146.153.146 [ 41.496997] (&list->lock Aug 14 18:07:09 194.146.153.146 #2 Aug 14 18:07:09 194.146.153.146 ){-+..} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c02617d4>] dev_queue_xmit+0x31f/0x481 Aug 14 18:07:09 194.146.153.146 [ 41.496997] Aug 14 18:07:09 194.146.153.146 [ 41.496997] but task is already holding lock: Aug 14 18:07:09 194.146.153.146 [ 41.496997] (&list->lock Aug 14 18:07:09 194.146.153.146 #2 Aug 14 18:07:09 194.146.153.146 ){-+..} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] Aug 14 18:07:09 194.146.153.146 [ 41.496997] other info that might help us debug this: Aug 14 18:07:09 194.146.153.146 [ 41.496997] 5 locks held by swapper/0: Aug 14 18:07:09 194.146.153.146 [ 41.496997] #0: Aug 14 18:07:09 194.146.153.146 (rcu_read_lock Aug 14 18:07:09 194.146.153.146 ){..--} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c025f873>] net_rx_action+0x54/0x1e5 Aug 14 18:07:09 194.146.153.146 [ 41.496997] #1: Aug 14 18:07:09 194.146.153.146 (rcu_read_lock Aug 14 18:07:09 194.146.153.146 ){..--} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c0260bb5>] netif_receive_skb+0x10c/0x3f6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] #2: Aug 14 18:07:09 194.146.153.146 (&list->lock Aug 14 18:07:09 194.146.153.146 #2 Aug 14 18:07:09 194.146.153.146 ){-+..} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] #3: Aug 14 18:07:09 194.146.153.146 (&p->tcfc_lock Aug 14 18:07:09 194.146.153.146 ){-+..} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<f8a602fc>] tcf_mirred+0x1f/0x14b [act_mirred] Aug 14 18:07:09 194.146.153.146 [ 41.496997] #4: Aug 14 18:07:09 194.146.153.146 (rcu_read_lock Aug 14 18:07:09 194.146.153.146 ){..--} Aug 14 18:07:09 194.146.153.146 , at: Aug 14 18:07:09 194.146.153.146 [<c0261635>] dev_queue_xmit+0x180/0x481 Aug 14 18:07:09 194.146.153.146 [ 41.496997] Aug 14 18:07:09 194.146.153.146 [ 41.496997] stack backtrace: Aug 14 18:07:09 194.146.153.146 [ 41.496997] Pid: 0, comm: swapper Not tainted 2.6.27-rc3-git2-build-0030 #5 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02ba433>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 printk+0xf/0x14 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013ed12>] Aug 14 18:07:09 194.146.153.146 __lock_acquire+0xb3a/0x118a Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013f3aa>] Aug 14 18:07:09 194.146.153.146 lock_acquire+0x48/0x64 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02bc901>] Aug 14 18:07:09 194.146.153.146 _spin_lock+0x1b/0x2a Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a60408>] Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x12b/0x14b [act_mirred] Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a602dd>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x0/0x14b [act_mirred] Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c027102b>] Aug 14 18:07:09 194.146.153.146 tcf_action_exec+0x43/0x72 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a92cd5>] Aug 14 18:07:09 194.146.153.146 u32_classify+0xf4/0x20b [cls_u32] Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013dabf>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 trace_hardirqs_on+0xb/0xd Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c026ea85>] Aug 14 18:07:09 194.146.153.146 tc_classify_compat+0x2e/0x5d Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c026ebcd>] Aug 14 18:07:09 194.146.153.146 tc_classify+0x17/0x72 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a9c0b2>] Aug 14 18:07:09 194.146.153.146 ingress_enqueue+0x1a/0x54 [sch_ingress] Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0260d31>] Aug 14 18:07:09 194.146.153.146 netif_receive_skb+0x288/0x3f6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0260f13>] Aug 14 18:07:09 194.146.153.146 process_backlog+0x74/0xcb Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c025f8da>] Aug 14 18:07:09 194.146.153.146 net_rx_action+0xbb/0x1e5 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0126203>] Aug 14 18:07:09 194.146.153.146 __do_softirq+0x7b/0xf4 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0126188>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 __do_softirq+0x0/0xf4 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c01060b3>] Aug 14 18:07:09 194.146.153.146 do_softirq+0x65/0xb6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c014a35d>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 handle_fasteoi_irq+0x0/0xb6 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0125e28>] Aug 14 18:07:09 194.146.153.146 irq_exit+0x44/0x79 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0106038>] Aug 14 18:07:09 194.146.153.146 do_IRQ+0xae/0xc4 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0104288>] Aug 14 18:07:09 194.146.153.146 common_interrupt+0x28/0x30 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013007b>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 find_get_pid+0x2e/0x4d Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0108d8a>] Aug 14 18:07:09 194.146.153.146 ? Aug 14 18:07:09 194.146.153.146 mwait_idle+0x39/0x43 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c01029ee>] Aug 14 18:07:09 194.146.153.146 cpu_idle+0xbf/0xe1 Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02afc5e>] Aug 14 18:07:09 194.146.153.146 rest_init+0x4e/0x50 Aug 14 18:07:09 194.146.153.146 [ 41.496997] ======================= ^ permalink raw reply [flat|nested] 29+ messages in thread
* New: softlockup in 2.6.27-rc3-git2 2008-08-14 15:07 ` Denys Fedoryshchenko @ 2008-08-14 15:10 ` Denys Fedoryshchenko 0 siblings, 0 replies; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-14 15:10 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev Sorry, i had to update subject On Thursday 14 August 2008, Denys Fedoryshchenko wrote: > On 2.6.27 rc3-git2 i am getting softlockup after 60-120 seconds after > running. Netconsole is almost dead, i tried to use it get stacktrace, but > it is sending few lines of header only. > > It happens when many tc sessions running in parallel, thats only info i > have now. > > Update: got interesting info, maybe this is the issue: > > > Aug 14 18:07:09 194.146.153.146 [ 41.496997] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] > ============================================= Aug 14 18:07:09 > 194.146.153.146 [ 41.496997] [ INFO: possible recursive locking detected > ] Aug 14 18:07:09 194.146.153.146 [ 41.496997] 2.6.27-rc3-git2-build-0030 > #5 Aug 14 18:07:09 194.146.153.146 [ 41.496997] > --------------------------------------------- Aug 14 18:07:09 > 194.146.153.146 [ 41.496997] swapper/0 is trying to acquire lock: Aug 14 > 18:07:09 194.146.153.146 [ 41.496997] (&list->lock > Aug 14 18:07:09 194.146.153.146 #2 > Aug 14 18:07:09 194.146.153.146 ){-+..} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c02617d4>] dev_queue_xmit+0x31f/0x481 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] but task is already holding > lock: Aug 14 18:07:09 194.146.153.146 [ 41.496997] (&list->lock > Aug 14 18:07:09 194.146.153.146 #2 > Aug 14 18:07:09 194.146.153.146 ){-+..} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] other info that might help > us debug this: Aug 14 18:07:09 194.146.153.146 [ 41.496997] 5 locks held > by swapper/0: Aug 14 18:07:09 194.146.153.146 [ 41.496997] #0: > Aug 14 18:07:09 194.146.153.146 (rcu_read_lock > Aug 14 18:07:09 194.146.153.146 ){..--} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c025f873>] net_rx_action+0x54/0x1e5 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] #1: > Aug 14 18:07:09 194.146.153.146 (rcu_read_lock > Aug 14 18:07:09 194.146.153.146 ){..--} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c0260bb5>] netif_receive_skb+0x10c/0x3f6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] #2: > Aug 14 18:07:09 194.146.153.146 (&list->lock > Aug 14 18:07:09 194.146.153.146 #2 > Aug 14 18:07:09 194.146.153.146 ){-+..} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] #3: > Aug 14 18:07:09 194.146.153.146 (&p->tcfc_lock > Aug 14 18:07:09 194.146.153.146 ){-+..} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<f8a602fc>] tcf_mirred+0x1f/0x14b > [act_mirred] Aug 14 18:07:09 194.146.153.146 [ 41.496997] #4: > Aug 14 18:07:09 194.146.153.146 (rcu_read_lock > Aug 14 18:07:09 194.146.153.146 ){..--} > Aug 14 18:07:09 194.146.153.146 , at: > Aug 14 18:07:09 194.146.153.146 [<c0261635>] dev_queue_xmit+0x180/0x481 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] stack backtrace: > Aug 14 18:07:09 194.146.153.146 [ 41.496997] Pid: 0, comm: swapper Not > tainted 2.6.27-rc3-git2-build-0030 #5 Aug 14 18:07:09 194.146.153.146 [ > 41.496997] [<c02ba433>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 printk+0xf/0x14 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013ed12>] > Aug 14 18:07:09 194.146.153.146 __lock_acquire+0xb3a/0x118a > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013f3aa>] > Aug 14 18:07:09 194.146.153.146 lock_acquire+0x48/0x64 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02bc901>] > Aug 14 18:07:09 194.146.153.146 _spin_lock+0x1b/0x2a > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02617d4>] > Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a60408>] > Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x12b/0x14b [act_mirred] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a602dd>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x0/0x14b [act_mirred] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c027102b>] > Aug 14 18:07:09 194.146.153.146 tcf_action_exec+0x43/0x72 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a92cd5>] > Aug 14 18:07:09 194.146.153.146 u32_classify+0xf4/0x20b [cls_u32] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013dabf>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 trace_hardirqs_on+0xb/0xd > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c026ea85>] > Aug 14 18:07:09 194.146.153.146 tc_classify_compat+0x2e/0x5d > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c026ebcd>] > Aug 14 18:07:09 194.146.153.146 tc_classify+0x17/0x72 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<f8a9c0b2>] > Aug 14 18:07:09 194.146.153.146 ingress_enqueue+0x1a/0x54 [sch_ingress] > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0260d31>] > Aug 14 18:07:09 194.146.153.146 netif_receive_skb+0x288/0x3f6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0260f13>] > Aug 14 18:07:09 194.146.153.146 process_backlog+0x74/0xcb > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c025f8da>] > Aug 14 18:07:09 194.146.153.146 net_rx_action+0xbb/0x1e5 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0126203>] > Aug 14 18:07:09 194.146.153.146 __do_softirq+0x7b/0xf4 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0126188>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 __do_softirq+0x0/0xf4 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c01060b3>] > Aug 14 18:07:09 194.146.153.146 do_softirq+0x65/0xb6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c014a35d>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 handle_fasteoi_irq+0x0/0xb6 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0125e28>] > Aug 14 18:07:09 194.146.153.146 irq_exit+0x44/0x79 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0106038>] > Aug 14 18:07:09 194.146.153.146 do_IRQ+0xae/0xc4 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0104288>] > Aug 14 18:07:09 194.146.153.146 common_interrupt+0x28/0x30 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c013007b>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 find_get_pid+0x2e/0x4d > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c0108d8a>] > Aug 14 18:07:09 194.146.153.146 ? > Aug 14 18:07:09 194.146.153.146 mwait_idle+0x39/0x43 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c01029ee>] > Aug 14 18:07:09 194.146.153.146 cpu_idle+0xbf/0xe1 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] [<c02afc5e>] > Aug 14 18:07:09 194.146.153.146 rest_init+0x4e/0x50 > Aug 14 18:07:09 194.146.153.146 [ 41.496997] ======================= > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-13 8:49 ` Jarek Poplawski 2008-08-13 9:08 ` Denys Fedoryshchenko 2008-08-14 15:07 ` Denys Fedoryshchenko @ 2008-08-15 13:13 ` Denys Fedoryshchenko 2008-08-15 14:16 ` Jarek Poplawski 2 siblings, 1 reply; 29+ messages in thread From: Denys Fedoryshchenko @ 2008-08-15 13:13 UTC (permalink / raw) To: Jarek Poplawski; +Cc: netdev On Wednesday 13 August 2008, Jarek Poplawski wrote: > I doubt current rc versions are stable enough for any production. HTB > waits for one fix, but it's nothing critical if it didn't bothered you > until now. There could be still some problems around schedulers > generally, after last big changes. > After patching issue with locking, i apply 2.6.27-rc3 with those(locking, not testing patch which limits resolution) patches on shaper who was crashing on load. Without your patch it was crashing like before on 2.6.27-rc3 too (NMI watchdog issuing panic, and machine not rebooting) after few hours of running. Sadly i will not be able to test on this machine anymore, because i lost access to it and had to change network structure. I will try to bring it locally to my office and simulate crash by generating traffic, but most probably it will not work. Seems your patch is required for mainline, but now i can test it only in theory, because another machine running as shaper now, running HPET, not TSC (it is AMD Opteron). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NMI lockup, 2.6.26 release 2008-08-15 13:13 ` NMI lockup, 2.6.26 release Denys Fedoryshchenko @ 2008-08-15 14:16 ` Jarek Poplawski 0 siblings, 0 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-08-15 14:16 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev On Fri, Aug 15, 2008 at 04:13:59PM +0300, Denys Fedoryshchenko wrote: ... > After patching issue with locking, i apply 2.6.27-rc3 with those(locking, not > testing patch which limits resolution) patches on shaper who was crashing on > load. Without your patch it was crashing like before on 2.6.27-rc3 too (NMI > watchdog issuing panic, and machine not rebooting) after few hours of > running. > > Sadly i will not be able to test on this machine anymore, because i lost > access to it and had to change network structure. > > I will try to bring it locally to my office and simulate crash by generating > traffic, but most probably it will not work. Seems your patch is required for > mainline, but now i can test it only in theory, because another machine > running as shaper now, running HPET, not TSC (it is AMD Opteron). Since this bug looks so rare, probably hardware dependent, it looks like fixing this can wait until it bothers somebody again. The most important, some workaround has been found, and we can go back to to improve this. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2008-08-15 14:14 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-22 18:42 NMI lockup, 2.6.26 release denys 2008-07-22 20:13 ` Jarek Poplawski 2008-07-22 20:35 ` Jarek Poplawski 2008-07-22 20:46 ` denys 2008-07-22 21:36 ` Jarek Poplawski 2008-07-22 21:45 ` denys 2008-07-23 19:47 ` denys 2008-07-23 21:09 ` Jarek Poplawski 2008-07-23 22:26 ` Jarek Poplawski 2008-07-23 23:24 ` Jarek Poplawski 2008-07-23 23:56 ` denys 2008-07-24 14:56 ` denys 2008-07-24 17:45 ` Jarek Poplawski 2008-07-25 7:36 ` Jarek Poplawski 2008-07-25 21:09 ` denys 2008-07-25 22:31 ` hrtimers lockups " Jarek Poplawski 2008-08-02 12:55 ` Denys Fedoryshchenko 2008-08-02 13:07 ` Jarek Poplawski 2008-08-12 11:31 ` Denys Fedoryshchenko 2008-08-12 12:40 ` Jarek Poplawski 2008-08-13 7:28 ` Denys Fedoryshchenko 2008-08-13 7:43 ` Jarek Poplawski 2008-08-13 8:02 ` Denys Fedoryshchenko 2008-08-13 8:49 ` Jarek Poplawski 2008-08-13 9:08 ` Denys Fedoryshchenko 2008-08-14 15:07 ` Denys Fedoryshchenko 2008-08-14 15:10 ` New: softlockup in 2.6.27-rc3-git2 Denys Fedoryshchenko 2008-08-15 13:13 ` NMI lockup, 2.6.26 release Denys Fedoryshchenko 2008-08-15 14:16 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).