NMI lockup, 2.6.26 release

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NMI lockup, 2.6.26 release
@ 2008-07-22 18:42 denys
  2008-07-22 20:13 ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: denys @ 2008-07-22 18:42 UTC (permalink / raw)
  To: netdev

workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)


Jul 22 21:33:30 10.154.154.1 [17146.584293] BUG: NMI Watchdog detected LOCKUP
Jul 22 21:33:30 10.154.154.1 on CPU1, ip c01d78d4, registers:
Jul 22 21:33:30 10.154.154.1 [17146.584293] Process swapper (pid: 0, ti=c0842000 task=f7c35c80 task.ti=f7c3d000)
Jul 22 21:33:30 10.154.154.1
Jul 22 21:33:30 10.154.154.1 [17146.584293] Stack:
Jul 22 21:33:30 10.154.154.1 c29f6498
Jul 22 21:33:30 10.154.154.1 00000000
Jul 22 21:33:30 10.154.154.1 00000000
Jul 22 21:33:30 10.154.154.1 f6c9fcc4
Jul 22 21:33:30 10.154.154.1 f6c9fcc8
Jul 22 21:33:30 10.154.154.1 c29f6490
Jul 22 21:33:30 10.154.154.1 c0842d1c
Jul 22 21:33:30 10.154.154.1 c0133ddc
Jul 22 21:33:30 10.154.154.1
Jul 22 21:33:30 10.154.154.1 [17146.584293]
Jul 22 21:33:30 10.154.154.1 00000001
Jul 22 21:33:30 10.154.154.1 f6c9fcc4
Jul 22 21:33:30 10.154.154.1 00000000
Jul 22 21:33:30 10.154.154.1 c29f6490
Jul 22 21:33:30 10.154.154.1 f6c9fcc4
Jul 22 21:33:30 10.154.154.1 c29f6490
Jul 22 21:33:30 10.154.154.1 c0842d44
Jul 22 21:33:30 10.154.154.1 c0134413
Jul 22 21:33:30 10.154.154.1
Jul 22 21:33:30 10.154.154.1 [17146.584293]
Jul 22 21:33:30 10.154.154.1 c29f6484
Jul 22 21:33:30 10.154.154.1 0230ac00
Jul 22 21:33:30 10.154.154.1 00000f97
Jul 22 21:33:30 10.154.154.1 00000000
Jul 22 21:33:30 10.154.154.1 00000286
Jul 22 21:33:30 10.154.154.1 f6c9f800
Jul 22 21:33:30 10.154.154.1 f7d2f000
Jul 22 21:33:30 10.154.154.1 00000000
Jul 22 21:33:30 10.154.154.1
Jul 22 21:33:30 10.154.154.1 [17146.584293] Call Trace:
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0133ddc>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 enqueue_hrtimer+0xfa/0x106
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0134413>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 hrtimer_start+0xee/0x118
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c026790e>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89eade1>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 htb_dequeue+0x6a8/0x6b3 [sch_htb]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f899bad4>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 sfq_enqueue+0x16/0x1be [sch_sfq]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c02669b9>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 __qdisc_run+0x5f/0x191
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c025ba50>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 dev_queue_xmit+0x1ba/0x316
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f8a21245>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 tcf_mirred+0x132/0x153 [act_mirred]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f8a21113>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 tcf_mirred+0x0/0x153 [act_mirred]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0268ce4>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 tcf_action_exec+0x44/0x77
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89ce792>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 u32_classify+0x119/0x24e [cls_u32]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0266dfa>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 tc_classify_compat+0x2f/0x5e
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0267806>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 tc_classify+0x17/0x78
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89f10a3>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 ingress_enqueue+0x1a/0x53 [sch_ingress]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0258eca>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 netif_receive_skb+0x26b/0x406
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0255bfb>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 __netdev_alloc_skb+0x17/0x34
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89b8c7a>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 e1000_receive_skb+0x13b/0x162 [e1000e]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89bb599>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 e1000_clean_rx_irq+0x1ff/0x28d [e1000e]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<f89b843a>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 e1000_clean+0x57/0x1de [e1000e]
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c025b006>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 net_rx_action+0xb3/0x1e0
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0125757>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 __do_softirq+0x6f/0xe9
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c010614b>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 do_softirq+0x5e/0xa8
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0146eb9>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 handle_edge_irq+0x0/0x10a
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c01256b5>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 irq_exit+0x44/0x77
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0106235>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 do_IRQ+0xa0/0xb7
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0108fb5>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 mwait_idle+0x0/0x43
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c01042fa>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 common_interrupt+0x2e/0x34
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0108fb5>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 mwait_idle+0x0/0x43
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c01300d8>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 param_array_set+0x96/0xc8
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0108fee>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 mwait_idle+0x39/0x43
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c0102596>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 cpu_idle+0x9a/0xba
Jul 22 21:33:30 10.154.154.1 [17146.584293]  [<c02aebfb>]
Jul 22 21:33:30 10.154.154.1 ?
Jul 22 21:33:30 10.154.154.1 start_secondary+0x160/0x165
Jul 22 21:33:30 10.154.154.1 [17146.584293]  =======================
Jul 22 21:33:30 10.154.154.1 [17146.584293] Code:
Jul 22 21:33:30 10.154.154.1 83
Jul 22 21:33:30 10.154.154.1 09
Jul 22 21:33:30 10.154.154.1 01
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 f0
Jul 22 21:33:30 10.154.154.1 83
Jul 22 21:33:30 10.154.154.1 26
Jul 22 21:33:30 10.154.154.1 fe
Jul 22 21:33:30 10.154.154.1 8b
Jul 22 21:33:30 10.154.154.1 55
Jul 22 21:33:30 10.154.154.1 e8
Jul 22 21:33:30 10.154.154.1 e8
Jul 22 21:33:30 10.154.154.1 57
ul 22 21:33:30 10.154.154.1 ff
Jul 22 21:33:30 10.154.154.1 ff
Jul 22 21:33:30 10.154.154.1 ff
Jul 22 21:33:30 10.154.154.1 eb
Jul 22 21:33:30 10.154.154.1 42
Jul 22 21:33:30 10.154.154.1 85
Jul 22 21:33:30 10.154.154.1 d2
Jul 22 21:33:30 10.154.154.1 74
Jul 22 21:33:30 10.154.154.1 15
Jul 22 21:33:30 10.154.154.1 8b
Jul 22 21:33:30 10.154.154.1 02
Jul 22 21:33:30 10.154.154.1 a8
Jul 22 21:33:30 10.154.154.1 01
Jul 22 21:33:30 10.154.154.1 75
Jul 22 21:33:30 10.154.154.1 0f
Jul 22 21:33:30 10.154.154.1 83
Jul 22 21:33:30 10.154.154.1 c8
Jul 22 21:33:30 10.154.154.1 01
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 f7
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 02
Jul 22 21:33:30 10.154.154.1 83
Jul 22 21:33:30 10.154.154.1 0b
Jul 22 21:33:30 10.154.154.1 01
Jul 22 21:33:30 10.154.154.1 83
Jul 22 21:33:30 10.154.154.1 26
Jul 22 21:33:30 10.154.154.1 fe
Jul 22 21:33:30 10.154.154.1 eb
Jul 22 21:33:30 10.154.154.1 29
Jul 22 21:33:30 10.154.154.1 unparseable log message: "<8b> "
Jul 22 21:33:30 10.154.154.1 53
Jul 22 21:33:30 10.154.154.1 08
Jul 22 21:33:30 10.154.154.1 39
Jul 22 21:33:30 10.154.154.1 fa
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 55
Jul 22 21:33:30 10.154.154.1 ec
Jul 22 21:33:30 10.154.154.1 75
Jul 22 21:33:30 10.154.154.1 0f
Jul 22 21:33:30 10.154.154.1 8b
Jul 22 21:33:30 10.154.154.1 55
Jul 22 21:33:30 10.154.154.1 e8
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 d8
Jul 22 21:33:30 10.154.154.1 89
Jul 22 21:33:30 10.154.154.1 df
Jul 22 21:33:30 10.154.154.1 e8
Jul 22 21:33:30 10.154.154.1 26
Jul 22 21:33:30 10.154.154.1 ff
Jul 22 21:33:30 10.154.154.1 ff
Jul 22 21:33:30 10.154.154.1


And one before, but also today  (not sure about kernel version, between rc8 and release)

ul 22 13:32:03 10.154.154.1 [143348.473981] BUG: NMI Watchdog detected LOCKUP
Jul 22 13:32:03 10.154.154.1 on CPU1, ip c01ca1bf, registers:
Jul 22 13:32:03 10.154.154.1 [143348.473981] Modules linked in:
Jul 22 13:32:03 10.154.154.1 netconsole
Jul 22 13:32:03 10.154.154.1 configfs
Jul 22 13:32:03 10.154.154.1 coretemp
Jul 22 13:32:03 10.154.154.1 hwmon
Jul 22 13:32:03 10.154.154.1 i2c_i801
Jul 22 13:32:03 10.154.154.1 i2c_core
Jul 22 13:32:03 10.154.154.1 nf_nat_ftp
Jul 22 13:32:03 10.154.154.1 nf_conntrack_ftp
Jul 22 13:32:03 10.154.154.1 softdog
Jul 22 13:32:03 10.154.154.1 nf_nat_pptp
Jul 22 13:32:03 10.154.154.1 nf_conntrack_pptp
Jul 22 13:32:03 10.154.154.1 nf_conntrack_proto_gre
Jul 22 13:32:03 10.154.154.1 nf_nat_proto_gre
Jul 22 13:32:03 10.154.154.1 hangcheck_timer
Jul 22 13:32:03 10.154.154.1 act_mirred
Jul 22 13:32:03 10.154.154.1 sch_ingress
Jul 22 13:32:03 10.154.154.1 act_police
Jul 22 13:32:03 10.154.154.1 cls_u32
Jul 22 13:32:03 10.154.154.1 sch_sfq
Jul 22 13:32:03 10.154.154.1 sch_htb
Jul 22 13:32:03 10.154.154.1 iptable_nat
Jul 22 13:32:03 10.154.154.1 nf_nat
Jul 22 13:32:03 10.154.154.1 nf_conntrack_ipv4
Jul 22 13:32:03 10.154.154.1 xt_tcpudp
Jul 22 13:32:03 10.154.154.1 ipt_TTL
Jul 22 13:32:03 10.154.154.1 ipt_ttl
Jul 22 13:32:03 10.154.154.1 xt_NOTRACK
Jul 22 13:32:03 10.154.154.1 nf_conntrack
Jul 22 13:32:03 10.154.154.1 iptable_raw
Jul 22 13:32:03 10.154.154.1 iptable_mangle
Jul 22 13:32:03 10.154.154.1 ifb
Jul 22 13:32:03 10.154.154.1 e1000e
Jul 22 13:32:03 10.154.154.1 iptable_filter
Jul 22 13:32:03 10.154.154.1 ip_tables
Jul 22 13:32:03 10.154.154.1 x_tables
Jul 22 13:32:03 10.154.154.1 8021q
Jul 22 13:32:03 10.154.154.1 tun
Jul 22 13:32:03 10.154.154.1 tulip
Jul 22 13:32:03 10.154.154.1 r8169
Jul 22 13:32:03 10.154.154.1 sky2
Jul 22 13:32:03 10.154.154.1 via_velocity
Jul 22 13:32:03 10.154.154.1 via_rhine
Jul 22 13:32:03 10.154.154.1 sis900
Jul 22 13:32:03 10.154.154.1 ne2k_pci
Jul 22 13:32:03 10.154.154.1 8390
Jul 22 13:32:03 10.154.154.1 tg3
Jul 22 13:32:03 10.154.154.1 8139too
Jul 22 13:32:03 10.154.154.1 e1000
Jul 22 13:32:03 10.154.154.1 e100
Jul 22 13:32:03 10.154.154.1 usb_storage
Jul 22 13:32:03 10.154.154.1 mtdblock
Jul 22 13:32:03 10.154.154.1 mtd_blkdevs
Jul 22 13:32:03 10.154.154.1 usbhid
Jul 22 13:32:03 10.154.154.1 uhci_hcd
Jul 22 13:32:03 10.154.154.1 ehci_hcd
Jul 22 13:32:03 10.154.154.1 ohci_hcd
Jul 22 13:32:03 10.154.154.1 usbcore
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 [143348.473981]
Jul 22 13:32:03 10.154.154.1 [143348.473981] Pid: 0, comm: swapper Not tainted (2.6.26-rc8-build-0029 #33)
Jul 22 13:32:03 10.154.154.1 [143348.473981] EIP: 0060:[<c01ca1bf>] EFLAGS: 00000082 CPU: 1
Jul 22 13:32:03 10.154.154.1 [143348.473981] EIP is at rb_insert_color+0x57/0xbc
Jul 22 13:32:03 10.154.154.1 [143348.473981] EAX: f75c24a4 EBX: f75c24a4 ECX: f75c24a4 EDX: 00000000
Jul 22 13:32:03 10.154.154.1 [143348.473981] ESI: f75c24a4 EDI: f75c24a4 EBP: c08f2d0c ESP: c08f2cf4
Jul 22 13:32:03 10.154.154.1 [143348.473981]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 22 13:32:03 10.154.154.1 [143348.473981] Process swapper (pid: 0, ti=c08f2000 task=f7c314a0 task.ti=f7c3c000)
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 [143348.473981] Stack:
Jul 22 13:32:03 10.154.154.1 c1ff8480
Jul 22 13:32:03 10.154.154.1 00000000
Jul 22 13:32:03 10.154.154.1 00000000
Jul 22 13:32:03 10.154.154.1 f75c24a4
Jul 22 13:32:03 10.154.154.1 f75c24a8
Jul 22 13:32:03 10.154.154.1 c1ff8478
Jul 22 13:32:03 10.154.154.1 c08f2d2c
Jul 22 13:32:03 10.154.154.1 c0130358
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 [143348.473981]
Jul 22 13:32:03 10.154.154.1 00000001
Jul 22 13:32:03 10.154.154.1 f75c24a4
Jul 22 13:32:03 10.154.154.1 00000000
Jul 22 13:32:03 10.154.154.1 c1ff8478
Jul 22 13:32:03 10.154.154.1 f75c24a4
Jul 22 13:32:03 10.154.154.1 c1ff8478
Jul 22 13:32:03 10.154.154.1 c08f2d50
Jul 22 13:32:03 10.154.154.1 c0130a46
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 [143348.473981]
Jul 22 13:32:03 10.154.154.1 55c50c00
Jul 22 13:32:03 10.154.154.1 0000825f
Jul 22 13:32:03 10.154.154.1 00000000
Jul 22 13:32:03 10.154.154.1 00000286
Jul 22 13:32:03 10.154.154.1 f75c2000
Jul 22 13:32:03 10.154.154.1 f7c9d000
Jul 22 13:32:03 10.154.154.1 00000000
Jul 22 13:32:03 10.154.154.1 c08f2d60
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 [143348.473981] Call Trace:
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0130358>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 enqueue_hrtimer+0xf5/0x101
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0130a46>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 hrtimer_start+0xe2/0x10c
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c025545d>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f8a11dd6>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 htb_dequeue+0x6a6/0x6b1 [sch_htb]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f8988ad0>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 sfq_enqueue+0x16/0x1c2 [sch_sfq]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c025451e>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 __qdisc_run+0x5f/0x187
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0249a9b>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 dev_queue_xmit+0x1a3/0x2c0
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f8a1523d>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 tcf_mirred+0x12d/0x149 [act_mirred]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f8a15110>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 tcf_mirred+0x0/0x149 [act_mirred]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c025682c>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 tcf_action_exec+0x44/0x77
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f89bd792>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 u32_classify+0x119/0x24e [cls_u32]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f88d2da8>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e100_exec_cb+0xee/0xf9 [e100]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f88d221b>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e100_xmit_prepare+0x0/0x87 [e100]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f88d2e0d>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e100_xmit_frame+0x5a/0xc4 [e100]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c02477c0>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 dev_hard_start_xmit+0x1f8/0x266
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0254952>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 tc_classify_compat+0x2f/0x5e
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0255355>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 tc_classify+0x17/0x78
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f89f40a2>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 ingress_enqueue+0x1a/0x53 [sch_ingress]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c024717f>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 netif_receive_skb+0x20e/0x393
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f89a8c21>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e1000_receive_skb+0x138/0x15f [e1000e]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f89ab4b8>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e1000_clean_rx_irq+0x1fc/0x28a [e1000e]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<f89a8406>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 e1000_clean+0x50/0x1be [e1000e]
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0248ff5>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 net_rx_action+0x8f/0x199
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0122a63>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 __do_softirq+0x64/0xcd
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0105e26>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 do_softirq+0x55/0x89
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c013e76c>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 handle_edge_irq+0x0/0x100
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c01229cc>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 irq_exit+0x38/0x6b
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0105efa>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 do_IRQ+0xa0/0xb6
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0108860>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 mwait_idle+0x0/0x38
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c01041df>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 common_interrupt+0x23/0x28
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0108860>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 mwait_idle+0x0/0x38
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c0108892>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 mwait_idle+0x32/0x38
Jul 22 13:32:03 10.154.154.1 [143348.473981]  [<c010255d>]
Jul 22 13:32:03 10.154.154.1 ?
Jul 22 13:32:03 10.154.154.1 cpu_idle+0x71/0x8a
Jul 22 13:32:03 10.154.154.1 [143348.473981]  =======================
Jul 22 13:32:03 10.154.154.1 [143348.473981] Code:
Jul 22 13:32:03 10.154.154.1 8b
Jul 22 13:32:03 10.154.154.1 43
Jul 22 13:32:03 10.154.154.1 04
Jul 22 13:32:03 10.154.154.1 39
Jul 22 13:32:03 10.154.154.1 f8
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 45
Jul 22 13:32:03 10.154.154.1 f0
Jul 22 13:32:03 10.154.154.1 75
Jul 22 13:32:03 10.154.154.1 0f
Jul 22 13:32:03 10.154.154.1 8b
Jul 22 13:32:03 10.154.154.1 55
Jul 22 13:32:03 10.154.154.1 e8
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 d8
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 df
Jul 22 13:32:03 10.154.154.1 e8
Jul 22 13:32:03 10.154.154.1 16
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 8b
Jul 22 13:32:03 10.154.154.1 4d
Jul 22 13:32:03 10.154.154.1 f0
Jul 22 13:32:03 10.154.154.1 83
Jul 22 13:32:03 10.154.154.1 09
Jul 22 13:32:03 10.154.154.1 01
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 f0
Jul 22 13:32:03 10.154.154.1 83
Jul 22 13:32:03 10.154.154.1 26
Jul 22 13:32:03 10.154.154.1 fe
Jul 22 13:32:03 10.154.154.1 8b
Jul 22 13:32:03 10.154.154.1 55
Jul 22 13:32:03 10.154.154.1 e8
Jul 22 13:32:03 10.154.154.1 e8
Jul 22 13:32:03 10.154.154.1 57
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 ff
Jul 22 13:32:03 10.154.154.1 eb
Jul 22 13:32:03 10.154.154.1 42
Jul 22 13:32:03 10.154.154.1
Jul 22 13:32:03 10.154.154.1 d2
Jul 22 13:32:03 10.154.154.1 74
Jul 22 13:32:03 10.154.154.1 15
Jul 22 13:32:03 10.154.154.1 8b
Jul 22 13:32:03 10.154.154.1 02
Jul 22 13:32:03 10.154.154.1 a8
Jul 22 13:32:03 10.154.154.1 01
Jul 22 13:32:03 10.154.154.1 75
Jul 22 13:32:03 10.154.154.1 0f
Jul 22 13:32:03 10.154.154.1 83
Jul 22 13:32:03 10.154.154.1 c8
Jul 22 13:32:03 10.154.154.1 01
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 f7
Jul 22 13:32:03 10.154.154.1 89
Jul 22 13:32:03 10.154.154.1 02
Jul 22 13:32:03 10.154.154.1 83
Jul 22 13:32:03 10.154.154.1 0b
Jul 22 13:32:03 10.154.154.1 01
Jul 22 13:32:03 10.154.154.1 83
Jul 22 13:32:03 10.154.154.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 18:42 NMI lockup, 2.6.26 release denys
@ 2008-07-22 20:13 ` Jarek Poplawski
  2008-07-22 20:35   ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-22 20:13 UTC (permalink / raw)
  To: denys; +Cc: netdev

denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:

> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
> 


Maybe it's unconnected but I'd recommend to try this fresh patch for
possible problems with e1000e vs. netconsole:

http://permalink.gmane.org/gmane.linux.network/100581

Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 20:13 ` Jarek Poplawski
@ 2008-07-22 20:35   ` Jarek Poplawski
  2008-07-22 20:46     ` denys
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-22 20:35 UTC (permalink / raw)
  To: denys; +Cc: netdev

Jarek Poplawski wrote, On 07/22/2008 10:13 PM:

> denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:
> 
>> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
>>
> 
> 
> Maybe it's unconnected but I'd recommend to try this fresh patch for
> possible problems with e1000e vs. netconsole:
> 
> http://permalink.gmane.org/gmane.linux.network/100581
 

...and if you have TSO enabled, here is another "maybe unconnected"
recommendation:

http://permalink.gmane.org/gmane.linux.network/99585

Jarek P.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 20:35   ` Jarek Poplawski
@ 2008-07-22 20:46     ` denys
  2008-07-22 21:36       ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: denys @ 2008-07-22 20:46 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: denys, netdev

First patch - probably not related. Netconsole is on e100.
Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot!

On Tuesday 22 July 2008, Jarek Poplawski wrote:
> Jarek Poplawski wrote, On 07/22/2008 10:13 PM:
> 
> > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:
> > 
> >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
> >>
> > 
> > 
> > Maybe it's unconnected but I'd recommend to try this fresh patch for
> > possible problems with e1000e vs. netconsole:
> > 
> > http://permalink.gmane.org/gmane.linux.network/100581
>  
> 
> ...and if you have TSO enabled, here is another "maybe unconnected"
> recommendation:
> 
> http://permalink.gmane.org/gmane.linux.network/99585
> 
> Jarek P.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 20:46     ` denys
@ 2008-07-22 21:36       ` Jarek Poplawski
  2008-07-22 21:45         ` denys
  2008-07-23 19:47         ` denys
  0 siblings, 2 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-22 21:36 UTC (permalink / raw)
  To: denys; +Cc: netdev

On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote:
> First patch - probably not related. Netconsole is on e100.
> Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot!

On the other hand, it seems the corruption of skbs fixed by this
second patch should probably show in some other places...

I wonder if you tried to run this without netconsole (or maybe
netconsole on another dev/driver) or without this NMI watchdog,
and similar lockups still happened?

Jarek P.

> 
> On Tuesday 22 July 2008, Jarek Poplawski wrote:
> > Jarek Poplawski wrote, On 07/22/2008 10:13 PM:
> > 
> > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:
> > > 
> > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
> > >>
> > > 
> > > 
> > > Maybe it's unconnected but I'd recommend to try this fresh patch for
> > > possible problems with e1000e vs. netconsole:
> > > 
> > > http://permalink.gmane.org/gmane.linux.network/100581
> >  
> > 
> > ...and if you have TSO enabled, here is another "maybe unconnected"
> > recommendation:
> > 
> > http://permalink.gmane.org/gmane.linux.network/99585
> > 
> > Jarek P.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 21:36       ` Jarek Poplawski
@ 2008-07-22 21:45         ` denys
  2008-07-23 19:47         ` denys
  1 sibling, 0 replies; 29+ messages in thread
From: denys @ 2008-07-22 21:45 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: denys, netdev

Well, i cannot see similar or not, only guess. Machine is crashing after few days, even without neconsole. Sometimes rebooting, sometimes not. 
So sometimes i am getting call at middle of night and running to power switch to reboot PC. I cannot check screen, it is in unreachable for me area.
This bug probably is since 2.6.23 even...

I will engage kexec with panic kernel just to be more safe, since it is not always rebooting over panic sysctl.
As i see in sources path to panic kexec kernel is much shorter... 


On Wednesday 23 July 2008, Jarek Poplawski wrote:
> On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote:
> > First patch - probably not related. Netconsole is on e100.
> > Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot!
> 
> On the other hand, it seems the corruption of skbs fixed by this
> second patch should probably show in some other places...
> 
> I wonder if you tried to run this without netconsole (or maybe
> netconsole on another dev/driver) or without this NMI watchdog,
> and similar lockups still happened?
> 
> Jarek P.
> 
> > 
> > On Tuesday 22 July 2008, Jarek Poplawski wrote:
> > > Jarek Poplawski wrote, On 07/22/2008 10:13 PM:
> > > 
> > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:
> > > > 
> > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
> > > >>
> > > > 
> > > > 
> > > > Maybe it's unconnected but I'd recommend to try this fresh patch for
> > > > possible problems with e1000e vs. netconsole:
> > > > 
> > > > http://permalink.gmane.org/gmane.linux.network/100581
> > >  
> > > 
> > > ...and if you have TSO enabled, here is another "maybe unconnected"
> > > recommendation:
> > > 
> > > http://permalink.gmane.org/gmane.linux.network/99585
> > > 
> > > Jarek P.
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-22 21:36       ` Jarek Poplawski
  2008-07-22 21:45         ` denys
@ 2008-07-23 19:47         ` denys
  2008-07-23 21:09           ` Jarek Poplawski
  2008-07-23 22:26           ` Jarek Poplawski
  1 sibling, 2 replies; 29+ messages in thread
From: denys @ 2008-07-23 19:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: denys, netdev

Seems none of patches help with issue.
I will recheck if i apply them correctly.

Here is latest.
Jul 23 21:37:19 10.154.154.1 [28896.091934] Process swapper (pid: 0, ti=c0842000 task=f7c35c80 task.ti=f7c3d000)
Jul 23 21:37:19 10.154.154.1
Jul 23 21:37:19 10.154.154.1 [28896.091934] Stack:
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 c0842eec
Jul 23 21:37:19 10.154.154.1 c01d7cc9
Jul 23 21:37:19 10.154.154.1 c2df6498
Jul 23 21:37:19 10.154.154.1 00000000
Jul 23 21:37:19 10.154.154.1 00000000
Jul 23 21:37:19 10.154.154.1
Jul 23 21:37:19 10.154.154.1 [28896.091934]
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 f6ce64c8
Jul 23 21:37:19 10.154.154.1 c2df6490
Jul 23 21:37:19 10.154.154.1 c0842f0c
Jul 23 21:37:19 10.154.154.1 c0133e9c
Jul 23 21:37:19 10.154.154.1 00000001
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 00000000
Jul 23 21:37:19 10.154.154.1
Jul 23 21:37:19 10.154.154.1 [28896.091934]
Jul 23 21:37:19 10.154.154.1 c2df6490
Jul 23 21:37:19 10.154.154.1 f6ce64c4
Jul 23 21:37:19 10.154.154.1 c2df6490
Jul 23 21:37:19 10.154.154.1 c0842f34
Jul 23 21:37:19 10.154.154.1 c01344d3
Jul 23 21:37:19 10.154.154.1 c2df6484
Jul 23 21:37:19 10.154.154.1 76103400
Jul 23 21:37:19 10.154.154.1 00001a45
Jul 23 21:37:19 10.154.154.1
Jul 23 21:37:19 10.154.154.1 [28896.091934] Call Trace:
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c01d7cc9>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 rb_insert_color+0x99/0xbc
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0133e9c>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 enqueue_hrtimer+0xfa/0x106
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c01344d3>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 hrtimer_start+0xee/0x118
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0267db2>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 qdisc_watchdog_schedule+0x19/0x1f
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<f89eadf3>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 htb_dequeue+0x6a8/0x6b3 [sch_htb]
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0266e5d>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 __qdisc_run+0x5f/0x191
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c025b1a5>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 net_tx_action+0xb4/0xda
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0125817>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 __do_softirq+0x6f/0xe9
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c010614b>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 do_softirq+0x5e/0xa8
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0146f79>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 handle_edge_irq+0x0/0x10a
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0125775>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 irq_exit+0x44/0x77
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0106235>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 do_IRQ+0xa0/0xb7
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0108fb5>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 mwait_idle+0x0/0x43
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c01042fa>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 common_interrupt+0x2e/0x34
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0108fb5>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 mwait_idle+0x0/0x43
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c01300d8>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 param_set_copystring+0x36/0x60
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0108fee>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 mwait_idle+0x39/0x43
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c0102596>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 cpu_idle+0x9a/0xba
Jul 23 21:37:19 10.154.154.1 [28896.091934]  [<c02af0a7>]
Jul 23 21:37:19 10.154.154.1 ?
Jul 23 21:37:19 10.154.154.1 start_secondary+0x160/0x165
Jul 23 21:37:19 10.154.154.1 [28896.091934]  =======================
Jul 23 21:37:19 10.154.154.1 [28896.091934] Code:
Jul 23 21:37:19 10.154.154.1 09
Jul 23 21:37:19 10.154.154.1 8b
Jul 23 21:37:19 10.154.154.1 01
Jul 23 21:37:19 10.154.154.1 83
Jul 23 21:37:19 10.154.154.1 e0
Jul 23 21:37:19 10.154.154.1 03
Jul 23 21:37:19 10.154.154.1 09
Jul 23 21:37:19 10.154.154.1 d8
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 01
Jul 23 21:37:19 10.154.154.1 8b
Jul 23 21:37:19 10.154.154.1 02
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 5a
Jul 23 21:37:19 10.154.154.1 08
Jul 23 21:37:19 10.154.154.1 83
Jul 23 21:37:19 10.154.154.1 e0
Jul 23 21:37:19 10.154.154.1 03
Jul 23 21:37:19 10.154.154.1 09
Jul 23 21:37:19 10.154.154.1 f0
Jul 23 21:37:19 10.154.154.1 85
Jul 23 21:37:19 10.154.154.1 f6
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 02
Jul 23 21:37:19 10.154.154.1 74
Jul 23 21:37:19 10.154.154.1 0f
Jul 23 21:37:19 10.154.154.1 3b
Jul 23 21:37:19 10.154.154.1 5e
Jul 23 21:37:19 10.154.154.1 08
Jul 23 21:37:19 10.154.154.1 75
Jul 23 21:37:19 10.154.154.1 05
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 56
Jul 23 21:37:19 10.154.154.1 08
Jul 23 21:37:19 10.154.154.1 eb
Jul 23 21:37:19 10.154.154.1 07
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 56
Jul 23 21:37:19 10.154.154.1 04
Jul 23 21:37:19 10.154.154.1 eb
Jul 23 21:37:19 10.154.154.1 02
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 17
Jul 23 21:37:19 10.154.154.1 unparseable log message: "<8b> "
Jul 23 21:37:19 10.154.154.1 03
Jul 23 21:37:19 10.154.154.1 83
Jul 23 21:37:19 10.154.154.1 e0
Jul 23 21:37:19 10.154.154.1 03
Jul 23 21:37:19 10.154.154.1 09
Jul 23 21:37:19 10.154.154.1 d0
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 03
Jul 23 21:37:19 10.154.154.1 5b
Jul 23 21:37:19 10.154.154.1 5e
Jul 23 21:37:19 10.154.154.1 5f
Jul 23 21:37:19 10.154.154.1 5d
Jul 23 21:37:19 10.154.154.1 c3
Jul 23 21:37:19 10.154.154.1 55
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 e5
Jul 23 21:37:19 10.154.154.1 57
Jul 23 21:37:19 10.154.154.1 89
Jul 23 21:37:19 10.154.154.1 d7
Jul 23 21:37:19 10.154.154.1 56
Jul 23 21:37:19 10.154.154.1


On Wednesday 23 July 2008, Jarek Poplawski wrote:
> On Tue, Jul 22, 2008 at 11:46:36PM +0300, denys@visp.net.lb wrote:
> > First patch - probably not related. Netconsole is on e100.
> > Second patch looks reasonable, because there is external interface (it is router) and e1000e there, with enabled TSO. Maybe some scanning tools doing connect to SSH and... BOOM.. i will try to patch it, thanks a lot!
> 
> On the other hand, it seems the corruption of skbs fixed by this
> second patch should probably show in some other places...
> 
> I wonder if you tried to run this without netconsole (or maybe
> netconsole on another dev/driver) or without this NMI watchdog,
> and similar lockups still happened?
> 
> Jarek P.
> 
> > 
> > On Tuesday 22 July 2008, Jarek Poplawski wrote:
> > > Jarek Poplawski wrote, On 07/22/2008 10:13 PM:
> > > 
> > > > denys@visp.net.lb wrote, On 07/22/2008 08:42 PM:
> > > > 
> > > >> workload: ifb shapers, HTB, nat, 1 e1000, 3 e100 (heavily used only two e100)
> > > >>
> > > > 
> > > > 
> > > > Maybe it's unconnected but I'd recommend to try this fresh patch for
> > > > possible problems with e1000e vs. netconsole:
> > > > 
> > > > http://permalink.gmane.org/gmane.linux.network/100581
> > >  
> > > 
> > > ...and if you have TSO enabled, here is another "maybe unconnected"
> > > recommendation:
> > > 
> > > http://permalink.gmane.org/gmane.linux.network/99585
> > > 
> > > Jarek P.
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 19:47         ` denys
@ 2008-07-23 21:09           ` Jarek Poplawski
  2008-07-23 22:26           ` Jarek Poplawski
  1 sibling, 0 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-23 21:09 UTC (permalink / raw)
  To: denys; +Cc: netdev

On Wed, Jul 23, 2008 at 10:47:17PM +0300, denys@visp.net.lb wrote:
> Seems none of patches help with issue.
> I will recheck if i apply them correctly.

It's a pity! On the other hand it looks like you have something really
special...

But, since all these reports stop at hrtimers, my proposal is to check 
where it could hit without them. I attach a patch for debugging, which
turns back a timer instead of hrtimer watchdog. Alas I didn't test it,
so be cautious. (This should work with 2.6.26 or .25, I hope.)

Jarek P.

---

 net/sched/sch_htb.c |   28 +++++++++++++++++++++++-----
 1 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 3fb58f4..fdc84c3 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -169,7 +169,8 @@ struct htb_sched {
 
 	int rate2quantum;	/* quant = rate / rate2quantum */
 	psched_time_t now;	/* cached dequeue time */
-	struct qdisc_watchdog watchdog;
+	//struct qdisc_watchdog watchdog;
+	struct timer_list timer;	/* send delay timer */
 
 	/* non shaped skbs; let them go directly thru */
 	struct sk_buff_head direct_queue;
@@ -893,6 +894,14 @@ next:
 	return skb;
 }
 
+static void htb_timer(unsigned long arg)
+{
+	struct Qdisc *sch = (struct Qdisc*)arg;
+	sch->flags &= ~TCQ_F_THROTTLED;
+	wmb();
+	netif_schedule(sch->dev);
+}
+
 static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 {
 	struct sk_buff *skb = NULL;
@@ -943,7 +952,9 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		}
 	}
 	sch->qstats.overlimits++;
-	qdisc_watchdog_schedule(&q->watchdog, next_event);
+	//qdisc_watchdog_schedule(&q->watchdog, next_event);
+	mod_timer(&q->timer, (unsigned long)next_event /
+				 PSCHED_TICKS_PER_SEC * HZ);
 fin:
 	return skb;
 }
@@ -996,7 +1007,9 @@ static void htb_reset(struct Qdisc *sch)
 
 		}
 	}
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	sch->flags &= ~TCQ_F_THROTTLED;
+	del_timer(&q->timer);
 	__skb_queue_purge(&q->direct_queue);
 	sch->q.qlen = 0;
 	memset(q->row, 0, sizeof(q->row));
@@ -1047,7 +1060,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt)
 	for (i = 0; i < TC_HTB_NUMPRIO; i++)
 		INIT_LIST_HEAD(q->drops + i);
 
-	qdisc_watchdog_init(&q->watchdog, sch);
+	//qdisc_watchdog_init(&q->watchdog, sch);
+	q->timer.function = htb_timer;
+	q->timer.data = (unsigned long)sch;
+	init_timer(&q->timer);
+
 	skb_queue_head_init(&q->direct_queue);
 
 	q->direct_qlen = sch->dev->tx_queue_len;
@@ -1262,7 +1279,8 @@ static void htb_destroy(struct Qdisc *sch)
 {
 	struct htb_sched *q = qdisc_priv(sch);
 
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	del_timer_sync(&q->timer);
 	/* This line used to be after htb_destroy_class call below
 	   and surprisingly it worked in 2.4. But it must precede it
 	   because filter need its target class alive to be able to call

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 19:47         ` denys
  2008-07-23 21:09           ` Jarek Poplawski
@ 2008-07-23 22:26           ` Jarek Poplawski
  2008-07-23 23:24             ` Jarek Poplawski
  1 sibling, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-23 22:26 UTC (permalink / raw)
  To: denys; +Cc: netdev

(take 2)

Hmm... this should be more accurate.

Jarek P.

---

 net/sched/sch_htb.c |   28 +++++++++++++++++++++++-----
 1 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 3fb58f4..fdc84c3 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -169,7 +169,8 @@ struct htb_sched {
 
 	int rate2quantum;	/* quant = rate / rate2quantum */
 	psched_time_t now;	/* cached dequeue time */
-	struct qdisc_watchdog watchdog;
+	//struct qdisc_watchdog watchdog;
+	struct timer_list timer;	/* send delay timer */
 
 	/* non shaped skbs; let them go directly thru */
 	struct sk_buff_head direct_queue;
@@ -893,6 +894,14 @@ next:
 	return skb;
 }
 
+static void htb_timer(unsigned long arg)
+{
+	struct Qdisc *sch = (struct Qdisc*)arg;
+	sch->flags &= ~TCQ_F_THROTTLED;
+	wmb();
+	netif_schedule(sch->dev);
+}
+
 static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 {
 	struct sk_buff *skb = NULL;
@@ -943,7 +952,9 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		}
 	}
 	sch->qstats.overlimits++;
-	qdisc_watchdog_schedule(&q->watchdog, next_event);
+	//qdisc_watchdog_schedule(&q->watchdog, next_event);
+	mod_timer(&q->timer, (unsigned long)next_event * HZ /
+				 PSCHED_TICKS_PER_SEC);
 fin:
 	return skb;
 }
@@ -996,7 +1007,9 @@ static void htb_reset(struct Qdisc *sch)
 
 		}
 	}
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	sch->flags &= ~TCQ_F_THROTTLED;
+	del_timer(&q->timer);
 	__skb_queue_purge(&q->direct_queue);
 	sch->q.qlen = 0;
 	memset(q->row, 0, sizeof(q->row));
@@ -1047,7 +1060,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt)
 	for (i = 0; i < TC_HTB_NUMPRIO; i++)
 		INIT_LIST_HEAD(q->drops + i);
 
-	qdisc_watchdog_init(&q->watchdog, sch);
+	//qdisc_watchdog_init(&q->watchdog, sch);
+	q->timer.function = htb_timer;
+	q->timer.data = (unsigned long)sch;
+	init_timer(&q->timer);
+
 	skb_queue_head_init(&q->direct_queue);
 
 	q->direct_qlen = sch->dev->tx_queue_len;
@@ -1262,7 +1279,8 @@ static void htb_destroy(struct Qdisc *sch)
 {
 	struct htb_sched *q = qdisc_priv(sch);
 
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	del_timer_sync(&q->timer);
 	/* This line used to be after htb_destroy_class call below
 	   and surprisingly it worked in 2.4. But it must precede it
 	   because filter need its target class alive to be able to call

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 22:26           ` Jarek Poplawski
@ 2008-07-23 23:24             ` Jarek Poplawski
  2008-07-23 23:56               ` denys
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-23 23:24 UTC (permalink / raw)
  To: denys; +Cc: netdev

(take 3)

...I'm really sorry! If it's possible better try to restart with this one.

Jarek P.

---

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 3fb58f4..6b6fff3 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -169,7 +169,8 @@ struct htb_sched {
 
 	int rate2quantum;	/* quant = rate / rate2quantum */
 	psched_time_t now;	/* cached dequeue time */
-	struct qdisc_watchdog watchdog;
+	//struct qdisc_watchdog watchdog;
+	struct timer_list timer;	/* send delay timer */
 
 	/* non shaped skbs; let them go directly thru */
 	struct sk_buff_head direct_queue;
@@ -893,6 +894,14 @@ next:
 	return skb;
 }
 
+static void htb_timer(unsigned long arg)
+{
+	struct Qdisc *sch = (struct Qdisc*)arg;
+	sch->flags &= ~TCQ_F_THROTTLED;
+	wmb();
+	netif_schedule(sch->dev);
+}
+
 static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 {
 	struct sk_buff *skb = NULL;
@@ -943,7 +952,10 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		}
 	}
 	sch->qstats.overlimits++;
-	qdisc_watchdog_schedule(&q->watchdog, next_event);
+	//qdisc_watchdog_schedule(&q->watchdog, next_event);
+	next_event = next_event - q->now;
+	mod_timer(&q->timer, jiffies + (unsigned long)next_event * HZ /
+				 PSCHED_TICKS_PER_SEC);
 fin:
 	return skb;
 }
@@ -996,7 +1008,9 @@ static void htb_reset(struct Qdisc *sch)
 
 		}
 	}
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	sch->flags &= ~TCQ_F_THROTTLED;
+	del_timer(&q->timer);
 	__skb_queue_purge(&q->direct_queue);
 	sch->q.qlen = 0;
 	memset(q->row, 0, sizeof(q->row));
@@ -1047,7 +1061,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt)
 	for (i = 0; i < TC_HTB_NUMPRIO; i++)
 		INIT_LIST_HEAD(q->drops + i);
 
-	qdisc_watchdog_init(&q->watchdog, sch);
+	//qdisc_watchdog_init(&q->watchdog, sch);
+	q->timer.function = htb_timer;
+	q->timer.data = (unsigned long)sch;
+	init_timer(&q->timer);
+
 	skb_queue_head_init(&q->direct_queue);
 
 	q->direct_qlen = sch->dev->tx_queue_len;
@@ -1262,7 +1280,8 @@ static void htb_destroy(struct Qdisc *sch)
 {
 	struct htb_sched *q = qdisc_priv(sch);
 
-	qdisc_watchdog_cancel(&q->watchdog);
+	//qdisc_watchdog_cancel(&q->watchdog);
+	del_timer_sync(&q->timer);
 	/* This line used to be after htb_destroy_class call below
 	   and surprisingly it worked in 2.4. But it must precede it
 	   because filter need its target class alive to be able to call

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 23:24             ` Jarek Poplawski
@ 2008-07-23 23:56               ` denys
  2008-07-24 14:56                 ` denys
  2008-07-25  7:36                 ` Jarek Poplawski
  0 siblings, 2 replies; 29+ messages in thread
From: denys @ 2008-07-23 23:56 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: denys, netdev

I dont have any problem to restart :-)
Build system and restart (over kexec) is automated..
Rebooting now.

On Thursday 24 July 2008, Jarek Poplawski wrote:
> (take 3)
> 
> ...I'm really sorry! If it's possible better try to restart with this one.
> 
> Jarek P.
> 
> ---
> 
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index 3fb58f4..6b6fff3 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -169,7 +169,8 @@ struct htb_sched {
>  
>  	int rate2quantum;	/* quant = rate / rate2quantum */
>  	psched_time_t now;	/* cached dequeue time */
> -	struct qdisc_watchdog watchdog;
> +	//struct qdisc_watchdog watchdog;
> +	struct timer_list timer;	/* send delay timer */
>  
>  	/* non shaped skbs; let them go directly thru */
>  	struct sk_buff_head direct_queue;
> @@ -893,6 +894,14 @@ next:
>  	return skb;
>  }
>  
> +static void htb_timer(unsigned long arg)
> +{
> +	struct Qdisc *sch = (struct Qdisc*)arg;
> +	sch->flags &= ~TCQ_F_THROTTLED;
> +	wmb();
> +	netif_schedule(sch->dev);
> +}
> +
>  static struct sk_buff *htb_dequeue(struct Qdisc *sch)
>  {
>  	struct sk_buff *skb = NULL;
> @@ -943,7 +952,10 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
>  		}
>  	}
>  	sch->qstats.overlimits++;
> -	qdisc_watchdog_schedule(&q->watchdog, next_event);
> +	//qdisc_watchdog_schedule(&q->watchdog, next_event);
> +	next_event = next_event - q->now;
> +	mod_timer(&q->timer, jiffies + (unsigned long)next_event * HZ /
> +				 PSCHED_TICKS_PER_SEC);
>  fin:
>  	return skb;
>  }
> @@ -996,7 +1008,9 @@ static void htb_reset(struct Qdisc *sch)
>  
>  		}
>  	}
> -	qdisc_watchdog_cancel(&q->watchdog);
> +	//qdisc_watchdog_cancel(&q->watchdog);
> +	sch->flags &= ~TCQ_F_THROTTLED;
> +	del_timer(&q->timer);
>  	__skb_queue_purge(&q->direct_queue);
>  	sch->q.qlen = 0;
>  	memset(q->row, 0, sizeof(q->row));
> @@ -1047,7 +1061,11 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt)
>  	for (i = 0; i < TC_HTB_NUMPRIO; i++)
>  		INIT_LIST_HEAD(q->drops + i);
>  
> -	qdisc_watchdog_init(&q->watchdog, sch);
> +	//qdisc_watchdog_init(&q->watchdog, sch);
> +	q->timer.function = htb_timer;
> +	q->timer.data = (unsigned long)sch;
> +	init_timer(&q->timer);
> +
>  	skb_queue_head_init(&q->direct_queue);
>  
>  	q->direct_qlen = sch->dev->tx_queue_len;
> @@ -1262,7 +1280,8 @@ static void htb_destroy(struct Qdisc *sch)
>  {
>  	struct htb_sched *q = qdisc_priv(sch);
>  
> -	qdisc_watchdog_cancel(&q->watchdog);
> +	//qdisc_watchdog_cancel(&q->watchdog);
> +	del_timer_sync(&q->timer);
>  	/* This line used to be after htb_destroy_class call below
>  	   and surprisingly it worked in 2.4. But it must precede it
>  	   because filter need its target class alive to be able to call
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 23:56               ` denys
@ 2008-07-24 14:56                 ` denys
  2008-07-24 17:45                   ` Jarek Poplawski
  2008-07-25  7:36                 ` Jarek Poplawski
  1 sibling, 1 reply; 29+ messages in thread
From: denys @ 2008-07-24 14:56 UTC (permalink / raw)
  To: denys; +Cc: Jarek Poplawski, netdev

Nothing yet. Still waiting, in 6 hours i will have "peak time".
But yesterday with same settings(without patch) it crashed 3 times.

It is Core2Duo E6750 CPU, clocksource is TSC.
TSC synchronization is always passed. HPET is available, so in case... i can try it too.
It was crashing with nmi_watchdog=1(at least it brings crash message) and without it (hard lockup).
Fishy point is ifb, but i have few hundreds of NAS servers with ifb + htb 
(but sure not 300-400 Mbps like on this host) and i check weekly log on all of them - none of them crashed.

So it is or hardware specific (but it is crashing always in same point), 
even maybe memory corruption, it is not a server-grade equipment, but it crashes always in same place, same way.
I have running MCE, so if there is thermal events - it must catch it.
But sure there is large list of errata for this hardware :-)

On Thursday 24 July 2008, denys@visp.net.lb wrote:
> I dont have any problem to restart :-)
> Build system and restart (over kexec) is automated..
> Rebooting now.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-24 14:56                 ` denys
@ 2008-07-24 17:45                   ` Jarek Poplawski
  0 siblings, 0 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-24 17:45 UTC (permalink / raw)
  To: denys; +Cc: netdev

On Thu, Jul 24, 2008 at 05:56:43PM +0300, denys@visp.net.lb wrote:
> Nothing yet. Still waiting, in 6 hours i will have "peak time".
> But yesterday with same settings(without patch) it crashed 3 times.

OK, so let's better wait to be sure of something here.

Jarek P.

(Btw., could try in the meantime to do something to shorten your lines?
It's not the most popular format on kernel lists...) 

> It is Core2Duo E6750 CPU, clocksource is TSC.
> TSC synchronization is always passed. HPET is available, so in case... i can try it too.
> It was crashing with nmi_watchdog=1(at least it brings crash message) and without it (hard lockup).
> Fishy point is ifb, but i have few hundreds of NAS servers with ifb + htb 
> (but sure not 300-400 Mbps like on this host) and i check weekly log on all of them - none of them crashed.
> 
> So it is or hardware specific (but it is crashing always in same point), 
> even maybe memory corruption, it is not a server-grade equipment, but it crashes always in same place, same way.
> I have running MCE, so if there is thermal events - it must catch it.
> But sure there is large list of errata for this hardware :-)
> 
> 
> On Thursday 24 July 2008, denys@visp.net.lb wrote:
> > I dont have any problem to restart :-)
> > Build system and restart (over kexec) is automated..
> > Rebooting now.
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-23 23:56               ` denys
  2008-07-24 14:56                 ` denys
@ 2008-07-25  7:36                 ` Jarek Poplawski
  2008-07-25 21:09                   ` denys
  2008-08-02 12:55                   ` Denys Fedoryshchenko
  1 sibling, 2 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-25  7:36 UTC (permalink / raw)
  To: denys; +Cc: netdev

Hi Denys,

In case this is still working after hrtimer -> timer change here is
another patch for testing: to check if limiting hrtimers scheduling
could matter here. Btw. could you write what is the approximate
number of htb qdiscs and classes working on each device of this box
(including ifbs)?

Thanks,
Jarek P.

(This patch should be applided to 2.6.26 or .25 after reverting
the previous debugging patch.)

---

 net/sched/sch_htb.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 30c999c..ff9e965 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -162,6 +162,7 @@ struct htb_sched {
 
 	int rate2quantum;	/* quant = rate / rate2quantum */
 	psched_time_t now;	/* cached dequeue time */
+	psched_time_t next_watchdog;
 	struct qdisc_watchdog watchdog;
 
 	/* non shaped skbs; let them go directly thru */
@@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		}
 	}
 	sch->qstats.overlimits++;
-	qdisc_watchdog_schedule(&q->watchdog, next_event);
+	if (q->next_watchdog < q->now || next_event <=
+	     q->next_watchdog - PSCHED_TICKS_PER_SEC / HZ) {
+		qdisc_watchdog_schedule(&q->watchdog, next_event);
+		q->next_watchdog = next_event;
+	}
 fin:
 	return skb;
 }
@@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch)
 		}
 	}
 	qdisc_watchdog_cancel(&q->watchdog);
+	q->next_watchdog = 0;
 	__skb_queue_purge(&q->direct_queue);
 	sch->q.qlen = 0;
 	memset(q->row, 0, sizeof(q->row));

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-25  7:36                 ` Jarek Poplawski
@ 2008-07-25 21:09                   ` denys
  2008-07-25 22:31                     ` hrtimers lockups " Jarek Poplawski
  2008-08-02 12:55                   ` Denys Fedoryshchenko
  1 sibling, 1 reply; 29+ messages in thread
From: denys @ 2008-07-25 21:09 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: denys, netdev

I will try to explain all details, maybe anything matter

around 150-300 megs passing
Core 2 Duo E6750
3 ifb's
29 htb classes (summary)
26 qdiscs (sfq and bfifo)
NAT is running (465-700K connections)
maximum bfifo qdisc size is 600Kbyte
mostly all filters u32 (one is police mtu)
quantum is 1514, one is 1515
Load is low (below 30-35)% by mpstat

The only error i have in dmesg (a LOT of this messages, different ip port, )
[162014.265116] UDP: short packet: From 200.122.35.205:64599 8409/1480 to 
213.254.233.9:6073
[162014.373110] UDP: short packet: From 200.122.35.205:52015 10698/1480 to 
213.254.233.9:4855

[162088.232099] UDP: bad checksum. From 96.234.33.9:1077 to 
213.254.233.9:49520 ulen 111


I run time-warp-test from Ingo Molnar - nothing, no warps.

If required - i can send all rules to private e-mail.

I will apply patch after 30-60 minutes (off peak time). Thanks for help a lot!

On Friday 25 July 2008, Jarek Poplawski wrote:
> Hi Denys,
>
> In case this is still working after hrtimer -> timer change here is
> another patch for testing: to check if limiting hrtimers scheduling
> could matter here. Btw. could you write what is the approximate
> number of htb qdiscs and classes working on each device of this box
> (including ifbs)?
>
> Thanks,
> Jarek P.
>
> (This patch should be applided to 2.6.26 or .25 after reverting
> the previous debugging patch.)
>
> ---
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* hrtimers lockups Re: NMI lockup, 2.6.26 release
  2008-07-25 21:09                   ` denys
@ 2008-07-25 22:31                     ` Jarek Poplawski
  0 siblings, 0 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-07-25 22:31 UTC (permalink / raw)
  To: denys; +Cc: Thomas Gleixner, netdev, linux-kernel

Hi,

This netdev thread describes lockups breaking in hrtimers code:

http://marc.info/?l=linux-netdev&m=121675217927170&w=2

Very similar reports from Denys Fedoryshchenko could be found in
netdev archives a few kernel versions before.

It looks like replacing hrtimers with timers in sch_htb code removes
problems. I hope, Thomas or somebody from linux-kernel could give
some clue on this.

Thanks,
Jarek P.

Denys, read below:

On Sat, Jul 26, 2008 at 12:09:52AM +0300, denys@visp.net.lb wrote:
> I will try to explain all details, maybe anything matter
> 
> around 150-300 megs passing
> Core 2 Duo E6750
> 3 ifb's
> 29 htb classes (summary)
> 26 qdiscs (sfq and bfifo)
> NAT is running (465-700K connections)
> maximum bfifo qdisc size is 600Kbyte
> mostly all filters u32 (one is police mtu)
> quantum is 1514, one is 1515
> Load is low (below 30-35)% by mpstat
> 
> The only error i have in dmesg (a LOT of this messages, different ip port, )
> [162014.265116] UDP: short packet: From 200.122.35.205:64599 8409/1480 to 
> 213.254.233.9:6073
> [162014.373110] UDP: short packet: From 200.122.35.205:52015 10698/1480 to 
> 213.254.233.9:4855
> 
> [162088.232099] UDP: bad checksum. From 96.234.33.9:1077 to 
> 213.254.233.9:49520 ulen 111
> 
> 
> I run time-warp-test from Ingo Molnar - nothing, no warps.
> 
> If required - i can send all rules to private e-mail.
> 
> I will apply patch after 30-60 minutes (off peak time). Thanks for help a lot!

You are very helpful too! But, I think we will need some help from
hrtimers/hardware gurus. IMHO, since it works with timers, the bug
doesn't seem to belong to "netdev". I can't see any obvious
possibility of "abusing" hrtimers with e.g. too big number of hrtimers
with your config (1 hrtimer per qdisc). So, I'm not very optimistic
about this new patch, but even if it works it looks like something
else is wrong. That's why I added some CC to this.

Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-07-25  7:36                 ` Jarek Poplawski
  2008-07-25 21:09                   ` denys
@ 2008-08-02 12:55                   ` Denys Fedoryshchenko
  2008-08-02 13:07                     ` Jarek Poplawski
  1 sibling, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-02 12:55 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

Sorry for long delay, i had to make sure that old change "fixed" the problem.

Now i am running second patch, it will take also around up to one week to make 
sure, that it is also make (or doesn't make) any changes, because sometimes 
it takes 3-4 days till crash happen.

On Friday 25 July 2008, Jarek Poplawski wrote:
> Hi Denys,
>
> In case this is still working after hrtimer -> timer change here is
> another patch for testing: to check if limiting hrtimers scheduling
> could matter here. Btw. could you write what is the approximate
> number of htb qdiscs and classes working on each device of this box
> (including ifbs)?
>
> Thanks,
> Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-02 12:55                   ` Denys Fedoryshchenko
@ 2008-08-02 13:07                     ` Jarek Poplawski
  2008-08-12 11:31                       ` Denys Fedoryshchenko
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-08-02 13:07 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Sat, Aug 02, 2008 at 03:55:10PM +0300, Denys Fedoryshchenko wrote:
> Sorry for long delay, i had to make sure that old change "fixed" the problem.
> 
> Now i am running second patch, it will take also around up to one week to make 
> sure, that it is also make (or doesn't make) any changes, because sometimes 
> it takes 3-4 days till crash happen.

OK, don't hurry!

Cheers,
Jarek P.

> 
> On Friday 25 July 2008, Jarek Poplawski wrote:
> > Hi Denys,
> >
> > In case this is still working after hrtimer -> timer change here is
> > another patch for testing: to check if limiting hrtimers scheduling
> > could matter here. Btw. could you write what is the approximate
> > number of htb qdiscs and classes working on each device of this box
> > (including ifbs)?
> >
> > Thanks,
> > Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-02 13:07                     ` Jarek Poplawski
@ 2008-08-12 11:31                       ` Denys Fedoryshchenko
  2008-08-12 12:40                         ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-12 11:31 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Saturday 02 August 2008, Jarek Poplawski wrote:
> On Sat, Aug 02, 2008 at 03:55:10PM +0300, Denys Fedoryshchenko wrote:
> > Sorry for long delay, i had to make sure that old change "fixed" the
> > problem.
> >
> > Now i am running second patch, it will take also around up to one week to
> > make sure, that it is also make (or doesn't make) any changes, because
> > sometimes it takes 3-4 days till crash happen.
>
> OK, don't hurry!
>
> Cheers,
> Jarek P.

With second patch it works fine, 9 days uptime now

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-12 11:31                       ` Denys Fedoryshchenko
@ 2008-08-12 12:40                         ` Jarek Poplawski
  2008-08-13  7:28                           ` Denys Fedoryshchenko
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-08-12 12:40 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote:
...
> With second patch it works fine, 9 days uptime now

Great! I didn't expect it would be so easy with this strange problem.
So, it looks like hrtimers could break probably after some
overscheduling. The only problem with this is to find some reasonable
limit which is both safe and doesn't harm resolution too much for
others.

IMHO this second patch with 1 jiffie watchdog resolution looks
reasonable and should be acceptable, but it would be nice to check if
we can go lower. Here is "the same" patch with only change in
resolution (1/10 of jiffie). If there are any problems with testing
this please let me know. (It should be applied after reverting
patch #2.)

Thanks,
Jarek P.

(testing patch #3)
---

 net/sched/sch_htb.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 30c999c..ff9e965 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -162,6 +162,7 @@ struct htb_sched {
 
 	int rate2quantum;	/* quant = rate / rate2quantum */
 	psched_time_t now;	/* cached dequeue time */
+	psched_time_t next_watchdog;
 	struct qdisc_watchdog watchdog;
 
 	/* non shaped skbs; let them go directly thru */
@@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		}
 	}
 	sch->qstats.overlimits++;
-	qdisc_watchdog_schedule(&q->watchdog, next_event);
+	if (q->next_watchdog < q->now || next_event <=
+	     q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) {
+		qdisc_watchdog_schedule(&q->watchdog, next_event);
+		q->next_watchdog = next_event;
+	}
 fin:
 	return skb;
 }
@@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch)
 		}
 	}
 	qdisc_watchdog_cancel(&q->watchdog);
+	q->next_watchdog = 0;
 	__skb_queue_purge(&q->direct_queue);
 	sch->q.qlen = 0;
 	memset(q->row, 0, sizeof(q->row));

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-12 12:40                         ` Jarek Poplawski
@ 2008-08-13  7:28                           ` Denys Fedoryshchenko
  2008-08-13  7:43                             ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-13  7:28 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

Just as proposal, maybe we can catch situation when "things going wrong" and 
panic? So we can forward some info to hrtimers guys?
If it is hrtimers bug...

On Tuesday 12 August 2008, Jarek Poplawski wrote:
> On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote:
> ...
>
> > With second patch it works fine, 9 days uptime now
>
> Great! I didn't expect it would be so easy with this strange problem.
> So, it looks like hrtimers could break probably after some
> overscheduling. The only problem with this is to find some reasonable
> limit which is both safe and doesn't harm resolution too much for
> others.
>
> IMHO this second patch with 1 jiffie watchdog resolution looks
> reasonable and should be acceptable, but it would be nice to check if
> we can go lower. Here is "the same" patch with only change in
> resolution (1/10 of jiffie). If there are any problems with testing
> this please let me know. (It should be applied after reverting
> patch #2.)
>
> Thanks,
> Jarek P.
>
> (testing patch #3)
> ---
>
>  net/sched/sch_htb.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index 30c999c..ff9e965 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -162,6 +162,7 @@ struct htb_sched {
>
>  	int rate2quantum;	/* quant = rate / rate2quantum */
>  	psched_time_t now;	/* cached dequeue time */
> +	psched_time_t next_watchdog;
>  	struct qdisc_watchdog watchdog;
>
>  	/* non shaped skbs; let them go directly thru */
> @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
>  		}
>  	}
>  	sch->qstats.overlimits++;
> -	qdisc_watchdog_schedule(&q->watchdog, next_event);
> +	if (q->next_watchdog < q->now || next_event <=
> +	     q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) {
> +		qdisc_watchdog_schedule(&q->watchdog, next_event);
> +		q->next_watchdog = next_event;
> +	}
>  fin:
>  	return skb;
>  }
> @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch)
>  		}
>  	}
>  	qdisc_watchdog_cancel(&q->watchdog);
> +	q->next_watchdog = 0;
>  	__skb_queue_purge(&q->direct_queue);
>  	sch->q.qlen = 0;
>  	memset(q->row, 0, sizeof(q->row));
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  7:28                           ` Denys Fedoryshchenko
@ 2008-08-13  7:43                             ` Jarek Poplawski
  2008-08-13  8:02                               ` Denys Fedoryshchenko
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-08-13  7:43 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Wed, Aug 13, 2008 at 10:28:11AM +0300, Denys Fedoryshchenko wrote:
> Just as proposal, maybe we can catch situation when "things going wrong" and 
> panic? So we can forward some info to hrtimers guys?
> If it is hrtimers bug...

Yes, it would be the best, but I don't know how much I can "use" you
and your clients for debugging this. So, of course, if it's possible
you could simply edit this patch and try with increased values like
(100 * HZ) or (1000 * HZ), or even something like:

+	if (q->next_watchdog < q->now || next_event <=
+	     q->next_watchdog - 10) {

Alas hrtimers guys didn't look like very interested, so the main
concern should be doing this optimal in net at least.

Jarek P.

> 
> On Tuesday 12 August 2008, Jarek Poplawski wrote:
> > On Tue, Aug 12, 2008 at 02:31:40PM +0300, Denys Fedoryshchenko wrote:
> > ...
> >
> > > With second patch it works fine, 9 days uptime now
> >
> > Great! I didn't expect it would be so easy with this strange problem.
> > So, it looks like hrtimers could break probably after some
> > overscheduling. The only problem with this is to find some reasonable
> > limit which is both safe and doesn't harm resolution too much for
> > others.
> >
> > IMHO this second patch with 1 jiffie watchdog resolution looks
> > reasonable and should be acceptable, but it would be nice to check if
> > we can go lower. Here is "the same" patch with only change in
> > resolution (1/10 of jiffie). If there are any problems with testing
> > this please let me know. (It should be applied after reverting
> > patch #2.)
> >
> > Thanks,
> > Jarek P.
> >
> > (testing patch #3)
> > ---
> >
> >  net/sched/sch_htb.c |    8 +++++++-
> >  1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > index 30c999c..ff9e965 100644
> > --- a/net/sched/sch_htb.c
> > +++ b/net/sched/sch_htb.c
> > @@ -162,6 +162,7 @@ struct htb_sched {
> >
> >  	int rate2quantum;	/* quant = rate / rate2quantum */
> >  	psched_time_t now;	/* cached dequeue time */
> > +	psched_time_t next_watchdog;
> >  	struct qdisc_watchdog watchdog;
> >
> >  	/* non shaped skbs; let them go directly thru */
> > @@ -920,7 +921,11 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
> >  		}
> >  	}
> >  	sch->qstats.overlimits++;
> > -	qdisc_watchdog_schedule(&q->watchdog, next_event);
> > +	if (q->next_watchdog < q->now || next_event <=
> > +	     q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) {
> > +		qdisc_watchdog_schedule(&q->watchdog, next_event);
> > +		q->next_watchdog = next_event;
> > +	}
> >  fin:
> >  	return skb;
> >  }
> > @@ -973,6 +978,7 @@ static void htb_reset(struct Qdisc *sch)
> >  		}
> >  	}
> >  	qdisc_watchdog_cancel(&q->watchdog);
> > +	q->next_watchdog = 0;
> >  	__skb_queue_purge(&q->direct_queue);
> >  	sch->q.qlen = 0;
> >  	memset(q->row, 0, sizeof(q->row));
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  7:43                             ` Jarek Poplawski
@ 2008-08-13  8:02                               ` Denys Fedoryshchenko
  2008-08-13  8:49                                 ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-13  8:02 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

As soon as kernel reboot themself, it won't hurt me much.
With NMI watchdog i notice there was panic missing, so nmi_watchdog was 
showing message and was not rebooting. It is fixed in next kernel and i patch 
in my kernel - so i will not crash+freeze anymore i guess and will not need 
to run to power switch at night.

It can be related to another problem (some corruption) which is not fixed yet, 
so prefferably to show timer guys exact location of problem.

Maybe you can make some patch like:

+	if (q->next_watchdog < q->now || next_event <=
+	     q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) {
+		qdisc_watchdog_schedule(&q->watchdog, next_event);
+		q->next_watchdog = next_event;
+	} else {
something like BUG()
         }
?
Probably also i will try to migrate to "rc" versions of kernel to see if 
problem still exist there, a lot of changes done there... is HTB corruption 
problem tracked finally and completely? I seen some discussions about it 
recently...

On Wednesday 13 August 2008, Jarek Poplawski wrote:
> On Wed, Aug 13, 2008 at 10:28:11AM +0300, Denys Fedoryshchenko wrote:
> > Just as proposal, maybe we can catch situation when "things going wrong"
> > and panic? So we can forward some info to hrtimers guys?
> > If it is hrtimers bug...
>
> Yes, it would be the best, but I don't know how much I can "use" you
> and your clients for debugging this. So, of course, if it's possible
> you could simply edit this patch and try with increased values like
> (100 * HZ) or (1000 * HZ), or even something like:
>
> +	if (q->next_watchdog < q->now || next_event <=
> +	     q->next_watchdog - 10) {
>
> Alas hrtimers guys didn't look like very interested, so the main
> concern should be doing this optimal in net at least.
>
> Jarek P.
>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  8:02                               ` Denys Fedoryshchenko
@ 2008-08-13  8:49                                 ` Jarek Poplawski
  2008-08-13  9:08                                   ` Denys Fedoryshchenko
                                                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-08-13  8:49 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Wed, Aug 13, 2008 at 11:02:34AM +0300, Denys Fedoryshchenko wrote:
> As soon as kernel reboot themself, it won't hurt me much.
> With NMI watchdog i notice there was panic missing, so nmi_watchdog was 
> showing message and was not rebooting. It is fixed in next kernel and i patch 
> in my kernel - so i will not crash+freeze anymore i guess and will not need 
> to run to power switch at night.
> 
> It can be related to another problem (some corruption) which is not fixed yet, 
> so prefferably to show timer guys exact location of problem.
> 
> Maybe you can make some patch like:
> 
> +	if (q->next_watchdog < q->now || next_event <=
> +	     q->next_watchdog - PSCHED_TICKS_PER_SEC / (10 * HZ)) {
> +		qdisc_watchdog_schedule(&q->watchdog, next_event);
> +		q->next_watchdog = next_event;
> +	} else {
> something like BUG()
>          }
> ?

I don't think it's right: there could be probably some small time
differences between cpus on SMP or even some inaccuracy related to
hardware, but I don't think it's the right place or method to verify
this. And eg. re-scheduling with the same time shouldn't be wrong too.

Anyway, narrowing the problem with such tests should give us better
understanding what could be a real problem here. BTW, could you
"remind" us the .config on this box (especially various *HZ*, *TIME*
and *TIMERS* settings).

> Probably also i will try to migrate to "rc" versions of kernel to see if 
> problem still exist there, a lot of changes done there... is HTB corruption 
> problem tracked finally and completely? I seen some discussions about it 
> recently...

I doubt current rc versions are stable enough for any production. HTB
waits for one fix, but it's nothing critical if it didn't bothered you
until now. There could be still some problems around schedulers
generally, after last big changes.

Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  8:49                                 ` Jarek Poplawski
@ 2008-08-13  9:08                                   ` Denys Fedoryshchenko
  2008-08-14 15:07                                   ` Denys Fedoryshchenko
  2008-08-15 13:13                                   ` NMI lockup, 2.6.26 release Denys Fedoryshchenko
  2 siblings, 0 replies; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-13  9:08 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Wednesday 13 August 2008, Jarek Poplawski wrote:
> I don't think it's right: there could be probably some small time
> differences between cpus on SMP or even some inaccuracy related to
> hardware, but I don't think it's the right place or method to verify
> this. And eg. re-scheduling with the same time shouldn't be wrong too.
OK! Got you. 
Difference possible, on that machine it is using TSC from Core 2 Duo.
I tried to run some code from Ingo Molnar to check is TSC synchronised - after 
2 days it doesn't detect anything.

>
> Anyway, narrowing the problem with such tests should give us better
> understanding what could be a real problem here. BTW, could you
> "remind" us the .config on this box (especially various *HZ*, *TIME*
> and *TIMERS* settings).
http://www.nuclearcat.com/files/config_2.6.26.2.txt
Same used for 2.6.26.2, this config from 2.6.26.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  8:49                                 ` Jarek Poplawski
  2008-08-13  9:08                                   ` Denys Fedoryshchenko
@ 2008-08-14 15:07                                   ` Denys Fedoryshchenko
  2008-08-14 15:10                                     ` New: softlockup in 2.6.27-rc3-git2 Denys Fedoryshchenko
  2008-08-15 13:13                                   ` NMI lockup, 2.6.26 release Denys Fedoryshchenko
  2 siblings, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-14 15:07 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On 2.6.27 rc3-git2 i am getting softlockup after 60-120 seconds after running.
Netconsole is almost dead, i tried to use it get stacktrace, but it is sending 
few lines of header only.

It happens when many tc sessions running in parallel, thats only info i have now.

Update: got interesting info, maybe this is the issue:


Aug 14 18:07:09 194.146.153.146 [   41.496997]
Aug 14 18:07:09 194.146.153.146 [   41.496997] =============================================
Aug 14 18:07:09 194.146.153.146 [   41.496997] [ INFO: possible recursive locking detected ]
Aug 14 18:07:09 194.146.153.146 [   41.496997] 2.6.27-rc3-git2-build-0030 #5
Aug 14 18:07:09 194.146.153.146 [   41.496997] ---------------------------------------------
Aug 14 18:07:09 194.146.153.146 [   41.496997] swapper/0 is trying to acquire lock:
Aug 14 18:07:09 194.146.153.146 [   41.496997]  (&list->lock
Aug 14 18:07:09 194.146.153.146 #2
Aug 14 18:07:09 194.146.153.146 ){-+..}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c02617d4>] dev_queue_xmit+0x31f/0x481
Aug 14 18:07:09 194.146.153.146 [   41.496997]
Aug 14 18:07:09 194.146.153.146 [   41.496997] but task is already holding lock:
Aug 14 18:07:09 194.146.153.146 [   41.496997]  (&list->lock
Aug 14 18:07:09 194.146.153.146 #2
Aug 14 18:07:09 194.146.153.146 ){-+..}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6
Aug 14 18:07:09 194.146.153.146 [   41.496997]
Aug 14 18:07:09 194.146.153.146 [   41.496997] other info that might help us debug this:
Aug 14 18:07:09 194.146.153.146 [   41.496997] 5 locks held by swapper/0:
Aug 14 18:07:09 194.146.153.146 [   41.496997]  #0:
Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
Aug 14 18:07:09 194.146.153.146 ){..--}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c025f873>] net_rx_action+0x54/0x1e5
Aug 14 18:07:09 194.146.153.146 [   41.496997]  #1:
Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
Aug 14 18:07:09 194.146.153.146 ){..--}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c0260bb5>] netif_receive_skb+0x10c/0x3f6
Aug 14 18:07:09 194.146.153.146 [   41.496997]  #2:
Aug 14 18:07:09 194.146.153.146 (&list->lock
Aug 14 18:07:09 194.146.153.146 #2
Aug 14 18:07:09 194.146.153.146 ){-+..}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6
Aug 14 18:07:09 194.146.153.146 [   41.496997]  #3:
Aug 14 18:07:09 194.146.153.146 (&p->tcfc_lock
Aug 14 18:07:09 194.146.153.146 ){-+..}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<f8a602fc>] tcf_mirred+0x1f/0x14b [act_mirred]
Aug 14 18:07:09 194.146.153.146 [   41.496997]  #4:
Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
Aug 14 18:07:09 194.146.153.146 ){..--}
Aug 14 18:07:09 194.146.153.146 , at:
Aug 14 18:07:09 194.146.153.146 [<c0261635>] dev_queue_xmit+0x180/0x481
Aug 14 18:07:09 194.146.153.146 [   41.496997]
Aug 14 18:07:09 194.146.153.146 [   41.496997] stack backtrace:
Aug 14 18:07:09 194.146.153.146 [   41.496997] Pid: 0, comm: swapper Not tainted 2.6.27-rc3-git2-build-0030 #5
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02ba433>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 printk+0xf/0x14
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013ed12>]
Aug 14 18:07:09 194.146.153.146 __lock_acquire+0xb3a/0x118a
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013f3aa>]
Aug 14 18:07:09 194.146.153.146 lock_acquire+0x48/0x64
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02bc901>]
Aug 14 18:07:09 194.146.153.146 _spin_lock+0x1b/0x2a
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a60408>]
Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x12b/0x14b [act_mirred]
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a602dd>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x0/0x14b [act_mirred]
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c027102b>]
Aug 14 18:07:09 194.146.153.146 tcf_action_exec+0x43/0x72
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a92cd5>]
Aug 14 18:07:09 194.146.153.146 u32_classify+0xf4/0x20b [cls_u32]
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013dabf>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 trace_hardirqs_on+0xb/0xd
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c026ea85>]
Aug 14 18:07:09 194.146.153.146 tc_classify_compat+0x2e/0x5d
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c026ebcd>]
Aug 14 18:07:09 194.146.153.146 tc_classify+0x17/0x72
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a9c0b2>]
Aug 14 18:07:09 194.146.153.146 ingress_enqueue+0x1a/0x54 [sch_ingress]
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0260d31>]
Aug 14 18:07:09 194.146.153.146 netif_receive_skb+0x288/0x3f6
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0260f13>]
Aug 14 18:07:09 194.146.153.146 process_backlog+0x74/0xcb
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c025f8da>]
Aug 14 18:07:09 194.146.153.146 net_rx_action+0xbb/0x1e5
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0126203>]
Aug 14 18:07:09 194.146.153.146 __do_softirq+0x7b/0xf4
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0126188>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 __do_softirq+0x0/0xf4
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c01060b3>]
Aug 14 18:07:09 194.146.153.146 do_softirq+0x65/0xb6
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c014a35d>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 handle_fasteoi_irq+0x0/0xb6
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0125e28>]
Aug 14 18:07:09 194.146.153.146 irq_exit+0x44/0x79
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0106038>]
Aug 14 18:07:09 194.146.153.146 do_IRQ+0xae/0xc4
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0104288>]
Aug 14 18:07:09 194.146.153.146 common_interrupt+0x28/0x30
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013007b>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 find_get_pid+0x2e/0x4d
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0108d8a>]
Aug 14 18:07:09 194.146.153.146 ?
Aug 14 18:07:09 194.146.153.146 mwait_idle+0x39/0x43
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c01029ee>]
Aug 14 18:07:09 194.146.153.146 cpu_idle+0xbf/0xe1
Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02afc5e>]
Aug 14 18:07:09 194.146.153.146 rest_init+0x4e/0x50
Aug 14 18:07:09 194.146.153.146 [   41.496997]  =======================

^ permalink raw reply	[flat|nested] 29+ messages in thread

* New: softlockup in 2.6.27-rc3-git2
  2008-08-14 15:07                                   ` Denys Fedoryshchenko
@ 2008-08-14 15:10                                     ` Denys Fedoryshchenko
  0 siblings, 0 replies; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-14 15:10 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

Sorry, i had to update subject

On Thursday 14 August 2008, Denys Fedoryshchenko wrote:
> On 2.6.27 rc3-git2 i am getting softlockup after 60-120 seconds after
> running. Netconsole is almost dead, i tried to use it get stacktrace, but
> it is sending few lines of header only.
>
> It happens when many tc sessions running in parallel, thats only info i
> have now.
>
> Update: got interesting info, maybe this is the issue:
>
>
> Aug 14 18:07:09 194.146.153.146 [   41.496997]
> Aug 14 18:07:09 194.146.153.146 [   41.496997]
> ============================================= Aug 14 18:07:09
> 194.146.153.146 [   41.496997] [ INFO: possible recursive locking detected
> ] Aug 14 18:07:09 194.146.153.146 [   41.496997] 2.6.27-rc3-git2-build-0030
> #5 Aug 14 18:07:09 194.146.153.146 [   41.496997]
> --------------------------------------------- Aug 14 18:07:09
> 194.146.153.146 [   41.496997] swapper/0 is trying to acquire lock: Aug 14
> 18:07:09 194.146.153.146 [   41.496997]  (&list->lock
> Aug 14 18:07:09 194.146.153.146 #2
> Aug 14 18:07:09 194.146.153.146 ){-+..}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c02617d4>] dev_queue_xmit+0x31f/0x481
> Aug 14 18:07:09 194.146.153.146 [   41.496997]
> Aug 14 18:07:09 194.146.153.146 [   41.496997] but task is already holding
> lock: Aug 14 18:07:09 194.146.153.146 [   41.496997]  (&list->lock
> Aug 14 18:07:09 194.146.153.146 #2
> Aug 14 18:07:09 194.146.153.146 ){-+..}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]
> Aug 14 18:07:09 194.146.153.146 [   41.496997] other info that might help
> us debug this: Aug 14 18:07:09 194.146.153.146 [   41.496997] 5 locks held
> by swapper/0: Aug 14 18:07:09 194.146.153.146 [   41.496997]  #0:
> Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
> Aug 14 18:07:09 194.146.153.146 ){..--}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c025f873>] net_rx_action+0x54/0x1e5
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  #1:
> Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
> Aug 14 18:07:09 194.146.153.146 ){..--}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c0260bb5>] netif_receive_skb+0x10c/0x3f6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  #2:
> Aug 14 18:07:09 194.146.153.146 (&list->lock
> Aug 14 18:07:09 194.146.153.146 #2
> Aug 14 18:07:09 194.146.153.146 ){-+..}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c0260d17>] netif_receive_skb+0x26e/0x3f6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  #3:
> Aug 14 18:07:09 194.146.153.146 (&p->tcfc_lock
> Aug 14 18:07:09 194.146.153.146 ){-+..}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<f8a602fc>] tcf_mirred+0x1f/0x14b
> [act_mirred] Aug 14 18:07:09 194.146.153.146 [   41.496997]  #4:
> Aug 14 18:07:09 194.146.153.146 (rcu_read_lock
> Aug 14 18:07:09 194.146.153.146 ){..--}
> Aug 14 18:07:09 194.146.153.146 , at:
> Aug 14 18:07:09 194.146.153.146 [<c0261635>] dev_queue_xmit+0x180/0x481
> Aug 14 18:07:09 194.146.153.146 [   41.496997]
> Aug 14 18:07:09 194.146.153.146 [   41.496997] stack backtrace:
> Aug 14 18:07:09 194.146.153.146 [   41.496997] Pid: 0, comm: swapper Not
> tainted 2.6.27-rc3-git2-build-0030 #5 Aug 14 18:07:09 194.146.153.146 [  
> 41.496997]  [<c02ba433>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 printk+0xf/0x14
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013ed12>]
> Aug 14 18:07:09 194.146.153.146 __lock_acquire+0xb3a/0x118a
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013f3aa>]
> Aug 14 18:07:09 194.146.153.146 lock_acquire+0x48/0x64
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02bc901>]
> Aug 14 18:07:09 194.146.153.146 _spin_lock+0x1b/0x2a
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02617d4>]
> Aug 14 18:07:09 194.146.153.146 dev_queue_xmit+0x31f/0x481
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a60408>]
> Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x12b/0x14b [act_mirred]
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a602dd>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 tcf_mirred+0x0/0x14b [act_mirred]
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c027102b>]
> Aug 14 18:07:09 194.146.153.146 tcf_action_exec+0x43/0x72
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a92cd5>]
> Aug 14 18:07:09 194.146.153.146 u32_classify+0xf4/0x20b [cls_u32]
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013dabf>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 trace_hardirqs_on+0xb/0xd
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c026ea85>]
> Aug 14 18:07:09 194.146.153.146 tc_classify_compat+0x2e/0x5d
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c026ebcd>]
> Aug 14 18:07:09 194.146.153.146 tc_classify+0x17/0x72
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<f8a9c0b2>]
> Aug 14 18:07:09 194.146.153.146 ingress_enqueue+0x1a/0x54 [sch_ingress]
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0260d31>]
> Aug 14 18:07:09 194.146.153.146 netif_receive_skb+0x288/0x3f6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0260f13>]
> Aug 14 18:07:09 194.146.153.146 process_backlog+0x74/0xcb
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c025f8da>]
> Aug 14 18:07:09 194.146.153.146 net_rx_action+0xbb/0x1e5
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0126203>]
> Aug 14 18:07:09 194.146.153.146 __do_softirq+0x7b/0xf4
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0126188>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 __do_softirq+0x0/0xf4
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c01060b3>]
> Aug 14 18:07:09 194.146.153.146 do_softirq+0x65/0xb6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c014a35d>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 handle_fasteoi_irq+0x0/0xb6
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0125e28>]
> Aug 14 18:07:09 194.146.153.146 irq_exit+0x44/0x79
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0106038>]
> Aug 14 18:07:09 194.146.153.146 do_IRQ+0xae/0xc4
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0104288>]
> Aug 14 18:07:09 194.146.153.146 common_interrupt+0x28/0x30
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c013007b>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 find_get_pid+0x2e/0x4d
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c0108d8a>]
> Aug 14 18:07:09 194.146.153.146 ?
> Aug 14 18:07:09 194.146.153.146 mwait_idle+0x39/0x43
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c01029ee>]
> Aug 14 18:07:09 194.146.153.146 cpu_idle+0xbf/0xe1
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  [<c02afc5e>]
> Aug 14 18:07:09 194.146.153.146 rest_init+0x4e/0x50
> Aug 14 18:07:09 194.146.153.146 [   41.496997]  =======================
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-13  8:49                                 ` Jarek Poplawski
  2008-08-13  9:08                                   ` Denys Fedoryshchenko
  2008-08-14 15:07                                   ` Denys Fedoryshchenko
@ 2008-08-15 13:13                                   ` Denys Fedoryshchenko
  2008-08-15 14:16                                     ` Jarek Poplawski
  2 siblings, 1 reply; 29+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-15 13:13 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Wednesday 13 August 2008, Jarek Poplawski wrote:
> I doubt current rc versions are stable enough for any production. HTB
> waits for one fix, but it's nothing critical if it didn't bothered you
> until now. There could be still some problems around schedulers
> generally, after last big changes.
>
After patching issue with locking, i apply 2.6.27-rc3 with those(locking, not 
testing patch which limits resolution) patches on shaper who was crashing on 
load. Without your patch it was crashing like before on 2.6.27-rc3 too (NMI 
watchdog issuing panic, and machine not rebooting) after few hours of 
running.

Sadly i will not be able to test on this machine anymore, because i lost 
access to it and had to change network structure. 

I will try to bring it locally to my office and simulate crash by generating 
traffic, but most probably it will not work. Seems your patch is required for 
mainline, but now i can test it only in theory, because another machine 
running as shaper now, running HPET, not TSC (it is AMD Opteron).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NMI lockup, 2.6.26 release
  2008-08-15 13:13                                   ` NMI lockup, 2.6.26 release Denys Fedoryshchenko
@ 2008-08-15 14:16                                     ` Jarek Poplawski
  0 siblings, 0 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-08-15 14:16 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Fri, Aug 15, 2008 at 04:13:59PM +0300, Denys Fedoryshchenko wrote:
...
> After patching issue with locking, i apply 2.6.27-rc3 with those(locking, not 
> testing patch which limits resolution) patches on shaper who was crashing on 
> load. Without your patch it was crashing like before on 2.6.27-rc3 too (NMI 
> watchdog issuing panic, and machine not rebooting) after few hours of 
> running.
> 
> Sadly i will not be able to test on this machine anymore, because i lost 
> access to it and had to change network structure. 
> 
> I will try to bring it locally to my office and simulate crash by generating 
> traffic, but most probably it will not work. Seems your patch is required for 
> mainline, but now i can test it only in theory, because another machine 
> running as shaper now, running HPET, not TSC (it is AMD Opteron).

Since this bug looks so rare, probably hardware dependent, it looks
like fixing this can wait until it bothers somebody again. The most
important, some workaround has been found, and we can go back to
to improve this.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-08-15 14:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-22 18:42 NMI lockup, 2.6.26 release denys
2008-07-22 20:13 ` Jarek Poplawski
2008-07-22 20:35   ` Jarek Poplawski
2008-07-22 20:46     ` denys
2008-07-22 21:36       ` Jarek Poplawski
2008-07-22 21:45         ` denys
2008-07-23 19:47         ` denys
2008-07-23 21:09           ` Jarek Poplawski
2008-07-23 22:26           ` Jarek Poplawski
2008-07-23 23:24             ` Jarek Poplawski
2008-07-23 23:56               ` denys
2008-07-24 14:56                 ` denys
2008-07-24 17:45                   ` Jarek Poplawski
2008-07-25  7:36                 ` Jarek Poplawski
2008-07-25 21:09                   ` denys
2008-07-25 22:31                     ` hrtimers lockups " Jarek Poplawski
2008-08-02 12:55                   ` Denys Fedoryshchenko
2008-08-02 13:07                     ` Jarek Poplawski
2008-08-12 11:31                       ` Denys Fedoryshchenko
2008-08-12 12:40                         ` Jarek Poplawski
2008-08-13  7:28                           ` Denys Fedoryshchenko
2008-08-13  7:43                             ` Jarek Poplawski
2008-08-13  8:02                               ` Denys Fedoryshchenko
2008-08-13  8:49                                 ` Jarek Poplawski
2008-08-13  9:08                                   ` Denys Fedoryshchenko
2008-08-14 15:07                                   ` Denys Fedoryshchenko
2008-08-14 15:10                                     ` New: softlockup in 2.6.27-rc3-git2 Denys Fedoryshchenko
2008-08-15 13:13                                   ` NMI lockup, 2.6.26 release Denys Fedoryshchenko
2008-08-15 14:16                                     ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).