Stephen Hemminger wrote: > On Thu, 04 Oct 2007 18:27:04 -0700 > Tina Yang wrote: > > >> Andrew Morton wrote: >> >>> (Please resoind by emailed reply-to-all, not via the bugzilla web interface) >>> >>> On Thu, 4 Oct 2007 16:24:18 -0700 (PDT) >>> bugme-daemon@bugzilla.kernel.org wrote: >>> >>> >>> >>>> http://bugzilla.kernel.org/show_bug.cgi?id=9124 >>>> >>>> Summary: Netconsole race crashed the system >>>> Product: Networking >>>> Version: 2.5 >>>> KernelVersion: 2.6.9, 2.6.18, 2.6.23 >>>> Platform: All >>>> OS/Version: Linux >>>> Tree: Mainline >>>> Status: NEW >>>> Severity: high >>>> Priority: P1 >>>> Component: Other >>>> AssignedTo: acme@ghostprotocols.net >>>> ReportedBy: tina.yang@oracle.com >>>> >>>> >>>> Most recent kernel where this bug did not occur: >>>> Think the problem has always been there. >>>> Distribution: >>>> Hardware Environment: >>>> DELL PowerEdge 2650 (x86) >>>> DELL PowerEdge 2850(x86_64) >>>> HP ProLiant DL380 G5 (x86_64) >>>> with various NICs - e1000, tg3, bnx2 >>>> Software Environment: >>>> 2.6.9, 2.6.18, 2.6.23 >>>> Problem Description: >>>> On 2.6.18 found this issue on e1000 and tg3. On mainline 2.6.23-rc* found this >>>> issue on e100,tgs and bnx2. It either panicked >>>> at netdevice.h:890 or hung the system, and sometimes depending >>>> on which NIC are used, the following console message, >>>> e1000: >>>> "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" >>>> tg3: >>>> "NETDEV WATCHDOG: eth4: transmit timed out" >>>> "tg3: eth4: transmit timed out, resetting" >>>> >>>> Steps to reproduce: >>>> 1. On 2.6.18 (both x86_x86_64) insert netconsole module.(NIC: e1000 and tg3) >>>> 2. Run a moderate io load , preferably fio - one process doing async+directIO >>>> using libaio >>>> >>>> fio jobfile: >>>> [global] >>>> iodepth=1024 >>>> iodepth_batch=60 >>>> randrepeat=1 >>>> size=1024m >>>> directory=/home/oracle >>>> numjobs=2 >>>> [job1] >>>> bs=8k >>>> direct=1 >>>> ioengine=libaio >>>> rw=randrw >>>> filename=file1:file2 >>>> >>>> 3. From second console as root do " echo t > /proc/sysrq-trigger" >>>> >>>> Machine will instantly hang. >>>> >>>> >>>> Crash stack captured on 2.6.9 >>>> PANIC: "kernel BUG at include/linux/netdevice.h:888!" >>>> #0 [ 23c5e60] disk_dump at f9ca71a2 >>>> #1 [ 23c5e64] printk at 21228d6 >>>> #2 [ 23c5e70] freeze_other_cpus at f9ca6ef5 >>>> #3 [ 23c5e80] start_disk_dump at f9ca6fa0 >>>> #4 [ 23c5e90] try_crashdump at 2133766 >>>> #5 [ 23c5e98] die at 2106354 >>>> #6 [ 23c5ecc] do_invalid_op at 210672f >>>> #7 [ 23c5f7c] error_code (via invalid_op) at fffecede >>>> EAX: 00000006 EBX: 00200202 ECX: 00000000 EDX: df287000 EBP: e05ca000 >>>> DS: 007b ESI: 00000001 ES: 007b EDI: e05ca240 >>>> CS: 0060 EIP: f8c82a08 ERR: ffffffff EFLAGS: 00210046 >>>> #8 [ 23c5fb8] tg3_poll at f8c82a08 >>>> #9 [ 23c5fd0] net_rx_action at 227a8da >>>> #10 [ 23c5fe8] __do_softirq at 2126422 >>>> --- --- >>>> #0 [25c71cac] do_softirq at 2108460 >>>> #1 [25c71cb4] dev_queue_xmit at 227a0d2 >>>> #2 [25c71ccc] ip_finish_output at 229288d >>>> #3 [25c71ce4] ip_queue_xmit at 2292fa9 >>>> #4 [25c71dac] tcp_transmit_skb at 22a0ff7 >>>> #5 [25c71dec] tcp_write_xmit at 22a1901 >>>> #6 [25c71e10] tcp_sendmsg at 2297d6d >>>> #7 [25c71e80] sock_aio_write at 2272512 >>>> #8 [25c71eec] do_sync_write at 215a444 >>>> #9 [25c71f88] vfs_write at 215a53a >>>> #10 [25c71fa4] sys_write at 215a5f4 >>>> #11 [25c71fc0] system_call at fffec219 >>>> >>>> net_device in memory, >>>> name = "eth0\000\000\000\000\000\000\000\000\000\000\000", >>>> ... >>>> >>>> >>>> Crash stack captured on 2.6.18 >>>> PANIC: "kernel BUG at include/linux/netdevice.h:890!" >>>> #0 [c072ce30] crash_kexec at c044418a >>>> #1 [c072ce74] die at c04054d0 >>>> #2 [c072cea4] do_invalid_op at c0405c20 >>>> #3 [c072cf54] error_code (via invalid_op) at c0404ab3 >>>> EAX: 00000007 EBX: 00000202 ECX: 00000000 EDX: f6d9c000 EBP: f6d9c400 >>>> DS: 007b ESI: 00000001 ES: 007b EDI: cb02b280 >>>> CS: 0060 EIP: f8927791 ERR: ffffffff EFLAGS: 00010046 >>>> #4 [c072cf88] tg3_poll at f8927791 >>>> --- --- >>>> #0 [f7e54f60] do_softirq at c0406433 >>>> #1 [f7e54f6c] do_IRQ at c0406425 >>>> #2 [f7e54fb4] cpu_idle at c0402c8e >>>> >>>> net_device in memory, >>>> name = "eth4\000\000\000\000\000\000\000\000\000\000\000", >>>> name_hlist = { >>>> next = 0x0, >>>> pprev = 0xc07d0148 >>>> }, >>>> ... >>>> >>>> >>>> >>> OK, but in my 2.6.18, include/linux/netdevice.h:890 is a >>> local_irq_restore() in netif_rx_complete(). I don't see how that can go >>> BUG. >>> >>> Does your 2.6.18 have any patches applied? >>> >>> Please tell us what is at include/linux/netdevice.h:890 in your 2.6.18 >>> tree. >>> >>> - >>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> netdevice.h attached. >> 890 BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state)); >> >> > > Comparing your version with the original 2.6.18 from kernel.org git shows: > > --- 2.6.18/include/linux/netdevice.h 2007-10-04 20:14:51.000000000 -0700 > +++ tina/include/linux//netdevice.h 2007-10-04 20:16:19.000000000 -0700 > @@ -342,6 +342,9 @@ > /* Instance data managed by the core of Wireless Extensions. */ > struct iw_public_data * wireless_data; > > + /* pending config used by cfg80211/wext compat code only */ > + void *cfg80211_wext_pending_config; > + > struct ethtool_ops *ethtool_ops; > > /* > @@ -386,6 +389,7 @@ > void *ip6_ptr; /* IPv6 specific data */ > void *ec_ptr; /* Econet specific data */ > void *ax25_ptr; /* AX.25 specific data */ > + void *ieee80211_ptr; /* IEEE 802.11 specific data */ > > /* > * Cache line mostly used on receive path (including eth_type_trans()) > > > So you are not using a "pure" v2.6.18 kernel rom kernel.org but more likely > a distribution kernel that had already integrated the mac80211 stuff. > > > Yes, it's RHEL5 2.6.18-8. Attached is the 2.6.9-42 version that doesn't have 802.11 and crashed at the same spot - netdevice.h:888. Also crashed are 2.6.23-rc2 and rc4.