2.6.33-rc5: (e1000): transmit queue 0 timed out

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.33-rc5: (e1000): transmit queue 0 timed out
@ 2010-01-23 15:37 Alexander Beregalov
  2010-01-23 19:52 ` Rafael J. Wysocki
  2010-01-26  1:07 ` Brandeburg, Jesse
  0 siblings, 2 replies; 7+ messages in thread
From: Alexander Beregalov @ 2010-01-23 15:37 UTC (permalink / raw)
  To: netdev, e1000-devel; +Cc: Rafael J. Wysocki

Hi

It is x86_32, UP

e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <0>
  TDT                  <1f>
  next_to_use          <1f>
  next_to_clean        <30>
buffer_info[next_to_clean]
  time_stamp           <12d519>
  next_to_watch        <30>
  jiffies              <12da92>
  next_to_watch.status <0>
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
Hardware name:
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Modules linked in: hwmon_vid sata_sil i2c_nforce2
Pid: 0, comm: swapper Not tainted 2.6.33-rc5 #1
Call Trace:
 [<c102a49d>] warn_slowpath_common+0x6d/0xa0
 [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
 [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
 [<c102a516>] warn_slowpath_fmt+0x26/0x30
 [<c12ea885>] dev_watchdog+0x1c5/0x1d0
 [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
 [<c1033c31>] run_timer_softirq+0x151/0x240
 [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
 [<c12ea6c0>] ? dev_watchdog+0x0/0x1d0
 [<c102f40a>] __do_softirq+0x7a/0x110
 [<c102f4ed>] do_softirq+0x4d/0x60
 [<c102f625>] irq_exit+0x65/0x70
 [<c1015fe7>] smp_apic_timer_interrupt+0x47/0x80
 [<c11d6904>] ? trace_hardirqs_off_thunk+0xc/0x18
 [<c1350e63>] apic_timer_interrupt+0x2f/0x34
 [<c10088fd>] ? default_idle+0x2d/0x60
 [<c1001b19>] cpu_idle+0x39/0x60
 [<c13451e8>] rest_init+0x48/0x50
 [<c16196b4>] start_kernel+0x26d/0x274
 [<c1619275>] ? unknown_bootoption+0x0/0x19c
 [<c1619068>] i386_start_kernel+0x68/0x6e
---[ end trace 828c510cca9472df ]---
BUG: unable to handle kernel paging request at 2e8ca4f3
IP: [<c1071c51>] put_page+0x11/0x120
*pde = 00000000
Oops: 0000 [#1]
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
Modules linked in: hwmon_vid sata_sil i2c_nforce2

Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
NF7-S/NF7,NF7-V (nVidia-nForce2)/
EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
EIP is at put_page+0x11/0x120
EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
Stack:
 00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
<0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
<0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
Call Trace:
 [<c12d3100>] ? skb_release_data+0x90/0xa0
 [<c12d2e32>] ? __kfree_skb+0x12/0x90
 [<c12d2ec5>] ? consume_skb+0x15/0x30
 [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
 [<c127c743>] ? e1000_down+0x1b3/0x1d0
 [<c127cf60>] ? e1000_reset_task+0x0/0x10
 [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
 [<c127cf6d>] ? e1000_reset_task+0xd/0x10
 [<c103a9ea>] ? worker_thread+0x14a/0x230
 [<c103a989>] ? worker_thread+0xe9/0x230
 [<c103e160>] ? autoremove_wake_function+0x0/0x40
 [<c103a8a0>] ? worker_thread+0x0/0x230
 [<c103de6c>] ? kthread+0x6c/0x80
 [<c103de00>] ? kthread+0x0/0x80
 [<c100303a>] ? kernel_thread_helper+0x6/0x1c
Code: 00 00 00 8d bc 27 00 00 00 00 55 b8 e0 1f 07 c1 89 e5 e8 83 93
fc ff c9 c3 90 55 89 e5 83 ec 10 89 5d f4 89 75 f8 89 c3 89 7d fc <66>
f7 00 00 c0 0f 85 e4 00 00 00 8b 40 04 85 c0 0f 84 e3 00 00
EIP: [<c1071c51>] put_page+0x11/0x120 SS:ESP 0068:f7065e98
CR2: 000000002e8ca4f3
---[ end trace 828c510cca9472e0 ]---

------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-23 15:37 2.6.33-rc5: (e1000): transmit queue 0 timed out Alexander Beregalov
@ 2010-01-23 19:52 ` Rafael J. Wysocki
  2010-01-23 20:04   ` Alexander Beregalov
  2010-01-26  1:07 ` Brandeburg, Jesse
  1 sibling, 1 reply; 7+ messages in thread
From: Rafael J. Wysocki @ 2010-01-23 19:52 UTC (permalink / raw)
  To: Alexander Beregalov; +Cc: netdev, e1000-devel

On Saturday 23 January 2010, Alexander Beregalov wrote:
> Hi
> 
> It is x86_32, UP
> 
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>   Tx Queue             <0>
>   TDH                  <0>
>   TDT                  <1f>
>   next_to_use          <1f>
>   next_to_clean        <30>
> buffer_info[next_to_clean]
>   time_stamp           <12d519>
>   next_to_watch        <30>
>   jiffies              <12da92>
>   next_to_watch.status <0>
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
> Hardware name:
> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
> Modules linked in: hwmon_vid sata_sil i2c_nforce2
> Pid: 0, comm: swapper Not tainted 2.6.33-rc5 #1
> Call Trace:
>  [<c102a49d>] warn_slowpath_common+0x6d/0xa0
>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>  [<c102a516>] warn_slowpath_fmt+0x26/0x30
>  [<c12ea885>] dev_watchdog+0x1c5/0x1d0
>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>  [<c1033c31>] run_timer_softirq+0x151/0x240
>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>  [<c12ea6c0>] ? dev_watchdog+0x0/0x1d0
>  [<c102f40a>] __do_softirq+0x7a/0x110
>  [<c102f4ed>] do_softirq+0x4d/0x60
>  [<c102f625>] irq_exit+0x65/0x70
>  [<c1015fe7>] smp_apic_timer_interrupt+0x47/0x80
>  [<c11d6904>] ? trace_hardirqs_off_thunk+0xc/0x18
>  [<c1350e63>] apic_timer_interrupt+0x2f/0x34
>  [<c10088fd>] ? default_idle+0x2d/0x60
>  [<c1001b19>] cpu_idle+0x39/0x60
>  [<c13451e8>] rest_init+0x48/0x50
>  [<c16196b4>] start_kernel+0x26d/0x274
>  [<c1619275>] ? unknown_bootoption+0x0/0x19c
>  [<c1619068>] i386_start_kernel+0x68/0x6e
> ---[ end trace 828c510cca9472df ]---
> BUG: unable to handle kernel paging request at 2e8ca4f3
> IP: [<c1071c51>] put_page+0x11/0x120
> *pde = 00000000
> Oops: 0000 [#1]
> last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
> Modules linked in: hwmon_vid sata_sil i2c_nforce2
> 
> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
> NF7-S/NF7,NF7-V (nVidia-nForce2)/
> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
> EIP is at put_page+0x11/0x120
> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
> Stack:
>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
> Call Trace:
>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>  [<c103a989>] ? worker_thread+0xe9/0x230
>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>  [<c103a8a0>] ? worker_thread+0x0/0x230
>  [<c103de6c>] ? kthread+0x6c/0x80
>  [<c103de00>] ? kthread+0x0/0x80
>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
> Code: 00 00 00 8d bc 27 00 00 00 00 55 b8 e0 1f 07 c1 89 e5 e8 83 93
> fc ff c9 c3 90 55 89 e5 83 ec 10 89 5d f4 89 75 f8 89 c3 89 7d fc <66>
> f7 00 00 c0 0f 85 e4 00 00 00 8b 40 04 85 c0 0f 84 e3 00 00
> EIP: [<c1071c51>] put_page+0x11/0x120 SS:ESP 0068:f7065e98
> CR2: 000000002e8ca4f3
> ---[ end trace 828c510cca9472e0 ]---

Do I think correctly that this is a regression?  If so, what's the last working
kernel?

Rafael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-23 19:52 ` Rafael J. Wysocki
@ 2010-01-23 20:04   ` Alexander Beregalov
  2010-01-27  1:12     ` Jesse Brandeburg
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Beregalov @ 2010-01-23 20:04 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: netdev, e1000-devel

On 23 January 2010 22:52, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Saturday 23 January 2010, Alexander Beregalov wrote:
>> Hi
>>
>> It is x86_32, UP
>>
>> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>   Tx Queue             <0>
>>   TDH                  <0>
>>   TDT                  <1f>
>>   next_to_use          <1f>
>>   next_to_clean        <30>
>> buffer_info[next_to_clean]
>>   time_stamp           <12d519>
>>   next_to_watch        <30>
>>   jiffies              <12da92>
>>   next_to_watch.status <0>
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
>> Hardware name:
>> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
>> Modules linked in: hwmon_vid sata_sil i2c_nforce2
>> Pid: 0, comm: swapper Not tainted 2.6.33-rc5 #1
>> Call Trace:
>>  [<c102a49d>] warn_slowpath_common+0x6d/0xa0
>>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>>  [<c102a516>] warn_slowpath_fmt+0x26/0x30
>>  [<c12ea885>] dev_watchdog+0x1c5/0x1d0
>>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>>  [<c1033c31>] run_timer_softirq+0x151/0x240
>>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>>  [<c12ea6c0>] ? dev_watchdog+0x0/0x1d0
>>  [<c102f40a>] __do_softirq+0x7a/0x110
>>  [<c102f4ed>] do_softirq+0x4d/0x60
>>  [<c102f625>] irq_exit+0x65/0x70
>>  [<c1015fe7>] smp_apic_timer_interrupt+0x47/0x80
>>  [<c11d6904>] ? trace_hardirqs_off_thunk+0xc/0x18
>>  [<c1350e63>] apic_timer_interrupt+0x2f/0x34
>>  [<c10088fd>] ? default_idle+0x2d/0x60
>>  [<c1001b19>] cpu_idle+0x39/0x60
>>  [<c13451e8>] rest_init+0x48/0x50
>>  [<c16196b4>] start_kernel+0x26d/0x274
>>  [<c1619275>] ? unknown_bootoption+0x0/0x19c
>>  [<c1619068>] i386_start_kernel+0x68/0x6e
>> ---[ end trace 828c510cca9472df ]---
>> BUG: unable to handle kernel paging request at 2e8ca4f3
>> IP: [<c1071c51>] put_page+0x11/0x120
>> *pde = 00000000
>> Oops: 0000 [#1]
>> last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
>> Modules linked in: hwmon_vid sata_sil i2c_nforce2
>>
>> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
>> NF7-S/NF7,NF7-V (nVidia-nForce2)/
>> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
>> EIP is at put_page+0x11/0x120
>> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
>> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
>> Stack:
>>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
>> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
>> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
>> Call Trace:
>>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>>  [<c103a989>] ? worker_thread+0xe9/0x230
>>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>>  [<c103a8a0>] ? worker_thread+0x0/0x230
>>  [<c103de6c>] ? kthread+0x6c/0x80
>>  [<c103de00>] ? kthread+0x0/0x80
>>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
>> Code: 00 00 00 8d bc 27 00 00 00 00 55 b8 e0 1f 07 c1 89 e5 e8 83 93
>> fc ff c9 c3 90 55 89 e5 83 ec 10 89 5d f4 89 75 f8 89 c3 89 7d fc <66>
>> f7 00 00 c0 0f 85 e4 00 00 00 8b 40 04 85 c0 0f 84 e3 00 00
>> EIP: [<c1071c51>] put_page+0x11/0x120 SS:ESP 0068:f7065e98
>> CR2: 000000002e8ca4f3
>> ---[ end trace 828c510cca9472e0 ]---
>
> Do I think correctly that this is a regression?  If so, what's the last working
> kernel?
>
Yes, it is.
It did not happen until -rc5.
2.6.33-rc4
2.6.33-rc4-00189-g6ccf80e
2.6.33-rc4-00204-g7dc9c48
2.6.33-rc4-00399-g24bc734
2.6.33-rc4-00519-g836f48c
worked fine.
Diffstat between 836f48c..v2.6.33-rc5 does not look relevant to me, as
well as 24bc734..v2.6.33-rc5.
It means the bug is not easy reproducible, I just do usual tasks, but
it has happened again:

WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
Hardware name:
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
<..>
BUG kmalloc-2048: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf4998022-0xf4998607. First byte 0x0 instead of 0x6b
INFO: Allocated in __netdev_alloc_skb+0x1e/0x40 age=372 cpu=0 pid=1724
INFO: Freed in skb_release_data+0x68/0xa0 age=292 cpu=0 pid=5
INFO: Slab 0xc2283300 objects=15 used=0 fp=0xf499b950 flags=0x40004082
INFO: Object 0xf4998000 @offset=0 fp=0xf499d1e0

  Object 0xf4998000:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
kkkkkkkkkkkkkkkk
  Object 0xf4998010:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
kkkkkkkkkkkkkkkk
  Object 0xf4998020:  6b 6b 00 07 e9 09 d4 79 01 14 2b 09 0b 28 08 00
kk..И.тy..+..(..
  Object 0xf4998030:  45 20 05 d4 7d 09 40 00 75 06 3c e5 d5 b6 b0 b4
E..т}.@.u.<Еу╤╟╢
  Object 0xf4998040:  c1 a8 01 02 23 28 af a3 57 1f 82 b2 c7 a5 14 d5
а╗..#(╞ёW..╡г╔.у
<..>
  Object 0xf49987d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
kkkkkkkkkkkkkkkk
  Object 0xf49987e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
kkkkkkkkkkkkkkkk
  Object 0xf49987f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
kkkkkkkkkkkkkkk╔
 Redzone 0xf4998800:  bb bb bb bb                                     ╩╩╩╩
 Padding 0xf4998828:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
Pid: 1719, comm: rtorrent Tainted: G        W  2.6.33-rc5 #1
Call Trace:
 [<c108d1af>] print_trailer+0xcf/0x120
 [<c108d844>] check_bytes_and_report+0xc4/0xf0
 [<c108da1f>] check_object+0x1af/0x200
 [<c108db4a>] __free_slab+0xda/0x100
 [<c108db8c>] discard_slab+0x1c/0x30
 [<c108f552>] __slab_free+0xd2/0x280
 [<c11d7156>] ? copy_to_user+0x36/0x130
 [<c108f9df>] kfree+0xdf/0x110
 [<c12d30d8>] ? skb_release_data+0x68/0xa0
 [<c12d30d8>] ? skb_release_data+0x68/0xa0
 [<c12d30d8>] skb_release_data+0x68/0xa0
 [<c12d2e32>] __kfree_skb+0x12/0x90
 [<c13006a0>] tcp_recvmsg+0x6c0/0x8d0
 [<c102f691>] ? local_bh_enable_ip+0x61/0xc0
 [<c1350b75>] ? _raw_spin_unlock_bh+0x25/0x30
 [<c10435e5>] ? T.324+0x15/0x1b0
 [<c12cdc73>] sock_common_recvmsg+0x43/0x60
 [<c12cbc87>] sock_recvmsg+0xb7/0xf0
 [<c10435e5>] ? T.324+0x15/0x1b0
 [<c12cc589>] sys_recvfrom+0x79/0xe0
 [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
 [<c104398e>] ? cpu_clock+0x4e/0x60
 [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
 [<c1052121>] ? lock_release_non_nested+0x301/0x340
 [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
 [<c107bd0a>] ? might_fault+0x4a/0xa0
 [<c12cc626>] sys_recv+0x36/0x40
 [<c12cd74c>] sys_socketcall+0x1ac/0x270
 [<c11d68f4>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c1002b10>] sysenter_do_call+0x12/0x36
FIX kmalloc-2048: Restoring 0xf4998022-0xf4998607=0x6b

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-23 15:37 2.6.33-rc5: (e1000): transmit queue 0 timed out Alexander Beregalov
  2010-01-23 19:52 ` Rafael J. Wysocki
@ 2010-01-26  1:07 ` Brandeburg, Jesse
  2010-01-27  8:55   ` [E1000-devel] " Alexander Beregalov
  1 sibling, 1 reply; 7+ messages in thread
From: Brandeburg, Jesse @ 2010-01-26  1:07 UTC (permalink / raw)
  To: Alexander Beregalov
  Cc: e1000-devel@lists.sourceforge.net, netdev, Rafael J. Wysocki



On Sat, 23 Jan 2010, Alexander Beregalov wrote:
> It is x86_32, UP
> 
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>   Tx Queue             <0>
>   TDH                  <0>

The queue seems to have not been started...  what test are you running?  
what kind of traffic and system?  (lspci -vvv please)


>   TDT                  <1f>
>   next_to_use          <1f>
>   next_to_clean        <30>
> buffer_info[next_to_clean]
>   time_stamp           <12d519>
>   next_to_watch        <30>
>   jiffies              <12da92>
>   next_to_watch.status <0>
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
> Hardware name:
> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
> Modules linked in: hwmon_vid sata_sil i2c_nforce2
> Pid: 0, comm: swapper Not tainted 2.6.33-rc5 #1
> Call Trace:
>  [<c102a49d>] warn_slowpath_common+0x6d/0xa0
>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>  [<c102a516>] warn_slowpath_fmt+0x26/0x30
>  [<c12ea885>] dev_watchdog+0x1c5/0x1d0
>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>  [<c1033c31>] run_timer_softirq+0x151/0x240
>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>  [<c12ea6c0>] ? dev_watchdog+0x0/0x1d0
>  [<c102f40a>] __do_softirq+0x7a/0x110
>  [<c102f4ed>] do_softirq+0x4d/0x60
>  [<c102f625>] irq_exit+0x65/0x70
>  [<c1015fe7>] smp_apic_timer_interrupt+0x47/0x80
>  [<c11d6904>] ? trace_hardirqs_off_thunk+0xc/0x18
>  [<c1350e63>] apic_timer_interrupt+0x2f/0x34
>  [<c10088fd>] ? default_idle+0x2d/0x60
>  [<c1001b19>] cpu_idle+0x39/0x60
>  [<c13451e8>] rest_init+0x48/0x50
>  [<c16196b4>] start_kernel+0x26d/0x274
>  [<c1619275>] ? unknown_bootoption+0x0/0x19c
>  [<c1619068>] i386_start_kernel+0x68/0x6e
> ---[ end trace 828c510cca9472df ]---
> BUG: unable to handle kernel paging request at 2e8ca4f3
> IP: [<c1071c51>] put_page+0x11/0x120

hm, put_page panic, are you running with jumbo frames enabled?  Does your 
network have jumbo frame traffic on it?

> *pde = 00000000
> Oops: 0000 [#1]
> last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
> Modules linked in: hwmon_vid sata_sil i2c_nforce2
> 
> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
> NF7-S/NF7,NF7-V (nVidia-nForce2)/
> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
> EIP is at put_page+0x11/0x120
> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
> Stack:
>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
> Call Trace:
>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>  [<c103a989>] ? worker_thread+0xe9/0x230
>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>  [<c103a8a0>] ? worker_thread+0x0/0x230
>  [<c103de6c>] ? kthread+0x6c/0x80
>  [<c103de00>] ? kthread+0x0/0x80
>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
> Code: 00 00 00 8d bc 27 00 00 00 00 55 b8 e0 1f 07 c1 89 e5 e8 83 93
> fc ff c9 c3 90 55 89 e5 83 ec 10 89 5d f4 89 75 f8 89 c3 89 7d fc <66>
> f7 00 00 c0 0f 85 e4 00 00 00 8b 40 04 85 c0 0f 84 e3 00 00
> EIP: [<c1071c51>] put_page+0x11/0x120 SS:ESP 0068:f7065e98
> CR2: 000000002e8ca4f3
> ---[ end trace 828c510cca9472e0 ]---


Thanks for the report, do you believe it to be new to e1000 in 2.6.33-rc5?
Have you had failure like this before and/or can you see the same failure 
on 2.6.32?



------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-23 20:04   ` Alexander Beregalov
@ 2010-01-27  1:12     ` Jesse Brandeburg
  2010-01-27  9:03       ` Alexander Beregalov
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Brandeburg @ 2010-01-27  1:12 UTC (permalink / raw)
  To: Alexander Beregalov, Jesse Brandeburg
  Cc: Rafael J. Wysocki, netdev, e1000-devel

I also just noticed something else.

On Sat, Jan 23, 2010 at 12:04 PM, Alexander Beregalov
<a.beregalov@gmail.com> wrote:
>>> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
>>> NF7-S/NF7,NF7-V (nVidia-nForce2)/
>>> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
>>> EIP is at put_page+0x11/0x120
>>> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
>>> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>>>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
>>> Stack:
>>>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
>>> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
>>> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
>>> Call Trace:
>>>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>>>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>>>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>>>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>>>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>>>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>>>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>>>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>>>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>>>  [<c103a989>] ? worker_thread+0xe9/0x230
>>>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>>>  [<c103a8a0>] ? worker_thread+0x0/0x230
>>>  [<c103de6c>] ? kthread+0x6c/0x80
>>>  [<c103de00>] ? kthread+0x0/0x80
>>>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
> Hardware name:
> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out

There are at least two problems here, not sure if they are related yet.
The first is the WATCHDOG (the kernel will print this only once per
e1000e driver load), we don't know what is causing that right now,
maybe we can mock up some ftrace magic to dump the transmit
descriptors that are formed, or you can run the e1000_dump patch code.
 Let me know if you want me to generate a version of it for you.

> <..>
> BUG kmalloc-2048: Poison overwritten
> -----------------------------------------------------------------------------
>
> INFO: 0xf4998022-0xf4998607. First byte 0x0 instead of 0x6b
> INFO: Allocated in __netdev_alloc_skb+0x1e/0x40 age=372 cpu=0 pid=1724
> INFO: Freed in skb_release_data+0x68/0xa0 age=292 cpu=0 pid=5
> INFO: Slab 0xc2283300 objects=15 used=0 fp=0xf499b950 flags=0x40004082
> INFO: Object 0xf4998000 @offset=0 fp=0xf499d1e0
>
>  Object 0xf4998000:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf4998010:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf4998020:  6b 6b 00 07 e9 09 d4 79 01 14 2b 09 0b 28 08 00
> kk..И.тy..+..(..
>  Object 0xf4998030:  45 20 05 d4 7d 09 40 00 75 06 3c e5 d5 b6 b0 b4
> E..т}.@.u.<Еу╤╟╢
>  Object 0xf4998040:  c1 a8 01 02 23 28 af a3 57 1f 82 b2 c7 a5 14 d5
> а╗..#(╞ёW..╡г╔.у
> <..>

hey, thats an ethernet/ipv4 packet!  The memory above is typically
allocated at a 2kB boundary and then the hardware would start DMAing
at 000+22...  how long was that packet corrupting data in memory (how
many bytes)?

00 07 e9 09 d4 79
dest mac
01 14 2b 09 0b 28
src mac address
08 00
ip header
45 20
header length 20 bytes, ip v4, DSCP = ECN capable
and on...

so this is the second issue, we've called kfree on a packet but
hardware still receives into it (probably that we didn't wait long
enough for receives to quit)

>  Object 0xf49987d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf49987e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf49987f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
> kkkkkkkkkkkkkkk╔
>  Redzone 0xf4998800:  bb bb bb bb                                     ╩╩╩╩
>  Padding 0xf4998828:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
> Pid: 1719, comm: rtorrent Tainted: G        W  2.6.33-rc5 #1
> Call Trace:
>  [<c108d1af>] print_trailer+0xcf/0x120
>  [<c108d844>] check_bytes_and_report+0xc4/0xf0
>  [<c108da1f>] check_object+0x1af/0x200
>  [<c108db4a>] __free_slab+0xda/0x100
>  [<c108db8c>] discard_slab+0x1c/0x30
>  [<c108f552>] __slab_free+0xd2/0x280
>  [<c11d7156>] ? copy_to_user+0x36/0x130
>  [<c108f9df>] kfree+0xdf/0x110
>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>  [<c12d30d8>] skb_release_data+0x68/0xa0
>  [<c12d2e32>] __kfree_skb+0x12/0x90
>  [<c13006a0>] tcp_recvmsg+0x6c0/0x8d0
>  [<c102f691>] ? local_bh_enable_ip+0x61/0xc0
>  [<c1350b75>] ? _raw_spin_unlock_bh+0x25/0x30
>  [<c10435e5>] ? T.324+0x15/0x1b0
>  [<c12cdc73>] sock_common_recvmsg+0x43/0x60
>  [<c12cbc87>] sock_recvmsg+0xb7/0xf0
>  [<c10435e5>] ? T.324+0x15/0x1b0
>  [<c12cc589>] sys_recvfrom+0x79/0xe0
>  [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
>  [<c104398e>] ? cpu_clock+0x4e/0x60
>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>  [<c1052121>] ? lock_release_non_nested+0x301/0x340
>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>  [<c107bd0a>] ? might_fault+0x4a/0xa0
>  [<c12cc626>] sys_recv+0x36/0x40
>  [<c12cd74c>] sys_socketcall+0x1ac/0x270
>  [<c11d68f4>] ? trace_hardirqs_on_thunk+0xc/0x10
>  [<c1002b10>] sysenter_do_call+0x12/0x36
> FIX kmalloc-2048: Restoring 0xf4998022-0xf4998607=0x6b

I'll reply here with a patch for checking if the receive unit is still
not stopped, but I still don't know what is causing the NETDEV
WATCHDOG and we need some more debug information for that (from the
e1000_dump patch)

can you answer any of my other questions too?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [E1000-devel] 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-26  1:07 ` Brandeburg, Jesse
@ 2010-01-27  8:55   ` Alexander Beregalov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexander Beregalov @ 2010-01-27  8:55 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: netdev, e1000-devel@lists.sourceforge.net, Rafael J. Wysocki

2010/1/26 Brandeburg, Jesse <jesse.brandeburg@intel.com>:
>
>
> On Sat, 23 Jan 2010, Alexander Beregalov wrote:
>> It is x86_32, UP
>>
>> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>   Tx Queue             <0>
>>   TDH                  <0>
>
> The queue seems to have not been started...  what test are you running?
> what kind of traffic and system?  (lspci -vvv please)

The host just does regular tasks - NFS server and rtorrent client.

01:0a.0 Ethernet controller [0200]: Intel Corporation 82540EM Gigabit
Ethernet Controller [8086:100e] (rev 02)
        Subsystem: Intel Corporation PRO/1000 MT Desktop Adapter [8086:002e]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR+ INTx-
        Latency: 64 (63750ns min)
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at ec000000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at ec020000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at a040 [size=64]
        [virtual] Expansion ROM at 60080000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [e4] PCI-X non-bridge device
                Command: DPERE- ERO+ RBC=512 OST=1
                Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [f0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Kernel driver in use: e1000

>
>
>>   TDT                  <1f>
>>   next_to_use          <1f>
>>   next_to_clean        <30>
>> buffer_info[next_to_clean]
>>   time_stamp           <12d519>
>>   next_to_watch        <30>
>>   jiffies              <12da92>
>>   next_to_watch.status <0>
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
>> Hardware name:
>> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
>> Modules linked in: hwmon_vid sata_sil i2c_nforce2
>> Pid: 0, comm: swapper Not tainted 2.6.33-rc5 #1
>> Call Trace:
>>  [<c102a49d>] warn_slowpath_common+0x6d/0xa0
>>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>>  [<c12ea885>] ? dev_watchdog+0x1c5/0x1d0
>>  [<c102a516>] warn_slowpath_fmt+0x26/0x30
>>  [<c12ea885>] dev_watchdog+0x1c5/0x1d0
>>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>>  [<c1033c31>] run_timer_softirq+0x151/0x240
>>  [<c1033bb7>] ? run_timer_softirq+0xd7/0x240
>>  [<c12ea6c0>] ? dev_watchdog+0x0/0x1d0
>>  [<c102f40a>] __do_softirq+0x7a/0x110
>>  [<c102f4ed>] do_softirq+0x4d/0x60
>>  [<c102f625>] irq_exit+0x65/0x70
>>  [<c1015fe7>] smp_apic_timer_interrupt+0x47/0x80
>>  [<c11d6904>] ? trace_hardirqs_off_thunk+0xc/0x18
>>  [<c1350e63>] apic_timer_interrupt+0x2f/0x34
>>  [<c10088fd>] ? default_idle+0x2d/0x60
>>  [<c1001b19>] cpu_idle+0x39/0x60
>>  [<c13451e8>] rest_init+0x48/0x50
>>  [<c16196b4>] start_kernel+0x26d/0x274
>>  [<c1619275>] ? unknown_bootoption+0x0/0x19c
>>  [<c1619068>] i386_start_kernel+0x68/0x6e
>> ---[ end trace 828c510cca9472df ]---
>> BUG: unable to handle kernel paging request at 2e8ca4f3
>> IP: [<c1071c51>] put_page+0x11/0x120
>
> hm, put_page panic, are you running with jumbo frames enabled?  Does your
> network have jumbo frame traffic on it?

Jumbo frames are disabled, no jumbo frame traffic.
>
>> *pde = 00000000
>> Oops: 0000 [#1]
>> last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
>> Modules linked in: hwmon_vid sata_sil i2c_nforce2
>>
>> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
>> NF7-S/NF7,NF7-V (nVidia-nForce2)/
>> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
>> EIP is at put_page+0x11/0x120
>> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
>> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
>> Stack:
>>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
>> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
>> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
>> Call Trace:
>>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>>  [<c103a989>] ? worker_thread+0xe9/0x230
>>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>>  [<c103a8a0>] ? worker_thread+0x0/0x230
>>  [<c103de6c>] ? kthread+0x6c/0x80
>>  [<c103de00>] ? kthread+0x0/0x80
>>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
>> Code: 00 00 00 8d bc 27 00 00 00 00 55 b8 e0 1f 07 c1 89 e5 e8 83 93
>> fc ff c9 c3 90 55 89 e5 83 ec 10 89 5d f4 89 75 f8 89 c3 89 7d fc <66>
>> f7 00 00 c0 0f 85 e4 00 00 00 8b 40 04 85 c0 0f 84 e3 00 00
>> EIP: [<c1071c51>] put_page+0x11/0x120 SS:ESP 0068:f7065e98
>> CR2: 000000002e8ca4f3
>> ---[ end trace 828c510cca9472e0 ]---
>
>
> Thanks for the report, do you believe it to be new to e1000 in 2.6.33-rc5?
> Have you had failure like this before and/or can you see the same failure
> on 2.6.32?

Yes, I believe it is new to 2.6.33-rc5, I have not seen it before.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out
  2010-01-27  1:12     ` Jesse Brandeburg
@ 2010-01-27  9:03       ` Alexander Beregalov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexander Beregalov @ 2010-01-27  9:03 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Rafael J. Wysocki, netdev, Jesse Brandeburg, e1000-devel

2010/1/27 Jesse Brandeburg <jesse.brandeburg@gmail.com>:
> I also just noticed something else.
>
> On Sat, Jan 23, 2010 at 12:04 PM, Alexander Beregalov
> <a.beregalov@gmail.com> wrote:
>>>> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
>>>> NF7-S/NF7,NF7-V (nVidia-nForce2)/
>>>> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
>>>> EIP is at put_page+0x11/0x120
>>>> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
>>>> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>>>>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
>>>> Stack:
>>>>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
>>>> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
>>>> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
>>>> Call Trace:
>>>>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>>>>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>>>>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>>>>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>>>>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>>>>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>>>>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>>>>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>>>>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>>>>  [<c103a989>] ? worker_thread+0xe9/0x230
>>>>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>>>>  [<c103a8a0>] ? worker_thread+0x0/0x230
>>>>  [<c103de6c>] ? kthread+0x6c/0x80
>>>>  [<c103de00>] ? kthread+0x0/0x80
>>>>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
>> Hardware name:
>> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
>
> There are at least two problems here, not sure if they are related yet.
> The first is the WATCHDOG (the kernel will print this only once per
> e1000e driver load), we don't know what is causing that right now,
> maybe we can mock up some ftrace magic to dump the transmit
> descriptors that are formed, or you can run the e1000_dump patch code.
>  Let me know if you want me to generate a version of it for you.

Yes please
>
>> <..>
>> BUG kmalloc-2048: Poison overwritten
>> -----------------------------------------------------------------------------
>>
>> INFO: 0xf4998022-0xf4998607. First byte 0x0 instead of 0x6b
>> INFO: Allocated in __netdev_alloc_skb+0x1e/0x40 age=372 cpu=0 pid=1724
>> INFO: Freed in skb_release_data+0x68/0xa0 age=292 cpu=0 pid=5
>> INFO: Slab 0xc2283300 objects=15 used=0 fp=0xf499b950 flags=0x40004082
>> INFO: Object 0xf4998000 @offset=0 fp=0xf499d1e0
>>
>>  Object 0xf4998000:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>> kkkkkkkkkkkkkkkk
>>  Object 0xf4998010:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>> kkkkkkkkkkkkkkkk
>>  Object 0xf4998020:  6b 6b 00 07 e9 09 d4 79 01 14 2b 09 0b 28 08 00
>> kk..И.тy..+..(..
>>  Object 0xf4998030:  45 20 05 d4 7d 09 40 00 75 06 3c e5 d5 b6 b0 b4
>> E..т}.@.u.<Еу╤╟╢
>>  Object 0xf4998040:  c1 a8 01 02 23 28 af a3 57 1f 82 b2 c7 a5 14 d5
>> а╗..#(╞ёW..╡г╔.у
>> <..>
>
> hey, thats an ethernet/ipv4 packet!  The memory above is typically
> allocated at a 2kB boundary and then the hardware would start DMAing
> at 000+22...  how long was that packet corrupting data in memory (how
> many bytes)?
>
> 00 07 e9 09 d4 79
> dest mac
Yes, this is the mac of the e1000 on the host.
> 01 14 2b 09 0b 28
> src mac address
It looks corrupted. Gateway's mac is 00:14:2b:09:0a:28
> 08 00
> ip header
> 45 20
> header length 20 bytes, ip v4, DSCP = ECN capable
> and on...
>
> so this is the second issue, we've called kfree on a packet but
> hardware still receives into it (probably that we didn't wait long
> enough for receives to quit)
>
>>  Object 0xf49987d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>> kkkkkkkkkkkkkkkk
>>  Object 0xf49987e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>> kkkkkkkkkkkkkkkk
>>  Object 0xf49987f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
>> kkkkkkkkkkkkkkk╔
>>  Redzone 0xf4998800:  bb bb bb bb                                     ╩╩╩╩
>>  Padding 0xf4998828:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
>> Pid: 1719, comm: rtorrent Tainted: G        W  2.6.33-rc5 #1
>> Call Trace:
>>  [<c108d1af>] print_trailer+0xcf/0x120
>>  [<c108d844>] check_bytes_and_report+0xc4/0xf0
>>  [<c108da1f>] check_object+0x1af/0x200
>>  [<c108db4a>] __free_slab+0xda/0x100
>>  [<c108db8c>] discard_slab+0x1c/0x30
>>  [<c108f552>] __slab_free+0xd2/0x280
>>  [<c11d7156>] ? copy_to_user+0x36/0x130
>>  [<c108f9df>] kfree+0xdf/0x110
>>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>>  [<c12d30d8>] skb_release_data+0x68/0xa0
>>  [<c12d2e32>] __kfree_skb+0x12/0x90
>>  [<c13006a0>] tcp_recvmsg+0x6c0/0x8d0
>>  [<c102f691>] ? local_bh_enable_ip+0x61/0xc0
>>  [<c1350b75>] ? _raw_spin_unlock_bh+0x25/0x30
>>  [<c10435e5>] ? T.324+0x15/0x1b0
>>  [<c12cdc73>] sock_common_recvmsg+0x43/0x60
>>  [<c12cbc87>] sock_recvmsg+0xb7/0xf0
>>  [<c10435e5>] ? T.324+0x15/0x1b0
>>  [<c12cc589>] sys_recvfrom+0x79/0xe0
>>  [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
>>  [<c104398e>] ? cpu_clock+0x4e/0x60
>>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>>  [<c1052121>] ? lock_release_non_nested+0x301/0x340
>>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>>  [<c107bd0a>] ? might_fault+0x4a/0xa0
>>  [<c12cc626>] sys_recv+0x36/0x40
>>  [<c12cd74c>] sys_socketcall+0x1ac/0x270
>>  [<c11d68f4>] ? trace_hardirqs_on_thunk+0xc/0x10
>>  [<c1002b10>] sysenter_do_call+0x12/0x36
>> FIX kmalloc-2048: Restoring 0xf4998022-0xf4998607=0x6b
>
> I'll reply here with a patch for checking if the receive unit is still
> not stopped, but I still don't know what is causing the NETDEV
> WATCHDOG and we need some more debug information for that (from the
> e1000_dump patch)

Yes please, I would like to help, but I do not know how to reproduce it.

>
> can you answer any of my other questions too?
>

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-01-27  9:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-23 15:37 2.6.33-rc5: (e1000): transmit queue 0 timed out Alexander Beregalov
2010-01-23 19:52 ` Rafael J. Wysocki
2010-01-23 20:04   ` Alexander Beregalov
2010-01-27  1:12     ` Jesse Brandeburg
2010-01-27  9:03       ` Alexander Beregalov
2010-01-26  1:07 ` Brandeburg, Jesse
2010-01-27  8:55   ` [E1000-devel] " Alexander Beregalov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).