All of lore.kernel.org
 help / color / mirror / Atom feed
* domU crash with kernel BUG at drivers/net/xen-netfront.c:305
@ 2013-12-27 11:09 Vasily Evseenko
  2013-12-27 11:53 ` Wei Liu
  2014-01-02  5:09 ` annie li
  0 siblings, 2 replies; 10+ messages in thread
From: Vasily Evseenko @ 2013-12-27 11:09 UTC (permalink / raw)
  To: xen-devel

Hi,

I've got domU crash (~ every 1-2 days under high network (tcp) load)
with message:

-----
[2013-12-26 03:53:18] kernel BUG at drivers/net/xen-netfront.c:305!
[2013-12-26 03:53:18] invalid opcode: 0000 [#1] SMP
[2013-12-26 03:53:18] Modules linked in: ipt_REJECT iptable_filter
xt_set xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat
ip_tables ip_set_hash_net ip_set_hash_ip ip_set nfnetlink ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_table
s ipv6 ext3 jbd xen_netfront coretemp hwmon crc32_pclmul crc32c_intel
ghash_clmulni_intel microcode pcspkr ext4 jbd2 mbcache aesni_intel
ablk_helper c
ryptd lrw gf128mul glue_helper aes_x86_64 xen_blkfront dm_mirror
dm_region_hash dm_log dm_mod
[2013-12-26 03:53:18] CPU: 0 PID: 15126 Comm: python Not tainted
3.10.25-11.x86_64 #1
[2013-12-26 03:53:18] task: ffff8801e5d68ac0 ti: ffff8801e7392000
task.ti: ffff8801e7392000
[2013-12-26 03:53:18] RIP: e030:[<ffffffffa015d637>]
[<ffffffffa015d637>] xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
[2013-12-26 03:53:18] RSP: e02b:ffff8801f2e03ce0  EFLAGS: 00010282
[2013-12-26 03:53:18] RAX: 00000000000001d4 RBX: ffff8801e5438800 RCX:
0000000000000001
[2013-12-26 03:53:18] RDX: 000000000000002a RSI: 0000000000000000 RDI:
0000000000002200
[2013-12-26 03:53:18] RBP: ffff8801f2e03d40 R08: 0000000000000000 R09:
0000000000001000
[2013-12-26 03:53:18] R10: ffff8801000083c0 R11: dead000000200200 R12:
0000000000000220
[2013-12-26 03:53:18] R13: ffff8801e6eec0c0 R14: 000000000000002a R15:
000000000239642a
[2013-12-26 03:53:18] FS:  00007f4cf48d57e0(0000)
GS:ffff8801f2e00000(0000) knlGS:0000000000000000
[2013-12-26 03:53:18] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2013-12-26 03:53:18] CR2: ffffffffff600400 CR3: 00000001e0db3000 CR4:
0000000000042660
[2013-12-26 03:53:18] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[2013-12-26 03:53:18] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[2013-12-26 03:53:18] Stack:
[2013-12-26 03:53:18]  ffff8801f2e03df0 02396417e5438000
ffff8801e5439d58 ffff8801e54394f0
[2013-12-26 03:53:18]  ffff8801e5438000 002affff00000013
ffff8801f2e03d40 ffff8801f2e03db0
[2013-12-26 03:53:18]  0000000000000010 ffff8800655e6ac0
ffff8801e5438800 ffff8801e511a000
[2013-12-26 03:53:18] Call Trace:
[2013-12-26 03:53:18]  <IRQ>
[2013-12-26 03:53:18]  [<ffffffffa015dc44>] xennet_poll+0x2f4/0x630
[xen_netfront]
[2013-12-26 03:53:18]  [<ffffffff810640a9>] ? raise_softirq_irqoff+0x9/0x50
[2013-12-26 03:53:18]  [<ffffffff8152050c>] ? dev_kfree_skb_irq+0x5c/0x70
[2013-12-26 03:53:18]  [<ffffffff810e4fb9>] ?
handle_irq_event_percpu+0xc9/0x210
[2013-12-26 03:53:18]  [<ffffffff81528022>] net_rx_action+0x112/0x290
[2013-12-26 03:53:18]  [<ffffffff810e514d>] ? handle_irq_event+0x4d/0x70
[2013-12-26 03:53:18]  [<ffffffff81063c97>] __do_softirq+0xf7/0x270
[2013-12-26 03:53:18]  [<ffffffff81600edc>] call_softirq+0x1c/0x30
[2013-12-26 03:53:18]  [<ffffffff81014505>] do_softirq+0x65/0xa0
[2013-12-26 03:53:18]  [<ffffffff810639c5>] irq_exit+0xc5/0xd0
[2013-12-26 03:53:18]  [<ffffffff81351e45>] xen_evtchn_do_upcall+0x35/0x50
[2013-12-26 03:53:18]  [<ffffffff81600f3e>]
xen_do_hypervisor_callback+0x1e/0x30
[2013-12-26 03:53:18]  <EOI>
[2013-12-26 03:53:18] Code: 8b 35 ee f9 bb e1 48 8d bb 08 0d 00 00 48 83
c6 64 e8 2e f2 f0 e0 8b 83 ec 0c 00 00 31 d2 89 c1 d1 e9 39 d1 76 9e e9
5a ff ff ff <0f> 0b eb fe 0f 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f
84 00
[2013-12-26 03:53:18] RIP  [<ffffffffa015d637>]
xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
[2013-12-26 03:53:18]  RSP <ffff8801f2e03ce0>
------------

dom0 and domU kernels are vanilla 3.10.25
host server has 4 cores x 2 threads with mapping: 4 - dom0, 2 - domU, 2
- domU
i've tried xen versions: 4.2.3 and 4.3.1
also i've tried to disable offloaing on domU:  ethtool -K eth0 tx off
tso off gso off   ----  no effects

domU's are under high TCP load (a lot of small tcp connections (web server))
sometimes  i've got on dom0:
---
[2013-12-26 00:16:30] (XEN) grant_table.c:289:d0 Increased maptrack size
to 2 frames
[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
99221507
[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
43646979
[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
43646979
[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
99221507
[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
43646979
[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
99221507
[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
99221507
[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
99221507

---

It seems the root of problem in dom0 messages above. Is it HW failure or
some internal kernel structures overflow?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-27 11:09 domU crash with kernel BUG at drivers/net/xen-netfront.c:305 Vasily Evseenko
@ 2013-12-27 11:53 ` Wei Liu
  2013-12-27 12:21   ` Vasily Evseenko
  2014-01-02  5:09 ` annie li
  1 sibling, 1 reply; 10+ messages in thread
From: Wei Liu @ 2013-12-27 11:53 UTC (permalink / raw)
  To: Vasily Evseenko; +Cc: wei.liu2, xen-devel

On Fri, Dec 27, 2013 at 03:09:17PM +0400, Vasily Evseenko wrote:
> Hi,
> 
> I've got domU crash (~ every 1-2 days under high network (tcp) load)
> with message:
> 

Do you have clear steps on how to reproduce this issue?

> domU's are under high TCP load (a lot of small tcp connections (web server))
> sometimes  i've got on dom0:
> ---
> [2013-12-26 00:16:30] (XEN) grant_table.c:289:d0 Increased maptrack size
> to 2 frames
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> 

Grant reference this big doesn't look right. Frontend is passing garbage
to backend. Given that the numbers showed above have certain pattern
there might be a bug in frontend somewhere.

> ---
> 
> It seems the root of problem in dom0 messages above. Is it HW failure or
> some internal kernel structures overflow?
> 

It's not hardware failure, more likely to be overflow. In order to fix
this you need to provide more info on your workload, software setup etc.

Wei.

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-27 11:53 ` Wei Liu
@ 2013-12-27 12:21   ` Vasily Evseenko
  2013-12-27 14:20     ` Wei Liu
  2013-12-31 12:56     ` William Dauchy
  0 siblings, 2 replies; 10+ messages in thread
From: Vasily Evseenko @ 2013-12-27 12:21 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

Hi,
There are no clear steps to reproduce.
Bug triggered only by high tcp network load (webserver pattern - many
small connections) every one - two days.
See xen, dom0 and domU's info in attachment. I can provide any
additional info.
I've tried dom0/domU kernels 3.10.23 (vanilla, from centos-xen) and
3.10.25,  xen 4.2.3 and 4.3.1.

On 12/27/2013 03:53 PM, Wei Liu wrote:
> On Fri, Dec 27, 2013 at 03:09:17PM +0400, Vasily Evseenko wrote:
>> Hi,
>>
>> I've got domU crash (~ every 1-2 days under high network (tcp) load)
>> with message:
>>
> Do you have clear steps on how to reproduce this issue?
>
>> domU's are under high TCP load (a lot of small tcp connections (web server))
>> sometimes  i've got on dom0:
>> ---
>> [2013-12-26 00:16:30] (XEN) grant_table.c:289:d0 Increased maptrack size
>> to 2 frames
>> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 99221507
>> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 43646979
>> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 43646979
>> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 99221507
>> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 43646979
>> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 99221507
>> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 99221507
>> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
>> 99221507
>>
> Grant reference this big doesn't look right. Frontend is passing garbage
> to backend. Given that the numbers showed above have certain pattern
> there might be a bug in frontend somewhere.
>
>> ---
>>
>> It seems the root of problem in dom0 messages above. Is it HW failure or
>> some internal kernel structures overflow?
>>
> It's not hardware failure, more likely to be overflow. In order to fix
> this you need to provide more info on your workload, software setup etc.
>
> Wei.
>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel


[-- Attachment #2: host.tgz --]
[-- Type: application/x-gzip, Size: 19688 bytes --]

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-27 12:21   ` Vasily Evseenko
@ 2013-12-27 14:20     ` Wei Liu
  2013-12-31 12:56     ` William Dauchy
  1 sibling, 0 replies; 10+ messages in thread
From: Wei Liu @ 2013-12-27 14:20 UTC (permalink / raw)
  To: Vasily Evseenko; +Cc: Wei Liu, xen-devel

On Fri, Dec 27, 2013 at 04:21:10PM +0400, Vasily Evseenko wrote:
> Hi,
> There are no clear steps to reproduce.
> Bug triggered only by high tcp network load (webserver pattern - many
> small connections) every one - two days.
> See xen, dom0 and domU's info in attachment. I can provide any
> additional info.
> I've tried dom0/domU kernels 3.10.23 (vanilla, from centos-xen) and
> 3.10.25,  xen 4.2.3 and 4.3.1.
> 

>From the look of the files I don't actually have enough clues.

I would suggest you try DaveM's latest net stable tree, which has a
bunch of fixes for netback. One interesting bit is there's a guest
receive side patch went in recently.

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git

Wei.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-27 12:21   ` Vasily Evseenko
  2013-12-27 14:20     ` Wei Liu
@ 2013-12-31 12:56     ` William Dauchy
  2013-12-31 14:23       ` Vasily Evseenko
  1 sibling, 1 reply; 10+ messages in thread
From: William Dauchy @ 2013-12-31 12:56 UTC (permalink / raw)
  To: Vasily Evseenko; +Cc: Wei Liu, xen-devel

On Fri, Dec 27, 2013 at 1:21 PM, Vasily Evseenko <svpcom@gmail.com> wrote:
> There are no clear steps to reproduce.
> Bug triggered only by high tcp network load (webserver pattern - many
> small connections) every one - two days.
> See xen, dom0 and domU's info in attachment. I can provide any
> additional info.
> I've tried dom0/domU kernels 3.10.23 (vanilla, from centos-xen) and
> 3.10.25,  xen 4.2.3 and 4.3.1.

maybe testing with kmemleak enabled on the domU will help getting more info.
-- 
William

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-31 12:56     ` William Dauchy
@ 2013-12-31 14:23       ` Vasily Evseenko
  0 siblings, 0 replies; 10+ messages in thread
From: Vasily Evseenko @ 2013-12-31 14:23 UTC (permalink / raw)
  To: William Dauchy; +Cc: Wei Liu, xen-devel

I've found workaround:
Run  "ethtool -K vifX.Y  tx off tso off gso off"
on dom0 side (in addition to "ethtool -K eth0 tx off tso off gso off"  
on domU).
Disabling offloading only in domU is not sufficient.


On 12/31/2013 04:56 PM, William Dauchy wrote:
> On Fri, Dec 27, 2013 at 1:21 PM, Vasily Evseenko <svpcom@gmail.com> wrote:
>> There are no clear steps to reproduce.
>> Bug triggered only by high tcp network load (webserver pattern - many
>> small connections) every one - two days.
>> See xen, dom0 and domU's info in attachment. I can provide any
>> additional info.
>> I've tried dom0/domU kernels 3.10.23 (vanilla, from centos-xen) and
>> 3.10.25,  xen 4.2.3 and 4.3.1.
> maybe testing with kmemleak enabled on the domU will help getting more info.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2013-12-27 11:09 domU crash with kernel BUG at drivers/net/xen-netfront.c:305 Vasily Evseenko
  2013-12-27 11:53 ` Wei Liu
@ 2014-01-02  5:09 ` annie li
  2014-01-02 11:40   ` Pasi Kärkkäinen
  2014-01-02 12:01   ` Wei Liu
  1 sibling, 2 replies; 10+ messages in thread
From: annie li @ 2014-01-02  5:09 UTC (permalink / raw)
  To: Vasily Evseenko; +Cc: xen-devel


On 2013/12/27 19:09, Vasily Evseenko wrote:
> Hi,
>
> I've got domU crash (~ every 1-2 days under high network (tcp) load)
> with message:
>
> -----
> [2013-12-26 03:53:18] kernel BUG at drivers/net/xen-netfront.c:305!
> [2013-12-26 03:53:18] invalid opcode: 0000 [#1] SMP
> [2013-12-26 03:53:18] Modules linked in: ipt_REJECT iptable_filter
> xt_set xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_nat
> ip_tables ip_set_hash_net ip_set_hash_ip ip_set nfnetlink ip6t_REJECT
> nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
> ip6_table
> s ipv6 ext3 jbd xen_netfront coretemp hwmon crc32_pclmul crc32c_intel
> ghash_clmulni_intel microcode pcspkr ext4 jbd2 mbcache aesni_intel
> ablk_helper c
> ryptd lrw gf128mul glue_helper aes_x86_64 xen_blkfront dm_mirror
> dm_region_hash dm_log dm_mod
> [2013-12-26 03:53:18] CPU: 0 PID: 15126 Comm: python Not tainted
> 3.10.25-11.x86_64 #1
> [2013-12-26 03:53:18] task: ffff8801e5d68ac0 ti: ffff8801e7392000
> task.ti: ffff8801e7392000
> [2013-12-26 03:53:18] RIP: e030:[<ffffffffa015d637>]
> [<ffffffffa015d637>] xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
> [2013-12-26 03:53:18] RSP: e02b:ffff8801f2e03ce0  EFLAGS: 00010282
> [2013-12-26 03:53:18] RAX: 00000000000001d4 RBX: ffff8801e5438800 RCX:
> 0000000000000001
> [2013-12-26 03:53:18] RDX: 000000000000002a RSI: 0000000000000000 RDI:
> 0000000000002200
> [2013-12-26 03:53:18] RBP: ffff8801f2e03d40 R08: 0000000000000000 R09:
> 0000000000001000
> [2013-12-26 03:53:18] R10: ffff8801000083c0 R11: dead000000200200 R12:
> 0000000000000220
> [2013-12-26 03:53:18] R13: ffff8801e6eec0c0 R14: 000000000000002a R15:
> 000000000239642a
> [2013-12-26 03:53:18] FS:  00007f4cf48d57e0(0000)
> GS:ffff8801f2e00000(0000) knlGS:0000000000000000
> [2013-12-26 03:53:18] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2013-12-26 03:53:18] CR2: ffffffffff600400 CR3: 00000001e0db3000 CR4:
> 0000000000042660
> [2013-12-26 03:53:18] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [2013-12-26 03:53:18] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [2013-12-26 03:53:18] Stack:
> [2013-12-26 03:53:18]  ffff8801f2e03df0 02396417e5438000
> ffff8801e5439d58 ffff8801e54394f0
> [2013-12-26 03:53:18]  ffff8801e5438000 002affff00000013
> ffff8801f2e03d40 ffff8801f2e03db0
> [2013-12-26 03:53:18]  0000000000000010 ffff8800655e6ac0
> ffff8801e5438800 ffff8801e511a000
> [2013-12-26 03:53:18] Call Trace:
> [2013-12-26 03:53:18]  <IRQ>
> [2013-12-26 03:53:18]  [<ffffffffa015dc44>] xennet_poll+0x2f4/0x630
> [xen_netfront]
> [2013-12-26 03:53:18]  [<ffffffff810640a9>] ? raise_softirq_irqoff+0x9/0x50
> [2013-12-26 03:53:18]  [<ffffffff8152050c>] ? dev_kfree_skb_irq+0x5c/0x70
> [2013-12-26 03:53:18]  [<ffffffff810e4fb9>] ?
> handle_irq_event_percpu+0xc9/0x210
> [2013-12-26 03:53:18]  [<ffffffff81528022>] net_rx_action+0x112/0x290
> [2013-12-26 03:53:18]  [<ffffffff810e514d>] ? handle_irq_event+0x4d/0x70
> [2013-12-26 03:53:18]  [<ffffffff81063c97>] __do_softirq+0xf7/0x270
> [2013-12-26 03:53:18]  [<ffffffff81600edc>] call_softirq+0x1c/0x30
> [2013-12-26 03:53:18]  [<ffffffff81014505>] do_softirq+0x65/0xa0
> [2013-12-26 03:53:18]  [<ffffffff810639c5>] irq_exit+0xc5/0xd0
> [2013-12-26 03:53:18]  [<ffffffff81351e45>] xen_evtchn_do_upcall+0x35/0x50
> [2013-12-26 03:53:18]  [<ffffffff81600f3e>]
> xen_do_hypervisor_callback+0x1e/0x30
> [2013-12-26 03:53:18]  <EOI>
> [2013-12-26 03:53:18] Code: 8b 35 ee f9 bb e1 48 8d bb 08 0d 00 00 48 83
> c6 64 e8 2e f2 f0 e0 8b 83 ec 0c 00 00 31 d2 89 c1 d1 e9 39 d1 76 9e e9
> 5a ff ff ff <0f> 0b eb fe 0f 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f
> 84 00
> [2013-12-26 03:53:18] RIP  [<ffffffffa015d637>]
> xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
> [2013-12-26 03:53:18]  RSP <ffff8801f2e03ce0>
> ------------
>
> dom0 and domU kernels are vanilla 3.10.25
> host server has 4 cores x 2 threads with mapping: 4 - dom0, 2 - domU, 2
> - domU
> i've tried xen versions: 4.2.3 and 4.3.1
> also i've tried to disable offloaing on domU:  ethtool -K eth0 tx off
> tso off gso off   ----  no effects
>
> domU's are under high TCP load (a lot of small tcp connections (web server))
> sometimes  i've got on dom0:
> ---
> [2013-12-26 00:16:30] (XEN) grant_table.c:289:d0 Increased maptrack size
> to 2 frames
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 43646979
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
> [2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> 99221507
>
> ---
>
> It seems the root of problem in dom0 messages above. Is it HW failure or
> some internal kernel structures overflow?
 From the stack, it reminds me this issue is very likely same with the 
one which has been fixed. There is something wrong with counting slots 
in netback, and then responses overlapps request in the ring, and 
grantcopy gets wrong grant reference and throws out error in 
grant_table.c. See 
http://lists.xen.org/archives/html/xen-devel/2013-09/msg01143.html
There were some back and forth work for this issue, but seems the fix 
patch exists since v3.12-rc4. Would you like to have a try with newer 
kernel version?

Thanks
Annie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2014-01-02  5:09 ` annie li
@ 2014-01-02 11:40   ` Pasi Kärkkäinen
  2014-01-02 12:01   ` Wei Liu
  1 sibling, 0 replies; 10+ messages in thread
From: Pasi Kärkkäinen @ 2014-01-02 11:40 UTC (permalink / raw)
  To: annie li; +Cc: Vasily Evseenko, xen-devel

On Thu, Jan 02, 2014 at 01:09:35PM +0800, annie li wrote:
> 
> On 2013/12/27 19:09, Vasily Evseenko wrote:
> >Hi,
> >
> >I've got domU crash (~ every 1-2 days under high network (tcp) load)
> >with message:
> >
> >-----
> >[2013-12-26 03:53:18] kernel BUG at drivers/net/xen-netfront.c:305!
> >[2013-12-26 03:53:18] invalid opcode: 0000 [#1] SMP
> >[2013-12-26 03:53:18] Modules linked in: ipt_REJECT iptable_filter
> >xt_set xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> >nf_nat_ipv4 nf_nat
> >ip_tables ip_set_hash_net ip_set_hash_ip ip_set nfnetlink ip6t_REJECT
> >nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
> >ip6_table
> >s ipv6 ext3 jbd xen_netfront coretemp hwmon crc32_pclmul crc32c_intel
> >ghash_clmulni_intel microcode pcspkr ext4 jbd2 mbcache aesni_intel
> >ablk_helper c
> >ryptd lrw gf128mul glue_helper aes_x86_64 xen_blkfront dm_mirror
> >dm_region_hash dm_log dm_mod
> >[2013-12-26 03:53:18] CPU: 0 PID: 15126 Comm: python Not tainted
> >3.10.25-11.x86_64 #1
> >[2013-12-26 03:53:18] task: ffff8801e5d68ac0 ti: ffff8801e7392000
> >task.ti: ffff8801e7392000
> >[2013-12-26 03:53:18] RIP: e030:[<ffffffffa015d637>]
> >[<ffffffffa015d637>] xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
> >[2013-12-26 03:53:18] RSP: e02b:ffff8801f2e03ce0  EFLAGS: 00010282
> >[2013-12-26 03:53:18] RAX: 00000000000001d4 RBX: ffff8801e5438800 RCX:
> >0000000000000001
> >[2013-12-26 03:53:18] RDX: 000000000000002a RSI: 0000000000000000 RDI:
> >0000000000002200
> >[2013-12-26 03:53:18] RBP: ffff8801f2e03d40 R08: 0000000000000000 R09:
> >0000000000001000
> >[2013-12-26 03:53:18] R10: ffff8801000083c0 R11: dead000000200200 R12:
> >0000000000000220
> >[2013-12-26 03:53:18] R13: ffff8801e6eec0c0 R14: 000000000000002a R15:
> >000000000239642a
> >[2013-12-26 03:53:18] FS:  00007f4cf48d57e0(0000)
> >GS:ffff8801f2e00000(0000) knlGS:0000000000000000
> >[2013-12-26 03:53:18] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> >[2013-12-26 03:53:18] CR2: ffffffffff600400 CR3: 00000001e0db3000 CR4:
> >0000000000042660
> >[2013-12-26 03:53:18] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >0000000000000000
> >[2013-12-26 03:53:18] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >0000000000000400
> >[2013-12-26 03:53:18] Stack:
> >[2013-12-26 03:53:18]  ffff8801f2e03df0 02396417e5438000
> >ffff8801e5439d58 ffff8801e54394f0
> >[2013-12-26 03:53:18]  ffff8801e5438000 002affff00000013
> >ffff8801f2e03d40 ffff8801f2e03db0
> >[2013-12-26 03:53:18]  0000000000000010 ffff8800655e6ac0
> >ffff8801e5438800 ffff8801e511a000
> >[2013-12-26 03:53:18] Call Trace:
> >[2013-12-26 03:53:18]  <IRQ>
> >[2013-12-26 03:53:18]  [<ffffffffa015dc44>] xennet_poll+0x2f4/0x630
> >[xen_netfront]
> >[2013-12-26 03:53:18]  [<ffffffff810640a9>] ? raise_softirq_irqoff+0x9/0x50
> >[2013-12-26 03:53:18]  [<ffffffff8152050c>] ? dev_kfree_skb_irq+0x5c/0x70
> >[2013-12-26 03:53:18]  [<ffffffff810e4fb9>] ?
> >handle_irq_event_percpu+0xc9/0x210
> >[2013-12-26 03:53:18]  [<ffffffff81528022>] net_rx_action+0x112/0x290
> >[2013-12-26 03:53:18]  [<ffffffff810e514d>] ? handle_irq_event+0x4d/0x70
> >[2013-12-26 03:53:18]  [<ffffffff81063c97>] __do_softirq+0xf7/0x270
> >[2013-12-26 03:53:18]  [<ffffffff81600edc>] call_softirq+0x1c/0x30
> >[2013-12-26 03:53:18]  [<ffffffff81014505>] do_softirq+0x65/0xa0
> >[2013-12-26 03:53:18]  [<ffffffff810639c5>] irq_exit+0xc5/0xd0
> >[2013-12-26 03:53:18]  [<ffffffff81351e45>] xen_evtchn_do_upcall+0x35/0x50
> >[2013-12-26 03:53:18]  [<ffffffff81600f3e>]
> >xen_do_hypervisor_callback+0x1e/0x30
> >[2013-12-26 03:53:18]  <EOI>
> >[2013-12-26 03:53:18] Code: 8b 35 ee f9 bb e1 48 8d bb 08 0d 00 00 48 83
> >c6 64 e8 2e f2 f0 e0 8b 83 ec 0c 00 00 31 d2 89 c1 d1 e9 39 d1 76 9e e9
> >5a ff ff ff <0f> 0b eb fe 0f 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f
> >84 00
> >[2013-12-26 03:53:18] RIP  [<ffffffffa015d637>]
> >xennet_alloc_rx_buffers+0x347/0x360 [xen_netfront]
> >[2013-12-26 03:53:18]  RSP <ffff8801f2e03ce0>
> >------------
> >
> >dom0 and domU kernels are vanilla 3.10.25
> >host server has 4 cores x 2 threads with mapping: 4 - dom0, 2 - domU, 2
> >- domU
> >i've tried xen versions: 4.2.3 and 4.3.1
> >also i've tried to disable offloaing on domU:  ethtool -K eth0 tx off
> >tso off gso off   ----  no effects
> >
> >domU's are under high TCP load (a lot of small tcp connections (web server))
> >sometimes  i've got on dom0:
> >---
> >[2013-12-26 00:16:30] (XEN) grant_table.c:289:d0 Increased maptrack size
> >to 2 frames
> >[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> >99221507
> >[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> >43646979
> >[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> >43646979
> >[2013-12-26 03:53:18] (XEN) grant_table.c:1858:d0 Bad grant reference
> >99221507
> >[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> >43646979
> >[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> >99221507
> >[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> >99221507
> >[2013-12-26 06:15:14] (XEN) grant_table.c:1858:d0 Bad grant reference
> >99221507
> >
> >---
> >
> >It seems the root of problem in dom0 messages above. Is it HW failure or
> >some internal kernel structures overflow?
> From the stack, it reminds me this issue is very likely same with
> the one which has been fixed. There is something wrong with counting
> slots in netback, and then responses overlapps request in the ring,
> and grantcopy gets wrong grant reference and throws out error in
> grant_table.c. See
> http://lists.xen.org/archives/html/xen-devel/2013-09/msg01143.html
> There were some back and forth work for this issue, but seems the
> fix patch exists since v3.12-rc4. Would you like to have a try with
> newer kernel version?
> 

If that patch fixes the bug it sounds like it needs to be backported
to at least 3.10.x aswell..

-- Pasi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2014-01-02  5:09 ` annie li
  2014-01-02 11:40   ` Pasi Kärkkäinen
@ 2014-01-02 12:01   ` Wei Liu
  2014-01-03  6:15     ` annie li
  1 sibling, 1 reply; 10+ messages in thread
From: Wei Liu @ 2014-01-02 12:01 UTC (permalink / raw)
  To: annie li; +Cc: Vasily Evseenko, wei.liu2, xen-devel

On Thu, Jan 02, 2014 at 01:09:35PM +0800, annie li wrote:
[...]
> >It seems the root of problem in dom0 messages above. Is it HW failure or
> >some internal kernel structures overflow?
> From the stack, it reminds me this issue is very likely same with
> the one which has been fixed. There is something wrong with counting
> slots in netback, and then responses overlapps request in the ring,
> and grantcopy gets wrong grant reference and throws out error in
> grant_table.c. See
> http://lists.xen.org/archives/html/xen-devel/2013-09/msg01143.html
> There were some back and forth work for this issue, but seems the
> fix patch exists since v3.12-rc4. Would you like to have a try with
> newer kernel version?
> 

FWIW the patch you mentioned was backported to the kernel he used.

Wei.

> Thanks
> Annie
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: domU crash with kernel BUG at drivers/net/xen-netfront.c:305
  2014-01-02 12:01   ` Wei Liu
@ 2014-01-03  6:15     ` annie li
  0 siblings, 0 replies; 10+ messages in thread
From: annie li @ 2014-01-03  6:15 UTC (permalink / raw)
  To: Wei Liu; +Cc: Vasily Evseenko, xen-devel


On 2014/1/2 20:01, Wei Liu wrote:
> On Thu, Jan 02, 2014 at 01:09:35PM +0800, annie li wrote:
> [...]
>>> It seems the root of problem in dom0 messages above. Is it HW failure or
>>> some internal kernel structures overflow?
>>  From the stack, it reminds me this issue is very likely same with
>> the one which has been fixed. There is something wrong with counting
>> slots in netback, and then responses overlapps request in the ring,
>> and grantcopy gets wrong grant reference and throws out error in
>> grant_table.c. See
>> http://lists.xen.org/archives/html/xen-devel/2013-09/msg01143.html
>> There were some back and forth work for this issue, but seems the
>> fix patch exists since v3.12-rc4. Would you like to have a try with
>> newer kernel version?
>>
> FWIW the patch you mentioned was backported to the kernel he used.

Yes, it exists in 3.10.25 he used.
Based on assumption of counting slots in netback causing this issue, 
maybe http://www.spinics.net/lists/netdev/msg260017.html is the right 
fix. This patch fixed an issue caused by counting slots, and it went 
into net-next tree, 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git

Thanks
Annie

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-01-03  6:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-27 11:09 domU crash with kernel BUG at drivers/net/xen-netfront.c:305 Vasily Evseenko
2013-12-27 11:53 ` Wei Liu
2013-12-27 12:21   ` Vasily Evseenko
2013-12-27 14:20     ` Wei Liu
2013-12-31 12:56     ` William Dauchy
2013-12-31 14:23       ` Vasily Evseenko
2014-01-02  5:09 ` annie li
2014-01-02 11:40   ` Pasi Kärkkäinen
2014-01-02 12:01   ` Wei Liu
2014-01-03  6:15     ` annie li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.