xen-netfront crash when detaching network while some network activity

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* xen-netfront crash when detaching network while some network activity
@ 2015-05-22 11:49 Marek Marczykowski-Górecki
  2015-05-22 16:25 ` [Xen-devel] " David Vrabel
  2015-05-26 10:56 ` David Vrabel
  0 siblings, 2 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-05-22 11:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel; +Cc: xen-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 4374 bytes --]

Hi all,

I'm experiencing xen-netfront crash when doing xl network-detach while
some network activity is going on at the same time. It happens only when
domU has more than one vcpu. Not sure if this matters, but the backend
is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
3.9.4 and 4.1-rc1 as well.

Steps to reproduce:
1. Start the domU with some network interface
2. Call there 'ping -f some-IP'
3. Call 'xl network-detach NAME 0'

The crash message:
[   54.163670] page:ffffea00004bddc0 count:0 mapcount:0 mapping:
(null) index:0x0
[   54.163692] flags: 0x3fff8000008000(tail)
[   54.163704] page dumped because:
VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
[   54.163726] ------------[ cut here ]------------
[   54.163734] kernel BUG at include/linux/mm.h:343!
[   54.163742] invalid opcode: 0000 [#1] SMP 
[   54.163752] Modules linked in:
[   54.163762] CPU: 1 PID: 24 Comm: xenwatch Not tainted
4.1.0-rc1-1.pvops.qubes.x86_64 #4
[   54.163773] task: ffff8800133c4c00 ti: ffff880012c94000 task.ti:
ffff880012c94000
[   54.163782] RIP: e030:[<ffffffff811843cc>]  [<ffffffff811843cc>]
__free_pages+0x4c/0x50
[   54.163800] RSP: e02b:ffff880012c97be8  EFLAGS: 00010292
[   54.163808] RAX: 0000000000000044 RBX: 000077ff80000000 RCX:
0000000000000044
[   54.163817] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff880013d0ea00
[   54.163826] RBP: ffff880012c97be8 R08: 00000000000000f2 R09:
0000000000000000
[   54.163835] R10: 00000000000000f2 R11: ffffffff8185efc0 R12:
0000000000000000
[   54.163844] R13: ffff880011814200 R14: ffff880012f77000 R15:
0000000000000004
[   54.163860] FS:  00007f735f0d8740(0000) GS:ffff880013d00000(0000)
knlGS:0000000000000000
[   54.163870] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   54.163878] CR2: 0000000001652c50 CR3: 0000000012112000 CR4:
0000000000002660
[   54.163892] Stack:
[   54.163901]  ffff880012c97c08 ffffffff81184430 0000000000000011
0000000000000004
[   54.163922]  ffff880012c97c38 ffffffff814100c6 ffff87ffffffffff
ffff880011f20d88
[   54.163943]  ffff880011814200 ffff880011f20000 ffff880012c97ca8
ffffffff814d34e6
[   54.163964] Call Trace:
[   54.163977]  [<ffffffff81184430>] free_pages+0x60/0x70
[   54.163994]  [<ffffffff814100c6>]
gnttab_end_foreign_access+0x136/0x170
[   54.164012]  [<ffffffff814d34e6>]
xennet_disconnect_backend.isra.24+0x166/0x390
[   54.164030]  [<ffffffff814d37a8>] xennet_remove+0x38/0xd0
[   54.164045]  [<ffffffff8141a009>] xenbus_dev_remove+0x59/0xc0
[   54.164059]  [<ffffffff81479d27>] __device_release_driver+0x87/0x120
[   54.164528]  [<ffffffff81479de3>] device_release_driver+0x23/0x30
[   54.164528]  [<ffffffff81479658>] bus_remove_device+0x108/0x180
[   54.164528]  [<ffffffff81475861>] device_del+0x141/0x270
[   54.164528]  [<ffffffff814186a0>] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [<ffffffff814759b2>] device_unregister+0x22/0x80
[   54.164528]  [<ffffffff81419e5f>] xenbus_dev_changed+0xaf/0x200
[   54.164528]  [<ffffffff816ad346>] ?
_raw_spin_unlock_irqrestore+0x16/0x20
[   54.164528]  [<ffffffff814186a0>] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [<ffffffff8141bdb9>] frontend_changed+0x29/0x60
[   54.164528]  [<ffffffff814186a0>] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [<ffffffff8141872e>] xenwatch_thread+0x8e/0x150
[   54.164528]  [<ffffffff810be2b0>] ? wait_woken+0x90/0x90
[   54.164528]  [<ffffffff81099958>] kthread+0xd8/0xf0
[   54.164528]  [<ffffffff81099880>] ?
kthread_create_on_node+0x1b0/0x1b0
[   54.164528]  [<ffffffff816adde2>] ret_from_fork+0x42/0x70
[   54.164528]  [<ffffffff81099880>] ?
kthread_create_on_node+0x1b0/0x1b0
[   54.164528] Code: f6 74 0c e8 67 f5 ff ff 5d c3 0f 1f 44 00 00 31 f6
e8 99 fd ff ff 5d c3 0f 1f 80 00 00 00 00 48 c7 c6 78 29 a1 81 e8 d4 37
02 00 <0f> 0b 66 90 66 66 66 66 90 48 85 ff 75 06 f3 c3 0f 1f 40 00 55 
[   54.164528] RIP  [<ffffffff811843cc>] __free_pages+0x4c/0x50
[   54.164528]  RSP <ffff880012c97be8>
[   54.166002] ---[ end trace 6b847bc27fec6d36 ]---

Any ideas how to fix this? I guess xennet_disconnect_backend should take
some lock.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
  2015-05-22 11:49 xen-netfront crash when detaching network while some network activity Marek Marczykowski-Górecki
@ 2015-05-22 16:25 ` David Vrabel
  2015-05-22 16:42   ` Marek Marczykowski-Górecki
  2015-05-26 10:56 ` David Vrabel
  1 sibling, 1 reply; 15+ messages in thread
From: David Vrabel @ 2015-05-22 16:25 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, David Vrabel
  Cc: netdev, xen-devel

On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> Hi all,
> 
> I'm experiencing xen-netfront crash when doing xl network-detach while
> some network activity is going on at the same time. It happens only when
> domU has more than one vcpu. Not sure if this matters, but the backend
> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> 3.9.4 and 4.1-rc1 as well.
> 
> Steps to reproduce:
> 1. Start the domU with some network interface
> 2. Call there 'ping -f some-IP'
> 3. Call 'xl network-detach NAME 0'

I tried this about 10 times without a crash.  How reproducible is it?

I used a 4.1-rc4 frontend and a 4.0 backend.

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
  2015-05-22 16:25 ` [Xen-devel] " David Vrabel
@ 2015-05-22 16:42   ` Marek Marczykowski-Górecki
  2015-05-22 16:58     ` David Vrabel
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-05-22 16:42 UTC (permalink / raw)
  To: David Vrabel; +Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, netdev, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote:
> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > Hi all,
> > 
> > I'm experiencing xen-netfront crash when doing xl network-detach while
> > some network activity is going on at the same time. It happens only when
> > domU has more than one vcpu. Not sure if this matters, but the backend
> > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > 3.9.4 and 4.1-rc1 as well.
> > 
> > Steps to reproduce:
> > 1. Start the domU with some network interface
> > 2. Call there 'ping -f some-IP'
> > 3. Call 'xl network-detach NAME 0'
> 
> I tried this about 10 times without a crash.  How reproducible is it?
> 
> I used a 4.1-rc4 frontend and a 4.0 backend.

It happens every time for me... Do you have at least two vcpus in that
domU? With one vcpu it doesn't crash. The IP for ping I've used one in
backend domU, but it shouldn't matter.

Backend is 3.19.6 here. I don't see any changes there between rc1 and
rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
  2015-05-22 16:42   ` Marek Marczykowski-Górecki
@ 2015-05-22 16:58     ` David Vrabel
  2015-05-22 17:13       ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: David Vrabel @ 2015-05-22 16:58 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, netdev, xen-devel

On 22/05/15 17:42, Marek Marczykowski-Górecki wrote:
> On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote:
>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
>>> Hi all,
>>>
>>> I'm experiencing xen-netfront crash when doing xl network-detach while
>>> some network activity is going on at the same time. It happens only when
>>> domU has more than one vcpu. Not sure if this matters, but the backend
>>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
>>> 3.9.4 and 4.1-rc1 as well.
>>>
>>> Steps to reproduce:
>>> 1. Start the domU with some network interface
>>> 2. Call there 'ping -f some-IP'
>>> 3. Call 'xl network-detach NAME 0'
>>
>> I tried this about 10 times without a crash.  How reproducible is it?
>>
>> I used a 4.1-rc4 frontend and a 4.0 backend.
> 
> It happens every time for me... Do you have at least two vcpus in that
> domU? With one vcpu it doesn't crash. The IP for ping I've used one in
> backend domU, but it shouldn't matter.
> 
> Backend is 3.19.6 here. I don't see any changes there between rc1 and
> rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me.

Doesn't repro for me with 4 VCPU PV or PVHVM guests.  Is your guest
kernel vanilla or does it have some qubes specific patches on top?

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
  2015-05-22 16:58     ` David Vrabel
@ 2015-05-22 17:13       ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-05-22 17:13 UTC (permalink / raw)
  To: David Vrabel; +Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, netdev, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2962 bytes --]

On Fri, May 22, 2015 at 05:58:41PM +0100, David Vrabel wrote:
> On 22/05/15 17:42, Marek Marczykowski-Górecki wrote:
> > On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote:
> >> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> >>> Hi all,
> >>>
> >>> I'm experiencing xen-netfront crash when doing xl network-detach while
> >>> some network activity is going on at the same time. It happens only when
> >>> domU has more than one vcpu. Not sure if this matters, but the backend
> >>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> >>> 3.9.4 and 4.1-rc1 as well.
> >>>
> >>> Steps to reproduce:
> >>> 1. Start the domU with some network interface
> >>> 2. Call there 'ping -f some-IP'
> >>> 3. Call 'xl network-detach NAME 0'
> >>
> >> I tried this about 10 times without a crash.  How reproducible is it?
> >>
> >> I used a 4.1-rc4 frontend and a 4.0 backend.
> > 
> > It happens every time for me... Do you have at least two vcpus in that
> > domU? With one vcpu it doesn't crash. The IP for ping I've used one in
> > backend domU, but it shouldn't matter.
> > 
> > Backend is 3.19.6 here. I don't see any changes there between rc1 and
> > rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me.
> 
> Doesn't repro for me with 4 VCPU PV or PVHVM guests.

I've tried with exactly 2 vcpus in frontend domU (PV), but I guess it
shouldn't matter. Backend is also PV.

> Is your guest
> kernel vanilla or does it have some qubes specific patches on top?

This one was from vanilla - both frontend and backend (just qubes
config).
Maybe something about device configuration? Here is xenstore dump:
frontend:
0 = ""
 backend = "/local/domain/66/backend/vif/69/0"
 backend-id = "66"
 state = "4"
 handle = "0"
 mac = "00:16:3e:5e:6c:07"
 multi-queue-num-queues = "2"
 queue-0 = ""
  tx-ring-ref = "1280"
  rx-ring-ref = "1281"
  event-channel-tx = "19"
  event-channel-rx = "20"
 queue-1 = ""
  tx-ring-ref = "1282"
  rx-ring-ref = "1283"
  event-channel-tx = "21"
  event-channel-rx = "22"
 request-rx-copy = "1"
 feature-rx-notify = "1"
 feature-sg = "1"
 feature-gso-tcpv4 = "1"
 feature-gso-tcpv6 = "1"
 feature-ipv6-csum-offload = "1"

backend:
69 = ""
 0 = ""
  frontend = "/local/domain/69/device/vif/0"
  frontend-id = "69"
  online = "1"
  state = "4"
  script = "/etc/xen/scripts/vif-route-qubes"
  mac = "00:16:3e:5e:6c:07"
  ip = "10.137.3.9"
  handle = "0"
  type = "vif"
  feature-sg = "1"
  feature-gso-tcpv4 = "1"
  feature-gso-tcpv6 = "1"
  feature-ipv6-csum-offload = "1"
  feature-rx-copy = "1"
  feature-rx-flip = "0"
  feature-split-event-channels = "1"
  multi-queue-max-queues = "2"
  hotplug-status = "connected"


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-05-22 11:49 xen-netfront crash when detaching network while some network activity Marek Marczykowski-Górecki
  2015-05-22 16:25 ` [Xen-devel] " David Vrabel
@ 2015-05-26 10:56 ` David Vrabel
  2015-05-26 22:03   ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 15+ messages in thread
From: David Vrabel @ 2015-05-26 10:56 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Konrad Rzeszutek Wilk,
	Boris Ostrovsky
  Cc: netdev, xen-devel

On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> Hi all,
> 
> I'm experiencing xen-netfront crash when doing xl network-detach while
> some network activity is going on at the same time. It happens only when
> domU has more than one vcpu. Not sure if this matters, but the backend
> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> 3.9.4 and 4.1-rc1 as well.
> 
> Steps to reproduce:
> 1. Start the domU with some network interface
> 2. Call there 'ping -f some-IP'
> 3. Call 'xl network-detach NAME 0'

There's a use-after-free in xennet_remove().  Does this patch fix it?

8<--------------------------------
xen-netfront: properly destroy queues when removing device

xennet_remove() freed the queues before freeing the netdevice which
results in a use-after-free when free_netdev() tries to delete the
napi instances that have already been freed.

Fix this by fully destroy the queues (which includes deleting the napi
instances) before freeing the netdevice.

Reported-by: Marek Marczykowski <marmarek@invisiblethingslab.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netfront.c |   15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 3f45afd..e031c94 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1698,6 +1698,7 @@ static void xennet_destroy_queues(struct netfront_info *info)
 
 		if (netif_running(info->netdev))
 			napi_disable(&queue->napi);
+		del_timer_sync(&queue->rx_refill_timer);
 		netif_napi_del(&queue->napi);
 	}
 
@@ -2102,9 +2103,6 @@ static const struct attribute_group xennet_dev_group = {
 static int xennet_remove(struct xenbus_device *dev)
 {
 	struct netfront_info *info = dev_get_drvdata(&dev->dev);
-	unsigned int num_queues = info->netdev->real_num_tx_queues;
-	struct netfront_queue *queue = NULL;
-	unsigned int i = 0;
 
 	dev_dbg(&dev->dev, "%s\n", dev->nodename);
 
@@ -2112,16 +2110,7 @@ static int xennet_remove(struct xenbus_device *dev)
 
 	unregister_netdev(info->netdev);
 
-	for (i = 0; i < num_queues; ++i) {
-		queue = &info->queues[i];
-		del_timer_sync(&queue->rx_refill_timer);
-	}
-
-	if (num_queues) {
-		kfree(info->queues);
-		info->queues = NULL;
-	}
-
+	xennet_destroy_queues(info);
 	xennet_free_netdev(info->netdev);
 
 	return 0;
-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-05-26 10:56 ` David Vrabel
@ 2015-05-26 22:03   ` Marek Marczykowski-Górecki
  2015-10-21 18:57     ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-05-26 22:03 UTC (permalink / raw)
  To: David Vrabel; +Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, xen-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 3296 bytes --]

On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > Hi all,
> > 
> > I'm experiencing xen-netfront crash when doing xl network-detach while
> > some network activity is going on at the same time. It happens only when
> > domU has more than one vcpu. Not sure if this matters, but the backend
> > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > 3.9.4 and 4.1-rc1 as well.
> > 
> > Steps to reproduce:
> > 1. Start the domU with some network interface
> > 2. Call there 'ping -f some-IP'
> > 3. Call 'xl network-detach NAME 0'
> 
> There's a use-after-free in xennet_remove().  Does this patch fix it?

Unfortunately not. Note that the crash is in xennet_disconnect_backend,
which is called before xennet_destroy_queues in xennet_remove.
I've tried to add napi_disable and even netif_napi_del just after
napi_synchronize in xennet_disconnect_backend (which would probably
cause crash when trying to cleanup the same later again), but it doesn't
help - the crash is the same (still in gnttab_end_foreign_access called
from xennet_disconnect_backend).


> 8<--------------------------------
> xen-netfront: properly destroy queues when removing device
> 
> xennet_remove() freed the queues before freeing the netdevice which
> results in a use-after-free when free_netdev() tries to delete the
> napi instances that have already been freed.
> 
> Fix this by fully destroy the queues (which includes deleting the napi
> instances) before freeing the netdevice.
> 
> Reported-by: Marek Marczykowski <marmarek@invisiblethingslab.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netfront.c |   15 ++-------------
>  1 file changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 3f45afd..e031c94 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -1698,6 +1698,7 @@ static void xennet_destroy_queues(struct netfront_info *info)
>  
>  		if (netif_running(info->netdev))
>  			napi_disable(&queue->napi);
> +		del_timer_sync(&queue->rx_refill_timer);
>  		netif_napi_del(&queue->napi);
>  	}
>  
> @@ -2102,9 +2103,6 @@ static const struct attribute_group xennet_dev_group = {
>  static int xennet_remove(struct xenbus_device *dev)
>  {
>  	struct netfront_info *info = dev_get_drvdata(&dev->dev);
> -	unsigned int num_queues = info->netdev->real_num_tx_queues;
> -	struct netfront_queue *queue = NULL;
> -	unsigned int i = 0;
>  
>  	dev_dbg(&dev->dev, "%s\n", dev->nodename);
>  
> @@ -2112,16 +2110,7 @@ static int xennet_remove(struct xenbus_device *dev)
>  
>  	unregister_netdev(info->netdev);
>  
> -	for (i = 0; i < num_queues; ++i) {
> -		queue = &info->queues[i];
> -		del_timer_sync(&queue->rx_refill_timer);
> -	}
> -
> -	if (num_queues) {
> -		kfree(info->queues);
> -		info->queues = NULL;
> -	}
> -
> +	xennet_destroy_queues(info);
>  	xennet_free_netdev(info->netdev);
>  
>  	return 0;

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-05-26 22:03   ` Marek Marczykowski-Górecki
@ 2015-10-21 18:57     ` Marek Marczykowski-Górecki
  2015-11-17  2:45       ` Marek Marczykowski-Górecki
  2015-11-17 11:59       ` [Xen-devel] " David Vrabel
  0 siblings, 2 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-10-21 18:57 UTC (permalink / raw)
  To: David Vrabel
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Annie Li, xen-devel,
	netdev

[-- Attachment #1: Type: text/plain, Size: 5548 bytes --]

On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > Hi all,
> > > 
> > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > some network activity is going on at the same time. It happens only when
> > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > 3.9.4 and 4.1-rc1 as well.
> > > 
> > > Steps to reproduce:
> > > 1. Start the domU with some network interface
> > > 2. Call there 'ping -f some-IP'
> > > 3. Call 'xl network-detach NAME 0'
> > 
> > There's a use-after-free in xennet_remove().  Does this patch fix it?
> 
> Unfortunately not. Note that the crash is in xennet_disconnect_backend,
> which is called before xennet_destroy_queues in xennet_remove.
> I've tried to add napi_disable and even netif_napi_del just after
> napi_synchronize in xennet_disconnect_backend (which would probably
> cause crash when trying to cleanup the same later again), but it doesn't
> help - the crash is the same (still in gnttab_end_foreign_access called
> from xennet_disconnect_backend).

Finally I've found some more time to debug this... All tests redone on
v4.3-rc6 frontend and 3.18.17 backend.

Looking at xennet_tx_buf_gc(), I have an impression that shared page
(queue->grant_tx_page[id]) is/should be freed in some other means than
(indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug
is that the page _is_ actually freed somewhere else already? At least changing
gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash
gone.

Relevant xennet_tx_buf_gc fragment:
            gnttab_end_foreign_access_ref(
                queue->grant_tx_ref[id], GNTMAP_readonly);
            gnttab_release_grant_reference(
                &queue->gref_tx_head, queue->grant_tx_ref[id]);
            queue->grant_tx_ref[id] = GRANT_INVALID_REF;
            queue->grant_tx_page[id] = NULL;
            add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id);
            dev_kfree_skb_irq(skb);

And similar fragment from xennet_release_tx_bufs:
        get_page(queue->grant_tx_page[i]);
        gnttab_end_foreign_access(queue->grant_tx_ref[i],
                      GNTMAP_readonly,
                      (unsigned long)page_address(queue->grant_tx_page[i]));
        queue->grant_tx_page[i] = NULL;
        queue->grant_tx_ref[i] = GRANT_INVALID_REF;
        add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i);
        dev_kfree_skb_irq(skb);

Note that both have dev_kfree_skb_irq, but the former use
gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access.
Also note that the crash is in gnttab_end_foreign_access, so before
dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later.

This change was introduced by cefe007 "xen-netfront: fix resource leak in
netfront". I'm not sure if changing gnttab_end_foreign_access back to
gnttab_end_foreign_access_ref would not (re)introduce some memory leak.

Let me paste again the error message:
[   73.718636] page:ffffea000043b1c0 count:0 mapcount:0 mapping:          (null) index:0x0
[   73.718661] flags: 0x3ffc0000008000(tail)
[   73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
[   73.718725] ------------[ cut here ]------------
[   73.718743] kernel BUG at include/linux/mm.h:338!

Also it all look quite strange - there is get_page() call just before
gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something
to do how get_page() works on "tail" pages (whatever it means)?

    static inline void get_page(struct page *page)
    {
        if (unlikely(PageTail(page)))
            if (likely(__get_page_tail(page)))
                return;
        /*
         * Getting a normal page or the head of a compound page
         * requires to already have an elevated page->_count.
         */
        VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
        atomic_inc(&page->_count);
    }

which (I think) ends up in:

    static inline void __get_page_tail_foll(struct page *page,
                        bool get_page_head)
    {
        /*
         * If we're getting a tail page, the elevated page->_count is
         * required only in the head page and we will elevate the head
         * page->_count and tail page->_mapcount.
         *
         * We elevate page_tail->_mapcount for tail pages to force
         * page_tail->_count to be zero at all times to avoid getting
         * false positives from get_page_unless_zero() with
         * speculative page access (like in
         * page_cache_get_speculative()) on tail pages.
         */
        VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
        if (get_page_head)
            atomic_inc(&page->first_page->_count);
        get_huge_page_tail(page);
    }

So the use counter is incremented in page->first_page->_count, not
page->_count. But according to the comment, it should influence
page->_mapcount, but the error message says it does not.

Any ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-10-21 18:57     ` Marek Marczykowski-Górecki
@ 2015-11-17  2:45       ` Marek Marczykowski-Górecki
  2015-12-01 22:00         ` Konrad Rzeszutek Wilk
  2015-11-17 11:59       ` [Xen-devel] " David Vrabel
  1 sibling, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-11-17  2:45 UTC (permalink / raw)
  To: David Vrabel
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Annie Li, xen-devel,
	netdev

[-- Attachment #1: Type: text/plain, Size: 5867 bytes --]

On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > Hi all,
> > > > 
> > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > some network activity is going on at the same time. It happens only when
> > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > > 3.9.4 and 4.1-rc1 as well.
> > > > 
> > > > Steps to reproduce:
> > > > 1. Start the domU with some network interface
> > > > 2. Call there 'ping -f some-IP'
> > > > 3. Call 'xl network-detach NAME 0'
> > > 
> > > There's a use-after-free in xennet_remove().  Does this patch fix it?
> > 
> > Unfortunately not. Note that the crash is in xennet_disconnect_backend,
> > which is called before xennet_destroy_queues in xennet_remove.
> > I've tried to add napi_disable and even netif_napi_del just after
> > napi_synchronize in xennet_disconnect_backend (which would probably
> > cause crash when trying to cleanup the same later again), but it doesn't
> > help - the crash is the same (still in gnttab_end_foreign_access called
> > from xennet_disconnect_backend).
> 
> Finally I've found some more time to debug this... All tests redone on
> v4.3-rc6 frontend and 3.18.17 backend.
> 
> Looking at xennet_tx_buf_gc(), I have an impression that shared page
> (queue->grant_tx_page[id]) is/should be freed in some other means than
> (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug
> is that the page _is_ actually freed somewhere else already? At least changing
> gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash
> gone.
> 
> Relevant xennet_tx_buf_gc fragment:
>             gnttab_end_foreign_access_ref(
>                 queue->grant_tx_ref[id], GNTMAP_readonly);
>             gnttab_release_grant_reference(
>                 &queue->gref_tx_head, queue->grant_tx_ref[id]);
>             queue->grant_tx_ref[id] = GRANT_INVALID_REF;
>             queue->grant_tx_page[id] = NULL;
>             add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id);
>             dev_kfree_skb_irq(skb);
> 
> And similar fragment from xennet_release_tx_bufs:
>         get_page(queue->grant_tx_page[i]);
>         gnttab_end_foreign_access(queue->grant_tx_ref[i],
>                       GNTMAP_readonly,
>                       (unsigned long)page_address(queue->grant_tx_page[i]));
>         queue->grant_tx_page[i] = NULL;
>         queue->grant_tx_ref[i] = GRANT_INVALID_REF;
>         add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i);
>         dev_kfree_skb_irq(skb);
> 
> Note that both have dev_kfree_skb_irq, but the former use
> gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access.
> Also note that the crash is in gnttab_end_foreign_access, so before
> dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later.
> 
> This change was introduced by cefe007 "xen-netfront: fix resource leak in
> netfront". I'm not sure if changing gnttab_end_foreign_access back to
> gnttab_end_foreign_access_ref would not (re)introduce some memory leak.
> 
> Let me paste again the error message:
> [   73.718636] page:ffffea000043b1c0 count:0 mapcount:0 mapping:          (null) index:0x0
> [   73.718661] flags: 0x3ffc0000008000(tail)
> [   73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
> [   73.718725] ------------[ cut here ]------------
> [   73.718743] kernel BUG at include/linux/mm.h:338!
> 
> Also it all look quite strange - there is get_page() call just before
> gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something
> to do how get_page() works on "tail" pages (whatever it means)?
> 
>     static inline void get_page(struct page *page)
>     {
>         if (unlikely(PageTail(page)))
>             if (likely(__get_page_tail(page)))
>                 return;
>         /*
>          * Getting a normal page or the head of a compound page
>          * requires to already have an elevated page->_count.
>          */
>         VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
>         atomic_inc(&page->_count);
>     }
> 
> which (I think) ends up in:
> 
>     static inline void __get_page_tail_foll(struct page *page,
>                         bool get_page_head)
>     {
>         /*
>          * If we're getting a tail page, the elevated page->_count is
>          * required only in the head page and we will elevate the head
>          * page->_count and tail page->_mapcount.
>          *
>          * We elevate page_tail->_mapcount for tail pages to force
>          * page_tail->_count to be zero at all times to avoid getting
>          * false positives from get_page_unless_zero() with
>          * speculative page access (like in
>          * page_cache_get_speculative()) on tail pages.
>          */
>         VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
>         if (get_page_head)
>             atomic_inc(&page->first_page->_count);
>         get_huge_page_tail(page);
>     }
> 
> So the use counter is incremented in page->first_page->_count, not
> page->_count. But according to the comment, it should influence
> page->_mapcount, but the error message says it does not.
> 
> Any ideas?

Ping?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
  2015-10-21 18:57     ` Marek Marczykowski-Górecki
  2015-11-17  2:45       ` Marek Marczykowski-Górecki
@ 2015-11-17 11:59       ` David Vrabel
  1 sibling, 0 replies; 15+ messages in thread
From: David Vrabel @ 2015-11-17 11:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, David Vrabel
  Cc: netdev, Boris Ostrovsky, xen-devel, Annie Li

On 21/10/15 19:57, Marek Marczykowski-Górecki wrote:
> 
> Any ideas?

No, sorry.  Netfront looks correct to me.

We take an additional ref for the ref released by
gnttab_release_grant_reference().  The get_page() here is safe since we
haven't freed the page yet (this is done in the subsequent call to
skb_kfree_irq()).

get_page()/put_page() also look fine when used with tail pages.

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-11-17  2:45       ` Marek Marczykowski-Górecki
@ 2015-12-01 22:00         ` Konrad Rzeszutek Wilk
  2015-12-01 22:32           ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-12-01 22:00 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Annie Li, Boris Ostrovsky, netdev, David Vrabel, xen-devel

On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > Hi all,
> > > > > 
> > > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > > some network activity is going on at the same time. It happens only when
> > > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > 
> > > > > Steps to reproduce:
> > > > > 1. Start the domU with some network interface
> > > > > 2. Call there 'ping -f some-IP'
> > > > > 3. Call 'xl network-detach NAME 0'

Do you see this all the time or just on occassions?

I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-12-01 22:00         ` Konrad Rzeszutek Wilk
@ 2015-12-01 22:32           ` Marek Marczykowski-Górecki
  2016-01-20 21:59             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2015-12-01 22:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: David Vrabel, Boris Ostrovsky, Annie Li, xen-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 1866 bytes --]

On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > > > some network activity is going on at the same time. It happens only when
> > > > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > > 
> > > > > > Steps to reproduce:
> > > > > > 1. Start the domU with some network interface
> > > > > > 2. Call there 'ping -f some-IP'
> > > > > > 3. Call 'xl network-detach NAME 0'
> 
> Do you see this all the time or just on occassions?

Using above procedure - all the time.

> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?

PV, started by libvirt. This may have something to do, the problem didn't
existed on older Xen (4.1) and started by xl. I'm not sure about kernel
version there, but I think I've tried there 3.18 too, which has this
problem.

But I don't see anything special in domU config file (neither backend
nor frontend) - it may be some libvirt default. If that's really the
cause. Can I (and how) get any useful information about that?


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2015-12-01 22:32           ` Marek Marczykowski-Górecki
@ 2016-01-20 21:59             ` Konrad Rzeszutek Wilk
  2016-01-21 12:30               ` Joao Martins
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 21:59 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Joao Martins
  Cc: Annie Li, Boris Ostrovsky, netdev, David Vrabel, xen-devel

On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> > > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > > > > some network activity is going on at the same time. It happens only when
> > > > > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > > > 
> > > > > > > Steps to reproduce:
> > > > > > > 1. Start the domU with some network interface
> > > > > > > 2. Call there 'ping -f some-IP'
> > > > > > > 3. Call 'xl network-detach NAME 0'
> > 
> > Do you see this all the time or just on occassions?
> 
> Using above procedure - all the time.
> 
> > I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
> 
> PV, started by libvirt. This may have something to do, the problem didn't
> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
> version there, but I think I've tried there 3.18 too, which has this
> problem.
> 
> But I don't see anything special in domU config file (neither backend
> nor frontend) - it may be some libvirt default. If that's really the
> cause. Can I (and how) get any useful information about that?

libvirt naturally does some libxl calls, and they may be different.

Any chance you could give me an idea of:
 - What commands you use in libvirt?
 - Do you use a bond or bridge?
 - What version of libvirt you are using?

Thanks!
CC-ing Joao just in case he has seen this.
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2016-01-20 21:59             ` Konrad Rzeszutek Wilk
@ 2016-01-21 12:30               ` Joao Martins
  2016-01-22 19:23                 ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Joao Martins @ 2016-01-21 12:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Marek Marczykowski-Górecki
  Cc: David Vrabel, Boris Ostrovsky, Annie Li, xen-devel, netdev



On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
>> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
>>>> On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
>>>>> On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
>>>>>> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
>>>>>>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'm experiencing xen-netfront crash when doing xl network-detach while
>>>>>>>> some network activity is going on at the same time. It happens only when
>>>>>>>> domU has more than one vcpu. Not sure if this matters, but the backend
>>>>>>>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
>>>>>>>> 3.9.4 and 4.1-rc1 as well.
>>>>>>>>
>>>>>>>> Steps to reproduce:
>>>>>>>> 1. Start the domU with some network interface
>>>>>>>> 2. Call there 'ping -f some-IP'
>>>>>>>> 3. Call 'xl network-detach NAME 0'
>>>
>>> Do you see this all the time or just on occassions?
>>
>> Using above procedure - all the time.
>>
>>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
>>
>> PV, started by libvirt. This may have something to do, the problem didn't
>> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
>> version there, but I think I've tried there 3.18 too, which has this
>> problem.
>>
>> But I don't see anything special in domU config file (neither backend
>> nor frontend) - it may be some libvirt default. If that's really the
>> cause. Can I (and how) get any useful information about that?
> 
> libvirt naturally does some libxl calls, and they may be different.
> 
> Any chance you could give me an idea of:
>  - What commands you use in libvirt?
>  - Do you use a bond or bridge?
>  - What version of libvirt you are using?
> 
> Thanks!
> CC-ing Joao just in case he has seen this.
>>
Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, using
both libvirt or plain xl (both on a bridge setup) and also irrespective of the
both load and direction of traffic (be it a ping flood, pktgen with min.
sized packets or iperf).

>>
>> -- 
>> Best Regards,
>> Marek Marczykowski-Górecki
>> Invisible Things Lab
>> A: Because it messes up the order in which people normally read text.
>> Q: Why is top-posting such a bad thing?
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen-netfront crash when detaching network while some network activity
  2016-01-21 12:30               ` Joao Martins
@ 2016-01-22 19:23                 ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2016-01-22 19:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Konrad Rzeszutek Wilk, David Vrabel, Boris Ostrovsky, Annie Li,
	xen-devel, netdev


[-- Attachment #1.1: Type: text/plain, Size: 3602 bytes --]

On Thu, Jan 21, 2016 at 12:30:48PM +0000, Joao Martins wrote:
> 
> 
> On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote:
> > On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
> >> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> >>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> >>>> On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> >>>>> On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> >>>>>> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> >>>>>>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I'm experiencing xen-netfront crash when doing xl network-detach while
> >>>>>>>> some network activity is going on at the same time. It happens only when
> >>>>>>>> domU has more than one vcpu. Not sure if this matters, but the backend
> >>>>>>>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> >>>>>>>> 3.9.4 and 4.1-rc1 as well.
> >>>>>>>>
> >>>>>>>> Steps to reproduce:
> >>>>>>>> 1. Start the domU with some network interface
> >>>>>>>> 2. Call there 'ping -f some-IP'
> >>>>>>>> 3. Call 'xl network-detach NAME 0'
> >>>
> >>> Do you see this all the time or just on occassions?
> >>
> >> Using above procedure - all the time.
> >>
> >>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
> >>
> >> PV, started by libvirt. This may have something to do, the problem didn't
> >> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
> >> version there, but I think I've tried there 3.18 too, which has this
> >> problem.
> >>
> >> But I don't see anything special in domU config file (neither backend
> >> nor frontend) - it may be some libvirt default. If that's really the
> >> cause. Can I (and how) get any useful information about that?
> > 
> > libvirt naturally does some libxl calls, and they may be different.
> > 
> > Any chance you could give me an idea of:
> >  - What commands you use in libvirt?
> >  - Do you use a bond or bridge?
> >  - What version of libvirt you are using?
> > 
> > Thanks!
> > CC-ing Joao just in case he has seen this.
> >>
> Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, using
> both libvirt or plain xl (both on a bridge setup) and also irrespective of the
> both load and direction of traffic (be it a ping flood, pktgen with min.
> sized packets or iperf).

I've ran the test again, on vanilla 4.4 and collected some info:
 - xenstore dump of frontend (xs-frontend-before.txt)
 - xenstore dump of backend (xs-backend-before.txt)
 - kernel messages (console output) (console.log)
 - kernel config (config-4.4)
 - libvirt config of that domain (netdebug.conf)

Versions:
 - kernel 4.4 (frontend), 4.2.8 (backend)
 - libvirt 1.2.20
 - xen 4.6.0

In backend domain there is no bridge or anything like that - only
routing. The same in frontend - nothing fancy - just IP set on eth0
there.

Steps to reproduce were the same:
 - start frontend domain (virsh create ...)
 - call ping -f
 - xl network-detach NAME 0

Note that the crash doesn't happen with attached patch applied (as noted
in mail on Oct 21), but I have no idea whether is it a proper fix, or
just prevents the crash by a coincidence.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #1.2: console.log --]
[-- Type: text/plain, Size: 29664 bytes --]

[    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC  
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.4.0-1.pvops.qubes.x86_64 (user@devel-3rdparty) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #20 SMP Fri Jan 22 00:39:29 CET 2016
[    0.000000] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH 3 rd.break
[    0.000000] x86/fpu: Legacy x87 FPU detected.
[    0.000000] x86/fpu: Using 'lazy' FPU context switches.
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] Released 0 page(s)
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: Xen
[    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
[    0.000000] MTRR: Disabled
[    0.000000] RAMDISK: [mem 0x02030000-0x027c6fff]
[    0.000000] NUMA turned off
[    0.000000] Faking a node at [mem 0x0000000000000000-0x00000000f9ffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x18837000-0x18848fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000f9ffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x00000000f9ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x00000000f9ffffff]
[    0.000000] p2m virtual area at ffffc90000000000, size is 40000000
[    0.000000] Remapped 0 page(s)
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.6.0 (preserve-AD)
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
[    0.000000] PERCPU: Embedded 33 pages/cpu @ffff880013c00000 s98264 r8192 d28712 u1048576
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1007882
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH 3 rd.break
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Memory: 302592K/4095612K available (6670K kernel code, 1222K rwdata, 2980K rodata, 1452K init, 1424K bss, 3793020K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000] 	Build-time adjustment of leaf fanout to 64.
[    0.000000] 	RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=2
[    0.000000] Using NULL legacy PIC
[    0.000000] NR_IRQS:4352 nr_irqs:48 0
[    0.000000] xen:events: Using FIFO-based ABI
[    0.000000] 	Offload RCU callbacks from all CPUs
[    0.000000] 	Offload RCU callbacks from CPUs: 0-1.
[    0.000000] Console: colour dummy device 80x25
[    0.000000] console [tty0] enabled
[    0.000000] console [hvc0] enabled
[    0.000000] clocksource: xen: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] installing Xen timer for CPU 0
[    0.000000] tsc: Detected 1995.104 MHz processor
[    0.001000] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.20 BogoMIPS (lpj=1995104)
[    0.001000] pid_max: default: 32768 minimum: 301
[    0.001000] Security Framework initialized
[    0.001611] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.004023] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.005197] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.005237] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.005788] Initializing cgroup subsys io
[    0.005803] Initializing cgroup subsys memory
[    0.005821] Initializing cgroup subsys devices
[    0.005832] Initializing cgroup subsys freezer
[    0.005843] Initializing cgroup subsys net_cls
[    0.005853] Initializing cgroup subsys perf_event
[    0.005863] Initializing cgroup subsys net_prio
[    0.005873] Initializing cgroup subsys hugetlb
[    0.005956] CPU: Physical Processor ID: 0
[    0.005963] CPU: Processor Core ID: 1
[    0.005977] Last level iTLB entries: 4KB 128, 2MB 4, 4MB 4
[    0.005983] Last level dTLB entries: 4KB 256, 2MB 0, 4MB 32, 1GB 0
[    0.039457] ftrace: allocating 25787 entries in 101 pages
[    0.047128] Could not initialize VPMU for cpu 0, error -95
[    0.047220] Performance Events: unsupported p6 CPU model 15 no PMU driver, software events only.
[    0.047979] NMI watchdog: disabled (cpu0): hardware events not enabled
[    0.047990] NMI watchdog: Shutting down hard lockup detector on all cpus
[    0.048139] SMP alternatives: switching to SMP code
[    0.078846] installing Xen timer for CPU 1
[    0.079180] x86: Booted up 1 node, 2 CPUs
[    0.079316] devtmpfs: initialized
[    0.083816] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.084096] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[    0.084108] pinctrl core: initialized pinctrl subsystem
[    0.104107] RTC time: 165:165:165, date: 165/165/65
[    0.110246] NET: Registered protocol family 16
[    0.110284] xen:grant_table: Grant tables using version 1 layout
[    0.110310] Grant table initialized
[    0.111010] PCI: setting up Xen PCI frontend stub
[    0.115090] ACPI: Interpreter disabled.
[    0.115090] xen:balloon: Initialising balloon driver
[    0.181042] xen_balloon: Initialising balloon driver
[    0.181110] vgaarb: loaded
[    0.182053] SCSI subsystem initialized
[    0.182165] dmi: Firmware registration failed.
[    0.183040] PCI: System does not support PCI
[    0.183050] PCI: System does not support PCI
[    0.183237] NetLabel: Initializing
[    0.183248] NetLabel:  domain hash size = 128
[    0.183254] NetLabel:  protocols = UNLABELED CIPSOv4
[    0.183279] NetLabel:  unlabeled traffic allowed by default
[    0.185019] clocksource: Switched to clocksource xen
[    0.195845] pnp: PnP ACPI: disabled
[    0.200085] NET: Registered protocol family 2
[    0.200441] TCP established hash table entries: 32768 (order: 6, 262144 bytes)
[    0.200691] TCP bind hash table entries: 32768 (order: 7, 524288 bytes)
[    0.200901] TCP: Hash tables configured (established 32768 bind 32768)
[    0.200975] UDP hash table entries: 2048 (order: 4, 65536 bytes)
[    0.201057] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
[    0.201180] NET: Registered protocol family 1
[    0.201284] Unpacking initramfs...
[    0.219827] Freeing initrd memory: 7772K (ffff880002030000 - ffff8800027c7000)
[    0.220918] futex hash table entries: 512 (order: 3, 32768 bytes)
[    0.220977] audit: initializing netlink subsys (disabled)
[    0.221036] audit: type=2000 audit(1453421870.210:1): initialized
[    0.221566] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    0.224154] zbud: loaded
[    0.224502] VFS: Disk quotas dquot_6.6.0
[    0.224567] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.225281] Key type big_key registered
[    0.230104] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[    0.230168] io scheduler noop registered
[    0.230177] io scheduler deadline registered
[    0.230239] io scheduler cfq registered (default)
[    0.230356] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    0.230371] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    0.230721] xen:xen_evtchn: Event-channel device installed
[    0.231501] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.232242] Non-volatile memory driver v1.3
[    0.232585] loop: module loaded
[    0.283555] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.287743] tun: Universal TUN/TAP device driver, 1.6
[    0.287754] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[    0.287959] xen_netfront: Initialising Xen virtual ethernet driver
[    0.291795] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.295457] i8042: PNP: No PS/2 controller found. Probing ports directly.
[    1.314402] i8042: No controller found
[    1.314436] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x39843a855e2, max_idle_ns: 881590688655 ns
[    1.314792] mousedev: PS/2 mouse device common for all mice
[    1.315247] device-mapper: uevent: version 1.0.3
[    1.315492] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[    1.315696] dmi-sysfs: dmi entry is absent.
[    1.315738] hidraw: raw HID events driver (C) Jiri Kosina
[    1.315852] drop_monitor: Initializing network drop monitor service
[    1.315956] Initializing XFRM netlink socket
[    1.316215] NET: Registered protocol family 10
[    1.316696] mip6: Mobile IPv6
[    1.316710] NET: Registered protocol family 17
[    1.317235] registered taskstats version 1
[    1.317290] zswap: loaded using pool lzo/zbud
[    1.386055] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.392405] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.417232]   Magic number: 1:252:3141
[    1.417399] console [netcon0] enabled
[    1.417417] netconsole: network logging started
[    1.417441] hctosys: unable to open rtc device (rtc0)
[    1.420219] Freeing unused kernel memory: 1452K (ffffffff81d33000 - ffffffff81e9e000)
[    1.420258] Write protecting the kernel read-only data: 12288k
[    1.434129] Freeing unused kernel memory: 1508K (ffff880001687000 - ffff880001800000)
[    1.435108] Freeing unused kernel memory: 1116K (ffff880001ae9000 - ffff880001c00000)
Qubes initramfs script here:
[    1.440455] random: modprobe urandom read with 33 bits of entropy available
modprobe: chdir(4.4.0-1.pvops.qubes.x86_64): No such file or directory
modprobe: chdir(4.4.0-1.pvops.qubes.x86_64): No such file or directory
Qubes: Cannot load Xen Block Frontend...
Waiting for /dev/xvda* devices...
Qubes: Doing COW setup for AppVM...
sfdisk: Checking that no-one is using this disk right now ...
sfdisk: OK
sfdisk:  /dev/xvdc: unrecognized partition table type
sfdisk: No partitions found
sfdisk: Warning: partition 1 does not end at a cylinder boundary
sfdisk: Warning: partition 2 does not start at a cylinder boundary
sfdisk: Warning: partition 2 does not end at a cylinder boundary
sfdisk: Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
[    1.448877]  xvdc: xvdc1 xvdc2
sfdisk: If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
Setting up swapspace version 1, size = 1073737728 bytes
UUID=da20ab8f-e3b2-460f-8ea1-c77204a0f63d
Qubes: done.
modprobe: chdir(4.4.0-1.pvops.qubes.x86_64): No such file or directory
[    1.464947] EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
[    1.465108] EXT4-fs (dm-0): couldn't mount as ext2 due to feature incompatibilities
[    1.466566] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
Waiting for /dev/xvdd device...
[    1.471024] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.471931] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    1.474541] EXT4-fs (dm-0): re-mounted. Opts: data=ordered
[    1.484119] EXT4-fs (dm-0): re-mounted. Opts: data=ordered
mount: mounting /tmp/modules/4.4.0-1.pvops.qubes.x86_64 on /sysroot/lib/modules/4.4.0-1.pvops.qubes.x86_64 failed: No such file or directory
[    1.578686] systemd[1]: systemd 216 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
[    1.578778] systemd[1]: Detected virtualization 'xen'.
[    1.578793] systemd[1]: Detected architecture 'x86-64'.

Welcome to .[0;34mFedora 21 (Twenty One).[0m!

[    1.579269] systemd[1]: No hostname configured.
[    1.579287] systemd[1]: Set hostname to <localhost>.
[    1.643346] systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[    1.673346] systemd[1]: Starting Forward Password Requests to Wall Directory Watch.
[    1.673475] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[    1.673505] systemd[1]: Expecting device dev-hvc0.device...
         Expecting device dev-hvc0.device...
[    1.673685] systemd[1]: Starting Remote File Systems.
[.[32m  OK  .[0m] Reached target Remote File Systems.
[    1.673773] systemd[1]: Reached target Remote File Systems.
[    1.673823] systemd[1]: Starting Arbitrary Executable File Formats File System Automount Point.
[.[32m  OK  .[0m] Set up automount Arbitrary Executab...ats File System Automount Point.
[    1.674059] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[    1.674091] systemd[1]: Starting Encrypted Volumes.
[.[32m  OK  .[0m] Reached target Encrypted Volumes.
[    1.674175] systemd[1]: Reached target Encrypted Volumes.
[    1.674201] systemd[1]: Expecting device dev-xvdc1.device...
         Expecting device dev-xvdc1.device...
[    1.674287] systemd[1]: Starting Root Slice.
[.[32m  OK  .[0m] Created slice Root Slice.
[    1.690630] systemd[1]: Created slice Root Slice.
[    1.690661] systemd[1]: Starting Delayed Shutdown Socket.
[.[32m  OK  .[0m] Listening on Delayed Shutdown Socket.
[    1.690788] systemd[1]: Listening on Delayed Shutdown Socket.
[    1.690813] systemd[1]: Starting /dev/initctl Compatibility Named Pipe.
[.[32m  OK  .[0m] Listening on /dev/initctl Compatibility Named Pipe.
[    1.690938] systemd[1]: Listening on /dev/initctl Compatibility Named Pipe.
[    1.690970] systemd[1]: Starting udev Control Socket.
[.[32m  OK  .[0m] Listening on udev Control Socket.
[    1.691101] systemd[1]: Listening on udev Control Socket.
[    1.691130] systemd[1]: Starting udev Kernel Socket.
[.[32m  OK  .[0m] Listening on udev Kernel Socket.
[    1.691234] systemd[1]: Listening on udev Kernel Socket.
[    1.691257] systemd[1]: Starting User and Session Slice.
[.[32m  OK  .[0m] Created slice User and Session Slice.
[    1.691711] systemd[1]: Created slice User and Session Slice.
[    1.691745] systemd[1]: Starting Journal Socket.
[.[32m  OK  .[0m] Listening on Journal Socket.
[    1.691894] systemd[1]: Listening on Journal Socket.
[    1.691942] systemd[1]: Starting System Slice.
[.[32m  OK  .[0m] Created slice System Slice.
[    1.692393] systemd[1]: Created slice System Slice.
[    1.692442] systemd[1]: Mounting Temporary Directory...
         Mounting Temporary Directory...
[    1.692722] systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway.
[    1.694761] systemd[1]: Starting Journal Socket (/dev/log).
[    1.696712] systemd[1]: Starting udev Coldplug all Devices...
         Starting udev Coldplug all Devices...
[    1.699255] systemd[1]: Mounting Huge Pages File System...
         Mounting Huge Pages File System...
[    1.701608] systemd[1]: Mounting Debug File System...
         Mounting Debug File System...
[    1.704788] systemd[1]: Mounting POSIX Message Queue File System...
         Mounting POSIX Message Queue File System...
[    1.719399] systemd[1]: Started Create list of required static device nodes for the current kernel.
[    1.719508] systemd[1]: Starting system-serial\x2dgetty.slice.
[.[32m  OK  .[0m] Created slice system-serial\x2dgetty.slice.
[    1.720294] systemd[1]: Created slice system-serial\x2dgetty.slice.
[    1.720357] systemd[1]: Started Collect Read-Ahead Data.
[    1.720401] systemd[1]: Started Replay Read-Ahead Data.
[    1.720437] systemd[1]: Starting File System Check on Root Device...
         Starting File System Check on Root Device...
[    1.727937] systemd[1]: Starting Load Kernel Modules...
         Starting Load Kernel Modules...
[    1.730073] systemd[1]: Starting Setup Virtual Console...
         Starting Setup Virtual Console...
[    1.745613] systemd[1]: Starting Load legacy module configuration...
         Starting Load legacy module configuration...
[    1.764453] systemd[1]: Started Set Up Additional Binary Formats.
[    1.764540] systemd[1]: Starting Slices.
[.[32m  OK  .[0m] Reached target Slices.
[    1.764610] systemd[1]: Reached target Slices.
[.[32m  OK  .[0m] Mounted POSIX Message Queue File System.
[    1.765640] systemd[1]: Mounted POSIX Message Queue File System.
[.[32m  OK  .[0m] Mounted Debug File System.
[    1.765721] systemd[1]: Mounted Debug File System.
[.[32m  OK  .[0m] Mounted Huge Pages File System.
[    1.765779] systemd[1]: Mounted Huge Pages File System.
[.[32m  OK  .[0m] Mounted Temporary Directory.
[    1.765860] systemd[1]: Mounted Temporary Directory.
[.[32m  OK  .[0m] Listening on Journal Socket (/dev/log).
[    1.766401] systemd[1]: Listening on Journal Socket (/dev/log).
[    1.768972] systemd[1]: systemd-modules-load.service: main process exited, code=exited, status=1/FAILURE
[.[1;31mFAILED.[0m] Failed to start Load Kernel Modules.
See "systemctl status systemd-modules-load.service" for details.
[    1.769555] systemd[1]: Failed to start Load Kernel Modules.
[    1.769579] systemd[1]: Unit systemd-modules-load.service entered failed state.
[    1.774333] systemd[1]: systemd-modules-load.service failed.
[.[32m  OK  .[0m] Started Setup Virtual Console.
[    1.774886] systemd[1]: Started Setup Virtual Console.
[    1.806920] systemd[1]: Mounting Configuration File System...
         Mounting Configuration File System...
[    1.809565] systemd[1]: Starting Apply Kernel Variables...
         Starting Apply Kernel Variables...
[    1.879402] systemd[1]: Mounted FUSE Control File System.
[    1.879728] systemd[1]: Starting Journal Service...
         Starting Journal Service...
[.[32m  OK  .[0m] Started udev Coldplug all Devices.
[    1.898192] systemd[1]: Started udev Coldplug all Devices.
[    1.923752] systemd[1]: Starting Show Plymouth Boot Screen...
         Starting Show Plymouth Boot Screen...
[.[32m  OK  .[0m] Mounted Configuration File System.
[    1.950124] systemd[1]: Mounted Configuration File System.
[.[32m  OK  .[0m] Started Load legacy module configuration.
[    1.950948] systemd[1]: Started Load legacy module configuration.
[.[32m  OK  .[0m] Started Apply Kernel Variables.
[    1.951688] systemd[1]: Started Apply Kernel Variables.
[    1.982704] systemd-fsck[142]: /dev/mapper/dmroot: clean, 139351/655360 files, 1285520/2621440 blocks
[.[32m  OK  .[0m] Started Journal Service.
[    1.984482] systemd[1]: Started Journal Service.
[.[32m  OK  .[0m] Started File System Check on Root Device.
[    2.049986] EXT4-fs (dm-0): re-mounted. Opts: (null)
[    2.118620] systemd-journald[173]: Received request to flush runtime journal from PID 1
[    5.367681] Adding 1048572k swap on /dev/xvdc1.  Priority:-1 extents:1 across:1048572k SSFS
         Starting Remount Root and Kernel File Systems...
[.[32m  OK  .[0m] Started Remount Root and Kernel File Systems.
         Starting Configure read-only root support...
         Starting Flush Journal to Persistent Storage...
         Starting Create Static Device Nodes in /dev...
         Starting Load/Save Random Seed...
[.[32m  OK  .[0m] Started Configure read-only root support.
[.[32m  OK  .[0m] Started Load/Save Random Seed.
[.[32m  OK  .[0m] Started Create Static Device Nodes in /dev.
         Starting udev Kernel Device Manager...
[.[32m  OK  .[0m] Reached target Local File Systems (Pre).
         Mounting /proc/xen...
[.[32m  OK  .[0m] Mounted /proc/xen.
[.[32m  OK  .[0m] Reached target Local File Systems.
         Starting Qubes DB agent...
         Starting Tell Plymouth To Write Out Runtime Data...
[.[32m  OK  .[0m] Started Tell Plymouth To Write Out Runtime Data.
[.[32m  OK  .[0m] Started Qubes DB agent.
         Starting Qubes Random Seed...
         Starting Init Qubes Services settings...
[.[32m  OK  .[0m] Started udev Kernel Device Manager.
[.[32m  OK  .[0m] Started Qubes Random Seed.
[.[32m  OK  .[0m] Started Flush Journal to Persistent Storage.
         Starting Create Volatile Files and Directories...
[.[32m  OK  .[0m] Found device /dev/hvc0.
[.[32m  OK  .[0m] Started Create Volatile Files and Directories.
         Starting Update UTMP about System Boot/Shutdown...
[.[32m  OK  .[0m] Started Update UTMP about System Boot/Shutdown.
.%G[.[32m  OK  .[0m] Started Init Qubes Services settings.
[.[32m  OK  .[0m] Found device /dev/xvdc1.
         Activating swap /dev/xvdc1...
[.[32m  OK  .[0m] Activated swap /dev/xvdc1.
[.[32m  OK  .[0m] Reached target Swap.
[.[32m  OK  .[0m] Reached target System Initialization.
[.[32m  OK  .[0m] Listening on CUPS Printing Service Sockets.
         Starting Manage Sound Card State (restore and store)...
[.[32m  OK  .[0m] Started Manage Sound Card State (restore and store).
[.[32m  OK  .[0m] Reached target Timers.
[.[32m  OK  .[0m] Listening on D-Bus System Message Bus Socket.
[.[32m  OK  .[0m] Reached target Sockets.
[.[32m  OK  .[0m] Started Show Plymouth Boot Screen.
[.[32m  OK  .[0m] Reached target Paths.
[.[32m  OK  .[0m] Reached target Basic System.
         Starting ABRT Automated Bug Reporting Tool...
[.[32m  OK  .[0m] Started ABRT Automated Bug Reporting Tool.
         Starting Initialize and mount /rw and /home...
         Starting Initialize and mount /rw and /home...
         Starting ABRT kernel log watcher...
[.[32m  OK  .[0m] Started ABRT kernel log watcher.
         Starting Install ABRT coredump hook...
         Starting Entropy Daemon based on the HAVEGE algorithm...
[.[32m  OK  .[0m] Started Entropy Daemon based on the HAVEGE algorithm.
         Starting Machine Check Exception Logging Daemon...
[    5.654686] random: nonblocking pool is initialized
         Starting Qubes memory information reporter...
         Starting Qubes remote exec agent...
         Starting ABRT Xorg log watcher...
[.[32m  OK  .[0m] Started ABRT Xorg log watcher.
         Starting Qubes base firewall settings...
         Starting Login Service...
         Starting Permit User Sessions...
         Starting LSB: Start/stop xen driver domain daemon...
         Starting D-Bus System Message Bus...
[    5.823870] EXT4-fs (xvdb): recovery complete
[.[32m  OK  .[0m] Started D-Bus System Message Bus.
[    5.831639] EXT4-fs (xvdb): mounted filesystem with ordered data mode. Opts: discard
[.[32m  OK  .[0m] Started Initialize and mount /rw and /home.
[.[32m  OK  .[0m] Started Initialize and mount /rw and /home.
[.[32m  OK  .[0m] Started Install ABRT coredump hook.
[.[32m  OK  .[0m] Started Machine Check Exception Logging Daemon.
[.[32m  OK  .[0m] Started Qubes memory information reporter.
[.[32m  OK  .[0m] Started Qubes remote exec agent.
[.[1;31mFAILED.[0m] Failed to start Qubes base firewall settings.
See "systemctl status qubes-iptables.service" for details.
[.[32m  OK  .[0m] Started Permit User Sessions.
[.[32m  OK  .[0m] Started LSB: Start/stop xen driver domain daemon.
[.[32m  OK  .[0m] Started Login Service.
         Starting Job spooling tools...
[.[32m  OK  .[0m] Started Job spooling tools.
         Starting Wait for Plymouth Boot Screen to Quit...
         Starting Terminate Plymouth Boot Screen...
         Starting Qubes misc post-boot actions...
         Starting Qubes GUI Agent...

Fedora release 21 (Twenty One)
Kernel 4.4.0-1.pvops.qubes.x86_64 on an x86_64 (hvc0)

netdebug login: [A\b \b\b \broot
Last login: Fri Jan 22 01:18:08 on 
[root@netdebug ~]# ping \b\b\b\b\b.[Kip r
default via 10.137.2.1 dev eth0 
10.137.2.1 dev eth0  scope link 
[root@netdebug ~]# ping -f 10.137.2.1
PING 10.137.2.1 (10.137.2.1) 56(84) bytes of data.
(...)

...........[  111.001689] page:ffffea00004bffc0 count:0 mapcount:0 mapping:          (null) index:0x0
[  111.001710] flags: 0x3fff8000000000()
[  111.001718] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
[  111.001739] ------------[ cut here ]------------
[  111.001746] kernel BUG at include/linux/mm.h:342!
[  111.001753] invalid opcode: 0000 [#1] SMP 
[  111.001761] Modules linked in:
[  111.001769] CPU: 1 PID: 23 Comm: xenwatch Not tainted 4.4.0-1.pvops.qubes.x86_64 #20
[  111.001780] task: ffff880012c88000 ti: ffff880012c7c000 task.ti: ffff880012c7c000
[  111.001790] RIP: e030:[<ffffffff81174a38>]  [<ffffffff81174a38>] __free_pages+0x38/0x40
[  111.001807] RSP: e02b:ffff880012c7fc30  EFLAGS: 00010246
[  111.001814] RAX: 0000000000000044 RBX: 000077ff80000000 RCX: 0000000000000044
[  111.001822] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880013d0de10
[  111.001829] RBP: ffff880012c7fc30 R08: 0000000000000000 R09: 0000000000000000
[  111.001837] R10: ffffea0000453f00 R11: ffffffff81863ec0 R12: 0000000000000000
[  111.001844] R13: ffff880012eb9c00 R14: ffff880012fff000 R15: 0000000000000000
[  111.001857] FS:  00007f84805fe700(0000) GS:ffff880013d00000(0000) knlGS:0000000000000000
[  111.001867] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  111.001874] CR2: 000055af04b6d000 CR3: 00000000008e8000 CR4: 0000000000002660
[  111.001882] Stack:
[  111.001887]  ffff880012c7fc50 ffffffff81174a9c 000000000000000e 0000000000000004
[  111.001900]  ffff880012c7fc80 ffffffff813f709e ffff87ffffffffff ffff880011528d68
[  111.001912]  ffff880012eb9c00 ffff880011528000 ffff880012c7fce8 ffffffff814b58c2
[  111.001924] Call Trace:
[  111.001932]  [<ffffffff81174a9c>] free_pages+0x5c/0x70
[  111.001942]  [<ffffffff813f709e>] gnttab_end_foreign_access+0x12e/0x160
[  111.001952]  [<ffffffff814b58c2>] xennet_disconnect_backend.isra.26+0x162/0x3b0
[  111.001963]  [<ffffffff814b5b91>] xennet_remove+0x31/0x80
[  111.001971]  [<ffffffff814011e5>] xenbus_dev_remove+0x55/0xb0
[  111.001980]  [<ffffffff8145d596>] __device_release_driver+0x96/0x130
[  111.002639]  [<ffffffff8145d653>] device_release_driver+0x23/0x30
[  111.002639]  [<ffffffff8145c3a1>] bus_remove_device+0x101/0x170
[  111.002639]  [<ffffffff814586a9>] device_del+0x139/0x270
[  111.002639]  [<ffffffff813ffb60>] ? unregister_xenbus_watch+0x1d0/0x1d0
[  111.002639]  [<ffffffff814587fe>] device_unregister+0x1e/0x60
[  111.002639]  [<ffffffff81401053>] xenbus_dev_changed+0xa3/0x1e0
[  111.002639]  [<ffffffff8167f18b>] ? _raw_spin_lock_irqsave+0x1b/0x40
[  111.002639]  [<ffffffff813ffb60>] ? unregister_xenbus_watch+0x1d0/0x1d0
[  111.002639]  [<ffffffff81402ea5>] frontend_changed+0x25/0x50
[  111.002639]  [<ffffffff813ffbe7>] xenwatch_thread+0x87/0x140
[  111.002639]  [<ffffffff810b3b60>] ? wait_woken+0x80/0x80
[  111.002639]  [<ffffffff81090838>] kthread+0xd8/0xf0
[  111.002639]  [<ffffffff81090760>] ? kthread_park+0x60/0x60
[  111.002639]  [<ffffffff8167f70f>] ret_from_fork+0x3f/0x70
[  111.002639]  [<ffffffff81090760>] ? kthread_park+0x60/0x60
[  111.002639] Code: c0 74 1c f0 ff 4f 1c 74 02 5d c3 85 f6 74 07 e8 ff f8 ff ff 5d c3 31 f6 e8 f6 fd ff ff 5d c3 48 c7 c6 20 27 a2 81 e8 28 47 02 00 <0f> 0b 66 0f 1f 44 00 00 66 66 66 66 90 48 85 ff 75 02 f3 c3 55 
[  111.002639] RIP  [<ffffffff81174a38>] __free_pages+0x38/0x40
[  111.002639]  RSP <ffff880012c7fc30>
[  111.003417] ---[ end trace e389324fec932a31 ]---
..............................................................\x03
--- 10.137.2.1 ping statistics ---
27465 packets transmitted, 27392 received, 0% packet loss, time 11156ms
rtt min/avg/max/mdev = 0.053/0.071/12.598/0.109 ms, pipe 2, ipg/ewma 0.406/0.060 ms

[-- Attachment #1.3: netdebug.conf --]
[-- Type: text/plain, Size: 1881 bytes --]

<domain type='xen'>
  <name>netdebug</name>
  <uuid>4e22b40c-e06e-46d1-aaaa-2293487b81b9</uuid>
  <memory unit='MiB'>4000</memory>
  <currentMemory unit='MiB'>400</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <os>
    <type arch='x86_64' machine='xenpv'>linux</type>
    <kernel>/var/lib/qubes/vm-kernels/4.4-debug/vmlinuz</kernel>
    <initrd>/var/lib/qubes/vm-kernels/4.4-debug/initramfs</initrd>
    <cmdline>root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH 3 rd.break</cmdline>
  </os>
  <clock offset='utc' adjustment='reset'>
    <timer name="tsc" mode="native"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>destroy</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/var/lib/qubes/vm-templates/fedora-21/root.img:/var/lib/qubes/vm-templates/fedora-21/root-cow.img'/>
      <target dev='xvda' bus='xen'/>
      <readonly/>
      <script path='block-snapshot'/>
    </disk>

    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/var/lib/qubes/appvms/netdebug/private.img'/>
      <target dev='xvdb' bus='xen'/>
    </disk>

    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/var/lib/qubes/appvms/netdebug/volatile.img'/>
      <target dev='xvdc' bus='xen'/>
    </disk>

    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/var/lib/qubes/vm-kernels/4.4-debug/modules.img'/>
      <target dev='xvdd' bus='xen'/>
      <readonly/>
    </disk>

    <interface type='ethernet'>
      <mac address='00:16:3E:5E:6C:1B'/>
      <ip address='10.137.2.29'/>
      <script path='vif-route-qubes'/>
      <backenddomain name='sys-firewall'/>
    </interface>


    <console type='pty'>
      <target type='xen' port='0'/>
    </console>
  </devices>
</domain>


[-- Attachment #1.4: xs-backend-before.txt --]
[-- Type: text/plain, Size: 1228 bytes --]

/local/domain/2/backend/vif/49/0/frontend = "/local/domain/49/device/vif/0"   (n2,r49)
/local/domain/2/backend/vif/49/0/frontend-id = "49"   (n2,r49)
/local/domain/2/backend/vif/49/0/online = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/state = "4"   (n2,r49)
/local/domain/2/backend/vif/49/0/script = "/etc/xen/scripts/vif-route-qubes"   (n2,r49)
/local/domain/2/backend/vif/49/0/mac = "00:16:3e:5e:6c:1b"   (n2,r49)
/local/domain/2/backend/vif/49/0/ip = "10.137.2.29"   (n2,r49)
/local/domain/2/backend/vif/49/0/handle = "0"   (n2,r49)
/local/domain/2/backend/vif/49/0/type = "vif"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-sg = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-gso-tcpv4 = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-gso-tcpv6 = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-ipv6-csum-offload = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-rx-copy = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-rx-flip = "0"   (n2,r49)
/local/domain/2/backend/vif/49/0/feature-split-event-channels = "1"   (n2,r49)
/local/domain/2/backend/vif/49/0/multi-queue-max-queues = "2"   (n2,r49)
/local/domain/2/backend/vif/49/0/hotplug-status = "connected"   (n2,r49)

[-- Attachment #1.5: xs-frontend-before.txt --]
[-- Type: text/plain, Size: 1503 bytes --]

/local/domain/49/device/vif/0 = ""   (n49,r2)
/local/domain/49/device/vif/0/backend = "/local/domain/2/backend/vif/49/0"   (n49,r2)
/local/domain/49/device/vif/0/backend-id = "2"   (n49,r2)
/local/domain/49/device/vif/0/state = "4"   (n49,r2)
/local/domain/49/device/vif/0/handle = "0"   (n49,r2)
/local/domain/49/device/vif/0/mac = "00:16:3e:5e:6c:1b"   (n49,r2)
/local/domain/49/device/vif/0/multi-queue-num-queues = "2"   (n49,r2)
/local/domain/49/device/vif/0/queue-0 = ""   (n49,r2)
/local/domain/49/device/vif/0/queue-0/tx-ring-ref = "1280"   (n49,r2)
/local/domain/49/device/vif/0/queue-0/rx-ring-ref = "1281"   (n49,r2)
/local/domain/49/device/vif/0/queue-0/event-channel-tx = "19"   (n49,r2)
/local/domain/49/device/vif/0/queue-0/event-channel-rx = "20"   (n49,r2)
/local/domain/49/device/vif/0/queue-1 = ""   (n49,r2)
/local/domain/49/device/vif/0/queue-1/tx-ring-ref = "1282"   (n49,r2)
/local/domain/49/device/vif/0/queue-1/rx-ring-ref = "1283"   (n49,r2)
/local/domain/49/device/vif/0/queue-1/event-channel-tx = "21"   (n49,r2)
/local/domain/49/device/vif/0/queue-1/event-channel-rx = "22"   (n49,r2)
/local/domain/49/device/vif/0/request-rx-copy = "1"   (n49,r2)
/local/domain/49/device/vif/0/feature-rx-notify = "1"   (n49,r2)
/local/domain/49/device/vif/0/feature-sg = "1"   (n49,r2)
/local/domain/49/device/vif/0/feature-gso-tcpv4 = "1"   (n49,r2)
/local/domain/49/device/vif/0/feature-gso-tcpv6 = "1"   (n49,r2)
/local/domain/49/device/vif/0/feature-ipv6-csum-offload = "1"   (n49,r2)

[-- Attachment #1.6: config-4.4 --]
[-- Type: application/x-troff-man, Size: 91440 bytes --]

[-- Attachment #1.7: xen-netfront-detach-crash.patch --]
[-- Type: text/plain, Size: 1128 bytes --]

When it get to free_page(queue->grant_tx_page[i]), the use counter on this page
is already 0, which cause a crash. Not sure if this is the proper fix
(according to git log this may introduce some memory leak), but at least it
prevent the crash.

Details in this thread:
http://xen.markmail.org/thread/pw5edbtqienjx4q5

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f821a97..a5efbb0 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1065,9 +1069,10 @@ static void xennet_release_tx_bufs(struct netfront_queue *queue)
 
 		skb = queue->tx_skbs[i].skb;
 		get_page(queue->grant_tx_page[i]);
-		gnttab_end_foreign_access(queue->grant_tx_ref[i],
-					  GNTMAP_readonly,
-					  (unsigned long)page_address(queue->grant_tx_page[i]));
+		gnttab_end_foreign_access_ref(
+				queue->grant_tx_ref[i], GNTMAP_readonly);
+		gnttab_release_grant_reference(
+				&queue->gref_tx_head, queue->grant_tx_ref[i]);
 		queue->grant_tx_page[i] = NULL;
 		queue->grant_tx_ref[i] = GRANT_INVALID_REF;
 		add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-01-22 19:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-22 11:49 xen-netfront crash when detaching network while some network activity Marek Marczykowski-Górecki
2015-05-22 16:25 ` [Xen-devel] " David Vrabel
2015-05-22 16:42   ` Marek Marczykowski-Górecki
2015-05-22 16:58     ` David Vrabel
2015-05-22 17:13       ` Marek Marczykowski-Górecki
2015-05-26 10:56 ` David Vrabel
2015-05-26 22:03   ` Marek Marczykowski-Górecki
2015-10-21 18:57     ` Marek Marczykowski-Górecki
2015-11-17  2:45       ` Marek Marczykowski-Górecki
2015-12-01 22:00         ` Konrad Rzeszutek Wilk
2015-12-01 22:32           ` Marek Marczykowski-Górecki
2016-01-20 21:59             ` Konrad Rzeszutek Wilk
2016-01-21 12:30               ` Joao Martins
2016-01-22 19:23                 ` Marek Marczykowski-Górecki
2015-11-17 11:59       ` [Xen-devel] " David Vrabel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).