Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 4.4 09/15] scsi: storvsc: Fix calculation of sub-channel count
From: Sasha Levin @ 2019-04-24 14:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, linux-hyperv,
	linux-scsi
In-Reply-To: <20190424145152.31351-1-sashal@kernel.org>

From: Michael Kelley <mikelley@microsoft.com>

[ Upstream commit 382e06d11e075a40b4094b6ef809f8d4bcc7ab2a ]

When the number of sub-channels offered by Hyper-V is >= the number of CPUs
in the VM, calculate the correct number of sub-channels.  The current code
produces one too many.

This scenario arises only when the number of CPUs is artificially
restricted (for example, with maxcpus=<n> on the kernel boot line), because
Hyper-V normally offers a sub-channel count < number of CPUs.  While the
current code doesn't break, the extra sub-channel is unbalanced across the
CPUs (for example, a total of 5 channels on a VM with 4 CPUs).

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/scsi/storvsc_drv.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 44b7a69d022a..45cd4cf93af3 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -613,13 +613,22 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
 static void  handle_multichannel_storage(struct hv_device *device, int max_chns)
 {
 	struct storvsc_device *stor_device;
-	int num_cpus = num_online_cpus();
 	int num_sc;
 	struct storvsc_cmd_request *request;
 	struct vstor_packet *vstor_packet;
 	int ret, t;
 
-	num_sc = ((max_chns > num_cpus) ? num_cpus : max_chns);
+	/*
+	 * If the number of CPUs is artificially restricted, such as
+	 * with maxcpus=1 on the kernel boot line, Hyper-V could offer
+	 * sub-channels >= the number of CPUs. These sub-channels
+	 * should not be created. The primary channel is already created
+	 * and assigned to one CPU, so check against # CPUs - 1.
+	 */
+	num_sc = min((int)(num_online_cpus() - 1), max_chns);
+	if (!num_sc)
+		return;
+
 	stor_device = get_out_stor_device(device);
 	if (!stor_device)
 		return;
-- 
2.19.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.9 19/28] scsi: storvsc: Fix calculation of sub-channel count
From: Sasha Levin @ 2019-04-24 14:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, linux-hyperv,
	linux-scsi
In-Reply-To: <20190424145012.30886-1-sashal@kernel.org>

From: Michael Kelley <mikelley@microsoft.com>

[ Upstream commit 382e06d11e075a40b4094b6ef809f8d4bcc7ab2a ]

When the number of sub-channels offered by Hyper-V is >= the number of CPUs
in the VM, calculate the correct number of sub-channels.  The current code
produces one too many.

This scenario arises only when the number of CPUs is artificially
restricted (for example, with maxcpus=<n> on the kernel boot line), because
Hyper-V normally offers a sub-channel count < number of CPUs.  While the
current code doesn't break, the extra sub-channel is unbalanced across the
CPUs (for example, a total of 5 channels on a VM with 4 CPUs).

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/scsi/storvsc_drv.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index d92b2808d191..6df34d68737f 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -641,13 +641,22 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
 static void  handle_multichannel_storage(struct hv_device *device, int max_chns)
 {
 	struct storvsc_device *stor_device;
-	int num_cpus = num_online_cpus();
 	int num_sc;
 	struct storvsc_cmd_request *request;
 	struct vstor_packet *vstor_packet;
 	int ret, t;
 
-	num_sc = ((max_chns > num_cpus) ? num_cpus : max_chns);
+	/*
+	 * If the number of CPUs is artificially restricted, such as
+	 * with maxcpus=1 on the kernel boot line, Hyper-V could offer
+	 * sub-channels >= the number of CPUs. These sub-channels
+	 * should not be created. The primary channel is already created
+	 * and assigned to one CPU, so check against # CPUs - 1.
+	 */
+	num_sc = min((int)(num_online_cpus() - 1), max_chns);
+	if (!num_sc)
+		return;
+
 	stor_device = get_out_stor_device(device);
 	if (!stor_device)
 		return;
-- 
2.19.1


^ permalink raw reply related

* Re: [GIT PULL] Hyper-V commits for 5.1
From: Greg KH @ 2019-04-25  9:27 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, linux-hyperv, kys, haiyangz, sthemmin, linux-kernel
In-Reply-To: <20190417013452.438FA20821@mail.kernel.org>

On Tue, Apr 16, 2019 at 09:34:51PM -0400, Sasha Levin wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
> 
> The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:
> 
>   Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed

Now pulled and pushed out, thanks.

greg k-h

^ permalink raw reply

* [PATCH hyperv-fixes] hv_netvsc: fix race that may miss tx queue wakeup
From: Haiyang Zhang @ 2019-04-30 19:29 UTC (permalink / raw)
  To: sashal@kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, olaf@aepfle.de,
	vkuznets, davem@davemloft.net, linux-kernel@vger.kernel.org

When the ring buffer is almost full due to RX completion messages, a
TX packet may reach the "low watermark" and cause the queue stopped.
If the TX completion arrives earlier than queue stopping, the wakeup
may be missed.

This patch moves the check for the last pending packet to cover both
EAGAIN and success cases, so the queue will be reliably waked up when
necessary.

Reported-and-tested-by: Stephan Klein <stephan.klein@wegfinder.at>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/netvsc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index fdbeb70..ee19860 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -875,12 +875,6 @@ static inline int netvsc_send_pkt(
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(txq);
 		ndev_ctx->eth_stats.stop_queue++;
-		if (atomic_read(&nvchan->queue_sends) < 1 &&
-		    !net_device->tx_disable) {
-			netif_tx_wake_queue(txq);
-			ndev_ctx->eth_stats.wake_queue++;
-			ret = -ENOSPC;
-		}
 	} else {
 		netdev_err(ndev,
 			   "Unable to send packet pages %u len %u, ret %d\n",
@@ -888,6 +882,15 @@ static inline int netvsc_send_pkt(
 			   ret);
 	}
 
+	if (netif_tx_queue_stopped(txq) &&
+	    atomic_read(&nvchan->queue_sends) < 1 &&
+	    !net_device->tx_disable) {
+		netif_tx_wake_queue(txq);
+		ndev_ctx->eth_stats.wake_queue++;
+		if (ret == -EAGAIN)
+			ret = -ENOSPC;
+	}
+
 	return ret;
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Hyperv netvsc - regression for 32-PAE kernel
From: Juliana Rodrigueiro @ 2019-05-02 16:14 UTC (permalink / raw)
  To: linux-hyperv@vger.kernel.org; +Cc: netdev

Hi all.

I have a custom linux OS vm running kernel 3.14 (32b with or without PAE) in 
windows 2012 R2. The vm has one Network Adapter and is generation 1. With this 
setup everything runs fine.

The problem started when I tried to update to kernel 4.19. The Synthetic 
network adapter driver does not successfully loads during boot and then the 
machine gets stuck.

If I remove the Network Adapter and add a Legacy one instead, the system runs 
normally. However, this implies an unacceptable performance regression for my 
use case.

I manage to boot the vm with the Network Adapter by adding "hv_netvsc" to the 
blacklist, so I can inspect the system. Manually running "modprobe -v 
hv_netvsc" doesn't show any errors, just the "instmod" for ucs2_string and 
hv_netvsc, and then hangs forever. The "dmesg" output shows the following 
problems:

[  994.830251] hv_netvsc 0969e9e1-1392-4ed6-a230-d5db70c76a3c (unnamed 
net_device) (uninitialized): 0x0 (len 0)
[  994.830306] hv_netvsc 0969e9e1-1392-4ed6-a230-d5db70c76a3c (unnamed 
net_device) (uninitialized): unhandled rndis message (type 0 len 0)
[  994.830435] hv_netvsc 0969e9e1-1392-4ed6-a230-d5db70c76a3c (unnamed 
net_device) (uninitialized): 0x0 (len 0)
[  994.830440] hv_netvsc 0969e9e1-1392-4ed6-a230-d5db70c76a3c (unnamed 
net_device) (uninitialized): unhandled rndis message (type 0 len 0)

The Network Adapter was "Not connected" during these error messages, but when 
connected to a Virtual Switch the errors are the same, except doubled, so I 
would have four "unhandled rndis message".

I tested kernel 4.19 without PAE, the module is loaded without problems and 
those error messages never appear. 

I also tested other stable kernel versions, for example 4.14.114, and this one 
actually works fine with PAE. At this point, it looked like a bisect could 
help me to get to the offending changes and to understand the problem.

So I got to the following commit:

commit 6ba34171bcbd10321c6cf554e0c1144d170f9d1a
Author: Michael Kelley <mikelley@microsoft.com>
Date:   Thu Aug 2 03:08:24 2018 +0000

    Drivers: hv: vmbus: Remove use of slow_virt_to_phys()

    slow_virt_to_phys() is only implemented for arch/x86.
    Remove its use in arch independent Hyper-V drivers, and
    replace with test for vmalloc() address followed by
    appropriate v-to-p function. This follows the typical
    pattern of other drivers and avoids the need to implement
    slow_virt_to_phys() for Hyper-V on ARM64.

    Signed-off-by: Michael Kelley <mikelley@microsoft.com>
    Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The catch is that slow_virt_to_phys has a special trick implemented in order 
to keep specifically 32-PAE kernel working, it is explained in a comment 
inside the function.

Reverting this commit makes the kernel 4.19 32-bit PAE work again. However I 
believe a better solution might exist.

Comments are very much appreciated.

Cheers!
Julie R.

^ permalink raw reply

* RE: Hyperv netvsc - regression for 32-PAE kernel
From: Michael Kelley @ 2019-05-02 22:23 UTC (permalink / raw)
  To: Juliana Rodrigueiro, linux-hyperv@vger.kernel.org; +Cc: netdev@vger.kernel.org
In-Reply-To: <6166175.oDc9uM0lzg@rocinante.m.i2n>

From: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com> Sent: Thursday, May 2, 2019 9:14 AM
> 
> So I got to the following commit:
> 
> commit 6ba34171bcbd10321c6cf554e0c1144d170f9d1a
> Author: Michael Kelley <mikelley@microsoft.com>
> Date:   Thu Aug 2 03:08:24 2018 +0000
> 
>     Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
> 
>     slow_virt_to_phys() is only implemented for arch/x86.
>     Remove its use in arch independent Hyper-V drivers, and
>     replace with test for vmalloc() address followed by
>     appropriate v-to-p function. This follows the typical
>     pattern of other drivers and avoids the need to implement
>     slow_virt_to_phys() for Hyper-V on ARM64.
> 
>     Signed-off-by: Michael Kelley <mikelley@microsoft.com>
>     Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> The catch is that slow_virt_to_phys has a special trick implemented in order
> to keep specifically 32-PAE kernel working, it is explained in a comment
> inside the function.
> 
> Reverting this commit makes the kernel 4.19 32-bit PAE work again. However I
> believe a better solution might exist.
> 
> Comments are very much appreciated.
> 

Julie -- thanks for tracking down the cause of the issue.  I'll try to
look at this tomorrow and propose a solution.

Michael Kelley


^ permalink raw reply

* RE: Hyperv netvsc - regression for 32-PAE kernel
From: Dexuan Cui @ 2019-05-03 19:58 UTC (permalink / raw)
  To: Michael Kelley, Juliana Rodrigueiro, linux-hyperv@vger.kernel.org
  Cc: netdev@vger.kernel.org
In-Reply-To: <DM5PR2101MB091875296619F1518C109E71D7340@DM5PR2101MB0918.namprd21.prod.outlook.com>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Michael Kelley
> Sent: Thursday, May 2, 2019 3:24 PM
> To: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>;
> linux-hyperv@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Subject: RE: Hyperv netvsc - regression for 32-PAE kernel
> 
> From: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com> Sent: Thursday,
> May 2, 2019 9:14 AM
> >
> > So I got to the following commit:
> >
> > commit 6ba34171bcbd10321c6cf554e0c1144d170f9d1a
> > Author: Michael Kelley <mikelley@microsoft.com>
> > Date:   Thu Aug 2 03:08:24 2018 +0000
> >
> >     Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
> >
> >     slow_virt_to_phys() is only implemented for arch/x86.
> >     Remove its use in arch independent Hyper-V drivers, and
> >     replace with test for vmalloc() address followed by
> >     appropriate v-to-p function. This follows the typical
> >     pattern of other drivers and avoids the need to implement
> >     slow_virt_to_phys() for Hyper-V on ARM64.
> >
> >     Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> >     Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
> >     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >
> > The catch is that slow_virt_to_phys has a special trick implemented in order
> > to keep specifically 32-PAE kernel working, it is explained in a comment
> > inside the function.
> >
> > Reverting this commit makes the kernel 4.19 32-bit PAE work again. However
> I
> > believe a better solution might exist.
> >
> > Comments are very much appreciated.
> >
> 
> Julie -- thanks for tracking down the cause of the issue.  I'll try to
> look at this tomorrow and propose a solution.
> 
> Michael Kelley

Hi Juliana,
Can you please try the below one-line patch? 

It should fix the issue.

Thanks,
-- Dexuan

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 23381c4..aaaee5f 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -38,7 +38,7 @@

 static unsigned long virt_to_hvpfn(void *addr)
 {
-       unsigned long paddr;
+       phys_addr_t paddr;

        if (is_vmalloc_addr(addr))
                paddr = page_to_phys(vmalloc_to_page(addr)) +



^ permalink raw reply related

* Re: [PATCH hyperv-fixes] hv_netvsc: fix race that may miss tx queue wakeup
From: David Miller @ 2019-05-04  3:50 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, linux-hyperv, netdev, kys, sthemmin, olaf, vkuznets,
	linux-kernel
In-Reply-To: <1556652525-83155-1-git-send-email-haiyangz@microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: Tue, 30 Apr 2019 19:29:07 +0000

> When the ring buffer is almost full due to RX completion messages, a
> TX packet may reach the "low watermark" and cause the queue stopped.
> If the TX completion arrives earlier than queue stopping, the wakeup
> may be missed.
> 
> This patch moves the check for the last pending packet to cover both
> EAGAIN and success cases, so the queue will be reliably waked up when
> necessary.
> 
> Reported-and-tested-by: Stephan Klein <stephan.klein@wegfinder.at>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

Applied, thanks.

^ permalink raw reply

* [GIT PULL] Hyper-V commits for 5.2
From: Sasha Levin @ 2019-05-06  3:31 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, linux-hyperv, kys, haiyangz, sthemmin, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:

  Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-next-signed

for you to fetch changes up to a3fb7bf369efa296c1fc68aead1b6fb3c735573b:

  drivers: input: serio: Add a module desription to the hyperv_keyboard driver (2019-04-23 15:41:40 -0400)

- ----------------------------------------------------------------
Adding module description to various hyper-v modules

- ----------------------------------------------------------------
Joseph Salisbury (3):
      drivers: hid: Add a module description line to the hid_hyperv driver
      drivers: hv: Add a module description line to the hv_vmbus driver
      drivers: input: serio: Add a module desription to the hyperv_keyboard driver

 drivers/hid/hid-hyperv.c              | 2 ++
 drivers/hv/vmbus_drv.c                | 1 +
 drivers/input/serio/hyperv-keyboard.c | 2 ++
 3 files changed, 5 insertions(+)
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAlzPqkQACgkQ3qZv95d3
LNwABg//b9MataFZTizTER4WnV5+PWOeK+gEh5nlpM63vgzkx9G3WYT7XydhdoT4
vaw77zNEMHChyxIEn6s4BLpOk8nAP3L8VyvgxUDSemmjV9/FCDz4D98IoFMeWuWO
97HER2DUC9r/RejYz7qb2lGMesIkAFZkYAFdcg8coswP1WWS5dM0CnyTFb7wjvFu
MVRHCH3W6i55sZVid2YE2qm2gmza8wIJYkggoTUwKeRcOWyz1s9VNknpTi/8BaeK
0dsnrvfw7jdYL89QKkU4J4n8EycyNopibK2d35kj/6KiMrUs91YXTlncdS68WZ+t
8OhLs7Vk4XV5yxT+LApX0teON7NI5irVzrLNBPkNUcBRwRWD0oJhHCfy4p5ulEQ2
yKWEHJ4noZNGeJvrjEuhEn+mGKxjk8zvxY39qXUddWT2MXyjVyUtyxykDlYFLd+H
b3a3qDIrR/QM5jTmA/xoxjwFoqJnjcMakdz2kZAVqeNVHEi20lQrZJFa1RBw4dD/
hw1SOZUI6thDn4QCTPVJiva4w+zEqCocm1+qwoFevMig0LSH7+IymBde5S7heha6
Hj0BTKKsQndhD/utA8HWpBebfmZSqA0vK8SSfIcPLlAxU9QVeiYU/0qgLEpqFN5/
LCmbe0THf4RdryqJWjZbLe8qRzrhNEetSQibEPY+QuiiOeZJBjw=
=vWC7
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: Hyperv netvsc - regression for 32-PAE kernel
From: Juliana Rodrigueiro @ 2019-05-06 10:55 UTC (permalink / raw)
  To: Dexuan Cui; +Cc: linux-hyperv
In-Reply-To: <PU1P153MB01698936BF3332FCBF64D65DBF350@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

Hi Dexuan.

> Can you please try the below one-line patch?

Nice, easy one liner and it works well for me.

I hope this patch will be applied.

Thank you!

Julie R.





^ permalink raw reply

* RE: Hyperv netvsc - regression for 32-PAE kernel
From: Dexuan Cui @ 2019-05-06 15:58 UTC (permalink / raw)
  To: Juliana Rodrigueiro; +Cc: linux-hyperv@vger.kernel.org, Michael Kelley
In-Reply-To: <5083893.yzDQlZqgr7@rocinante.m.i2n>

> From: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
> Sent: Monday, May 6, 2019 3:56 AM
> To: Dexuan Cui <decui@microsoft.com>
> Cc: linux-hyperv@vger.kernel.org
> Subject: Re: Hyperv netvsc - regression for 32-PAE kernel
> 
> Hi Dexuan.
> 
> > Can you please try the below one-line patch?
> 
> Nice, easy one liner and it works well for me.
> 
> I hope this patch will be applied.
> 
> Thank you!
> 
> Julie R.

Hi Julie,

Thanks for testing the patch!

I'm going to formally send the patch to the list.

Can I add a "Reported-and-tested-by" tag for you?

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH] Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE
From: Dexuan Cui @ 2019-05-07  7:46 UTC (permalink / raw)
  To: linux-hyperv@vger.kernel.org, gregkh@linuxfoundation.org,
	Stephen Hemminger, sashal@kernel.org, Sasha Levin, Haiyang Zhang,
	KY Srinivasan, devel@linuxdriverproject.org, Michael Kelley,
	juliana.rodrigueiro@intra2net.com
  Cc: linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
	apw@canonical.com, olaf@aepfle.de, vkuznets, jasowang@redhat.com,
	Dexuan Cui, stable@vger.kernel.org

In the case of X86_PAE, unsigned long is u32, but the physical address type
should be u64. Due to the bug here, the netvsc driver can not load
successfully, and sometimes the VM can panic due to memory corruption (the
hypervisor writes data to the wrong location).

Fixes: 6ba34171bcbd ("Drivers: hv: vmbus: Remove use of slow_virt_to_phys()")
Cc: stable@vger.kernel.org
Cc: Michael Kelley <mikelley@microsoft.com>
Reported-and-tested-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/hv/channel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 23381c41d087..aaaee5f93193 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -38,7 +38,7 @@
 
 static unsigned long virt_to_hvpfn(void *addr)
 {
-	unsigned long paddr;
+	phys_addr_t paddr;
 
 	if (is_vmalloc_addr(addr))
 		paddr = page_to_phys(vmalloc_to_page(addr)) +
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCH] Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE
From: Michael Kelley @ 2019-05-07 12:51 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, sashal@kernel.org,
	Sasha Levin, Haiyang Zhang, KY Srinivasan,
	devel@linuxdriverproject.org, juliana.rodrigueiro@intra2net.com
  Cc: linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
	apw@canonical.com, olaf@aepfle.de, vkuznets, jasowang@redhat.com,
	stable@vger.kernel.org
In-Reply-To: <1557215147-89776-1-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Tuesday, May 7, 2019 12:47 AM
> 
> In the case of X86_PAE, unsigned long is u32, but the physical address type
> should be u64. Due to the bug here, the netvsc driver can not load
> successfully, and sometimes the VM can panic due to memory corruption (the
> hypervisor writes data to the wrong location).
> 
> Fixes: 6ba34171bcbd ("Drivers: hv: vmbus: Remove use of slow_virt_to_phys()")
> Cc: stable@vger.kernel.org
> Cc: Michael Kelley <mikelley@microsoft.com>
> Reported-and-tested-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* Re: [PATCH 2/6] x86: hv: hv_init.c: Replace alloc_page() with kmem_cache_alloc()
From: Maya Nakamura @ 2019-05-08  6:46 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: mikelley, kys, haiyangz, sthemmin, sashal, x86, linux-hyperv,
	linux-kernel
In-Reply-To: <87mukvfynk.fsf@vitty.brq.redhat.com>

On Fri, Apr 12, 2019 at 09:52:47AM +0200, Vitaly Kuznetsov wrote:
> Maya Nakamura <m.maya.nakamura@gmail.com> writes:
> 
> > On Fri, Apr 05, 2019 at 01:31:02PM +0200, Vitaly Kuznetsov wrote:
> >> Maya Nakamura <m.maya.nakamura@gmail.com> writes:
> >> 
> >> > @@ -98,18 +99,20 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_input_arg);
> >> >  u32 hv_max_vp_index;
> >> >  EXPORT_SYMBOL_GPL(hv_max_vp_index);
> >> >  
> >> > +struct kmem_cache *cachep;
> >> > +EXPORT_SYMBOL_GPL(cachep);
> >> > +
> >> >  static int hv_cpu_init(unsigned int cpu)
> >> >  {
> >> >  	u64 msr_vp_index;
> >> >  	struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> >> >  	void **input_arg;
> >> > -	struct page *pg;
> >> >  
> >> >  	input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> >> > -	pg = alloc_page(GFP_KERNEL);
> >> > -	if (unlikely(!pg))
> >> > +	*input_arg = kmem_cache_alloc(cachep, GFP_KERNEL);
> >> 
> >> I'm not sure use of kmem_cache is justified here: pages we allocate are
> >> not cache-line and all these allocations are supposed to persist for the
> >> lifetime of the guest. In case you think that even on x86 it will be
> >> possible to see PAGE_SIZE != HV_HYP_PAGE_SIZE you can use alloc_pages()
> >> instead.
> >> 
> > Thank you for your feedback, Vitaly!
> >
> > Will you please tell me how cache-line relates to kmem_cache?
> >
> > I understand that alloc_pages() would work when PAGE_SIZE <=
> > HV_HYP_PAGE_SIZE, but I think that it would not work if PAGE_SIZE >
> > HV_HYP_PAGE_SIZE.
> 
> Sorry, my bad: I meant to say "not cache-like" (these allocations are
> not 'cache') but the typo made it completely incomprehensible. 
 
No worries! Thank you for sharing your thoughts with me, Vitaly.

Do you know of any alternatives to kmem_cache that can allocate memory
in a specified size (different than a guest page size) with alignment?
Memory allocated by alloc_page() is aligned but limited to the guest
page size, and kmalloc() can allocate a particular size but it seems
that it does not guarantee alignment. I am asking this while considering
the changes for architecture independent code.

> >> Also, in case the idea is to generalize stuff, what will happen if
> >> PAGE_SIZE > HV_HYP_PAGE_SIZE? Who will guarantee proper alignment?
> >> 
> >> I think we can leave hypercall arguments, vp_assist and similar pages
> >> alone for now: the code is not going to be shared among architectures
> >> anyways.
> >> 
> > About the alignment, kmem_cache_create() aligns memory with its third
> > parameter, offset.
> 
> Yes, I know, I was trying to think about a (hypothetical) situation when
> page sizes differ: what would be the memory alignment requirements from
> the hypervisor for e.g. hypercall arguments? In case it's always
> HV_HYP_PAGE_SIZE we're good but could it be PAGE_SIZE (for e.g. TLB
> flush hypercall)? I don't know. For x86 this discussion probably makes
> no sense. I'm, however, struggling to understand what benefit we will
> get from the change. Maybe just leave it as-is for now and fix
> arch-independent code only? And later, if we decide to generalize this
> code, make another approach? (Not insisting, just a suggestion)

Thank you for the suggestion, Vitaly!

The introduction of HV_HYP_PAGE_SIZE is weighing the assumption of the
future page size—it can be bigger based on the general trend, not
smaller, which is a reasonable assumption, I think.

> >> > @@ -338,7 +349,10 @@ void __init hyperv_init(void)
> >> >  	guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
> >> >  	wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
> >> >  
> >> > -	hv_hypercall_pg  = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
> >> > +	hv_hypercall_pg = kmem_cache_alloc(cachep, GFP_KERNEL);
> >> > +	if (hv_hypercall_pg)
> >> > +		set_memory_x((unsigned long)hv_hypercall_pg, 1);
> >> 
> >> _RX is not writeable, right?
> >> 
> > Yes, you are correct. I should use set_memory_ro() in addition to
> > set_memory_x().
> >
> >> > @@ -416,6 +431,7 @@ void hyperv_cleanup(void)
> >> >  	 * let hypercall operations fail safely rather than
> >> >  	 * panic the kernel for using invalid hypercall page
> >> >  	 */
> >> > +	kmem_cache_free(cachep, hv_hypercall_pg);
> >> 
> >> Please don't do that: hyperv_cleanup() is called on kexec/kdump and
> >> we're trying to do the bare minimum to allow next kernel to boot. Doing
> >> excessive work here will likely lead to consequent problems (we're
> >> already crashing the case it's kdump!).
> >> 
> > Thank you for the explanation! I will remove that.
> >
> 
> -- 
> Vitaly

^ permalink raw reply

* Re: [PATCH 2/6] x86: hv: hv_init.c: Replace alloc_page() with kmem_cache_alloc()
From: Vitaly Kuznetsov @ 2019-05-08 14:54 UTC (permalink / raw)
  To: Maya Nakamura
  Cc: mikelley, kys, haiyangz, sthemmin, sashal, x86, linux-hyperv,
	linux-kernel
In-Reply-To: <20190508064559.GA54416@maya190131.isni1t2eisqetojrdim5hhf1se.xx.internal.cloudapp.net>

Maya Nakamura <m.maya.nakamura@gmail.com> writes:

> On Fri, Apr 12, 2019 at 09:52:47AM +0200, Vitaly Kuznetsov wrote:
>> Maya Nakamura <m.maya.nakamura@gmail.com> writes:
>> 
>> > On Fri, Apr 05, 2019 at 01:31:02PM +0200, Vitaly Kuznetsov wrote:
>> >> Maya Nakamura <m.maya.nakamura@gmail.com> writes:
>> >> 
>> >> > @@ -98,18 +99,20 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_input_arg);
>> >> >  u32 hv_max_vp_index;
>> >> >  EXPORT_SYMBOL_GPL(hv_max_vp_index);
>> >> >  
>> >> > +struct kmem_cache *cachep;
>> >> > +EXPORT_SYMBOL_GPL(cachep);
>> >> > +
>> >> >  static int hv_cpu_init(unsigned int cpu)
>> >> >  {
>> >> >  	u64 msr_vp_index;
>> >> >  	struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
>> >> >  	void **input_arg;
>> >> > -	struct page *pg;
>> >> >  
>> >> >  	input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>> >> > -	pg = alloc_page(GFP_KERNEL);
>> >> > -	if (unlikely(!pg))
>> >> > +	*input_arg = kmem_cache_alloc(cachep, GFP_KERNEL);
>> >> 
>> >> I'm not sure use of kmem_cache is justified here: pages we allocate are
>> >> not cache-line and all these allocations are supposed to persist for the
>> >> lifetime of the guest. In case you think that even on x86 it will be
>> >> possible to see PAGE_SIZE != HV_HYP_PAGE_SIZE you can use alloc_pages()
>> >> instead.
>> >> 
>> > Thank you for your feedback, Vitaly!
>> >
>> > Will you please tell me how cache-line relates to kmem_cache?
>> >
>> > I understand that alloc_pages() would work when PAGE_SIZE <=
>> > HV_HYP_PAGE_SIZE, but I think that it would not work if PAGE_SIZE >
>> > HV_HYP_PAGE_SIZE.
>> 
>> Sorry, my bad: I meant to say "not cache-like" (these allocations are
>> not 'cache') but the typo made it completely incomprehensible. 
>  
> No worries! Thank you for sharing your thoughts with me, Vitaly.
>
> Do you know of any alternatives to kmem_cache that can allocate memory
> in a specified size (different than a guest page size) with alignment?
> Memory allocated by alloc_page() is aligned but limited to the guest
> page size, and kmalloc() can allocate a particular size but it seems
> that it does not guarantee alignment. I am asking this while considering
> the changes for architecture independent code.
>

I think we can consider these allocations being DMA-like (because
Hypervisor accesses this memory too) so you can probably take a look at
dma_pool_create()/dma_pool_alloc() and friends.

>> >> Also, in case the idea is to generalize stuff, what will happen if
>> >> PAGE_SIZE > HV_HYP_PAGE_SIZE? Who will guarantee proper alignment?
>> >> 
>> >> I think we can leave hypercall arguments, vp_assist and similar pages
>> >> alone for now: the code is not going to be shared among architectures
>> >> anyways.
>> >> 
>> > About the alignment, kmem_cache_create() aligns memory with its third
>> > parameter, offset.
>> 
>> Yes, I know, I was trying to think about a (hypothetical) situation when
>> page sizes differ: what would be the memory alignment requirements from
>> the hypervisor for e.g. hypercall arguments? In case it's always
>> HV_HYP_PAGE_SIZE we're good but could it be PAGE_SIZE (for e.g. TLB
>> flush hypercall)? I don't know. For x86 this discussion probably makes
>> no sense. I'm, however, struggling to understand what benefit we will
>> get from the change. Maybe just leave it as-is for now and fix
>> arch-independent code only? And later, if we decide to generalize this
>> code, make another approach? (Not insisting, just a suggestion)
>
> Thank you for the suggestion, Vitaly!
>
> The introduction of HV_HYP_PAGE_SIZE is weighing the assumption of the
> future page size—it can be bigger based on the general trend, not
> smaller, which is a reasonable assumption, I think.
>

Let's spell it out (as BUILD_BUG_ON(HV_HYP_PAGE_SIZE < PAGE_SIZE) or
something like that) then to make it clear.

-- 
Vitaly

^ permalink raw reply

* Re: [PATCH 2/6] x86: hv: hv_init.c: Replace alloc_page() with kmem_cache_alloc()
From: Vitaly Kuznetsov @ 2019-05-08 19:53 UTC (permalink / raw)
  To: Stephen Hemminger, m.maya.nakamura
  Cc: Michael Kelley, KY Srinivasan, Haiyang Zhang, sashal@kernel.org,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BYAPR21MB1317AC7CA4B242106FCAD698CC320@BYAPR21MB1317.namprd21.prod.outlook.com>

Stephen Hemminger <sthemmin@microsoft.com> writes:

> I would worry that kmem_cache_alloc does not currently have same alignment constraints.
> See discussion here:
> https://lwn.net/SubscriberLink/787740/a886fe4ea6681322/

I think it even was me who reported this bug with XFS originally :-) 

Yes, plain kmalloc() doesn't give you alignment guarantees (it is very
easy to prove, e.g. with CONFIG_KASAN), however, kmem_cache_create() (and
dma_pool_create() to that matter) has explicit 'align' parameter and it
is a bug if it is not respected.

-- 
Vitaly

^ permalink raw reply

* [PATCH] hv_sock: Fix data loss upon socket close
From: Sunil Muthuswamy @ 2019-05-08 23:10 UTC (permalink / raw)
  To: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	David S. Miller, Dexuan Cui, Michael Kelley
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

Currently, when a hvsock socket is closed, the socket is shutdown and
immediately a RST is sent. There is no wait for the FIN packet to arrive
from the other end. This can lead to data loss since the connection is
terminated abruptly. This can manifest easily in cases of a fast guest
hvsock writer and a much slower host hvsock reader. Essentially hvsock is
not following the proper STREAM(TCP) closing handshake mechanism.

The fix involves adding support for the delayed close of hvsock, which is
in-line with other socket providers such as virtio. While closing, the
socket waits for a constant timeout, for the FIN packet to arrive from the
other end. On timeout, it will terminate the connection (i.e a RST).

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
---
 net/vmw_vsock/hyperv_transport.c | 122 ++++++++++++++++++++++++++++-----------
 1 file changed, 87 insertions(+), 35 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index a827547..62b986d 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -35,6 +35,9 @@
 /* The MTU is 16KB per the host side's design */
 #define HVS_MTU_SIZE		(1024 * 16)
 
+/* How long to wait for graceful shutdown of a connection */
+#define HVS_CLOSE_TIMEOUT (8 * HZ)
+
 struct vmpipe_proto_header {
 	u32 pkt_type;
 	u32 data_size;
@@ -305,19 +308,33 @@ static void hvs_channel_cb(void *ctx)
 		sk->sk_write_space(sk);
 }
 
-static void hvs_close_connection(struct vmbus_channel *chan)
+static void hvs_do_close_lock_held(struct vsock_sock *vsk,
+				   bool cancel_timeout)
 {
-	struct sock *sk = get_per_channel_state(chan);
-	struct vsock_sock *vsk = vsock_sk(sk);
-
-	lock_sock(sk);
+	struct sock *sk = sk_vsock(vsk);
 
-	sk->sk_state = TCP_CLOSE;
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
-
+	vsk->peer_shutdown = SHUTDOWN_MASK;
+	if (vsock_stream_has_data(vsk) <= 0)
+		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
+	if (vsk->close_work_scheduled &&
+	    (!cancel_timeout || cancel_delayed_work(&vsk->close_work))) {
+		vsk->close_work_scheduled = false;
+		vsock_remove_sock(vsk);
+
+		/* Release the reference taken while scheduling the timeout */
+		sock_put(sk);
+	}
+}
+
+/* Equivalent of a RST */
+static void hvs_close_connection(struct vmbus_channel *chan)
+{
+	struct sock *sk = get_per_channel_state(chan);
 
+	lock_sock(sk);
+	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
 }
 
@@ -452,50 +469,80 @@ static int hvs_connect(struct vsock_sock *vsk)
 	return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
 }
 
+static inline void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)
+{
+	struct vmpipe_proto_header hdr;
+
+	if (hvs->fin_sent || !hvs->chan)
+		return;
+
+	/* It can't fail: see hvs_channel_writable_bytes(). */
+	(void)hvs_send_data(hvs->chan, (struct hvs_send_buf *)&hdr, 0);
+	hvs->fin_sent = true;
+}
+
 static int hvs_shutdown(struct vsock_sock *vsk, int mode)
 {
 	struct sock *sk = sk_vsock(vsk);
-	struct vmpipe_proto_header hdr;
-	struct hvs_send_buf *send_buf;
-	struct hvsock *hvs;
 
 	if (!(mode & SEND_SHUTDOWN))
 		return 0;
 
 	lock_sock(sk);
+	hvs_shutdown_lock_held(vsk->trans, mode);
+	release_sock(sk);
+	return 0;
+}
 
-	hvs = vsk->trans;
-	if (hvs->fin_sent)
-		goto out;
-
-	send_buf = (struct hvs_send_buf *)&hdr;
+static void hvs_close_timeout(struct work_struct *work)
+{
+	struct vsock_sock *vsk =
+		container_of(work, struct vsock_sock, close_work.work);
+	struct sock *sk = sk_vsock(vsk);
 
-	/* It can't fail: see hvs_channel_writable_bytes(). */
-	(void)hvs_send_data(hvs->chan, send_buf, 0);
+	sock_hold(sk);
+	lock_sock(sk);
+	if (!sock_flag(sk, SOCK_DONE))
+		hvs_do_close_lock_held(vsk, false);
 
-	hvs->fin_sent = true;
-out:
+	vsk->close_work_scheduled = false;
 	release_sock(sk);
-	return 0;
+	sock_put(sk);
 }
 
-static void hvs_release(struct vsock_sock *vsk)
+/* Returns true, if it is safe to remove socket; false otherwise */
+static bool hvs_close_lock_held(struct vsock_sock *vsk)
 {
 	struct sock *sk = sk_vsock(vsk);
-	struct hvsock *hvs = vsk->trans;
-	struct vmbus_channel *chan;
 
-	lock_sock(sk);
+	if (!(sk->sk_state == TCP_ESTABLISHED ||
+	      sk->sk_state == TCP_CLOSING))
+		return true;
 
-	sk->sk_state = TCP_CLOSING;
-	vsock_remove_sock(vsk);
+	if ((sk->sk_shutdown & SHUTDOWN_MASK) != SHUTDOWN_MASK)
+		hvs_shutdown_lock_held(vsk->trans, SHUTDOWN_MASK);
 
-	release_sock(sk);
+	if (sock_flag(sk, SOCK_DONE))
+		return true;
 
-	chan = hvs->chan;
-	if (chan)
-		hvs_shutdown(vsk, RCV_SHUTDOWN | SEND_SHUTDOWN);
+	/* This reference will be dropped by the delayed close routine */
+	sock_hold(sk);
+	INIT_DELAYED_WORK(&vsk->close_work, hvs_close_timeout);
+	vsk->close_work_scheduled = true;
+	schedule_delayed_work(&vsk->close_work, HVS_CLOSE_TIMEOUT);
+	return false;
+}
 
+static void hvs_release(struct vsock_sock *vsk)
+{
+	struct sock *sk = sk_vsock(vsk);
+	bool remove_sock;
+
+	lock_sock(sk);
+	remove_sock = hvs_close_lock_held(vsk);
+	release_sock(sk);
+	if (remove_sock)
+		vsock_remove_sock(vsk);
 }
 
 static void hvs_destruct(struct vsock_sock *vsk)
@@ -532,10 +579,11 @@ static bool hvs_dgram_allow(u32 cid, u32 port)
 	return false;
 }
 
-static int hvs_update_recv_data(struct hvsock *hvs)
+static int hvs_update_recv_data(struct vsock_sock *vsk)
 {
 	struct hvs_recv_buf *recv_buf;
 	u32 payload_len;
+	struct hvsock *hvs = vsk->trans;
 
 	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
 	payload_len = recv_buf->hdr.data_size;
@@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
 	if (payload_len > HVS_MTU_SIZE)
 		return -EIO;
 
-	if (payload_len == 0)
+	/* Peer shutdown */
+	if (payload_len == 0) {
+		struct sock *sk = sk_vsock(vsk);
 		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
+		sk->sk_state_change(sk);
+	}
 
 	hvs->recv_data_len = payload_len;
 	hvs->recv_data_off = 0;
@@ -566,7 +618,7 @@ static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
 
 	if (need_refill) {
 		hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
-		ret = hvs_update_recv_data(hvs);
+		ret = hvs_update_recv_data(vsk);
 		if (ret)
 			return ret;
 	}
@@ -581,7 +633,7 @@ static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
 	if (hvs->recv_data_len == 0) {
 		hvs->recv_desc = hv_pkt_iter_next(hvs->chan, hvs->recv_desc);
 		if (hvs->recv_desc) {
-			ret = hvs_update_recv_data(hvs);
+			ret = hvs_update_recv_data(vsk);
 			if (ret)
 				return ret;
 		}
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH] Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE
From: Sasha Levin @ 2019-05-09  1:06 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, devel@linuxdriverproject.org,
	juliana.rodrigueiro@intra2net.com, linux-kernel@vger.kernel.org,
	marcelo.cerri@canonical.com, apw@canonical.com, olaf@aepfle.de,
	vkuznets, jasowang@redhat.com, stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB09188A7DB0777CD50333F94ED7310@DM5PR2101MB0918.namprd21.prod.outlook.com>

On Tue, May 07, 2019 at 12:51:51PM +0000, Michael Kelley wrote:
>From: Dexuan Cui <decui@microsoft.com> Sent: Tuesday, May 7, 2019 12:47 AM
>>
>> In the case of X86_PAE, unsigned long is u32, but the physical address type
>> should be u64. Due to the bug here, the netvsc driver can not load
>> successfully, and sometimes the VM can panic due to memory corruption (the
>> hypervisor writes data to the wrong location).
>>
>> Fixes: 6ba34171bcbd ("Drivers: hv: vmbus: Remove use of slow_virt_to_phys()")
>> Cc: stable@vger.kernel.org
>> Cc: Michael Kelley <mikelley@microsoft.com>
>> Reported-and-tested-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
>> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>
>Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

Queued for hyperv-fixes, thanks!

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] hv_sock: Fix data loss upon socket close
From: David Miller @ 2019-05-09 20:58 UTC (permalink / raw)
  To: sunilmut
  Cc: kys, haiyangz, sthemmin, sashal, decui, mikelley, netdev,
	linux-hyperv, linux-kernel
In-Reply-To: <BN6PR21MB0465168DEA6CABA910832A5BC0320@BN6PR21MB0465.namprd21.prod.outlook.com>

From: Sunil Muthuswamy <sunilmut@microsoft.com>
Date: Wed, 8 May 2019 23:10:35 +0000

> +static inline void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)

Please do not use the inline keyword in foo.c files, let the compiler decide.

Also, longer term thing, I notice that vsock_remove_socket() is very
inefficient locking-wise.  It takes the table lock to do the placement
test, and takes it again to do the removal.  Might even be racy.

^ permalink raw reply

* RE: [PATCH 2/6] x86: hv: hv_init.c: Replace alloc_page() with kmem_cache_alloc()
From: Vitaly Kuznetsov @ 2019-05-10 13:21 UTC (permalink / raw)
  To: Michael Kelley, m.maya.nakamura
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <MN2PR21MB1232C6ABA5DAC847C8A910E1D70C0@MN2PR21MB1232.namprd21.prod.outlook.com>

Michael Kelley <mikelley@microsoft.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com>  Sent: Wednesday, May 8, 2019 7:55 AM
>> >>
>> >> Sorry, my bad: I meant to say "not cache-like" (these allocations are
>> >> not 'cache') but the typo made it completely incomprehensible.
>> >
>> > No worries! Thank you for sharing your thoughts with me, Vitaly.
>> >
>> > Do you know of any alternatives to kmem_cache that can allocate memory
>> > in a specified size (different than a guest page size) with alignment?
>> > Memory allocated by alloc_page() is aligned but limited to the guest
>> > page size, and kmalloc() can allocate a particular size but it seems
>> > that it does not guarantee alignment. I am asking this while considering
>> > the changes for architecture independent code.
>> >
>> 
>> I think we can consider these allocations being DMA-like (because
>> Hypervisor accesses this memory too) so you can probably take a look at
>> dma_pool_create()/dma_pool_alloc() and friends.
>> 
>
> I've taken a look at dma_pool_create(), and it takes a "struct device"
> argument with which the DMA pool will be associated.  That probably
> makes DMA pools a bad choice for this usage.  Pages need to be allocated
> pretty early during boot for Hyper-V communication, and even if the
> device subsystem is initialized early enough to create a fake device,
> such a dependency seems rather dubious.

We can probably use dma_pool_create()/dma_pool_alloc() from vmbus module
but these 'early' allocations may not have a device to bind to indeed.

>
> kmem_cache_create/alloc() seems like the only choice to get
> guaranteed alignment.  Do you see any actual problem with
> using kmem_cache_*, other than the naming?  It seems like these
> kmem_cache_*  functions really just act as a sub-allocator for
> known-size allocations, and "cache" is a common usage
> pattern, but not necessarily the only usage pattern.

Yes, it's basically the name - it makes it harder to read the code and
some future refactoring of kmem_cache_* may not take our use-case into
account (as we're misusing the API). We can try renaming it to something
generic of course and see what -mm people have to say :-)

-- 
Vitaly

^ permalink raw reply

* RE: [PATCH 2/6] x86: hv: hv_init.c: Replace alloc_page() with kmem_cache_alloc()
From: Vitaly Kuznetsov @ 2019-05-10 17:45 UTC (permalink / raw)
  To: Michael Kelley, m.maya.nakamura
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BYAPR21MB1221962ED2DD7FEE19E7DAB6D70C0@BYAPR21MB1221.namprd21.prod.outlook.com>

Michael Kelley <mikelley@microsoft.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com>  Sent: Friday, May 10, 2019 6:22 AM
>> >>
>> >> I think we can consider these allocations being DMA-like (because
>> >> Hypervisor accesses this memory too) so you can probably take a look at
>> >> dma_pool_create()/dma_pool_alloc() and friends.
>> >>
>> >
>> > I've taken a look at dma_pool_create(), and it takes a "struct device"
>> > argument with which the DMA pool will be associated.  That probably
>> > makes DMA pools a bad choice for this usage.  Pages need to be allocated
>> > pretty early during boot for Hyper-V communication, and even if the
>> > device subsystem is initialized early enough to create a fake device,
>> > such a dependency seems rather dubious.
>> 
>> We can probably use dma_pool_create()/dma_pool_alloc() from vmbus module
>> but these 'early' allocations may not have a device to bind to indeed.
>> 
>> >
>> > kmem_cache_create/alloc() seems like the only choice to get
>> > guaranteed alignment.  Do you see any actual problem with
>> > using kmem_cache_*, other than the naming?  It seems like these
>> > kmem_cache_*  functions really just act as a sub-allocator for
>> > known-size allocations, and "cache" is a common usage
>> > pattern, but not necessarily the only usage pattern.
>> 
>> Yes, it's basically the name - it makes it harder to read the code and
>> some future refactoring of kmem_cache_* may not take our use-case into
>> account (as we're misusing the API). We can try renaming it to something
>> generic of course and see what -mm people have to say :-)
>> 
>
> This makes me think of creating Hyper-V specific alloc/free functions
> that wrap whatever the backend allocator actually is.  So we have
> hv_alloc_hyperv_page() and hv_free_hyperv_page().  That makes the
> code very readable and the intent is super clear.
>
> As for the backend allocator, an alternative is to write our own simple
> allocator.  It maintains a single free list.  If hv_alloc_hyperv_page() is
> called, and the free list is empty, do alloc_page() and break it up into
> Hyper-V sized pages to replenish the free list.  (On x86, these end up
> being 1-for-1 operations.)  hv_free_hyperv_page() just puts the Hyper-V
> page back on the free list.  Don't worry trying to combine and do
> free_page() since there's very little free'ing done anyway.  And I'm
> assuming GPF_KERNEL is all we need.
>
> If in the future Linux provides an alternate general-purpose allocator
> that guarantees alignment, we can ditch the simple allocator and use
> the new mechanism with some simple code changes in one place.
>
> Thoughts?
>

+1 for adding wrappers and if the allocator turns out to be more-or-less
trivial I think we can live with that for the time being.

-- 
Vitaly

^ permalink raw reply

* RE: [PATCH] hv_sock: Fix data loss upon socket close
From: Dexuan Cui @ 2019-05-11  3:56 UTC (permalink / raw)
  To: Sunil Muthuswamy, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, David S. Miller, Michael Kelley
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BN6PR21MB0465168DEA6CABA910832A5BC0320@BN6PR21MB0465.namprd21.prod.outlook.com>

> From: Sunil Muthuswamy <sunilmut@microsoft.com>
> Sent: Wednesday, May 8, 2019 4:11 PM
>
> Currently, when a hvsock socket is closed, the socket is shutdown and
> immediately a RST is sent. There is no wait for the FIN packet to arrive
> from the other end. This can lead to data loss since the connection is
> terminated abruptly. This can manifest easily in cases of a fast guest
> hvsock writer and a much slower host hvsock reader. Essentially hvsock is
> not following the proper STREAM(TCP) closing handshake mechanism.

Hi Sunil,
It looks to me the above description is inaccurate.

In the upstream Linux kernel, closing a hv_sock file descriptor may hang
in vmbus_hvsock_device_unregister() -> msleep(), until the host side of
the connection is closed. This is bad and should be fixed, but I don't think
the current code can cause data loss: when Linux calls hvs_destruct() ->
vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ...
-> vmbus_close() to close the channel, Linux knows the host app has
already called close(), and normally that means the host app has
received all the data from the connection.

BTW, technically speaking, in hv_sock there is no RST packet, while there
is indeed a payload_len==0 packet, which is similar to TCP FIN.

I think by saying "a RST is sent" you mean Linux VM closes the channel.

> The fix involves adding support for the delayed close of hvsock, which is
> in-line with other socket providers such as virtio.

With this "delayed close" patch, Linux's close() won't hang until the host
also closes the connection. This is good!

> While closing, the
> socket waits for a constant timeout, for the FIN packet to arrive from the
> other end. On timeout, it will terminate the connection (i.e a RST).

As I mentioned above, I suppose the "RST" means Linux closes the channel.

When Linux closes a connection, the FIN packet is written into the shared
guest-to-host channel ringbuffer immediately, so the host is able to see it
immediately, but the real question is: what if the host kernel and/or host app
can not (timely) receive the data from the ringbuffer, inclding the FIN?

Does the host kernel guarantee it *always* timely fetches/caches all the
data from a connection, even if the host app has not accept()'d the
conection, or the host app is reading from the connection too slowly?

If the host doesn't guarantee that, then even with this patch there is still
a chance Linux can time out, and close the channel before the host
finishes receiving all the data.

I'm curious how Windows guest implements the "async close"?
Does Windows guest also use the same timeout strategy here? If yes,
what's the timeout value used?

> diff --git a/net/vmw_vsock/hyperv_transport.c
> b/net/vmw_vsock/hyperv_transport.c
> index a827547..62b986d 100644

Sorry, I need more time to review the rest of patch. Will try to reply ASAP.

> -static int hvs_update_recv_data(struct hvsock *hvs)
> +static int hvs_update_recv_data(struct vsock_sock *vsk)
>  {
>       struct hvs_recv_buf *recv_buf;
>       u32 payload_len;
> +     struct hvsock *hvs = vsk->trans;
>
>       recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
>       payload_len = recv_buf->hdr.data_size;
> @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
>       if (payload_len > HVS_MTU_SIZE)
>               return -EIO;
>
> -     if (payload_len == 0)
> +     /* Peer shutdown */
> +     if (payload_len == 0) {
> +             struct sock *sk = sk_vsock(vsk);
>               hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
> +             sk->sk_state_change(sk);
> +     }

Can you please explain why we need to call this sk->sk_state_change()?

When we call hvs_update_recv_data(), we hold the lock_sock(sk) lock, and we
know there is at least one byte to read. Since we hold the lock, the other
code paths, which normally are also requried to acquire the lock before
checking vsk->peer_shutdown, can not race with us.

Thanks,
-- Dexuan

^ permalink raw reply

* RE: [PATCH] hv_sock: Fix data loss upon socket close
From: Sunil Muthuswamy @ 2019-05-14 16:33 UTC (permalink / raw)
  To: David Miller
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Dexuan Cui, Michael Kelley,
	netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190509.135809.630741953977432246.davem@davemloft.net>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-owner@vger.kernel.org> On Behalf Of David Miller
> Sent: Thursday, May 9, 2019 1:58 PM
> To: Sunil Muthuswamy <sunilmut@microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; sashal@kernel.org; Dexuan Cui <decui@microsoft.com>; Michael Kelley <mikelley@microsoft.com>;
> netdev@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] hv_sock: Fix data loss upon socket close
> 
> From: Sunil Muthuswamy <sunilmut@microsoft.com>
> Date: Wed, 8 May 2019 23:10:35 +0000
> 
> > +static inline void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)
> 
> Please do not use the inline keyword in foo.c files, let the compiler decide.
> 
Thanks, will fix in the next version.
> Also, longer term thing, I notice that vsock_remove_socket() is very
> inefficient locking-wise.  It takes the table lock to do the placement
> test, and takes it again to do the removal.  Might even be racy.
Agreed. The check & remove should be done as an atomic operation.
This can be taken up as a separate patch.

^ permalink raw reply

* RE: [PATCH] hv_sock: Fix data loss upon socket close
From: Sunil Muthuswamy @ 2019-05-14 20:40 UTC (permalink / raw)
  To: Dexuan Cui, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, David S. Miller, Michael Kelley
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <PU1P153MB01695C88469F32B9ECC7657EBF0D0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Friday, May 10, 2019 8:57 PM
> To: Sunil Muthuswamy <sunilmut@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> Stephen Hemminger <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; David S. Miller <davem@davemloft.net>;
> Michael Kelley <mikelley@microsoft.com>
> Cc: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close
> 
> > From: Sunil Muthuswamy <sunilmut@microsoft.com>
> > Sent: Wednesday, May 8, 2019 4:11 PM
> >
> > Currently, when a hvsock socket is closed, the socket is shutdown and
> > immediately a RST is sent. There is no wait for the FIN packet to arrive
> > from the other end. This can lead to data loss since the connection is
> > terminated abruptly. This can manifest easily in cases of a fast guest
> > hvsock writer and a much slower host hvsock reader. Essentially hvsock is
> > not following the proper STREAM(TCP) closing handshake mechanism.
> 
> Hi Sunil,
> It looks to me the above description is inaccurate.
> 
> In the upstream Linux kernel, closing a hv_sock file descriptor may hang
> in vmbus_hvsock_device_unregister() -> msleep(), until the host side of
> the connection is closed. This is bad and should be fixed, but I don't think
> the current code can cause data loss: when Linux calls hvs_destruct() ->
> vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ...
> -> vmbus_close() to close the channel, Linux knows the host app has
> already called close(), and normally that means the host app has
> received all the data from the connection.
> 
> BTW, technically speaking, in hv_sock there is no RST packet, while there
> is indeed a payload_len==0 packet, which is similar to TCP FIN.
> 
> I think by saying "a RST is sent" you mean Linux VM closes the channel.
> 
> > The fix involves adding support for the delayed close of hvsock, which is
> > in-line with other socket providers such as virtio.
> 
> With this "delayed close" patch, Linux's close() won't hang until the host
> also closes the connection. This is good!
> 
The next version of the patch will only focus on implementing the delayed
(or background) close logic. I will update the title and the description of the
next version patch to more accurately reflect the change. 
> > While closing, the
> > socket waits for a constant timeout, for the FIN packet to arrive from the
> > other end. On timeout, it will terminate the connection (i.e a RST).
> 
> As I mentioned above, I suppose the "RST" means Linux closes the channel.
> 
> When Linux closes a connection, the FIN packet is written into the shared
> guest-to-host channel ringbuffer immediately, so the host is able to see it
> immediately, but the real question is: what if the host kernel and/or host app
> can not (timely) receive the data from the ringbuffer, inclding the FIN?
> 
> Does the host kernel guarantee it *always* timely fetches/caches all the
> data from a connection, even if the host app has not accept()'d the
> conection, or the host app is reading from the connection too slowly?
> 
> If the host doesn't guarantee that, then even with this patch there is still
> a chance Linux can time out, and close the channel before the host
> finishes receiving all the data.
> 
TCP protocol does not guarantee that all the data gets delivered, especially
during close. The applications are required to mitigate this by using a
combination of shutdown() and subsequent read() on both client and server
side.
> I'm curious how Windows guest implements the "async close"?
> Does Windows guest also use the same timeout strategy here? If yes,
> what's the timeout value used?
> 
Windows also implements the delayed close logic in a similar fashion. You can
lookup the MSDN article on 'graceful shutdown'.
> > diff --git a/net/vmw_vsock/hyperv_transport.c
> > b/net/vmw_vsock/hyperv_transport.c
> > index a827547..62b986d 100644
> 
> Sorry, I need more time to review the rest of patch. Will try to reply ASAP.
> 
> > -static int hvs_update_recv_data(struct hvsock *hvs)
> > +static int hvs_update_recv_data(struct vsock_sock *vsk)
> >  {
> >  	struct hvs_recv_buf *recv_buf;
> >  	u32 payload_len;
> > +	struct hvsock *hvs = vsk->trans;
> >
> >  	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> >  	payload_len = recv_buf->hdr.data_size;
> > @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
> >  	if (payload_len > HVS_MTU_SIZE)
> >  		return -EIO;
> >
> > -	if (payload_len == 0)
> > +	/* Peer shutdown */
> > +	if (payload_len == 0) {
> > +		struct sock *sk = sk_vsock(vsk);
> >  		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
> > +		sk->sk_state_change(sk);
> > +	}
> 
> Can you please explain why we need to call this sk->sk_state_change()?
> 
> When we call hvs_update_recv_data(), we hold the lock_sock(sk) lock, and we
> know there is at least one byte to read. Since we hold the lock, the other
> code paths, which normally are also requried to acquire the lock before
> checking vsk->peer_shutdown, can not race with us.
> 
This was to wakeup any waiters on socket state change. Since the updated
patch now only focuses on adding the delayed close logic, I will remove this in
the next version.
Thanks for the review so far.
> Thanks,
> -- Dexuan

^ permalink raw reply

* [PATCH v2] hv_sock: Add support for delayed close
From: Sunil Muthuswamy @ 2019-05-15  0:56 UTC (permalink / raw)
  To: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	David S. Miller, Dexuan Cui, Michael Kelley
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

Currently, hvsock does not implement any delayed or background close
logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and
the last reference to the socket is dropped, which leads to a call to
.destruct where the socket can hang indefinitely waiting for the peer to
close it's side. The can cause the user application to hang in the close()
call.

This change implements proper STREAM(TCP) closing handshake mechanism by
sending the FIN to the peer and the waiting for the peer's FIN to arrive
for a given timeout. On timeout, it will try to terminate the connection
(i.e. a RST). This is in-line with other socket providers such as virtio.

This change does not address the hang in the vmbus_hvsock_device_unregister
where it waits indefinitely for the host to rescind the channel. That
should be taken up as a separate fix.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
---
Changes since v1:
- Updated the title and description to better reflect the change. The title
was previously called 'hv_sock: Fix data loss upon socket close'
- Removed the sk_state_change call to keep the fix focused.
- Removed 'inline' keyword from the .c file and letting compiler do it.

 net/vmw_vsock/hyperv_transport.c | 108 ++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 31 deletions(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index a827547..982a8dc 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -35,6 +35,9 @@
 /* The MTU is 16KB per the host side's design */
 #define HVS_MTU_SIZE		(1024 * 16)
 
+/* How long to wait for graceful shutdown of a connection */
+#define HVS_CLOSE_TIMEOUT (8 * HZ)
+
 struct vmpipe_proto_header {
 	u32 pkt_type;
 	u32 data_size;
@@ -305,19 +308,32 @@ static void hvs_channel_cb(void *ctx)
 		sk->sk_write_space(sk);
 }
 
-static void hvs_close_connection(struct vmbus_channel *chan)
+static void hvs_do_close_lock_held(struct vsock_sock *vsk,
+				   bool cancel_timeout)
 {
-	struct sock *sk = get_per_channel_state(chan);
-	struct vsock_sock *vsk = vsock_sk(sk);
-
-	lock_sock(sk);
+	struct sock *sk = sk_vsock(vsk);
 
-	sk->sk_state = TCP_CLOSE;
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
-
+	vsk->peer_shutdown = SHUTDOWN_MASK;
+	if (vsock_stream_has_data(vsk) <= 0)
+		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
+	if (vsk->close_work_scheduled &&
+	    (!cancel_timeout || cancel_delayed_work(&vsk->close_work))) {
+		vsk->close_work_scheduled = false;
+		vsock_remove_sock(vsk);
 
+		/* Release the reference taken while scheduling the timeout */
+		sock_put(sk);
+	}
+}
+
+static void hvs_close_connection(struct vmbus_channel *chan)
+{
+	struct sock *sk = get_per_channel_state(chan);
+
+	lock_sock(sk);
+	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
 }
 
@@ -452,50 +468,80 @@ static int hvs_connect(struct vsock_sock *vsk)
 	return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
 }
 
+static void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)
+{
+	struct vmpipe_proto_header hdr;
+
+	if (hvs->fin_sent || !hvs->chan)
+		return;
+
+	/* It can't fail: see hvs_channel_writable_bytes(). */
+	(void)hvs_send_data(hvs->chan, (struct hvs_send_buf *)&hdr, 0);
+	hvs->fin_sent = true;
+}
+
 static int hvs_shutdown(struct vsock_sock *vsk, int mode)
 {
 	struct sock *sk = sk_vsock(vsk);
-	struct vmpipe_proto_header hdr;
-	struct hvs_send_buf *send_buf;
-	struct hvsock *hvs;
 
 	if (!(mode & SEND_SHUTDOWN))
 		return 0;
 
 	lock_sock(sk);
+	hvs_shutdown_lock_held(vsk->trans, mode);
+	release_sock(sk);
+	return 0;
+}
 
-	hvs = vsk->trans;
-	if (hvs->fin_sent)
-		goto out;
-
-	send_buf = (struct hvs_send_buf *)&hdr;
+static void hvs_close_timeout(struct work_struct *work)
+{
+	struct vsock_sock *vsk =
+		container_of(work, struct vsock_sock, close_work.work);
+	struct sock *sk = sk_vsock(vsk);
 
-	/* It can't fail: see hvs_channel_writable_bytes(). */
-	(void)hvs_send_data(hvs->chan, send_buf, 0);
+	sock_hold(sk);
+	lock_sock(sk);
+	if (!sock_flag(sk, SOCK_DONE))
+		hvs_do_close_lock_held(vsk, false);
 
-	hvs->fin_sent = true;
-out:
+	vsk->close_work_scheduled = false;
 	release_sock(sk);
-	return 0;
+	sock_put(sk);
 }
 
-static void hvs_release(struct vsock_sock *vsk)
+/* Returns true, if it is safe to remove socket; false otherwise */
+static bool hvs_close_lock_held(struct vsock_sock *vsk)
 {
 	struct sock *sk = sk_vsock(vsk);
-	struct hvsock *hvs = vsk->trans;
-	struct vmbus_channel *chan;
 
-	lock_sock(sk);
+	if (!(sk->sk_state == TCP_ESTABLISHED ||
+	      sk->sk_state == TCP_CLOSING))
+		return true;
 
-	sk->sk_state = TCP_CLOSING;
-	vsock_remove_sock(vsk);
+	if ((sk->sk_shutdown & SHUTDOWN_MASK) != SHUTDOWN_MASK)
+		hvs_shutdown_lock_held(vsk->trans, SHUTDOWN_MASK);
 
-	release_sock(sk);
+	if (sock_flag(sk, SOCK_DONE))
+		return true;
 
-	chan = hvs->chan;
-	if (chan)
-		hvs_shutdown(vsk, RCV_SHUTDOWN | SEND_SHUTDOWN);
+	/* This reference will be dropped by the delayed close routine */
+	sock_hold(sk);
+	INIT_DELAYED_WORK(&vsk->close_work, hvs_close_timeout);
+	vsk->close_work_scheduled = true;
+	schedule_delayed_work(&vsk->close_work, HVS_CLOSE_TIMEOUT);
+	return false;
+}
 
+static void hvs_release(struct vsock_sock *vsk)
+{
+	struct sock *sk = sk_vsock(vsk);
+	bool remove_sock;
+
+	lock_sock(sk);
+	remove_sock = hvs_close_lock_held(vsk);
+	release_sock(sk);
+	if (remove_sock)
+		vsock_remove_sock(vsk);
 }
 
 static void hvs_destruct(struct vsock_sock *vsk)
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox