Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Saeed Mahameed @ 2019-08-23  5:29 UTC (permalink / raw)
  To: davem@davemloft.net, haiyangz
  Cc: kys, sthemmin@microsoft.com, lorenzo.pieralisi@arm.com,
	linux-kernel@vger.kernel.org, Eran Ben Elisha,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	leon@kernel.org, sashal@kernel.org, bhelgaas@google.com,
	linux-hyperv@vger.kernel.org
In-Reply-To: <20190822.153912.2269276523787180347.davem@davemloft.net>

On Thu, 2019-08-22 at 15:39 -0700, David Miller wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Date: Thu, 22 Aug 2019 22:37:13 +0000
> 
> > The v5 is pretty much the same as v4, except Eran had a fix to
> patch #3 in response to
> > Leon Romanovsky <leon@kernel.org>.
> 
> Well you now have to send me a patch relative to v4 in order to fix
> that.
> 
> When I say "applied", the series is in my tree and is therefore
> permanent.
> It is therefore never appropriate to then post a new version of the
> series.

Dave, I think you didn't reply back to v4 that the series was applied.
So that might have created some confusion for Haiyang.



^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: David Miller @ 2019-08-23  6:11 UTC (permalink / raw)
  To: saeedm
  Cc: haiyangz, kys, sthemmin, lorenzo.pieralisi, linux-kernel, eranbe,
	netdev, linux-pci, leon, sashal, bhelgaas, linux-hyperv
In-Reply-To: <f7a0ce8822e197ace496a348a14ac6939313d8f6.camel@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Fri, 23 Aug 2019 05:29:48 +0000

> On Thu, 2019-08-22 at 15:39 -0700, David Miller wrote:
>> From: Haiyang Zhang <haiyangz@microsoft.com>
>> Date: Thu, 22 Aug 2019 22:37:13 +0000
>> 
>> > The v5 is pretty much the same as v4, except Eran had a fix to
>> patch #3 in response to
>> > Leon Romanovsky <leon@kernel.org>.
>> 
>> Well you now have to send me a patch relative to v4 in order to fix
>> that.
>> 
>> When I say "applied", the series is in my tree and is therefore
>> permanent.
>> It is therefore never appropriate to then post a new version of the
>> series.
> 
> Dave, I think you didn't reply back to v4 that the series was applied.
> So that might have created some confusion for Haiyang.

I thought I did, sorry, my bad.

^ permalink raw reply

* RE: [PATCH v2 1/3] drivers: hv: vmbus: Introduce latency testing
From: Michael Kelley @ 2019-08-23 16:44 UTC (permalink / raw)
  To: brandonbonaby94
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190823033850.GA41496@Test-Virtual-Machine>

From: Branden Bonaby <brandonbonaby94@gmail.com> Sent: Thursday, August 22, 2019 8:39 PM
> > > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > > index 09829e15d4a0..c9c63a4033cd 100644
> > > --- a/drivers/hv/connection.c
> > > +++ b/drivers/hv/connection.c
> > > @@ -357,6 +357,9 @@ void vmbus_on_event(unsigned long data)
> > >
> > >  	trace_vmbus_on_event(channel);
> > >
> > > +#ifdef CONFIG_HYPERV_TESTING
> > > +	hv_debug_delay_test(channel, INTERRUPT_DELAY);
> > > +#endif /* CONFIG_HYPERV_TESTING */
> >
> > You are following Vitaly's suggestion to use #ifdef's so no code is
> > generated when HYPERV_TESTING is not enabled.  However, this
> > direct approach to using #ifdef's really clutters the code and makes
> > it harder to read and follow.  The better approach is to use the
> > #ifdef in the include file where the functions are defined.  If
> > HYPERV_TESTING is not enabled, provide a #else that defines
> > the function with an empty implementation for which the compiler
> > will generate no code.   An as example, see the function definition
> > for hyperv_init() in arch/x86/include/asm/mshyperv.h.  There are
> > several functions treated similarly in that include file.
> >
> 
> I checked out the code in arch/x86/include/asm/mshyperv.h, after
> thinking about it, I'm wondering if it would be better just to have
> two files one called hv_debugfs.c and the other hyperv_debugfs.h.
> I could put the code definitions in hv_debugfs.c and at the top
> include the hyperv_debugfs.h file which would house the declarations
> of these functions under the ifdef. Then like you alluded too use
> an #else statement that would have the null implementations of the
> above functions. Then put an #include "hyperv_debugfs.h" in the
> hyperv_vmbus.h file. I figured instead of putting the code directly
> into the vmbus_drv.c file it might be best to put them in a seperate
> file like hv_debugfs.c. This way when we start adding more tests we
> don't bloat the vmbus_drv.c file unnecessarily. The hv_debugfs.c
> file would have the #ifdef CONFIG_HYPERV_TESTING at the top so if
> its not enabled  those null implementations in "hyperv_debugfs.h"
> woud kick in anywhere that included the hyperv_vmbus.h file which
> is what we want.
> 
> what do you think?
> 

I'll preface my comments by saying that how code gets structured
into files is always a bit of a judgment call.  The goal is to group code
into sensible chunks to make it easy to understand and to make it
easy to modify and extend later.  The latter is a prediction about the
future, which may or may not be accurate.   For the former, what's
"easy to understand," is often in the eye of the beholder.  So you may
get different opinions from different reviewers.

That said, I like the idea of a separate hv_debugfs.c file to contain
the implementation of the various functions you have added to
provide the fuzzing capability.   I'm less convinced about the value
of a separate hyperv_debugfs.h file.   I think you have one big
#ifdef CONFIG_HYPERV_TESTING followed by the declarations of
the functions in hv_debugfs.c, followed by #else and null
implementations of those functions.  This is 20 lines of code or so,
and could easily go in hyperv_vmbus.h.

For the new hv_debugfs.c, you can avoid the need for
#ifdef CONFIG_HYPERV_TESTING by modifying the Makefile in
drivers/hv so that hv_debugfs.o is built only if CONFIG_HYPERV_TESTING
is defined.  Look at the current Makefile to see how this is done
with CONFIG_HYPERV_UTILS and CONFIG_HYPERV_BALLOON.

Michael

^ permalink raw reply

* Re: [PATCH v2 1/3] drivers: hv: vmbus: Introduce latency testing
From: Branden Bonaby @ 2019-08-23 17:36 UTC (permalink / raw)
  To: Michael Kelley
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR21MB01379300AF2B441D0B53692DD7A40@DM5PR21MB0137.namprd21.prod.outlook.com>

On Fri, Aug 23, 2019 at 04:44:09PM +0000, Michael Kelley wrote:
> From: Branden Bonaby <brandonbonaby94@gmail.com> Sent: Thursday, August 22, 2019 8:39 PM
> > > > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > > > index 09829e15d4a0..c9c63a4033cd 100644
> > > > --- a/drivers/hv/connection.c
> > > > +++ b/drivers/hv/connection.c
> > > > @@ -357,6 +357,9 @@ void vmbus_on_event(unsigned long data)
> > > >
> > > >  	trace_vmbus_on_event(channel);
> > > >
> > > > +#ifdef CONFIG_HYPERV_TESTING
> > > > +	hv_debug_delay_test(channel, INTERRUPT_DELAY);
> > > > +#endif /* CONFIG_HYPERV_TESTING */
> > >
> > > You are following Vitaly's suggestion to use #ifdef's so no code is
> > > generated when HYPERV_TESTING is not enabled.  However, this
> > > direct approach to using #ifdef's really clutters the code and makes
> > > it harder to read and follow.  The better approach is to use the
> > > #ifdef in the include file where the functions are defined.  If
> > > HYPERV_TESTING is not enabled, provide a #else that defines
> > > the function with an empty implementation for which the compiler
> > > will generate no code.   An as example, see the function definition
> > > for hyperv_init() in arch/x86/include/asm/mshyperv.h.  There are
> > > several functions treated similarly in that include file.
> > >
> > 
> > I checked out the code in arch/x86/include/asm/mshyperv.h, after
> > thinking about it, I'm wondering if it would be better just to have
> > two files one called hv_debugfs.c and the other hyperv_debugfs.h.
> > I could put the code definitions in hv_debugfs.c and at the top
> > include the hyperv_debugfs.h file which would house the declarations
> > of these functions under the ifdef. Then like you alluded too use
> > an #else statement that would have the null implementations of the
> > above functions. Then put an #include "hyperv_debugfs.h" in the
> > hyperv_vmbus.h file. I figured instead of putting the code directly
> > into the vmbus_drv.c file it might be best to put them in a seperate
> > file like hv_debugfs.c. This way when we start adding more tests we
> > don't bloat the vmbus_drv.c file unnecessarily. The hv_debugfs.c
> > file would have the #ifdef CONFIG_HYPERV_TESTING at the top so if
> > its not enabled  those null implementations in "hyperv_debugfs.h"
> > woud kick in anywhere that included the hyperv_vmbus.h file which
> > is what we want.
> > 
> > what do you think?
> > 
> 
> I'll preface my comments by saying that how code gets structured
> into files is always a bit of a judgment call.  The goal is to group code
> into sensible chunks to make it easy to understand and to make it
> easy to modify and extend later.  The latter is a prediction about the
> future, which may or may not be accurate.   For the former, what's
> "easy to understand," is often in the eye of the beholder.  So you may
> get different opinions from different reviewers.
> 
> That said, I like the idea of a separate hv_debugfs.c file to contain
> the implementation of the various functions you have added to
> provide the fuzzing capability.   I'm less convinced about the value
> of a separate hyperv_debugfs.h file.   I think you have one big
> #ifdef CONFIG_HYPERV_TESTING followed by the declarations of
> the functions in hv_debugfs.c, followed by #else and null
> implementations of those functions.  This is 20 lines of code or so,
> and could easily go in hyperv_vmbus.h.
> 
> For the new hv_debugfs.c, you can avoid the need for
> #ifdef CONFIG_HYPERV_TESTING by modifying the Makefile in
> drivers/hv so that hv_debugfs.o is built only if CONFIG_HYPERV_TESTING
> is defined.  Look at the current Makefile to see how this is done
> with CONFIG_HYPERV_UTILS and CONFIG_HYPERV_BALLOON.
> 
> Michael
>

I see, that does make sense, I'll go ahead and add these changes.

thanks

branden bonaby

^ permalink raw reply

* RE: [PATCH v3 02/12] x86/hyper-v: Implement hv_is_hibernation_supported()
From: Michael Kelley @ 2019-08-23 19:50 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-3-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> When a Linux VM runs on Hyper-V and hibernates, it must disable the
> memory hot-add/remove and balloon up/down capabilities in the hv_balloon
> driver.

I'm unclear on the above statement.  I think the requirement is that
ballooning must not be active when hibernation is initiated.  Is hibernation
blocked in that case?  If not, what happens?

> 
> By default, Hyper-V does not enable the virtual ACPI S4 state for a VM;
> on recent Hyper-V hosts, the administrator is able to enable the virtual
> ACPI S4 state for a VM, so we hope to use the presence of the virtual ACPI

"we hope" sounds very indefinite. :-(    Does ACPI S4 have to be enabled for
hibernation to be initiated?  Goes back to my first question ....

> S4 state as a hint for hv_balloon to disable the aforementioned
> capabilities.
> 
> The new API will be used by hv_balloon.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c      | 7 +++++++
>  include/asm-generic/mshyperv.h | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 78e53d9..6735e45 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -7,6 +7,7 @@
>   * Author : K. Y. Srinivasan <kys@microsoft.com>
>   */
> 
> +#include <linux/acpi.h>
>  #include <linux/efi.h>
>  #include <linux/types.h>
>  #include <asm/apic.h>
> @@ -453,3 +454,9 @@ bool hv_is_hyperv_initialized(void)
>  	return hypercall_msr.enable;
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> +
> +bool hv_is_hibernation_supported(void)
> +{
> +	return acpi_sleep_state_supported(ACPI_STATE_S4);
> +}
> +EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 0becb7d..1cb4001 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -166,9 +166,11 @@ static inline int cpumask_to_vpset(struct hv_vpset *vpset,
>  void hyperv_report_panic(struct pt_regs *regs, long err);
>  void hyperv_report_panic_msg(phys_addr_t pa, size_t size);
>  bool hv_is_hyperv_initialized(void);
> +bool hv_is_hibernation_supported(void);
>  void hyperv_cleanup(void);
>  #else /* CONFIG_HYPERV */
>  static inline bool hv_is_hyperv_initialized(void) { return false; }
> +static inline bool hv_is_hibernation_supported(void) { return false; }
>  static inline void hyperv_cleanup(void) {}
>  #endif /* CONFIG_HYPERV */
> 
> --
> 1.8.3.1


^ permalink raw reply

* RE: [PATCH v3 06/12] Drivers: hv: vmbus: Add a helper function is_sub_channel()
From: Michael Kelley @ 2019-08-23 19:51 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-7-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> The existing method of telling if a channel is sub-channel in
> vmbus_process_offer() is cumbersome. This new simple helper function
> is preferred in future.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  include/linux/hyperv.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 6256cc3..2d39248 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -245,7 +245,10 @@ struct vmbus_channel_offer {
>  		} pipe;
>  	} u;
>  	/*
> -	 * The sub_channel_index is defined in win8.
> +	 * The sub_channel_index is defined in Win8: a value of zero means a
> +	 * primary channel and a value of non-zero means a sub-channel.
> +	 *
> +	 * Before Win8, the field is reserved, meaning it's always zero.
>  	 */
>  	u16 sub_channel_index;
>  	u16 reserved3;
> @@ -934,6 +937,11 @@ static inline bool is_hvsock_channel(const struct vmbus_channel
> *c)
>  		  VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
>  }
> 
> +static inline bool is_sub_channel(const struct vmbus_channel *c)
> +{
> +	return c->offermsg.offer.sub_channel_index != 0;
> +}
> +
>  static inline void set_channel_affinity_state(struct vmbus_channel *c,
>  					      enum hv_numa_policy policy)
>  {
> --
> 1.8.3.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH v3 08/12] Drivers: hv: vmbus: Ignore the offers when resuming from hibernation
From: Michael Kelley @ 2019-08-23 19:56 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-9-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>  Sent: Monday, August 19, 2019 6:52 PM
> 
> When the VM resumes, the host re-sends the offers. We should not add the
> offers to the global vmbus_connection.chn_list again.
> 
> This patch assumes the RELIDs of the channels don't change across
> hibernation. Actually this is not always true, especially in the case of
> NIC SR-IOV the VF vmbus device's RELID sometimes can change. A later patch
> will address this issue by mapping the new offers to the old channels and
> fixing up the old channels, if necessary.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 29 ++++++++++++++++++++++++++++-
>  drivers/hv/connection.c   | 27 +++++++++++++++++++++++++++
>  drivers/hv/hyperv_vmbus.h |  3 +++
>  3 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index addcef5..f7a1184 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -854,12 +854,39 @@ void vmbus_initiate_unload(bool crash)
>  static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
>  {
>  	struct vmbus_channel_offer_channel *offer;
> -	struct vmbus_channel *newchannel;
> +	struct vmbus_channel *oldchannel, *newchannel;
> +	size_t offer_sz;
> 
>  	offer = (struct vmbus_channel_offer_channel *)hdr;
> 
>  	trace_vmbus_onoffer(offer);
> 
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +	oldchannel = find_primary_channel_by_offer(offer);
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
> +	if (oldchannel != NULL) {
> +		atomic_dec(&vmbus_connection.offer_in_progress);
> +
> +		/*
> +		 * We're resuming from hibernation: we expect the host to send
> +		 * exactly the same offers that we had before the hibernation.
> +		 */
> +		offer_sz = sizeof(*offer);
> +		if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0)
> +			return;
> +
> +		pr_debug("Mismatched offer from the host (relid=%d)\n",
> +			 offer->child_relid);
> +
> +		print_hex_dump_debug("Old vmbus offer: ", DUMP_PREFIX_OFFSET,
> +				     16, 4, &oldchannel->offermsg, offer_sz,
> +				     false);
> +		print_hex_dump_debug("New vmbus offer: ", DUMP_PREFIX_OFFSET,
> +				     16, 4, offer, offer_sz, false);
> +		return;
> +	}
> +
>  	/* Allocate the channel object and save this offer. */
>  	newchannel = alloc_channel();
>  	if (!newchannel) {
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 09829e1..6c7a983 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -337,6 +337,33 @@ struct vmbus_channel *relid2channel(u32 relid)
>  }
> 
>  /*
> + * find_primary_channel_by_offer - Get the channel object given the new offer.
> + * This is only used in the resume path of hibernation.
> + */
> +struct vmbus_channel *
> +find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer)
> +{
> +	struct vmbus_channel *channel;
> +	const guid_t *inst1, *inst2;
> +
> +	WARN_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
> +
> +	/* Ignore sub-channel offers. */
> +	if (offer->offer.sub_channel_index != 0)
> +		return NULL;
> +
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		inst1 = &channel->offermsg.offer.if_instance;
> +		inst2 = &offer->offer.if_instance;
> +
> +		if (guid_equal(inst1, inst2))
> +			return channel;
> +	}
> +
> +	return NULL;
> +}

Any particular reason this new function is in connection.c instead of
putting it in channel_mgmt.c where it is called?

> +
> +/*
>   * vmbus_on_event - Process a channel event notification
>   *
>   * For batched channels (default) optimize host to guest signaling
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 9f7fb6d..c42b46d 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -310,6 +310,9 @@ int vmbus_add_channel_kobj(struct hv_device *device_obj,
> 
>  struct vmbus_channel *relid2channel(u32 relid);
> 
> +struct vmbus_channel *
> +find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer);
> +
>  void vmbus_free_channels(void);
> 
>  /* Connection interface */
> --
> 1.8.3.1


^ permalink raw reply

* RE: [PATCH v3 10/12] Drivers: hv: vmbus: Clean up hv_sock channels by force upon suspend
From: Michael Kelley @ 2019-08-23 20:02 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-11-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> Fake RESCIND_CHANNEL messages to clean up hv_sock channels by force for
> hibernation. There is no better method to clean up the channels since
> some of the channels may still be referenced by the userspace apps when
> hiberantin is triggered: in this case, the "rescind" fields of the

s/hiberantin/hibernation/

> channels are set, and the apps will thoroughly destroy the channels
> after hibernation.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 55
> ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index ce9974b..2bea669 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -24,6 +24,7 @@
>  #include <linux/sched/task_stack.h>
> 
>  #include <asm/mshyperv.h>
> +#include <linux/delay.h>
>  #include <linux/notifier.h>
>  #include <linux/ptrace.h>
>  #include <linux/screen_info.h>
> @@ -1069,6 +1070,41 @@ void vmbus_on_msg_dpc(unsigned long data)
>  	vmbus_signal_eom(msg, message_type);
>  }
> 
> +/*
> + * Fake RESCIND_CHANNEL messages to clean up hv_sock channels by force for
> + * hibernation, because hv_sock connections can not persist across hibernation.
> + */
> +static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
> +{
> +	struct onmessage_work_context *ctx;
> +	struct vmbus_channel_rescind_offer *rescind;
> +
> +	WARN_ON(!is_hvsock_channel(channel));
> +
> +	/*
> +	 * sizeof(*ctx) is small and the allocation should really not fail,
> +	 * otherwise the state of the hv_sock connections ends up in limbo.
> +	 */
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL | __GFP_NOFAIL);
> +
> +	/*
> +	 * So far, these are not really used by Linux. Just set them to the
> +	 * reasonable values conforming to the definitions of the fields.
> +	 */
> +	ctx->msg.header.message_type = 1;
> +	ctx->msg.header.payload_size = sizeof(*rescind);
> +
> +	/* These values are actually used by Linux. */
> +	rescind = (struct vmbus_channel_rescind_offer *)ctx->msg.u.payload;
> +	rescind->header.msgtype = CHANNELMSG_RESCIND_CHANNELOFFER;
> +	rescind->child_relid = channel->offermsg.child_relid;
> +
> +	INIT_WORK(&ctx->work, vmbus_onmessage_work);
> +
> +	queue_work_on(vmbus_connection.connect_cpu,
> +		      vmbus_connection.work_queue,
> +		      &ctx->work);
> +}
> 
>  /*
>   * Direct callback for channels using other deferred processing
> @@ -2091,6 +2127,25 @@ static int vmbus_acpi_add(struct acpi_device *device)
> 
>  static int vmbus_bus_suspend(struct device *dev)
>  {
> +	struct vmbus_channel *channel;
> +
> +	while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
> +		/*
> +		 * We wait here until any channel offer is currently
> +		 * being processed.
> +		 */

The wording of the comment is a bit off.  Maybe

		/*
		 * We wait here until the completion of any channel
		 * offers that are currently in progress.
		 */

> +		msleep(1);
> +	}
> +
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		if (!is_hvsock_channel(channel))
> +			continue;
> +
> +		vmbus_force_channel_rescinded(channel);
> +	}
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
>  	vmbus_initiate_unload(false);
> 
>  	vmbus_connection.conn_state = DISCONNECTED;
> --
> 1.8.3.1

Modulo the nits:

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH v3 11/12] Drivers: hv: vmbus: Suspend after cleaning up hv_sock and sub channels
From: Michael Kelley @ 2019-08-23 20:16 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-12-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> Before suspend, Linux must make sure all the hv_sock channels have been
> properly cleaned up, because a hv_sock connection can not persist across
> hibernation, and the user-space app must be properly notified of the
> state change of the connection.
> 
> Before suspend, Linux also must make sure all the sub-channels have been
> destroyed, i.e. the related channel structs of the sub-channels must be
> properly removed, otherwise they would cause a conflict when the
> sub-channels are recreated upon resume.
> 
> Add a counter to track such channels, and vmbus_bus_suspend() should wait
> for the counter to drop to zero.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 28 ++++++++++++++++++++++++++++
>  drivers/hv/connection.c   |  3 +++
>  drivers/hv/hyperv_vmbus.h | 12 ++++++++++++
>  drivers/hv/vmbus_drv.c    | 41 ++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index f7a1184..8491d1b 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -499,6 +499,8 @@ static void vmbus_add_channel_work(struct work_struct *work)
>  	return;
> 
>  err_deq_chan:
> +	WARN_ON_ONCE(1);
> +

Why this change?  I was thinking maybe it's a debug statement that got
left in.

>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
>  	/*
> @@ -545,6 +547,10 @@ static void vmbus_process_offer(struct vmbus_channel
> *newchannel)
> 
>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
> +	/* Remember the channels that should be cleaned up upon suspend. */
> +	if (is_hvsock_channel(newchannel) || is_sub_channel(newchannel))
> +		atomic_inc(&vmbus_connection.nr_chan_close_on_suspend);
> +
>  	/*
>  	 * Now that we have acquired the channel_mutex,
>  	 * we can release the potentially racing rescind thread.
> @@ -915,6 +921,16 @@ static void vmbus_onoffer(struct
> vmbus_channel_message_header *hdr)
>  	vmbus_process_offer(newchannel);
>  }
> 
> +static void check_ready_for_suspend_event(void)
> +{
> +	/*
> +	 * If all the sub-channels or hv_sock channels have been cleaned up,
> +	 * then it's safe to suspend.
> +	 */
> +	if (atomic_dec_and_test(&vmbus_connection.nr_chan_close_on_suspend))
> +		complete(&vmbus_connection.ready_for_suspend_event);
> +}
> +
>  /*
>   * vmbus_onoffer_rescind - Rescind offer handler.
>   *
> @@ -925,6 +941,7 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  	struct vmbus_channel_rescind_offer *rescind;
>  	struct vmbus_channel *channel;
>  	struct device *dev;
> +	bool clean_up_chan_for_suspend;
> 
>  	rescind = (struct vmbus_channel_rescind_offer *)hdr;
> 
> @@ -964,6 +981,8 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  		return;
>  	}
> 
> +	clean_up_chan_for_suspend = is_hvsock_channel(channel) ||
> +				    is_sub_channel(channel);
>  	/*
>  	 * Before setting channel->rescind in vmbus_rescind_cleanup(), we
>  	 * should make sure the channel callback is not running any more.
> @@ -989,6 +1008,10 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  	if (channel->device_obj) {
>  		if (channel->chn_rescind_callback) {
>  			channel->chn_rescind_callback(channel);
> +
> +			if (clean_up_chan_for_suspend)
> +				check_ready_for_suspend_event();
> +
>  			return;
>  		}
>  		/*
> @@ -1021,6 +1044,11 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  		}
>  		mutex_unlock(&vmbus_connection.channel_mutex);
>  	}
> +
> +	/* The "channel" may have been freed. Do not access it any longer. */
> +
> +	if (clean_up_chan_for_suspend)
> +		check_ready_for_suspend_event();
>  }

Having to add the above lines twice is a bit clumsy, but the problem is
the overall structure of the vmbus_onoffer_rescind.  The early return in
the case of a rescind_callback function is a bit weird, but I guess it makes
sense since from what I can tell, only uio and hv_sock have rescind callback
functions.   Some minor restructuring might be warranted, but I don't feel
strongly about it.

> 
>  void vmbus_hvsock_device_unregister(struct vmbus_channel *channel)
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 701d9a8..f15d3115 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -26,6 +26,9 @@
>  struct vmbus_connection vmbus_connection = {
>  	.conn_state		= DISCONNECTED,
>  	.next_gpadl_handle	= ATOMIC_INIT(0xE1E10),
> +
> +	.ready_for_suspend_event= COMPLETION_INITIALIZER(
> +				  vmbus_connection.ready_for_suspend_event),
>  };
>  EXPORT_SYMBOL_GPL(vmbus_connection);
> 
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 4610277..9f96e23 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -258,6 +258,18 @@ struct vmbus_connection {
>  	struct workqueue_struct *work_queue;
>  	struct workqueue_struct *handle_primary_chan_wq;
>  	struct workqueue_struct *handle_sub_chan_wq;
> +
> +	/*
> +	 * The number of sub-channels and hv_sock channels that should be
> +	 * cleaned up upon suspend: sub-channels will be re-created upon
> +	 * resume, and hv_sock channels should not survive suspend.
> +	 */
> +	atomic_t nr_chan_close_on_suspend;
> +	/*
> +	 * vmbus_bus_suspend() waits for "nr_chan_close_on_suspend" to
> +	 * drop to zero.
> +	 */
> +	struct completion ready_for_suspend_event;
>  };
> 
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 2bea669..0507157 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2127,7 +2127,8 @@ static int vmbus_acpi_add(struct acpi_device *device)
> 
>  static int vmbus_bus_suspend(struct device *dev)
>  {
> -	struct vmbus_channel *channel;
> +	struct vmbus_channel *channel, *sc;
> +	unsigned long flags;
> 
>  	while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
>  		/*
> @@ -2146,6 +2147,41 @@ static int vmbus_bus_suspend(struct device *dev)
>  	}
>  	mutex_unlock(&vmbus_connection.channel_mutex);
> 
> +	/*
> +	 * Wait until all the sub-channels and hv_sock channels have been
> +	 * cleaned up. Sub-channels should be destroyed upon suspend, otherwise
> +	 * they would conflict with the new sub-channels that will be created
> +	 * in the resume path. hv_sock channels should also be destroyed, but
> +	 * a hv_sock channel of an established hv_sock connection can not be
> +	 * really destroyed since it may still be referenced by the userspace
> +	 * application, so we just force the hv_sock channel to be rescinded
> +	 * by vmbus_force_channel_rescinded(), and the userspace application
> +	 * will thoroughly destroy the channel after hibernation.
> +	 */
> +	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)

At first glance, the above line seemed useless to me.  You could just do the
wait_for_completion() unconditionally.  But is the intent to handle the case where
the VM never had any sub-channels or hv_sock channels, and so
nr_chan_close_on_suspend never went above 0?  If so, a comment might
be helpful.

> +		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
> +
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		if (is_hvsock_channel(channel)) {
> +			if (!channel->rescind) {
> +				pr_err("hv_sock channel not rescinded!\n");
> +				WARN_ON_ONCE(1);
> +			}
> +			continue;
> +		}
> +
> +		spin_lock_irqsave(&channel->lock, flags);
> +		list_for_each_entry(sc, &channel->sc_list, sc_list) {
> +			pr_err("Sub-channel not deleted!\n");
> +			WARN_ON_ONCE(1);
> +		}
> +		spin_unlock_irqrestore(&channel->lock, flags);
> +	}
> +
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
>  	vmbus_initiate_unload(false);
> 
>  	vmbus_connection.conn_state = DISCONNECTED;
> @@ -2186,6 +2222,9 @@ static int vmbus_bus_resume(struct device *dev)
> 
>  	vmbus_request_offers();
> 
> +	/* Reset the event for the next suspend. */
> +	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> +
>  	return 0;
>  }
> 
> --
> 1.8.3.1


^ permalink raw reply

* RE: [PATCH v3 12/12] Drivers: hv: vmbus: Resume after fixing up old primary channels
From: Michael Kelley @ 2019-08-23 20:25 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-13-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> When the host re-offers the primary channels upon resume, the host only
> guarantees the Instance GUID  doesn't change, so vmbus_bus_suspend()
> should invalidate channel->offermsg.child_relid and figure out the
> number of primary channels that need to be fixed up upon resume.
> 
> Upon resume, vmbus_onoffer() finds the old channel structs, and maps
> the new offers to the old channels, and fixes up the old structs,
> and finally the resume callbacks of the VSC drivers will re-open
> the channels.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 76 +++++++++++++++++++++++++++++++++++------------
>  drivers/hv/connection.c   |  2 ++
>  drivers/hv/hyperv_vmbus.h | 14 +++++++++
>  drivers/hv/vmbus_drv.c    | 17 +++++++++++
>  include/linux/hyperv.h    |  3 ++
>  5 files changed, 93 insertions(+), 19 deletions(-)
> 
> @@ -875,12 +913,21 @@ static void vmbus_onoffer(struct
> vmbus_channel_message_header *hdr)
>  		atomic_dec(&vmbus_connection.offer_in_progress);
> 
>  		/*
> -		 * We're resuming from hibernation: we expect the host to send
> -		 * exactly the same offers that we had before the hibernation.
> +		 * We're resuming from hibernation: all the sub-channel and
> +		 * hv_sock channels we had before the hibernation should have
> +		 * been cleaned up, and now we must be seeing a re-offered
> +		 * primary channel that we had before the hibernation.
>  		 */
> +
> +		WARN_ON(oldchannel->offermsg.child_relid != INVALID_RELID);
> +		/* Fix up the relid. */
> +		oldchannel->offermsg.child_relid = offer->child_relid;
> +
>  		offer_sz = sizeof(*offer);
> -		if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0)
> +		if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0) {
> +			check_ready_for_resume_event();
>  			return;
> +		}
> 
>  		pr_debug("Mismatched offer from the host (relid=%d)\n",
>  			 offer->child_relid);
> @@ -890,6 +937,11 @@ static void vmbus_onoffer(struct
> vmbus_channel_message_header *hdr)
>  				     false);
>  		print_hex_dump_debug("New vmbus offer: ", DUMP_PREFIX_OFFSET,
>  				     16, 4, offer, offer_sz, false);
> +
> +		vmbus_setup_channel_state(oldchannel, offer);
> +
> +		check_ready_for_resume_event();

This is the error case where the new offer didn't match some aspect of
the old offer.  Is the intent to proceed and use the new offer?   I can see
that check_ready_for_resume_event() has to be called in the error case,
otherwise the resume operation will hang forever, but I'm not sure about
setting up the channel state and then proceeding as if all is good.

> +
>  		return;
>  	}
> 

^ permalink raw reply

* [PATCH v4 0/9] x86/tlb: Concurrent TLB flushes
From: Nadav Amit @ 2019-08-23 22:41 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Borislav Petkov, Boris Ostrovsky, Haiyang Zhang,
	Josh Poimboeuf, Juergen Gross, K. Y. Srinivasan, Paolo Bonzini,
	Rik van Riel, Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
	virtualization, xen-devel

[ Similar cover-letter to v3 with updated performance numbers on skylake.
  Sorry for the time it since the last version. ]

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each PTE flushing can take 100s
of cycles. This patch-set allows TLB flushes to be run concurrently:
first request the remote CPUs to initiate the flush, then run it
locally, and finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid, for
example, unwarranted false-sharing.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions:

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
 --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   ---------------------   ----------------------  ------
 1	1152920 (7453)		1169469 (9059)		+1.4%
 2	1545832 (12555)		1584172 (10484) 	+2.4%
 4	2480703 (12039)		2518641 (12875)		+1.5%
 8	3684486 (26007)		3840343 (44144)		+4.2%
 16	4981567 (23565)		5125756 (15458)		+2.8%
 32	5679542 (10116)		5887826 (6121)		+3.6%
 56	5630944 (17937)		5812514 (26264)		+3.2%

(Note that on configurations with up to 28 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   ---------------------   ----------------------  ------
 1	1444119 (8524)		1469606 (10527)		+1.7%
 2	1921540 (24169)		1961899 (14450)		+2.1%
 4	3073716 (21786)		3199880 (16774)		+4.1%
 8	4700698 (49534)		4802312 (11043)		+2.1%
 16	6005180 (6366)		6006656 (31624)		   0%
 32	6826466 (10496)		6886622 (19110)		+0.8%
 56	6832344 (13468)		6885586 (20646)		+0.8%

The results are somewhat different than the results that have been obtained on
Haswell-X, which were sent before and the maximum performance improvement is
smaller. However, the performance improvement is significant.

v3 -> v4:
  * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
  * Prevent preemption on_each_cpu(). It is not needed, but it prevents
    concerns. [Peter/tglx]
  * Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c                 |  10 +-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h       |  52 +++----
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c                 |  11 +-
 arch/x86/kernel/paravirt.c            |   2 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 195 ++++++++++++++------------
 arch/x86/xen/mmu_pv.c                 |  11 +-
 include/linux/cpumask.h               |   6 +-
 include/linux/smp.h                   |  34 ++++-
 include/trace/events/xen.h            |   2 +-
 kernel/smp.c                          | 138 +++++++++---------
 14 files changed, 254 insertions(+), 221 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-08-23 22:41 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel
In-Reply-To: <20190823224153.15223-1-namit@vmware.com>

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 | 10 +++---
 arch/x86/include/asm/paravirt.h       |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h       |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c                 | 11 +++++--
 arch/x86/kernel/paravirt.c            |  2 +-
 arch/x86/mm/tlb.c                     | 45 ++++++++++++++++++---------
 arch/x86/xen/mmu_pv.c                 | 11 +++----
 include/trace/events/xen.h            |  2 +-
 10 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
 	return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-				    const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+				   const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
 	struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	u64 status = U64_MAX;
 	unsigned long flags;
 
-	trace_hyperv_mmu_flush_tlb_others(cpus, info);
+	trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
 	if (!hv_hypercall_pg)
 		goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	if (!(status & HV_HYPERCALL_RESULT_MASK))
 		return;
 do_native:
-	native_flush_tlb_others(cpus, info);
+	native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
 		return;
 
 	pr_info("Using hypercall for remote TLB flush\n");
-	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 69089d46f128..bc4829c9b3f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
 {
-	PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 70b654f3ffe5..63fa751344bf 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -206,8 +206,8 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info);
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 
 	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 2f6e9be163ae..559195f79c2f 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -533,7 +533,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
+ *  - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
@@ -586,7 +586,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
@@ -610,8 +610,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, info)	\
-	native_flush_tlb_others(mask, info)
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
 
 #define paravirt_tlb_remove_table(tlb, page) \
 	tlb_remove_page(tlb, (void *)(page))
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..85ca8560c7f9 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -8,7 +8,7 @@
 
 #if IS_ENABLED(CONFIG_HYPERV)
 
-TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+TRACE_EVENT(hyperv_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus,
 		     const struct flush_tlb_info *info),
 	    TP_ARGS(cpus, info),
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4cc967178bf9..0941d2d7f1cb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -592,7 +592,7 @@ static void __init kvm_apf_trap_init(void)
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -606,6 +606,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		/*
+		 * The local vCPU is never preempted, so we do not explicitly
+		 * skip check for local vCPU - it will never be cleared from
+		 * flushmask.
+		 */
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -615,7 +620,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -637,7 +642,7 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 59d3d2763a9e..5520a04c84ba 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -359,7 +359,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
-	.mmu.flush_tlb_others	= native_flush_tlb_others,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c3ca3545d78a..5376a5447bd0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -562,7 +562,7 @@ static void flush_tlb_func(void *info)
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -660,9 +660,14 @@ static bool tlb_is_not_lazy(int cpu)
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -682,10 +687,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		flush_tlb_func((void *)info);
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func,
@@ -704,7 +711,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables) {
-		smp_call_function_many(cpumask, flush_tlb_func, (void *)info, 1);
+		__smp_call_function_many(cpumask, flush_tlb_func, (void *)info,
+					 SCF_WAIT | SCF_RUN_LOCAL);
 	} else {
 		/*
 		 * Although we could have used on_each_cpu_cond_mask(),
@@ -731,7 +739,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 			if (tlb_is_not_lazy(cpu))
 				__cpumask_set_cpu(cpu, cond_cpumask);
 		}
-		smp_call_function_many(cond_cpumask, flush_tlb_func, (void *)info, 1);
+		__smp_call_function_many(cond_cpumask, flush_tlb_func,
+					 (void *)info, SCF_WAIT | SCF_RUN_LOCAL);
 	}
 }
 
@@ -812,16 +821,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		flush_tlb_multi(mm_cpumask(mm), info);
+	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func(info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
@@ -875,16 +888,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0);
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		flush_tlb_multi(&batch->cpumask, info);
+	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func(info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, info);
-
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 26e8b326966d..48f7c7eb4dbc 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1345,8 +1345,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info)
+static void xen_flush_tlb_multi(const struct cpumask *cpus,
+				const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1356,7 +1356,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	const size_t mc_entry_size = sizeof(args->op) +
 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 
-	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
+	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1365,9 +1365,8 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	args = mcs.args;
 	args->op.arg2.vcpumask = to_cpumask(args->mask);
 
-	/* Remove us, and any offline CPUS. */
+	/* Remove any offline CPUs */
 	cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
 	if (info->end != TLB_FLUSH_ALL &&
@@ -2396,7 +2395,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
 	.flush_tlb_one_user = xen_flush_tlb_one_user,
-	.flush_tlb_others = xen_flush_tlb_others,
+	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
 	.pgd_alloc = xen_pgd_alloc,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 9a0e8af21310..546022acf160 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user,
 	    TP_printk("addr %lx", __entry->addr)
 	);
 
-TRACE_EVENT(xen_mmu_flush_tlb_others,
+TRACE_EVENT(xen_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm,
 		     unsigned long addr, unsigned long end),
 	    TP_ARGS(cpus, mm, addr, end),
-- 
2.17.1


^ permalink raw reply related

* [GIT PULL] Hyper-V commits for v5.3-rc
From: Sasha Levin @ 2019-08-24 14:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-hyperv, kys, sthemmin, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

The following changes since commit d1abaeb3be7b5fa6d7a1fbbd2e14e3310005c4c1:

  Linux 5.3-rc5 (2019-08-18 14:31:08 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed

for you to fetch changes up to a9fc4340aee041dd186d1fb8f1b5d1e9caf28212:

  Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE (2019-08-20 12:49:57 -0400)

- ----------------------------------------------------------------
- - Fix for panics and network failures on PAE guests by Dexuan Cui.
- - Fix of a memory leak (and related cleanups) in the hyper-v keyboard
driver by Dexuan Cui.
- - Code cleanups for hyper-v clocksource driver during the merge window
by Dexuan Cui.
- - Fix for a false positive warning in the userspace hyper-v KVP store by
Vitaly Kuznetsov.

- ----------------------------------------------------------------
Dexuan Cui (3):
      Drivers: hv: vmbus: Remove the unused "tsc_page" from struct hv_context
      Input: hyperv-keyboard: Use in-place iterator API in the channel callback
      Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE

Vitaly Kuznetsov (1):
      Tools: hv: kvp: eliminate 'may be used uninitialized' warning

 drivers/hv/channel.c                  |  2 +-
 drivers/hv/hyperv_vmbus.h             |  2 --
 drivers/input/serio/hyperv-keyboard.c | 35 ++++++-----------------------------
 tools/hv/hv_kvp_daemon.c              |  2 +-
 4 files changed, 8 insertions(+), 33 deletions(-)
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAl1hUC4ACgkQ3qZv95d3
LNzfPQ/7B2htJA7ZjiRqqoTFN6yQBGuuyCeoJyZhYOTRc5CZ3IPnEg7wcZ5cJ9xi
3ByRiRPWn3hqYgZXDA7pS6K4vAk/Gkafnq9E7d3SGhSNF9d+n9YzcIG5haRpFCfM
nenQ1WCP6wXeF/VlwOCLnTIKPqWEzwaFAANvhbS/19Ab/6n8ww1J+jilvI1QOBCR
hcUjP6+4Q/QuBZLQ/451ol7KMbAJkCdlq8tNJ2/OUm5dExajJuTVU55W/Qozmf9o
X3Q7nKkK54N+iIj0N2oh8kaH0HzTLWM64qy48KDN5czgiQtTeesHq8BTkAldokPZ
xZgK0jkJkZQL43ZzYs257rslr59j8Ol7CgnnIPrtM0YIE9tCiZdwBblKV0XgL/7m
Op1cQheZc0gavm1ynq3/w0kOOdkBM32LExnJI112LfNP7nQPZ8MX+efT7z2IsMbh
QK/NxcQL7pbgPs640uWqFicMQR+umwwQomEb39LDUh1/uzQEw9YCgVZU1JkcxJjd
Se+ldSi1yEQJ66p1Jf2lyAdDiDg5OwvjZhL1SNmAvvwpw/tSY0t94cwpJxVZ8WuI
pRDvWcVdxBJUExr7u0BaUWu3J8hYl+974GFCt+66M7tKDf8bsCOsfeBEqfPN5PNb
/qU0a5pnQolJmEqdxY7TU0qBgA1xfaNbdJNrBOvsPr0dSRz4ElQ=
=Dwtt
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [PATCH] x86/Hyper-V: Fix build error with CONFIG_HYPERV_TSCPAGE=N
From: Sasha Levin @ 2019-08-24 15:12 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: lantianyu1986, Tianyu Lan, linux-hyperv, linux-kernel, kys,
	haiyangz, sthemmin, tglx, mingo, bp, hpa, x86, daniel.lezcano,
	michael.h.kelley
In-Reply-To: <87zhk1pp9p.fsf@vitty.brq.redhat.com>

On Thu, Aug 22, 2019 at 10:39:46AM +0200, Vitaly Kuznetsov wrote:
>lantianyu1986@gmail.com writes:
>
>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>
>> Both Hyper-V tsc page and Hyper-V tsc MSR code use variable
>> hv_sched_clock_offset for their sched clock callback and so
>> define the variable regardless of CONFIG_HYPERV_TSCPAGE setting.
>
>CONFIG_HYPERV_TSCPAGE is gone after my "x86/hyper-v: enable TSC page
>clocksource on 32bit" patch. Do we still have an issue to fix?

Yes. Let's get it fixed on older kernels (as such we need to tag this
one for stable). The 32bit TSC patch won't come in before 5.4 anyway.

Vitaly, does can you ack this patch? It might require you to re-spin
your patch.

--
Thanks,
Sasha

^ permalink raw reply

* RE: [PATCH] video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver
From: Michael Kelley @ 2019-08-24 18:11 UTC (permalink / raw)
  To: Wei Hu, rdunlap@infradead.org, shc_work@mail.ru,
	gregkh@linuxfoundation.org, lee.jones@linaro.org,
	alexandre.belloni@bootlin.com, baijiaju1990@gmail.com,
	fthain@telegraphics.com.au, info@metux.net,
	linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-fbdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	sashal@kernel.org, Stephen Hemminger, Haiyang Zhang,
	KY Srinivasan, Dexuan Cui, Iouri Tarassov
In-Reply-To: <KL1P15301MB026487D86E439FA67B25C42CBBAA0@KL1P15301MB0264.APCP153.PROD.OUTLOOK.COM>

From: Wei Hu <weh@microsoft.com> Sent: Wednesday, August 21, 2019 4:59 AM
>
> > From: Michael Kelley <mikelley@microsoft.com>
> > Sent: Monday, August 19, 2019 6:41 AM
> > To: Wei Hu <weh@microsoft.com>; rdunlap@infradead.org; shc_work@mail.ru;
> 
> > > -	msg.dirt.rect[0].x1 = 0;
> > > -	msg.dirt.rect[0].y1 = 0;
> > > -	msg.dirt.rect[0].x2 = info->var.xres;
> > > -	msg.dirt.rect[0].y2 = info->var.yres;
> > > +	msg.dirt.rect[0].x1 = (x1 < 0 || x1 > x2) ? 0 : x1;
> > > +	msg.dirt.rect[0].y1 = (y2 < 0 || y1 > y2) ? 0 : y1;
> >
> > This should be:
> >
> > 	msg.dirt.rect[0].y1 = (y1 < 0 || y1 > y2) ? 0 : y1;
> >
> > Also, throughout the code, I don't think there are any places where
> > x or y coordinate values are ever negative.  INT_MAX or 0 is used as the
> > sentinel value indicating "not set".  So can all the tests for less than 0
> > now be eliminated, both in this function and in other functions?
> >
> > > +	msg.dirt.rect[0].x2 =
> > > +		(x2 < x1 || x2 > info->var.xres) ? info->var.xres : x2;
> > > +	msg.dirt.rect[0].y2 =
> > > +		(y2 < y1 || y2 > info->var.yres) ? info->var.yres : y2;
> >
> > How exactly is the dirty rectangle specified to Hyper-V?  Suppose the frame
> > buffer resolution is 100x200.  If you want to specify the entire rectangle, the
> > first coordinate is (0, 0).  But what is the second coordinate?  Should it be
> > (99, 199) or (100, 200)?  The above code (and original code) implies it
> > should specified as (100, 200), which is actually a point outside the
> > maximum resolution, which is counter-intuitive and makes me wonder
> > if the code is correct.
> >
> [Wei Hu]
> The current code treat the entire framebuffer rectangle as (0,0) -> (var.xres, var.yres).
> Every time it sends refresh request, these are two points sent to host and host
> seems accept it. See the above (x1, y1) and (x2, y2)  in the deleted lines.
> 
> So in your example the second coordinate is (100, 200).

OK, agreed.  I ran some experiments and confirmed that this is indeed the
Hyper-V behavior.

> 
> 
> > > +/* Deferred IO callback */
> > > +static void synthvid_deferred_io(struct fb_info *p,
> > > +				 struct list_head *pagelist)
> > > +{
> > > +	struct hvfb_par *par = p->par;
> > > +	struct page *page;
> > > +	unsigned long start, end;
> > > +	int y1, y2, miny, maxy;
> > > +	unsigned long flags;
> > > +
> > > +	miny = INT_MAX;
> > > +	maxy = 0;
> > > +
> > > +	list_for_each_entry(page, pagelist, lru) {
> > > +		start = page->index << PAGE_SHIFT;
> > > +		end = start + PAGE_SIZE - 1;
> > > +		y1 = start / p->fix.line_length;
> > > +		y2 = end / p->fix.line_length;
> >
> > The above division rounds down because any remainder is discarded.  I
> > wondered whether rounding down is correct, which got me to thinking
> > about how the dirty rectangle is specified.  Is y2 the index of the last
> > dirty row?  If so, that's not consistent with the code in synthvid_update(),
> > which might choose var.yres as y2, and that's the index of a row outside
> > of the frame buffer.
> >
> [Wei Hu]
> In this place we try to figure out and merge all the faulted pages into one
> big dirty rectangle. A page in memory represents one or multiple lines in
> frame buffer. For example, one faulted page could represent all the linear
> pixels from (x, y) to (x-1, y+1). In this case we just form the dirty rectangle
> as (0, y) -> (var.xres, y+1). Also keep in mind we need to merge multiple
> pages. That's why in the end the dirty rectangle is (0, miny) -> (var.xres, maxy).

Let me give an example of where I think the new code doesn't work.  Suppose
the frame buffer resolution is 1024x768.  With 4 bytes per pixel, each row
is 4096 bytes, or exactly one page.  So each page contains exactly one row of
pixels. For simplicity in my example, let's look at the case when this function
is called with only one dirty page.   The calculation of y1 will identify the row
that is dirty.   The calculation of y2 will identify the same row.  So y1 will
equal y2, and miny will equal maxy.  Then when synthvid_update() is called,
Hyper-V will interpret the parameters as no rows needing to be updated.  In
a more complex case where the pagelist contains multiple dirty pages, maxy
also ends up one less than it needs to be.

I think passing 'maxy + 1' instead of 'maxy' to synthvid_update() will solve
the problem.  It certainly warrants a comment that the calculation of maxy
is "inclusive", while synthvid_update() expects its parameters to be "exclusive"
per Hyper-V's expectations.

There's also another interesting situation.  Suppose the resolution and page size
is such that a page contains multiple rows.  If the last page of the frame buffer
is dirty, this routine could calculate a y2 value identifying a "phantom" row
that is off the end of the frame buffer -- i.e., that is bigger than yres.  You
have synthvid_send() handling that case by clamping the y2 value to yres, but
it might be worth a comment here acknowledging the situation and how it is
handled.  I did a test, and it appears that Hyper-V does its own clamping of
the values passed in, but we should not take a dependency on how Hyper-V
handles incorrect inputs.

> 
> 
> > > +		if (y2 > p->var.yres)
> > > +			y2 = p->var.yres;
> > > +		miny = min_t(int, miny, y1);
> > > +		maxy = max_t(int, maxy, y2);
> > > +
> > > +		/* Copy from dio space to mmio address */
> > > +		if (par->fb_ready) {
> > > +			spin_lock_irqsave(&par->docopy_lock, flags);
> > > +			hvfb_docopy(par, start, PAGE_SIZE);
> > > +			spin_unlock_irqrestore(&par->docopy_lock, flags);
> > > +		}
> > > +	}
> > > +
> > > +	if (par->fb_ready)
> > > +		synthvid_update(p, 0, miny, p->var.xres, maxy);
> > > +}
> 
> 
> 
> 
> > > +
> > > +	/* Copy the dirty rectangle to frame buffer memory */
> > > +	spin_lock_irqsave(&par->docopy_lock, flags);
> > > +	for (j = y1; j <= y2 && x1 < x2; j++) {
> > > +		if (j == info->var.yres)
> > > +			break;
> > > +		hvfb_docopy(par,
> > > +			    j * info->fix.line_length +
> > > +			    (x1 * screen_depth / 8),
> > > +			    (x2 - x1 + 1) * screen_depth / 8);
> >
> > Whether the +1 is needed above gets back to the question I
> > raised earlier about how to interpret the coordinates -- whether
> > the (x2, y2) coordinate is just outside the dirty rectangle or
> > just inside the dirty rectangle.  Most of the code seems to treat
> > it as being just outside the dirty rectangle, in which case the +1
> > should not be used.
> >
> [Wei Hu]
> This dirty rectangle is not from page fault, but rather from frame buffer
> framework when the screen is in text mode. I am not 100% sure if the dirty
> rectangle given from kernel includes on extra line outside or not.  Here I
> just play it safe by copying one extra line in the worst case.

Got it.  I was incorrectly conflating the two ways the frame buffer can get
updated.

I looked at the how x2 and y2 of the dirty rectangle get set in this case.  It
looks to me like it's always calculated as x1 + width, and y1 + height in
hvfb_ondemand_refresh_throttle().  If that's correct, then the dirty
rectangle coordinates are "exclusive", which is what Hyper-V wants.   And
indeed the call to synthvid_update() later in this function assumes the
"exclusive" format.  Also, the "j == info->var.yres" test is correct only
if the y2 value is in the "exclusive" format.

With that being the case, the "for" loop control above should have j < y2
instead of j <= y2, as we don't need to copy the y2 row.  And the +1
isn't needed in the arguments to hvfb_docopy().

I really don't like the idea of copying an extra row, or an extra byte in a
row, "just in case".

Michael

> 
> Suppose dirty rectangle only contain one pixel, for example (0,0) is the only
> pixel changed in the entire frame buffer. If kernel sends me dirty rectangle as
> (0, 0) -> (0, 0), the above function works correctly. If the kernel sends
>  (0, 0) -> (1, 1), then the above function just copies one extra row and one extra
> column, which should also be fine. The hvfb_docopy() takes care of the
> edge cases.
> 
> Thanks,
> Wei

^ permalink raw reply

* Re: [PATCH] x86/Hyper-V: Fix build error with CONFIG_HYPERV_TSCPAGE=N
From: Tianyu Lan @ 2019-08-26  3:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Vitaly Kuznetsov, Tianyu Lan, linux-hyperv,
	linux-kernel@vger kernel org, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, the arch/x86 maintainers, Daniel Lezcano,
	michael.h.kelley
In-Reply-To: <20190824151218.GM1581@sasha-vm>

On Sun, Aug 25, 2019 at 1:52 AM Sasha Levin <sashal@kernel.org> wrote:
>
> On Thu, Aug 22, 2019 at 10:39:46AM +0200, Vitaly Kuznetsov wrote:
> >lantianyu1986@gmail.com writes:
> >
> >> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> >>
> >> Both Hyper-V tsc page and Hyper-V tsc MSR code use variable
> >> hv_sched_clock_offset for their sched clock callback and so
> >> define the variable regardless of CONFIG_HYPERV_TSCPAGE setting.
> >
> >CONFIG_HYPERV_TSCPAGE is gone after my "x86/hyper-v: enable TSC page
> >clocksource on 32bit" patch. Do we still have an issue to fix?
>
> Yes. Let's get it fixed on older kernels (as such we need to tag this
> one for stable). The 32bit TSC patch won't come in before 5.4 anyway.
>
> Vitaly, does can you ack this patch? It might require you to re-spin
> your patch.
>
Hi Sasha:
               Thomas has foled this fix into original patch.
https://lkml.org/lkml/2019/8/23/600
--
Best regards
Tianyu Lan

^ permalink raw reply

* Re: [PATCH] x86/Hyper-V: Fix build error with CONFIG_HYPERV_TSCPAGE=N
From: Vitaly Kuznetsov @ 2019-08-26  7:20 UTC (permalink / raw)
  To: Sasha Levin
  Cc: lantianyu1986, Tianyu Lan, linux-hyperv, linux-kernel, kys,
	haiyangz, sthemmin, mingo, bp, hpa, x86, daniel.lezcano,
	michael.h.kelley, tglx
In-Reply-To: <20190824151218.GM1581@sasha-vm>

Sasha Levin <sashal@kernel.org> writes:

> On Thu, Aug 22, 2019 at 10:39:46AM +0200, Vitaly Kuznetsov wrote:
>>lantianyu1986@gmail.com writes:
>>
>>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>>
>>> Both Hyper-V tsc page and Hyper-V tsc MSR code use variable
>>> hv_sched_clock_offset for their sched clock callback and so
>>> define the variable regardless of CONFIG_HYPERV_TSCPAGE setting.
>>
>>CONFIG_HYPERV_TSCPAGE is gone after my "x86/hyper-v: enable TSC page
>>clocksource on 32bit" patch. Do we still have an issue to fix?
>
> Yes. Let's get it fixed on older kernels (as such we need to tag this
> one for stable). The 32bit TSC patch won't come in before 5.4 anyway.
>
> Vitaly, does can you ack this patch? It might require you to re-spin
> your patch.
>

Sure, no problem,

Acked-by: Vitaly Kuznetsov <vkuznets@redhat.com>

I, however, was under the impression the patch fixes the issue with the
newly introduced sched clock:

commit b74e1d61dbc614ff35ef3ad9267c61ed06b09051
Author: Tianyu Lan <Tianyu.Lan@microsoft.com>
Date:   Wed Aug 14 20:32:16 2019 +0800

    clocksource/hyperv: Add Hyper-V specific sched clock function

(and Fixes: tag is missing)

and this is not in mainline as of v5.3-rc6. In tip/timers/core Thomas
already picked my "clocksource/drivers/hyperv: Enable TSC page
clocksource on 32bit" which also resolves the issue.

So my question is - which older/stable kernel do you have in mind?

-- 
Vitaly

^ permalink raw reply

* Re: [PATCH] x86/Hyper-V: Fix build error with CONFIG_HYPERV_TSCPAGE=N
From: Thomas Gleixner @ 2019-08-26 11:55 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sasha Levin, lantianyu1986, Tianyu Lan, linux-hyperv,
	linux-kernel, kys, haiyangz, sthemmin, mingo, bp, hpa, x86,
	daniel.lezcano, michael.h.kelley
In-Reply-To: <87d0gso0jf.fsf@vitty.brq.redhat.com>

On Mon, 26 Aug 2019, Vitaly Kuznetsov wrote:
> Sasha Levin <sashal@kernel.org> writes:
> 
> > On Thu, Aug 22, 2019 at 10:39:46AM +0200, Vitaly Kuznetsov wrote:
> >>lantianyu1986@gmail.com writes:
> >>
> >>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> >>>
> >>> Both Hyper-V tsc page and Hyper-V tsc MSR code use variable
> >>> hv_sched_clock_offset for their sched clock callback and so
> >>> define the variable regardless of CONFIG_HYPERV_TSCPAGE setting.
> >>
> >>CONFIG_HYPERV_TSCPAGE is gone after my "x86/hyper-v: enable TSC page
> >>clocksource on 32bit" patch. Do we still have an issue to fix?
> >
> > Yes. Let's get it fixed on older kernels (as such we need to tag this
> > one for stable). The 32bit TSC patch won't come in before 5.4 anyway.
> >
> > Vitaly, does can you ack this patch? It might require you to re-spin
> > your patch.
> >
> 
> Sure, no problem,
> 
> Acked-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> I, however, was under the impression the patch fixes the issue with the
> newly introduced sched clock:
> 
> commit b74e1d61dbc614ff35ef3ad9267c61ed06b09051
> Author: Tianyu Lan <Tianyu.Lan@microsoft.com>
> Date:   Wed Aug 14 20:32:16 2019 +0800
> 
>     clocksource/hyperv: Add Hyper-V specific sched clock function
> 
> (and Fixes: tag is missing)
> 
> and this is not in mainline as of v5.3-rc6. In tip/timers/core Thomas
> already picked my "clocksource/drivers/hyperv: Enable TSC page
> clocksource on 32bit" which also resolves the issue.

No. I folded Sashas fix into the original clocksource patch and then added
yours on top.

Thanks,

	tglx

^ permalink raw reply

* [PATCH] PCI: hv: Make functions only used locally static in pci-hyperv.c
From: Krzysztof Wilczynski @ 2019-08-26 15:41 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Lorenzo Pieralisi, linux-pci, linux-kernel, linux-hyperv

Functions hv_read_config_block(), hv_write_config_block()
and hv_register_block_invalidate() are not used anywhere
else and are local to drivers/pci/controller/pci-hyperv.c,
and do not need to be in global scope, so make these static.

Resolve compiler warning that can be seen when building with
warnings enabled (W=1).

Signed-off-by: Krzysztof Wilczynski <kw@linux.com>
---
 drivers/pci/controller/pci-hyperv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f1f300218fab..c9642e429c2d 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -930,7 +930,7 @@ static void hv_pci_read_config_compl(void *context, struct pci_response *resp,
  *
  * Return: 0 on success, -errno on failure
  */
-int hv_read_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
+static int hv_read_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
 			 unsigned int block_id, unsigned int *bytes_returned)
 {
 	struct hv_pcibus_device *hbus =
@@ -1010,7 +1010,7 @@ static void hv_pci_write_config_compl(void *context, struct pci_response *resp,
  *
  * Return: 0 on success, -errno on failure
  */
-int hv_write_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
+static int hv_write_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
 			  unsigned int block_id)
 {
 	struct hv_pcibus_device *hbus =
@@ -1079,7 +1079,7 @@ int hv_write_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
  *
  * Return: 0 on success, -errno on failure
  */
-int hv_register_block_invalidate(struct pci_dev *pdev, void *context,
+static int hv_register_block_invalidate(struct pci_dev *pdev, void *context,
 				 void (*block_invalidate)(void *context,
 							  u64 block_mask))
 {
-- 
2.22.1


^ permalink raw reply related

* Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support
From: Vitaly Kuznetsov @ 2019-08-27  6:41 UTC (permalink / raw)
  To: lantianyu1986
  Cc: Tianyu Lan, kvm, linux-doc, linux-hyperv, linux-kernel, pbonzini,
	rkrcmar, corbet, kys, haiyangz, sthemmin, sashal, tglx, mingo, bp,
	hpa, x86, michael.h.kelley
In-Reply-To: <20190819131737.26942-1-Tianyu.Lan@microsoft.com>

lantianyu1986@gmail.com writes:

> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>
> This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
> in L0 can delegate L1 hypervisor to handle tlb flush request from
> L2 guest when direct tlb flush is enabled in L1.
>
> Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
> feature from user space. User space should enable this feature only
> when Hyper-V hypervisor capability is exposed to guest and KVM profile
> is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
> We hope L2 guest doesn't use KVM hypercall when the feature is
> enabled. Detail please see comment of new API
> "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"

I was thinking about this for awhile and I think I have a better
proposal. Instead of adding this new capability let's enable direct TLB
flush when KVM guest enables Hyper-V Hypercall page (writes to
HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
hypercalls as we can't handle both KVM-style and Hyper-V-style
hypercalls simultaneously and kvm_emulate_hypercall() does:

	if (kvm_hv_hypercall_enabled(vcpu->kvm))
		return kvm_hv_hypercall(vcpu);

What do you think?

(and instead of adding the capability we can add kvm.ko module parameter
to enable direct tlb flush unconditionally, like
'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
permanenetly enabled)).

-- 
Vitaly

^ permalink raw reply

* RE: [PATCH v2] video: hyperv: hyperv_fb: Obtain screen resolution from Hyper-V host
From: Wei Hu @ 2019-08-27  8:02 UTC (permalink / raw)
  To: Michael Kelley, b.zolnierkie@samsung.com,
	linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-fbdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	sashal@kernel.org, Stephen Hemminger, Haiyang Zhang,
	KY Srinivasan, Dexuan Cui
  Cc: Iouri Tarassov
In-Reply-To: <DM5PR21MB01370CB55C3011582F6F8C5CD7AA0@DM5PR21MB0137.namprd21.prod.outlook.com>

> -----Original Message-----
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Thursday, August 22, 2019 7:49 AM
> To: Wei Hu <weh@microsoft.com>; b.zolnierkie@samsung.com; linux-
> hyperv@vger.kernel.org; dri-devel@lists.freedesktop.org; linux-
> fbdev@vger.kernel.org; linux-kernel@vger.kernel.org; sashal@kernel.org;
> Stephen Hemminger <sthemmin@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>
> Cc: Iouri Tarassov <iourit@microsoft.com>
> > +
> > +struct synthvid_supported_resolution_resp {
> > +	u8 edid_block[SYNTHVID_EDID_BLOCK_SIZE];
> > +	u8 resolution_count;
> > +	u8 default_resolution_index;
> > +	u8 is_standard;
> > +	struct hvd_screen_info
> > +		supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT];
> 
> Is there extra whitespace on this line?  Just wondering why it doesn't
> line up.
> 
[Wei Hu] 
No extra whitespace. It would be over 80 characters if I had put them in one line.

^ permalink raw reply

* [PATCH v3] video: hyperv: hyperv_fb: Obtain screen resolution from Hyper-V host
From: Wei Hu @ 2019-08-27 11:08 UTC (permalink / raw)
  To: b.zolnierkie@samsung.com, linux-hyperv@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-fbdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, sashal@kernel.org,
	Stephen Hemminger, Haiyang Zhang, KY Srinivasan, Dexuan Cui,
	Michael Kelley
  Cc: Wei Hu, Iouri Tarassov

Beginning from Windows 10 RS5+, VM screen resolution is obtained from host.
The "video=hyperv_fb" boot time option is not needed, but still can be
used to overwrite what the host specifies. The VM resolution on the host
could be set by executing the powershell "set-vmvideo" command.

v2:
- Implemented fallback when version negotiation failed.
- Defined full size for supported_resolution array.

v3:
- Corrected the synthvid major and minor version comparison problem.

Signed-off-by: Iouri Tarassov <iourit@microsoft.com>
Signed-off-by: Wei Hu <weh@microsoft.com>
---
 drivers/video/fbdev/hyperv_fb.c | 159 +++++++++++++++++++++++++++++---
 1 file changed, 147 insertions(+), 12 deletions(-)

diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 00f5bdcc6c6f..1464c6f14687 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -23,6 +23,14 @@
  *
  * Portrait orientation is also supported:
  *     For example: video=hyperv_fb:864x1152
+ *
+ * When a Windows 10 RS5+ host is used, the virtual machine screen
+ * resolution is obtained from the host. The "video=hyperv_fb" option is
+ * not needed, but still can be used to overwrite what the host specifies.
+ * The VM resolution on the host could be set by executing the powershell
+ * "set-vmvideo" command. For example
+ *     set-vmvideo -vmname name -horizontalresolution:1920 \
+ * -verticalresolution:1200 -resolutiontype single
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -44,6 +52,10 @@
 #define SYNTHVID_VERSION(major, minor) ((minor) << 16 | (major))
 #define SYNTHVID_VERSION_WIN7 SYNTHVID_VERSION(3, 0)
 #define SYNTHVID_VERSION_WIN8 SYNTHVID_VERSION(3, 2)
+#define SYNTHVID_VERSION_WIN10 SYNTHVID_VERSION(3, 5)
+
+#define SYNTHVID_VER_GET_MAJOR(ver) (ver & 0x0000ffff)
+#define SYNTHVID_VER_GET_MINOR(ver) ((ver & 0xffff0000) >> 16)
 
 #define SYNTHVID_DEPTH_WIN7 16
 #define SYNTHVID_DEPTH_WIN8 32
@@ -82,16 +94,25 @@ enum synthvid_msg_type {
 	SYNTHVID_POINTER_SHAPE		= 8,
 	SYNTHVID_FEATURE_CHANGE		= 9,
 	SYNTHVID_DIRT			= 10,
+	SYNTHVID_RESOLUTION_REQUEST	= 13,
+	SYNTHVID_RESOLUTION_RESPONSE	= 14,
 
-	SYNTHVID_MAX			= 11
+	SYNTHVID_MAX			= 15
 };
 
+#define		SYNTHVID_EDID_BLOCK_SIZE	128
+#define		SYNTHVID_MAX_RESOLUTION_COUNT	64
+
+struct hvd_screen_info {
+	u16 width;
+	u16 height;
+} __packed;
+
 struct synthvid_msg_hdr {
 	u32 type;
 	u32 size;  /* size of this header + payload after this field*/
 } __packed;
 
-
 struct synthvid_version_req {
 	u32 version;
 } __packed;
@@ -102,6 +123,19 @@ struct synthvid_version_resp {
 	u8 max_video_outputs;
 } __packed;
 
+struct synthvid_supported_resolution_req {
+	u8 maximum_resolution_count;
+} __packed;
+
+struct synthvid_supported_resolution_resp {
+	u8 edid_block[SYNTHVID_EDID_BLOCK_SIZE];
+	u8 resolution_count;
+	u8 default_resolution_index;
+	u8 is_standard;
+	struct hvd_screen_info
+		supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT];
+} __packed;
+
 struct synthvid_vram_location {
 	u64 user_ctx;
 	u8 is_vram_gpa_specified;
@@ -187,6 +221,8 @@ struct synthvid_msg {
 		struct synthvid_pointer_shape ptr_shape;
 		struct synthvid_feature_change feature_chg;
 		struct synthvid_dirt dirt;
+		struct synthvid_supported_resolution_req resolution_req;
+		struct synthvid_supported_resolution_resp resolution_resp;
 	};
 } __packed;
 
@@ -224,6 +260,8 @@ struct hvfb_par {
 
 static uint screen_width = HVFB_WIDTH;
 static uint screen_height = HVFB_HEIGHT;
+static uint screen_width_max = HVFB_WIDTH;
+static uint screen_height_max = HVFB_HEIGHT;
 static uint screen_depth;
 static uint screen_fb_size;
 
@@ -354,6 +392,7 @@ static void synthvid_recv_sub(struct hv_device *hdev)
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
+	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
 		memcpy(par->init_buf, msg, MAX_VMBUS_PKT_SIZE);
 		complete(&par->wait);
@@ -400,6 +439,17 @@ static void synthvid_receive(void *ctx)
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
+/* Check if the ver1 version is equal or greater than ver2 */
+static inline bool synthvid_ver_eg(u32 ver1, u32 ver2)
+{
+	if (SYNTHVID_VER_GET_MAJOR(ver1) > SYNTHVID_VER_GET_MAJOR(ver2) ||
+	    (SYNTHVID_VER_GET_MAJOR(ver1) == SYNTHVID_VER_GET_MAJOR(ver2) &&
+	     SYNTHVID_VER_GET_MINOR(ver1) >= SYNTHVID_VER_GET_MINOR(ver2)))
+		return true;
+
+	return false;
+}
+
 /* Check synthetic video protocol version with the host */
 static int synthvid_negotiate_ver(struct hv_device *hdev, u32 ver)
 {
@@ -428,6 +478,64 @@ static int synthvid_negotiate_ver(struct hv_device *hdev, u32 ver)
 	}
 
 	par->synthvid_version = ver;
+	pr_info("Synthvid Version major %d, minor %d\n",
+		SYNTHVID_VER_GET_MAJOR(ver), SYNTHVID_VER_GET_MINOR(ver));
+
+out:
+	return ret;
+}
+
+/* Get current resolution from the host */
+static int synthvid_get_supported_resolution(struct hv_device *hdev)
+{
+	struct fb_info *info = hv_get_drvdata(hdev);
+	struct hvfb_par *par = info->par;
+	struct synthvid_msg *msg = (struct synthvid_msg *)par->init_buf;
+	int ret = 0;
+	unsigned long t;
+	u8 index;
+	int i;
+
+	memset(msg, 0, sizeof(struct synthvid_msg));
+	msg->vid_hdr.type = SYNTHVID_RESOLUTION_REQUEST;
+	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
+		sizeof(struct synthvid_supported_resolution_req);
+
+	msg->resolution_req.maximum_resolution_count =
+		SYNTHVID_MAX_RESOLUTION_COUNT;
+	synthvid_send(hdev, msg);
+
+	t = wait_for_completion_timeout(&par->wait, VSP_TIMEOUT);
+	if (!t) {
+		pr_err("Time out on waiting resolution response\n");
+			ret = -ETIMEDOUT;
+			goto out;
+	}
+
+	if (msg->resolution_resp.resolution_count == 0) {
+		pr_err("No supported resolutions\n");
+		ret = -ENODEV;
+		goto out;
+	}
+
+	index = msg->resolution_resp.default_resolution_index;
+	if (index >= msg->resolution_resp.resolution_count) {
+		pr_err("Invalid resolution index: %d\n", index);
+		ret = -ENODEV;
+		goto out;
+	}
+
+	for (i = 0; i < msg->resolution_resp.resolution_count; i++) {
+		screen_width_max = max_t(unsigned int, screen_width_max,
+		    msg->resolution_resp.supported_resolution[i].width);
+		screen_height_max = max_t(unsigned int, screen_height_max,
+		    msg->resolution_resp.supported_resolution[i].height);
+	}
+
+	screen_width =
+		msg->resolution_resp.supported_resolution[index].width;
+	screen_height =
+		msg->resolution_resp.supported_resolution[index].height;
 
 out:
 	return ret;
@@ -448,11 +556,27 @@ static int synthvid_connect_vsp(struct hv_device *hdev)
 	}
 
 	/* Negotiate the protocol version with host */
-	if (vmbus_proto_version == VERSION_WS2008 ||
-	    vmbus_proto_version == VERSION_WIN7)
-		ret = synthvid_negotiate_ver(hdev, SYNTHVID_VERSION_WIN7);
-	else
+	switch (vmbus_proto_version) {
+	case VERSION_WIN10:
+	case VERSION_WIN10_V5:
+		ret = synthvid_negotiate_ver(hdev, SYNTHVID_VERSION_WIN10);
+		if (!ret)
+			break;
+		/* Fallthrough */
+	case VERSION_WIN8:
+	case VERSION_WIN8_1:
 		ret = synthvid_negotiate_ver(hdev, SYNTHVID_VERSION_WIN8);
+		if (!ret)
+			break;
+		/* Fallthrough */
+	case VERSION_WS2008:
+	case VERSION_WIN7:
+		ret = synthvid_negotiate_ver(hdev, SYNTHVID_VERSION_WIN7);
+		break;
+	default:
+		ret = synthvid_negotiate_ver(hdev, SYNTHVID_VERSION_WIN10);
+		break;
+	}
 
 	if (ret) {
 		pr_err("Synthetic video device version not accepted\n");
@@ -464,6 +588,12 @@ static int synthvid_connect_vsp(struct hv_device *hdev)
 	else
 		screen_depth = SYNTHVID_DEPTH_WIN8;
 
+	if (synthvid_ver_eg(par->synthvid_version, SYNTHVID_VERSION_WIN10)) {
+		ret = synthvid_get_supported_resolution(hdev);
+		if (ret)
+			pr_info("Failed to get supported resolution from host, use default\n");
+	}
+
 	screen_fb_size = hdev->channel->offermsg.offer.
 				mmio_megabytes * 1024 * 1024;
 
@@ -653,6 +783,8 @@ static void hvfb_get_option(struct fb_info *info)
 	}
 
 	if (x < HVFB_WIDTH_MIN || y < HVFB_HEIGHT_MIN ||
+	    (synthvid_ver_eg(par->synthvid_version, SYNTHVID_VERSION_WIN10) &&
+	    (x > screen_width_max || y > screen_height_max)) ||
 	    (par->synthvid_version == SYNTHVID_VERSION_WIN8 &&
 	     x * y * screen_depth / 8 > SYNTHVID_FB_SIZE_WIN8) ||
 	    (par->synthvid_version == SYNTHVID_VERSION_WIN7 &&
@@ -689,8 +821,12 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
 		}
 
 		if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
-		    pci_resource_len(pdev, 0) < screen_fb_size)
+		    pci_resource_len(pdev, 0) < screen_fb_size) {
+			pr_err("Resource not available or (0x%lx < 0x%lx)\n",
+			       (unsigned long) pci_resource_len(pdev, 0),
+			       (unsigned long) screen_fb_size);
 			goto err1;
+		}
 
 		pot_end = pci_resource_end(pdev, 0);
 		pot_start = pot_end - screen_fb_size + 1;
@@ -781,17 +917,16 @@ static int hvfb_probe(struct hv_device *hdev,
 		goto error1;
 	}
 
+	hvfb_get_option(info);
+	pr_info("Screen resolution: %dx%d, Color depth: %d\n",
+		screen_width, screen_height, screen_depth);
+
 	ret = hvfb_getmem(hdev, info);
 	if (ret) {
 		pr_err("No memory for framebuffer\n");
 		goto error2;
 	}
 
-	hvfb_get_option(info);
-	pr_info("Screen resolution: %dx%d, Color depth: %d\n",
-		screen_width, screen_height, screen_depth);
-
-
 	/* Set up fb_info */
 	info->flags = FBINFO_DEFAULT;
 
-- 
2.20.1


^ permalink raw reply related

* [PATHC v2] video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver
From: Wei Hu @ 2019-08-27 11:25 UTC (permalink / raw)
  To: rdunlap@infradead.org, shc_work@mail.ru,
	gregkh@linuxfoundation.org, lee.jones@linaro.org,
	alexandre.belloni@bootlin.com, baijiaju1990@gmail.com,
	fthain@telegraphics.com.au, info@metux.net,
	linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-fbdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	sashal@kernel.org, Stephen Hemminger, Haiyang Zhang,
	KY Srinivasan, Dexuan Cui, Michael Kelley
  Cc: Wei Hu

Without deferred IO support, hyperv_fb driver informs the host to refresh
the entire guest frame buffer at fixed rate, e.g. at 20Hz, no matter there
is screen update or not. This patch supports deferred IO for screens in
graphics mode and also enables the frame buffer on-demand refresh. The
highest refresh rate is still set at 20Hz.

Currently Hyper-V only takes a physical address from guest as the starting
address of frame buffer. This implies the guest must allocate contiguous
physical memory for frame buffer. In addition, Hyper-V Gen 2 VMs only
accept address from MMIO region as frame buffer address. Due to these
limitations on Hyper-V host, we keep a shadow copy of frame buffer
in the guest. This means one more copy of the dirty rectangle inside
guest when doing the on-demand refresh. This can be optimized in the
future with help from host. For now the host performance gain from deferred
IO outweighs the shadow copy impact in the guest.

v2: Incorporated review comments from Michael Kelley
- Increased dirty rectangle by one row in deferred IO case when sending
to Hyper-V.
- Corrected the dirty rectangle size in the text mode.
- Added more comments.
- Other minor code cleanups.

Signed-off-by: Wei Hu <weh@microsoft.com>
---
 drivers/video/fbdev/Kconfig     |   1 +
 drivers/video/fbdev/hyperv_fb.c | 221 +++++++++++++++++++++++++++++---
 2 files changed, 202 insertions(+), 20 deletions(-)

diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
index 1b2f5f31fb6f..e781f89a1824 100644
--- a/drivers/video/fbdev/Kconfig
+++ b/drivers/video/fbdev/Kconfig
@@ -2241,6 +2241,7 @@ config FB_HYPERV
 	select FB_CFB_FILLRECT
 	select FB_CFB_COPYAREA
 	select FB_CFB_IMAGEBLIT
+	select FB_DEFERRED_IO
 	help
 	  This framebuffer driver supports Microsoft Hyper-V Synthetic Video.
 
diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 2ca400c0d621..279a2164a57c 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -234,6 +234,7 @@ struct synthvid_msg {
 #define RING_BUFSIZE (256 * 1024)
 #define VSP_TIMEOUT (10 * HZ)
 #define HVFB_UPDATE_DELAY (HZ / 20)
+#define HVFB_ONDEMAND_THROTTLE (HZ / 20)
 
 struct hvfb_par {
 	struct fb_info *info;
@@ -253,6 +254,17 @@ struct hvfb_par {
 	bool synchronous_fb;
 
 	struct notifier_block hvfb_panic_nb;
+
+	/* Memory for deferred IO and frame buffer itself */
+	unsigned char *dio_vp;
+	unsigned char *mmio_vp;
+	unsigned long mmio_pp;
+	spinlock_t docopy_lock; /* Lock to protect memory copy */
+
+	/* Dirty rectangle, protected by delayed_refresh_lock */
+	int x1, y1, x2, y2;
+	bool delayed_refresh;
+	spinlock_t delayed_refresh_lock;
 };
 
 static uint screen_width = HVFB_WIDTH;
@@ -261,6 +273,7 @@ static uint screen_width_max = HVFB_WIDTH;
 static uint screen_height_max = HVFB_HEIGHT;
 static uint screen_depth;
 static uint screen_fb_size;
+static uint dio_fb_size; /* FB size for deferred IO */
 
 /* Send message to Hyper-V host */
 static inline int synthvid_send(struct hv_device *hdev,
@@ -347,28 +360,94 @@ static int synthvid_send_ptr(struct hv_device *hdev)
 }
 
 /* Send updated screen area (dirty rectangle) location to host */
-static int synthvid_update(struct fb_info *info)
+static int
+synthvid_update(struct fb_info *info, int x1, int y1, int x2, int y2)
 {
 	struct hv_device *hdev = device_to_hv_device(info->device);
 	struct synthvid_msg msg;
 
 	memset(&msg, 0, sizeof(struct synthvid_msg));
+	if (x2 == INT_MAX)
+		x2 = info->var.xres;
+	if (y2 == INT_MAX)
+		y2 = info->var.yres;
 
 	msg.vid_hdr.type = SYNTHVID_DIRT;
 	msg.vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
 		sizeof(struct synthvid_dirt);
 	msg.dirt.video_output = 0;
 	msg.dirt.dirt_count = 1;
-	msg.dirt.rect[0].x1 = 0;
-	msg.dirt.rect[0].y1 = 0;
-	msg.dirt.rect[0].x2 = info->var.xres;
-	msg.dirt.rect[0].y2 = info->var.yres;
+	msg.dirt.rect[0].x1 = (x1 > x2) ? 0 : x1;
+	msg.dirt.rect[0].y1 = (y1 > y2) ? 0 : y1;
+	msg.dirt.rect[0].x2 =
+		(x2 < x1 || x2 > info->var.xres) ? info->var.xres : x2;
+	msg.dirt.rect[0].y2 =
+		(y2 < y1 || y2 > info->var.yres) ? info->var.yres : y2;
 
 	synthvid_send(hdev, &msg);
 
 	return 0;
 }
 
+static void hvfb_docopy(struct hvfb_par *par,
+			unsigned long offset,
+			unsigned long size)
+{
+	if (!par || !par->mmio_vp || !par->dio_vp || !par->fb_ready ||
+	    size == 0 || offset >= dio_fb_size)
+		return;
+
+	if (offset + size > dio_fb_size)
+		size = dio_fb_size - offset;
+
+	memcpy(par->mmio_vp + offset, par->dio_vp + offset, size);
+}
+
+/* Deferred IO callback */
+static void synthvid_deferred_io(struct fb_info *p,
+				 struct list_head *pagelist)
+{
+	struct hvfb_par *par = p->par;
+	struct page *page;
+	unsigned long start, end;
+	int y1, y2, miny, maxy;
+	unsigned long flags;
+
+	miny = INT_MAX;
+	maxy = 0;
+
+	/*
+	 * Merge dirty pages. It is possible that last page cross
+	 * over the end of frame buffer row yres. This is taken care of
+	 * in synthvid_update function by clamping the y2
+	 * value to yres.
+	 */
+	list_for_each_entry(page, pagelist, lru) {
+		start = page->index << PAGE_SHIFT;
+		end = start + PAGE_SIZE - 1;
+		y1 = start / p->fix.line_length;
+		y2 = end / p->fix.line_length;
+		if (y2 > p->var.yres)
+			y2 = p->var.yres;
+		miny = min_t(int, miny, y1);
+		maxy = max_t(int, maxy, y2);
+
+		/* Copy from dio space to mmio address */
+		if (par->fb_ready) {
+			spin_lock_irqsave(&par->docopy_lock, flags);
+			hvfb_docopy(par, start, PAGE_SIZE);
+			spin_unlock_irqrestore(&par->docopy_lock, flags);
+		}
+	}
+
+	if (par->fb_ready)
+		synthvid_update(p, 0, miny, p->var.xres, maxy + 1);
+}
+
+static struct fb_deferred_io synthvid_defio = {
+	.delay		= HZ / 20,
+	.deferred_io	= synthvid_deferred_io,
+};
 
 /*
  * Actions on received messages from host:
@@ -604,7 +683,7 @@ static int synthvid_send_config(struct hv_device *hdev)
 	msg->vid_hdr.type = SYNTHVID_VRAM_LOCATION;
 	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
 		sizeof(struct synthvid_vram_location);
-	msg->vram.user_ctx = msg->vram.vram_gpa = info->fix.smem_start;
+	msg->vram.user_ctx = msg->vram.vram_gpa = par->mmio_pp;
 	msg->vram.is_vram_gpa_specified = 1;
 	synthvid_send(hdev, msg);
 
@@ -614,7 +693,7 @@ static int synthvid_send_config(struct hv_device *hdev)
 		ret = -ETIMEDOUT;
 		goto out;
 	}
-	if (msg->vram_ack.user_ctx != info->fix.smem_start) {
+	if (msg->vram_ack.user_ctx != par->mmio_pp) {
 		pr_err("Unable to set VRAM location\n");
 		ret = -ENODEV;
 		goto out;
@@ -631,19 +710,82 @@ static int synthvid_send_config(struct hv_device *hdev)
 
 /*
  * Delayed work callback:
- * It is called at HVFB_UPDATE_DELAY or longer time interval to process
- * screen updates. It is re-scheduled if further update is necessary.
+ * It is scheduled to call whenever update request is received and it has
+ * not been called in last HVFB_ONDEMAND_THROTTLE time interval.
  */
 static void hvfb_update_work(struct work_struct *w)
 {
 	struct hvfb_par *par = container_of(w, struct hvfb_par, dwork.work);
 	struct fb_info *info = par->info;
+	unsigned long flags;
+	int x1, x2, y1, y2;
+	int j;
+
+	spin_lock_irqsave(&par->delayed_refresh_lock, flags);
+	/* Reset the request flag */
+	par->delayed_refresh = false;
+
+	/* Store the dirty rectangle to local variables */
+	x1 = par->x1;
+	x2 = par->x2;
+	y1 = par->y1;
+	y2 = par->y2;
+
+	/* Clear dirty rectangle */
+	par->x1 = par->y1 = INT_MAX;
+	par->x2 = par->y2 = 0;
+
+	spin_unlock_irqrestore(&par->delayed_refresh_lock, flags);
 
+	if (x1 < 0 || x1 > info->var.xres || x2 < 0 ||
+	    x2 > info->var.xres || y1 < 0 || y1 > info->var.yres ||
+	    y2 < 0 || y2 > info->var.yres || x2 <= x1)
+		return;
+
+	/* Copy the dirty rectangle to frame buffer memory */
+	spin_lock_irqsave(&par->docopy_lock, flags);
+	for (j = y1; j < y2; j++) {
+		if (j == info->var.yres)
+			break;
+		hvfb_docopy(par,
+			    j * info->fix.line_length +
+			    (x1 * screen_depth / 8),
+			    (x2 - x1) * screen_depth / 8);
+	}
+	spin_unlock_irqrestore(&par->docopy_lock, flags);
+
+	/* Refresh */
 	if (par->fb_ready)
-		synthvid_update(info);
+		synthvid_update(info, x1, y1, x2, y2);
+}
+
+/*
+ * Control the on-demand refresh frequency. It schedules a delayed
+ * screen update if it has not yet.
+ */
+static void hvfb_ondemand_refresh_throttle(struct hvfb_par *par,
+					   int x1, int y1, int w, int h)
+{
+	unsigned long flags;
+	int x2 = x1 + w;
+	int y2 = y1 + h;
+
+	spin_lock_irqsave(&par->delayed_refresh_lock, flags);
+
+	/* Merge dirty rectangle */
+	par->x1 = min_t(int, par->x1, x1);
+	par->y1 = min_t(int, par->y1, y1);
+	par->x2 = max_t(int, par->x2, x2);
+	par->y2 = max_t(int, par->y2, y2);
+
+	/* Schedule a delayed screen update if not yet */
+	if (par->delayed_refresh == false) {
+		schedule_delayed_work(&par->dwork,
+				      HVFB_ONDEMAND_THROTTLE);
+		par->delayed_refresh = true;
+	}
 
-	if (par->update)
-		schedule_delayed_work(&par->dwork, HVFB_UPDATE_DELAY);
+	spin_unlock_irqrestore(&par->delayed_refresh_lock, flags);
 }
 
 static int hvfb_on_panic(struct notifier_block *nb,
@@ -655,7 +797,8 @@ static int hvfb_on_panic(struct notifier_block *nb,
 	par = container_of(nb, struct hvfb_par, hvfb_panic_nb);
 	par->synchronous_fb = true;
 	info = par->info;
-	synthvid_update(info);
+	hvfb_docopy(par, 0, dio_fb_size);
+	synthvid_update(info, 0, 0, INT_MAX, INT_MAX);
 
 	return NOTIFY_DONE;
 }
@@ -716,7 +859,10 @@ static void hvfb_cfb_fillrect(struct fb_info *p,
 
 	cfb_fillrect(p, rect);
 	if (par->synchronous_fb)
-		synthvid_update(p);
+		synthvid_update(p, 0, 0, INT_MAX, INT_MAX);
+	else
+		hvfb_ondemand_refresh_throttle(par, rect->dx, rect->dy,
+					       rect->width, rect->height);
 }
 
 static void hvfb_cfb_copyarea(struct fb_info *p,
@@ -726,7 +872,10 @@ static void hvfb_cfb_copyarea(struct fb_info *p,
 
 	cfb_copyarea(p, area);
 	if (par->synchronous_fb)
-		synthvid_update(p);
+		synthvid_update(p, 0, 0, INT_MAX, INT_MAX);
+	else
+		hvfb_ondemand_refresh_throttle(par, area->dx, area->dy,
+					       area->width, area->height);
 }
 
 static void hvfb_cfb_imageblit(struct fb_info *p,
@@ -736,7 +885,10 @@ static void hvfb_cfb_imageblit(struct fb_info *p,
 
 	cfb_imageblit(p, image);
 	if (par->synchronous_fb)
-		synthvid_update(p);
+		synthvid_update(p, 0, 0, INT_MAX, INT_MAX);
+	else
+		hvfb_ondemand_refresh_throttle(par, image->dx, image->dy,
+					       image->width, image->height);
 }
 
 static struct fb_ops hvfb_ops = {
@@ -795,6 +947,9 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
 	resource_size_t pot_start, pot_end;
 	int ret;
 
+	dio_fb_size =
+		screen_width * screen_height * screen_depth / 8;
+
 	if (gen2vm) {
 		pot_start = 0;
 		pot_end = -1;
@@ -829,9 +984,14 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
 	if (!fb_virt)
 		goto err2;
 
+	/* Allocate memory for deferred IO */
+	par->dio_vp = vzalloc(round_up(dio_fb_size, PAGE_SIZE));
+	if (par->dio_vp == NULL)
+		goto err3;
+
 	info->apertures = alloc_apertures(1);
 	if (!info->apertures)
-		goto err3;
+		goto err4;
 
 	if (gen2vm) {
 		info->apertures->ranges[0].base = screen_info.lfb_base;
@@ -843,16 +1003,23 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info)
 		info->apertures->ranges[0].size = pci_resource_len(pdev, 0);
 	}
 
+	/* Physical address of FB device */
+	par->mmio_pp = par->mem->start;
+	/* Virtual address of FB device */
+	par->mmio_vp = (unsigned char *) fb_virt;
+
 	info->fix.smem_start = par->mem->start;
-	info->fix.smem_len = screen_fb_size;
-	info->screen_base = fb_virt;
-	info->screen_size = screen_fb_size;
+	info->fix.smem_len = dio_fb_size;
+	info->screen_base = par->dio_vp;
+	info->screen_size = dio_fb_size;
 
 	if (!gen2vm)
 		pci_dev_put(pdev);
 
 	return 0;
 
+err4:
+	vfree(par->dio_vp);
 err3:
 	iounmap(fb_virt);
 err2:
@@ -870,6 +1037,7 @@ static void hvfb_putmem(struct fb_info *info)
 {
 	struct hvfb_par *par = info->par;
 
+	vfree(par->dio_vp);
 	iounmap(info->screen_base);
 	vmbus_free_mmio(par->mem->start, screen_fb_size);
 	par->mem = NULL;
@@ -895,6 +1063,12 @@ static int hvfb_probe(struct hv_device *hdev,
 	init_completion(&par->wait);
 	INIT_DELAYED_WORK(&par->dwork, hvfb_update_work);
 
+	par->delayed_refresh = false;
+	spin_lock_init(&par->delayed_refresh_lock);
+	spin_lock_init(&par->docopy_lock);
+	par->x1 = par->y1 = INT_MAX;
+	par->x2 = par->y2 = 0;
+
 	/* Connect to VSP */
 	hv_set_drvdata(hdev, info);
 	ret = synthvid_connect_vsp(hdev);
@@ -946,6 +1120,10 @@ static int hvfb_probe(struct hv_device *hdev,
 	info->fbops = &hvfb_ops;
 	info->pseudo_palette = par->pseudo_palette;
 
+	/* Initialize deferred IO */
+	info->fbdefio = &synthvid_defio;
+	fb_deferred_io_init(info);
+
 	/* Send config to host */
 	ret = synthvid_send_config(hdev);
 	if (ret)
@@ -967,6 +1145,7 @@ static int hvfb_probe(struct hv_device *hdev,
 	return 0;
 
 error:
+	fb_deferred_io_cleanup(info);
 	hvfb_putmem(info);
 error2:
 	vmbus_close(hdev->channel);
@@ -989,6 +1168,8 @@ static int hvfb_remove(struct hv_device *hdev)
 	par->update = false;
 	par->fb_ready = false;
 
+	fb_deferred_io_cleanup(info);
+
 	unregister_framebuffer(info);
 	cancel_delayed_work_sync(&par->dwork);
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support
From: Tianyu Lan @ 2019-08-27 12:17 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Tianyu Lan, kvm, linux-doc, linux-hyperv,
	linux-kernel@vger kernel org, Paolo Bonzini, Radim Krcmar, corbet,
	KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, michael.h.kelley
In-Reply-To: <87ftlnm7o8.fsf@vitty.brq.redhat.com>

On Tue, Aug 27, 2019 at 2:41 PM Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> lantianyu1986@gmail.com writes:
>
> > From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> >
> > This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
> > in L0 can delegate L1 hypervisor to handle tlb flush request from
> > L2 guest when direct tlb flush is enabled in L1.
> >
> > Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
> > feature from user space. User space should enable this feature only
> > when Hyper-V hypervisor capability is exposed to guest and KVM profile
> > is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
> > We hope L2 guest doesn't use KVM hypercall when the feature is
> > enabled. Detail please see comment of new API
> > "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"
>
> I was thinking about this for awhile and I think I have a better
> proposal. Instead of adding this new capability let's enable direct TLB
> flush when KVM guest enables Hyper-V Hypercall page (writes to
> HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
> hypercalls as we can't handle both KVM-style and Hyper-V-style
> hypercalls simultaneously and kvm_emulate_hypercall() does:
>
>         if (kvm_hv_hypercall_enabled(vcpu->kvm))
>                 return kvm_hv_hypercall(vcpu);
>
> What do you think?
>
> (and instead of adding the capability we can add kvm.ko module parameter
> to enable direct tlb flush unconditionally, like
> 'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
> based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
> permanenetly enabled)).
>

Hi Vitaly::
     Actually, I had such idea before. But user space should check
whether hv tlb flush
is exposed to VM before enabling direct tlb flush. If no, user space
should not direct
tlb flush for guest since Hyper-V will do more check for each
hypercall from nested
VM with enabling the feauter..

-- 
Best regards
Tianyu Lan

^ permalink raw reply

* Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support
From: Tianyu Lan @ 2019-08-27 12:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Tianyu Lan, kvm, linux-doc, linux-hyperv,
	linux-kernel@vger kernel org, Paolo Bonzini, Radim Krcmar, corbet,
	KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, michael.h.kelley
In-Reply-To: <CAOLK0pzXPG9tBnQoKGTSNHMwXXrEQ4zZH1uWn2F2mQ2ddVcoFA@mail.gmail.com>

On Tue, Aug 27, 2019 at 8:17 PM Tianyu Lan <lantianyu1986@gmail.com> wrote:
>
> On Tue, Aug 27, 2019 at 2:41 PM Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> >
> > lantianyu1986@gmail.com writes:
> >
> > > From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> > >
> > > This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
> > > in L0 can delegate L1 hypervisor to handle tlb flush request from
> > > L2 guest when direct tlb flush is enabled in L1.
> > >
> > > Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
> > > feature from user space. User space should enable this feature only
> > > when Hyper-V hypervisor capability is exposed to guest and KVM profile
> > > is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
> > > We hope L2 guest doesn't use KVM hypercall when the feature is
> > > enabled. Detail please see comment of new API
> > > "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"
> >
> > I was thinking about this for awhile and I think I have a better
> > proposal. Instead of adding this new capability let's enable direct TLB
> > flush when KVM guest enables Hyper-V Hypercall page (writes to
> > HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
> > hypercalls as we can't handle both KVM-style and Hyper-V-style
> > hypercalls simultaneously and kvm_emulate_hypercall() does:
> >
> >         if (kvm_hv_hypercall_enabled(vcpu->kvm))
> >                 return kvm_hv_hypercall(vcpu);
> >
> > What do you think?
> >
> > (and instead of adding the capability we can add kvm.ko module parameter
> > to enable direct tlb flush unconditionally, like
> > 'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
> > based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
> > permanenetly enabled)).
> >
>
> Hi Vitaly::
>      Actually, I had such idea before. But user space should check
> whether hv tlb flush
> is exposed to VM before enabling direct tlb flush. If no, user space
> should not direct
> tlb flush for guest since Hyper-V will do more check for each
> hypercall from nested
> VM with enabling the feauter..
>
Fix the line break.Sorry for noise.

Actually, I had such idea before. But user space should check
whether hv tlb flush is exposed to VM before enabling direct tlb
flush. If no, user space should not direct tlb flush for guest since
Hyper-V will do more check for each hypercall from nested VM
with enabling the feauter..
---
Best regards
Tianyu Lan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox