Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [PATCH v3 11/12] Drivers: hv: vmbus: Suspend after cleaning up hv_sock and sub channels
From: Michael Kelley @ 2019-08-23 20:16 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-12-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> Before suspend, Linux must make sure all the hv_sock channels have been
> properly cleaned up, because a hv_sock connection can not persist across
> hibernation, and the user-space app must be properly notified of the
> state change of the connection.
> 
> Before suspend, Linux also must make sure all the sub-channels have been
> destroyed, i.e. the related channel structs of the sub-channels must be
> properly removed, otherwise they would cause a conflict when the
> sub-channels are recreated upon resume.
> 
> Add a counter to track such channels, and vmbus_bus_suspend() should wait
> for the counter to drop to zero.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 28 ++++++++++++++++++++++++++++
>  drivers/hv/connection.c   |  3 +++
>  drivers/hv/hyperv_vmbus.h | 12 ++++++++++++
>  drivers/hv/vmbus_drv.c    | 41 ++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index f7a1184..8491d1b 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -499,6 +499,8 @@ static void vmbus_add_channel_work(struct work_struct *work)
>  	return;
> 
>  err_deq_chan:
> +	WARN_ON_ONCE(1);
> +

Why this change?  I was thinking maybe it's a debug statement that got
left in.

>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
>  	/*
> @@ -545,6 +547,10 @@ static void vmbus_process_offer(struct vmbus_channel
> *newchannel)
> 
>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
> +	/* Remember the channels that should be cleaned up upon suspend. */
> +	if (is_hvsock_channel(newchannel) || is_sub_channel(newchannel))
> +		atomic_inc(&vmbus_connection.nr_chan_close_on_suspend);
> +
>  	/*
>  	 * Now that we have acquired the channel_mutex,
>  	 * we can release the potentially racing rescind thread.
> @@ -915,6 +921,16 @@ static void vmbus_onoffer(struct
> vmbus_channel_message_header *hdr)
>  	vmbus_process_offer(newchannel);
>  }
> 
> +static void check_ready_for_suspend_event(void)
> +{
> +	/*
> +	 * If all the sub-channels or hv_sock channels have been cleaned up,
> +	 * then it's safe to suspend.
> +	 */
> +	if (atomic_dec_and_test(&vmbus_connection.nr_chan_close_on_suspend))
> +		complete(&vmbus_connection.ready_for_suspend_event);
> +}
> +
>  /*
>   * vmbus_onoffer_rescind - Rescind offer handler.
>   *
> @@ -925,6 +941,7 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  	struct vmbus_channel_rescind_offer *rescind;
>  	struct vmbus_channel *channel;
>  	struct device *dev;
> +	bool clean_up_chan_for_suspend;
> 
>  	rescind = (struct vmbus_channel_rescind_offer *)hdr;
> 
> @@ -964,6 +981,8 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  		return;
>  	}
> 
> +	clean_up_chan_for_suspend = is_hvsock_channel(channel) ||
> +				    is_sub_channel(channel);
>  	/*
>  	 * Before setting channel->rescind in vmbus_rescind_cleanup(), we
>  	 * should make sure the channel callback is not running any more.
> @@ -989,6 +1008,10 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  	if (channel->device_obj) {
>  		if (channel->chn_rescind_callback) {
>  			channel->chn_rescind_callback(channel);
> +
> +			if (clean_up_chan_for_suspend)
> +				check_ready_for_suspend_event();
> +
>  			return;
>  		}
>  		/*
> @@ -1021,6 +1044,11 @@ static void vmbus_onoffer_rescind(struct
> vmbus_channel_message_header *hdr)
>  		}
>  		mutex_unlock(&vmbus_connection.channel_mutex);
>  	}
> +
> +	/* The "channel" may have been freed. Do not access it any longer. */
> +
> +	if (clean_up_chan_for_suspend)
> +		check_ready_for_suspend_event();
>  }

Having to add the above lines twice is a bit clumsy, but the problem is
the overall structure of the vmbus_onoffer_rescind.  The early return in
the case of a rescind_callback function is a bit weird, but I guess it makes
sense since from what I can tell, only uio and hv_sock have rescind callback
functions.   Some minor restructuring might be warranted, but I don't feel
strongly about it.

> 
>  void vmbus_hvsock_device_unregister(struct vmbus_channel *channel)
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 701d9a8..f15d3115 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -26,6 +26,9 @@
>  struct vmbus_connection vmbus_connection = {
>  	.conn_state		= DISCONNECTED,
>  	.next_gpadl_handle	= ATOMIC_INIT(0xE1E10),
> +
> +	.ready_for_suspend_event= COMPLETION_INITIALIZER(
> +				  vmbus_connection.ready_for_suspend_event),
>  };
>  EXPORT_SYMBOL_GPL(vmbus_connection);
> 
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 4610277..9f96e23 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -258,6 +258,18 @@ struct vmbus_connection {
>  	struct workqueue_struct *work_queue;
>  	struct workqueue_struct *handle_primary_chan_wq;
>  	struct workqueue_struct *handle_sub_chan_wq;
> +
> +	/*
> +	 * The number of sub-channels and hv_sock channels that should be
> +	 * cleaned up upon suspend: sub-channels will be re-created upon
> +	 * resume, and hv_sock channels should not survive suspend.
> +	 */
> +	atomic_t nr_chan_close_on_suspend;
> +	/*
> +	 * vmbus_bus_suspend() waits for "nr_chan_close_on_suspend" to
> +	 * drop to zero.
> +	 */
> +	struct completion ready_for_suspend_event;
>  };
> 
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 2bea669..0507157 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2127,7 +2127,8 @@ static int vmbus_acpi_add(struct acpi_device *device)
> 
>  static int vmbus_bus_suspend(struct device *dev)
>  {
> -	struct vmbus_channel *channel;
> +	struct vmbus_channel *channel, *sc;
> +	unsigned long flags;
> 
>  	while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
>  		/*
> @@ -2146,6 +2147,41 @@ static int vmbus_bus_suspend(struct device *dev)
>  	}
>  	mutex_unlock(&vmbus_connection.channel_mutex);
> 
> +	/*
> +	 * Wait until all the sub-channels and hv_sock channels have been
> +	 * cleaned up. Sub-channels should be destroyed upon suspend, otherwise
> +	 * they would conflict with the new sub-channels that will be created
> +	 * in the resume path. hv_sock channels should also be destroyed, but
> +	 * a hv_sock channel of an established hv_sock connection can not be
> +	 * really destroyed since it may still be referenced by the userspace
> +	 * application, so we just force the hv_sock channel to be rescinded
> +	 * by vmbus_force_channel_rescinded(), and the userspace application
> +	 * will thoroughly destroy the channel after hibernation.
> +	 */
> +	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)

At first glance, the above line seemed useless to me.  You could just do the
wait_for_completion() unconditionally.  But is the intent to handle the case where
the VM never had any sub-channels or hv_sock channels, and so
nr_chan_close_on_suspend never went above 0?  If so, a comment might
be helpful.

> +		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
> +
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		if (is_hvsock_channel(channel)) {
> +			if (!channel->rescind) {
> +				pr_err("hv_sock channel not rescinded!\n");
> +				WARN_ON_ONCE(1);
> +			}
> +			continue;
> +		}
> +
> +		spin_lock_irqsave(&channel->lock, flags);
> +		list_for_each_entry(sc, &channel->sc_list, sc_list) {
> +			pr_err("Sub-channel not deleted!\n");
> +			WARN_ON_ONCE(1);
> +		}
> +		spin_unlock_irqrestore(&channel->lock, flags);
> +	}
> +
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
>  	vmbus_initiate_unload(false);
> 
>  	vmbus_connection.conn_state = DISCONNECTED;
> @@ -2186,6 +2222,9 @@ static int vmbus_bus_resume(struct device *dev)
> 
>  	vmbus_request_offers();
> 
> +	/* Reset the event for the next suspend. */
> +	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> +
>  	return 0;
>  }
> 
> --
> 1.8.3.1


^ permalink raw reply

* RE: [PATCH v3 10/12] Drivers: hv: vmbus: Clean up hv_sock channels by force upon suspend
From: Michael Kelley @ 2019-08-23 20:02 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-11-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> Fake RESCIND_CHANNEL messages to clean up hv_sock channels by force for
> hibernation. There is no better method to clean up the channels since
> some of the channels may still be referenced by the userspace apps when
> hiberantin is triggered: in this case, the "rescind" fields of the

s/hiberantin/hibernation/

> channels are set, and the apps will thoroughly destroy the channels
> after hibernation.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 55
> ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index ce9974b..2bea669 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -24,6 +24,7 @@
>  #include <linux/sched/task_stack.h>
> 
>  #include <asm/mshyperv.h>
> +#include <linux/delay.h>
>  #include <linux/notifier.h>
>  #include <linux/ptrace.h>
>  #include <linux/screen_info.h>
> @@ -1069,6 +1070,41 @@ void vmbus_on_msg_dpc(unsigned long data)
>  	vmbus_signal_eom(msg, message_type);
>  }
> 
> +/*
> + * Fake RESCIND_CHANNEL messages to clean up hv_sock channels by force for
> + * hibernation, because hv_sock connections can not persist across hibernation.
> + */
> +static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
> +{
> +	struct onmessage_work_context *ctx;
> +	struct vmbus_channel_rescind_offer *rescind;
> +
> +	WARN_ON(!is_hvsock_channel(channel));
> +
> +	/*
> +	 * sizeof(*ctx) is small and the allocation should really not fail,
> +	 * otherwise the state of the hv_sock connections ends up in limbo.
> +	 */
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL | __GFP_NOFAIL);
> +
> +	/*
> +	 * So far, these are not really used by Linux. Just set them to the
> +	 * reasonable values conforming to the definitions of the fields.
> +	 */
> +	ctx->msg.header.message_type = 1;
> +	ctx->msg.header.payload_size = sizeof(*rescind);
> +
> +	/* These values are actually used by Linux. */
> +	rescind = (struct vmbus_channel_rescind_offer *)ctx->msg.u.payload;
> +	rescind->header.msgtype = CHANNELMSG_RESCIND_CHANNELOFFER;
> +	rescind->child_relid = channel->offermsg.child_relid;
> +
> +	INIT_WORK(&ctx->work, vmbus_onmessage_work);
> +
> +	queue_work_on(vmbus_connection.connect_cpu,
> +		      vmbus_connection.work_queue,
> +		      &ctx->work);
> +}
> 
>  /*
>   * Direct callback for channels using other deferred processing
> @@ -2091,6 +2127,25 @@ static int vmbus_acpi_add(struct acpi_device *device)
> 
>  static int vmbus_bus_suspend(struct device *dev)
>  {
> +	struct vmbus_channel *channel;
> +
> +	while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
> +		/*
> +		 * We wait here until any channel offer is currently
> +		 * being processed.
> +		 */

The wording of the comment is a bit off.  Maybe

		/*
		 * We wait here until the completion of any channel
		 * offers that are currently in progress.
		 */

> +		msleep(1);
> +	}
> +
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		if (!is_hvsock_channel(channel))
> +			continue;
> +
> +		vmbus_force_channel_rescinded(channel);
> +	}
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
>  	vmbus_initiate_unload(false);
> 
>  	vmbus_connection.conn_state = DISCONNECTED;
> --
> 1.8.3.1

Modulo the nits:

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH v3 08/12] Drivers: hv: vmbus: Ignore the offers when resuming from hibernation
From: Michael Kelley @ 2019-08-23 19:56 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-9-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>  Sent: Monday, August 19, 2019 6:52 PM
> 
> When the VM resumes, the host re-sends the offers. We should not add the
> offers to the global vmbus_connection.chn_list again.
> 
> This patch assumes the RELIDs of the channels don't change across
> hibernation. Actually this is not always true, especially in the case of
> NIC SR-IOV the VF vmbus device's RELID sometimes can change. A later patch
> will address this issue by mapping the new offers to the old channels and
> fixing up the old channels, if necessary.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 29 ++++++++++++++++++++++++++++-
>  drivers/hv/connection.c   | 27 +++++++++++++++++++++++++++
>  drivers/hv/hyperv_vmbus.h |  3 +++
>  3 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index addcef5..f7a1184 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -854,12 +854,39 @@ void vmbus_initiate_unload(bool crash)
>  static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
>  {
>  	struct vmbus_channel_offer_channel *offer;
> -	struct vmbus_channel *newchannel;
> +	struct vmbus_channel *oldchannel, *newchannel;
> +	size_t offer_sz;
> 
>  	offer = (struct vmbus_channel_offer_channel *)hdr;
> 
>  	trace_vmbus_onoffer(offer);
> 
> +	mutex_lock(&vmbus_connection.channel_mutex);
> +	oldchannel = find_primary_channel_by_offer(offer);
> +	mutex_unlock(&vmbus_connection.channel_mutex);
> +
> +	if (oldchannel != NULL) {
> +		atomic_dec(&vmbus_connection.offer_in_progress);
> +
> +		/*
> +		 * We're resuming from hibernation: we expect the host to send
> +		 * exactly the same offers that we had before the hibernation.
> +		 */
> +		offer_sz = sizeof(*offer);
> +		if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0)
> +			return;
> +
> +		pr_debug("Mismatched offer from the host (relid=%d)\n",
> +			 offer->child_relid);
> +
> +		print_hex_dump_debug("Old vmbus offer: ", DUMP_PREFIX_OFFSET,
> +				     16, 4, &oldchannel->offermsg, offer_sz,
> +				     false);
> +		print_hex_dump_debug("New vmbus offer: ", DUMP_PREFIX_OFFSET,
> +				     16, 4, offer, offer_sz, false);
> +		return;
> +	}
> +
>  	/* Allocate the channel object and save this offer. */
>  	newchannel = alloc_channel();
>  	if (!newchannel) {
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 09829e1..6c7a983 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -337,6 +337,33 @@ struct vmbus_channel *relid2channel(u32 relid)
>  }
> 
>  /*
> + * find_primary_channel_by_offer - Get the channel object given the new offer.
> + * This is only used in the resume path of hibernation.
> + */
> +struct vmbus_channel *
> +find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer)
> +{
> +	struct vmbus_channel *channel;
> +	const guid_t *inst1, *inst2;
> +
> +	WARN_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
> +
> +	/* Ignore sub-channel offers. */
> +	if (offer->offer.sub_channel_index != 0)
> +		return NULL;
> +
> +	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> +		inst1 = &channel->offermsg.offer.if_instance;
> +		inst2 = &offer->offer.if_instance;
> +
> +		if (guid_equal(inst1, inst2))
> +			return channel;
> +	}
> +
> +	return NULL;
> +}

Any particular reason this new function is in connection.c instead of
putting it in channel_mgmt.c where it is called?

> +
> +/*
>   * vmbus_on_event - Process a channel event notification
>   *
>   * For batched channels (default) optimize host to guest signaling
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 9f7fb6d..c42b46d 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -310,6 +310,9 @@ int vmbus_add_channel_kobj(struct hv_device *device_obj,
> 
>  struct vmbus_channel *relid2channel(u32 relid);
> 
> +struct vmbus_channel *
> +find_primary_channel_by_offer(const struct vmbus_channel_offer_channel *offer);
> +
>  void vmbus_free_channels(void);
> 
>  /* Connection interface */
> --
> 1.8.3.1


^ permalink raw reply

* RE: [PATCH v3 06/12] Drivers: hv: vmbus: Add a helper function is_sub_channel()
From: Michael Kelley @ 2019-08-23 19:51 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-7-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> The existing method of telling if a channel is sub-channel in
> vmbus_process_offer() is cumbersome. This new simple helper function
> is preferred in future.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  include/linux/hyperv.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 6256cc3..2d39248 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -245,7 +245,10 @@ struct vmbus_channel_offer {
>  		} pipe;
>  	} u;
>  	/*
> -	 * The sub_channel_index is defined in win8.
> +	 * The sub_channel_index is defined in Win8: a value of zero means a
> +	 * primary channel and a value of non-zero means a sub-channel.
> +	 *
> +	 * Before Win8, the field is reserved, meaning it's always zero.
>  	 */
>  	u16 sub_channel_index;
>  	u16 reserved3;
> @@ -934,6 +937,11 @@ static inline bool is_hvsock_channel(const struct vmbus_channel
> *c)
>  		  VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
>  }
> 
> +static inline bool is_sub_channel(const struct vmbus_channel *c)
> +{
> +	return c->offermsg.offer.sub_channel_index != 0;
> +}
> +
>  static inline void set_channel_affinity_state(struct vmbus_channel *c,
>  					      enum hv_numa_policy policy)
>  {
> --
> 1.8.3.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH v3 02/12] x86/hyper-v: Implement hv_is_hibernation_supported()
From: Michael Kelley @ 2019-08-23 19:50 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1566265863-21252-3-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, August 19, 2019 6:52 PM
> 
> When a Linux VM runs on Hyper-V and hibernates, it must disable the
> memory hot-add/remove and balloon up/down capabilities in the hv_balloon
> driver.

I'm unclear on the above statement.  I think the requirement is that
ballooning must not be active when hibernation is initiated.  Is hibernation
blocked in that case?  If not, what happens?

> 
> By default, Hyper-V does not enable the virtual ACPI S4 state for a VM;
> on recent Hyper-V hosts, the administrator is able to enable the virtual
> ACPI S4 state for a VM, so we hope to use the presence of the virtual ACPI

"we hope" sounds very indefinite. :-(    Does ACPI S4 have to be enabled for
hibernation to be initiated?  Goes back to my first question ....

> S4 state as a hint for hv_balloon to disable the aforementioned
> capabilities.
> 
> The new API will be used by hv_balloon.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c      | 7 +++++++
>  include/asm-generic/mshyperv.h | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 78e53d9..6735e45 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -7,6 +7,7 @@
>   * Author : K. Y. Srinivasan <kys@microsoft.com>
>   */
> 
> +#include <linux/acpi.h>
>  #include <linux/efi.h>
>  #include <linux/types.h>
>  #include <asm/apic.h>
> @@ -453,3 +454,9 @@ bool hv_is_hyperv_initialized(void)
>  	return hypercall_msr.enable;
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> +
> +bool hv_is_hibernation_supported(void)
> +{
> +	return acpi_sleep_state_supported(ACPI_STATE_S4);
> +}
> +EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 0becb7d..1cb4001 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -166,9 +166,11 @@ static inline int cpumask_to_vpset(struct hv_vpset *vpset,
>  void hyperv_report_panic(struct pt_regs *regs, long err);
>  void hyperv_report_panic_msg(phys_addr_t pa, size_t size);
>  bool hv_is_hyperv_initialized(void);
> +bool hv_is_hibernation_supported(void);
>  void hyperv_cleanup(void);
>  #else /* CONFIG_HYPERV */
>  static inline bool hv_is_hyperv_initialized(void) { return false; }
> +static inline bool hv_is_hibernation_supported(void) { return false; }
>  static inline void hyperv_cleanup(void) {}
>  #endif /* CONFIG_HYPERV */
> 
> --
> 1.8.3.1


^ permalink raw reply

* Re: [PATCH v2 1/3] drivers: hv: vmbus: Introduce latency testing
From: Branden Bonaby @ 2019-08-23 17:36 UTC (permalink / raw)
  To: Michael Kelley
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR21MB01379300AF2B441D0B53692DD7A40@DM5PR21MB0137.namprd21.prod.outlook.com>

On Fri, Aug 23, 2019 at 04:44:09PM +0000, Michael Kelley wrote:
> From: Branden Bonaby <brandonbonaby94@gmail.com> Sent: Thursday, August 22, 2019 8:39 PM
> > > > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > > > index 09829e15d4a0..c9c63a4033cd 100644
> > > > --- a/drivers/hv/connection.c
> > > > +++ b/drivers/hv/connection.c
> > > > @@ -357,6 +357,9 @@ void vmbus_on_event(unsigned long data)
> > > >
> > > >  	trace_vmbus_on_event(channel);
> > > >
> > > > +#ifdef CONFIG_HYPERV_TESTING
> > > > +	hv_debug_delay_test(channel, INTERRUPT_DELAY);
> > > > +#endif /* CONFIG_HYPERV_TESTING */
> > >
> > > You are following Vitaly's suggestion to use #ifdef's so no code is
> > > generated when HYPERV_TESTING is not enabled.  However, this
> > > direct approach to using #ifdef's really clutters the code and makes
> > > it harder to read and follow.  The better approach is to use the
> > > #ifdef in the include file where the functions are defined.  If
> > > HYPERV_TESTING is not enabled, provide a #else that defines
> > > the function with an empty implementation for which the compiler
> > > will generate no code.   An as example, see the function definition
> > > for hyperv_init() in arch/x86/include/asm/mshyperv.h.  There are
> > > several functions treated similarly in that include file.
> > >
> > 
> > I checked out the code in arch/x86/include/asm/mshyperv.h, after
> > thinking about it, I'm wondering if it would be better just to have
> > two files one called hv_debugfs.c and the other hyperv_debugfs.h.
> > I could put the code definitions in hv_debugfs.c and at the top
> > include the hyperv_debugfs.h file which would house the declarations
> > of these functions under the ifdef. Then like you alluded too use
> > an #else statement that would have the null implementations of the
> > above functions. Then put an #include "hyperv_debugfs.h" in the
> > hyperv_vmbus.h file. I figured instead of putting the code directly
> > into the vmbus_drv.c file it might be best to put them in a seperate
> > file like hv_debugfs.c. This way when we start adding more tests we
> > don't bloat the vmbus_drv.c file unnecessarily. The hv_debugfs.c
> > file would have the #ifdef CONFIG_HYPERV_TESTING at the top so if
> > its not enabled  those null implementations in "hyperv_debugfs.h"
> > woud kick in anywhere that included the hyperv_vmbus.h file which
> > is what we want.
> > 
> > what do you think?
> > 
> 
> I'll preface my comments by saying that how code gets structured
> into files is always a bit of a judgment call.  The goal is to group code
> into sensible chunks to make it easy to understand and to make it
> easy to modify and extend later.  The latter is a prediction about the
> future, which may or may not be accurate.   For the former, what's
> "easy to understand," is often in the eye of the beholder.  So you may
> get different opinions from different reviewers.
> 
> That said, I like the idea of a separate hv_debugfs.c file to contain
> the implementation of the various functions you have added to
> provide the fuzzing capability.   I'm less convinced about the value
> of a separate hyperv_debugfs.h file.   I think you have one big
> #ifdef CONFIG_HYPERV_TESTING followed by the declarations of
> the functions in hv_debugfs.c, followed by #else and null
> implementations of those functions.  This is 20 lines of code or so,
> and could easily go in hyperv_vmbus.h.
> 
> For the new hv_debugfs.c, you can avoid the need for
> #ifdef CONFIG_HYPERV_TESTING by modifying the Makefile in
> drivers/hv so that hv_debugfs.o is built only if CONFIG_HYPERV_TESTING
> is defined.  Look at the current Makefile to see how this is done
> with CONFIG_HYPERV_UTILS and CONFIG_HYPERV_BALLOON.
> 
> Michael
>

I see, that does make sense, I'll go ahead and add these changes.

thanks

branden bonaby

^ permalink raw reply

* RE: [PATCH v2 1/3] drivers: hv: vmbus: Introduce latency testing
From: Michael Kelley @ 2019-08-23 16:44 UTC (permalink / raw)
  To: brandonbonaby94
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190823033850.GA41496@Test-Virtual-Machine>

From: Branden Bonaby <brandonbonaby94@gmail.com> Sent: Thursday, August 22, 2019 8:39 PM
> > > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > > index 09829e15d4a0..c9c63a4033cd 100644
> > > --- a/drivers/hv/connection.c
> > > +++ b/drivers/hv/connection.c
> > > @@ -357,6 +357,9 @@ void vmbus_on_event(unsigned long data)
> > >
> > >  	trace_vmbus_on_event(channel);
> > >
> > > +#ifdef CONFIG_HYPERV_TESTING
> > > +	hv_debug_delay_test(channel, INTERRUPT_DELAY);
> > > +#endif /* CONFIG_HYPERV_TESTING */
> >
> > You are following Vitaly's suggestion to use #ifdef's so no code is
> > generated when HYPERV_TESTING is not enabled.  However, this
> > direct approach to using #ifdef's really clutters the code and makes
> > it harder to read and follow.  The better approach is to use the
> > #ifdef in the include file where the functions are defined.  If
> > HYPERV_TESTING is not enabled, provide a #else that defines
> > the function with an empty implementation for which the compiler
> > will generate no code.   An as example, see the function definition
> > for hyperv_init() in arch/x86/include/asm/mshyperv.h.  There are
> > several functions treated similarly in that include file.
> >
> 
> I checked out the code in arch/x86/include/asm/mshyperv.h, after
> thinking about it, I'm wondering if it would be better just to have
> two files one called hv_debugfs.c and the other hyperv_debugfs.h.
> I could put the code definitions in hv_debugfs.c and at the top
> include the hyperv_debugfs.h file which would house the declarations
> of these functions under the ifdef. Then like you alluded too use
> an #else statement that would have the null implementations of the
> above functions. Then put an #include "hyperv_debugfs.h" in the
> hyperv_vmbus.h file. I figured instead of putting the code directly
> into the vmbus_drv.c file it might be best to put them in a seperate
> file like hv_debugfs.c. This way when we start adding more tests we
> don't bloat the vmbus_drv.c file unnecessarily. The hv_debugfs.c
> file would have the #ifdef CONFIG_HYPERV_TESTING at the top so if
> its not enabled  those null implementations in "hyperv_debugfs.h"
> woud kick in anywhere that included the hyperv_vmbus.h file which
> is what we want.
> 
> what do you think?
> 

I'll preface my comments by saying that how code gets structured
into files is always a bit of a judgment call.  The goal is to group code
into sensible chunks to make it easy to understand and to make it
easy to modify and extend later.  The latter is a prediction about the
future, which may or may not be accurate.   For the former, what's
"easy to understand," is often in the eye of the beholder.  So you may
get different opinions from different reviewers.

That said, I like the idea of a separate hv_debugfs.c file to contain
the implementation of the various functions you have added to
provide the fuzzing capability.   I'm less convinced about the value
of a separate hyperv_debugfs.h file.   I think you have one big
#ifdef CONFIG_HYPERV_TESTING followed by the declarations of
the functions in hv_debugfs.c, followed by #else and null
implementations of those functions.  This is 20 lines of code or so,
and could easily go in hyperv_vmbus.h.

For the new hv_debugfs.c, you can avoid the need for
#ifdef CONFIG_HYPERV_TESTING by modifying the Makefile in
drivers/hv so that hv_debugfs.o is built only if CONFIG_HYPERV_TESTING
is defined.  Look at the current Makefile to see how this is done
with CONFIG_HYPERV_UTILS and CONFIG_HYPERV_BALLOON.

Michael

^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: David Miller @ 2019-08-23  6:11 UTC (permalink / raw)
  To: saeedm
  Cc: haiyangz, kys, sthemmin, lorenzo.pieralisi, linux-kernel, eranbe,
	netdev, linux-pci, leon, sashal, bhelgaas, linux-hyperv
In-Reply-To: <f7a0ce8822e197ace496a348a14ac6939313d8f6.camel@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Fri, 23 Aug 2019 05:29:48 +0000

> On Thu, 2019-08-22 at 15:39 -0700, David Miller wrote:
>> From: Haiyang Zhang <haiyangz@microsoft.com>
>> Date: Thu, 22 Aug 2019 22:37:13 +0000
>> 
>> > The v5 is pretty much the same as v4, except Eran had a fix to
>> patch #3 in response to
>> > Leon Romanovsky <leon@kernel.org>.
>> 
>> Well you now have to send me a patch relative to v4 in order to fix
>> that.
>> 
>> When I say "applied", the series is in my tree and is therefore
>> permanent.
>> It is therefore never appropriate to then post a new version of the
>> series.
> 
> Dave, I think you didn't reply back to v4 that the series was applied.
> So that might have created some confusion for Haiyang.

I thought I did, sorry, my bad.

^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Saeed Mahameed @ 2019-08-23  5:29 UTC (permalink / raw)
  To: davem@davemloft.net, haiyangz
  Cc: kys, sthemmin@microsoft.com, lorenzo.pieralisi@arm.com,
	linux-kernel@vger.kernel.org, Eran Ben Elisha,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	leon@kernel.org, sashal@kernel.org, bhelgaas@google.com,
	linux-hyperv@vger.kernel.org
In-Reply-To: <20190822.153912.2269276523787180347.davem@davemloft.net>

On Thu, 2019-08-22 at 15:39 -0700, David Miller wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Date: Thu, 22 Aug 2019 22:37:13 +0000
> 
> > The v5 is pretty much the same as v4, except Eran had a fix to
> patch #3 in response to
> > Leon Romanovsky <leon@kernel.org>.
> 
> Well you now have to send me a patch relative to v4 in order to fix
> that.
> 
> When I say "applied", the series is in my tree and is therefore
> permanent.
> It is therefore never appropriate to then post a new version of the
> series.

Dave, I think you didn't reply back to v4 that the series was applied.
So that might have created some confusion for Haiyang.



^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Eran Ben Elisha @ 2019-08-23  5:05 UTC (permalink / raw)
  To: haiyangz, David Miller
  Cc: sashal@kernel.org, Saeed Mahameed, leon@kernel.org,
	lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, kys, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM6PR21MB133778F0890449A5D58DD9D5CAA50@DM6PR21MB1337.namprd21.prod.outlook.com>



On 8/23/2019 1:43 AM, Haiyang Zhang wrote:
> 
> 
>> -----Original Message-----
>> From: David Miller <davem@davemloft.net>
>> Sent: Thursday, August 22, 2019 3:39 PM
>> To: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: sashal@kernel.org; saeedm@mellanox.com; leon@kernel.org;
>> eranbe@mellanox.com; lorenzo.pieralisi@arm.com; bhelgaas@google.com;
>> linux-pci@vger.kernel.org; linux-hyperv@vger.kernel.org;
>> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
>> Hemminger <sthemmin@microsoft.com>; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e
>> HV VHCA stats
>>
>> From: Haiyang Zhang <haiyangz@microsoft.com>
>> Date: Thu, 22 Aug 2019 22:37:13 +0000
>>
>>> The v5 is pretty much the same as v4, except Eran had a fix to patch #3 in
>> response to
>>> Leon Romanovsky <leon@kernel.org>.
>>
>> Well you now have to send me a patch relative to v4 in order to fix that.
>>
>> When I say "applied", the series is in my tree and is therefore permanent.
>> It is therefore never appropriate to then post a new version of the series.
> 
> Thanks.
> 
> Eran, could you submit another patch for the fix to patch #3?

Sure, will prepare and send later today.

> 
> - Haiyang
> 

^ permalink raw reply

* Re: [PATCH v2 1/3] drivers: hv: vmbus: Introduce latency testing
From: Branden Bonaby @ 2019-08-23  3:38 UTC (permalink / raw)
  To: Michael Kelley
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR21MB01375C24AE9ECD93DBE856C2D7AA0@DM5PR21MB0137.namprd21.prod.outlook.com>

> >  endmenu
> > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > index 09829e15d4a0..c9c63a4033cd 100644
> > --- a/drivers/hv/connection.c
> > +++ b/drivers/hv/connection.c
> > @@ -357,6 +357,9 @@ void vmbus_on_event(unsigned long data)
> > 
> >  	trace_vmbus_on_event(channel);
> > 
> > +#ifdef CONFIG_HYPERV_TESTING
> > +	hv_debug_delay_test(channel, INTERRUPT_DELAY);
> > +#endif /* CONFIG_HYPERV_TESTING */
> 
> You are following Vitaly's suggestion to use #ifdef's so no code is
> generated when HYPERV_TESTING is not enabled.  However, this
> direct approach to using #ifdef's really clutters the code and makes
> it harder to read and follow.  The better approach is to use the
> #ifdef in the include file where the functions are defined.  If
> HYPERV_TESTING is not enabled, provide a #else that defines
> the function with an empty implementation for which the compiler
> will generate no code.   An as example, see the function definition
> for hyperv_init() in arch/x86/include/asm/mshyperv.h.  There are
> several functions treated similarly in that include file.
> 

I checked out the code in arch/x86/include/asm/mshyperv.h, after
thinking about it, I'm wondering if it would be better just to have
two files one called hv_debugfs.c and the other hyperv_debugfs.h.
I could put the code definitions in hv_debugfs.c and at the top
include the hyperv_debugfs.h file which would house the declarations
of these functions under the ifdef. Then like you alluded too use
an #else statement that would have the null implementations of the
above functions. Then put an #include "hyperv_debugfs.h" in the
hyperv_vmbus.h file. I figured instead of putting the code directly
into the vmbus_drv.c file it might be best to put them in a seperate
file like hv_debugfs.c. This way when we start adding more tests we
don't bloat the vmbus_drv.c file unnecessarily. The hv_debugfs.c
file would have the #ifdef CONFIG_HYPERV_TESTING at the top so if
its not enabled  those null implementations in "hyperv_debugfs.h"
woud kick in anywhere that included the hyperv_vmbus.h file which
is what we want.

what do you think?

> 
> >  	do {
> >  		void (*callback_fn)(void *);
> > 
> > diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> > index 362e70e9d145..edf14f596d8c 100644
> > --- a/drivers/hv/hyperv_vmbus.h
> > +++ b/drivers/hv/hyperv_vmbus.h
> > @@ -357,4 +357,24 @@ enum hvutil_device_state {
> >  	HVUTIL_DEVICE_DYING,     /* driver unload is in progress */
> >  };
> > 
> > +#ifdef CONFIG_HYPERV_TESTING
> > +#include <linux/debugfs.h>
> > +#include <linux/delay.h>
> > +#include <linux/err.h>
> 
> Generally #include files should go at the top of the file, even if they
> are only needed conditionally.
>

I see , will change

> > +#define TESTING "hyperv"
> 
> I'm not seeing what this line is for, or how it is used.

I used it as the top level name for the dentry that
would appear in debugfs but now I realize its actually
not needed, so i'll remove this.

> > --- a/include/linux/hyperv.h
> > +++ b/include/linux/hyperv.h
> > @@ -926,6 +926,21 @@ struct vmbus_channel {
> >  	 * full outbound ring buffer.
> >  	 */
> >  	u64 out_full_first;
> > +
> > +#ifdef CONFIG_HYPERV_TESTING
> > +	/* enabling/disabling fuzz testing on the channel (default is false)*/
> > +	bool fuzz_testing_state;
> > +
> > +	/* Interrupt delay will delay the guest from emptying the ring buffer
> > +	 * for a specific amount of time. The delay is in microseconds and will
> > +	 * be between 1 to a maximum of 1000, its default is 0 (no delay).
> > +	 * The  Message delay will delay guest reading on a per message basis
> > +	 * in microseconds between 1 to 1000 with the default being 0
> > +	 * (no delay).
> > +	 */
> > +	u32 fuzz_testing_interrupt_delay;
> > +	u32 fuzz_testing_message_delay;
> > +#endif /* CONFIG_HYPERV_TESTING */
> 
> For fields in a data structure like this, you don't have much choice
> but to put the #ifdef directly inline.  However, for small fields like this
> and where the data structure isn't size sensitive, you could consider
> omitting the #ifdef and just always including the fields even when
> HYPERV_TESTING is not enabled.  I don't have a strong preference
> either way.
>

I'll take the ifdefs out since the fields aren't too big


^ permalink raw reply

* RE: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Haiyang Zhang @ 2019-08-22 22:43 UTC (permalink / raw)
  To: David Miller, eranbe@mellanox.com
  Cc: sashal@kernel.org, saeedm@mellanox.com, leon@kernel.org,
	lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190822.153912.2269276523787180347.davem@davemloft.net>



> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Thursday, August 22, 2019 3:39 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: sashal@kernel.org; saeedm@mellanox.com; leon@kernel.org;
> eranbe@mellanox.com; lorenzo.pieralisi@arm.com; bhelgaas@google.com;
> linux-pci@vger.kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e
> HV VHCA stats
> 
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Date: Thu, 22 Aug 2019 22:37:13 +0000
> 
> > The v5 is pretty much the same as v4, except Eran had a fix to patch #3 in
> response to
> > Leon Romanovsky <leon@kernel.org>.
> 
> Well you now have to send me a patch relative to v4 in order to fix that.
> 
> When I say "applied", the series is in my tree and is therefore permanent.
> It is therefore never appropriate to then post a new version of the series.

Thanks.

Eran, could you submit another patch for the fix to patch #3? 

- Haiyang


^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: David Miller @ 2019-08-22 22:39 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, saeedm, leon, eranbe, lorenzo.pieralisi, bhelgaas,
	linux-pci, linux-hyperv, netdev, kys, sthemmin, linux-kernel
In-Reply-To: <DM6PR21MB133743FB2006A28AE10A170CCAA50@DM6PR21MB1337.namprd21.prod.outlook.com>

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: Thu, 22 Aug 2019 22:37:13 +0000

> The v5 is pretty much the same as v4, except Eran had a fix to patch #3 in response to
> Leon Romanovsky <leon@kernel.org>.

Well you now have to send me a patch relative to v4 in order to fix that.

When I say "applied", the series is in my tree and is therefore permanent.
It is therefore never appropriate to then post a new version of the series.

^ permalink raw reply

* RE: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Haiyang Zhang @ 2019-08-22 22:37 UTC (permalink / raw)
  To: David Miller
  Cc: sashal@kernel.org, saeedm@mellanox.com, leon@kernel.org,
	eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, linux-kernel@vger.kernel.org
In-Reply-To: <20190822.153315.1245817410062415025.davem@davemloft.net>



> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Thursday, August 22, 2019 3:33 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: sashal@kernel.org; saeedm@mellanox.com; leon@kernel.org;
> eranbe@mellanox.com; lorenzo.pieralisi@arm.com; bhelgaas@google.com;
> linux-pci@vger.kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e
> HV VHCA stats
> 
> 
> I applied this patch series already to net-next, what are you doing?

The v5 is pretty much the same as v4, except Eran had a fix to patch #3 in response to
Leon Romanovsky <leon@kernel.org>.

Thanks,
- Haiyang


^ permalink raw reply

* Re: [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: David Miller @ 2019-08-22 22:33 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, saeedm, leon, eranbe, lorenzo.pieralisi, bhelgaas,
	linux-pci, linux-hyperv, netdev, kys, sthemmin, linux-kernel
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>


I applied this patch series already to net-next, what are you doing?

^ permalink raw reply

* RE: [Patch v2] storvsc: setup 1:1 mapping between hardware queue and CPU queue
From: Long Li @ 2019-08-22 22:29 UTC (permalink / raw)
  To: Michael Kelley, longli@linuxonhyperv.com, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CY4PR21MB0741D566F71F0D1E8C5B9970CEA50@CY4PR21MB0741.namprd21.prod.outlook.com>

>>>Subject: RE: [Patch v2] storvsc: setup 1:1 mapping between hardware
>>>queue and CPU queue
>>>
>>>>>>Subject: RE: [Patch v2] storvsc: setup 1:1 mapping between hardware
>>>>>>queue and CPU queue
>>>>>>
>>>>>>From: Long Li <longli@linuxonhyperv.com> Sent: Thursday, August 22,
>>>>>>2019
>>>>>>1:42 PM
>>>>>>>
>>>>>>> storvsc doesn't use a dedicated hardware queue for a given CPU
>>>queue.
>>>>>>> When issuing I/O, it selects returning CPU (hardware queue)
>>>>>>> dynamically based on vmbus channel usage across all channels.
>>>>>>>
>>>>>>> This patch advertises num_possible_cpus() as number of hardware
>>>>>>> queues. This will have upper layer setup 1:1 mapping between
>>>>>>> hardware queue and CPU queue and avoid unnecessary locking when
>>>issuing I/O.
>>>>>>>
>>>>>>> Changes:
>>>>>>> v2: rely on default upper layer function to map queues. (suggested
>>>>>>> by Ming Lei
>>>>>>> <tom.leiming@gmail.com>)
>>>>>>>
>>>>>>> Signed-off-by: Long Li <longli@microsoft.com>
>>>>>>> ---
>>>>>>>  drivers/scsi/storvsc_drv.c | 3 +--
>>>>>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/scsi/storvsc_drv.c
>>>>>>> b/drivers/scsi/storvsc_drv.c index b89269120a2d..dfd3b76a4f89
>>>>>>> 100644
>>>>>>> --- a/drivers/scsi/storvsc_drv.c
>>>>>>> +++ b/drivers/scsi/storvsc_drv.c
>>>>>>> @@ -1836,8 +1836,7 @@ static int storvsc_probe(struct hv_device
>>>>>>*device,
>>>>>>>  	/*
>>>>>>>  	 * Set the number of HW queues we are supporting.
>>>>>>>  	 */
>>>>>>> -	if (stor_device->num_sc != 0)
>>>>>>> -		host->nr_hw_queues = stor_device->num_sc + 1;
>>>>>>> +	host->nr_hw_queues = num_possible_cpus();
>>>>>>
>>>>>>For a lot of the VM sizes in Azure, num_possible_cpus() is 128, even
>>>>>>if the VM has only 4 or 8 or some other smaller number of vCPUs.
>>>>>>So I'm wondering if you really want num_present_cpus() here instead,
>>>>>>which would include only the vCPUs that actually exist in the VM.
>>>
>>>I think reporting num_possible_cpus() doesn't do more harm or take more
>>>resources. Because block layer allocates map for all the possible CPUs.
>>>
>>>The actual mapping is done in blk_mq_map_queues(), and it iterates all the
>>>possible CPUs. If we report num_present_cpus(), the rest of the CPUs also
>>>need to be mapped.

Actually I get your point, reporting num_present_cpus() will get less number of struct blk_mq_hw_ctx created. So it saves memory.

If we don't plan to support adding/onlining CPUs, we should use num_present_cpus().

>>>
>>>>>>
>>>>>>Michael
>>>>>>
>>>>>>>
>>>>>>>  	/*
>>>>>>>  	 * Set the error handler work queue.
>>>>>>> --
>>>>>>> 2.17.1


^ permalink raw reply

* [PATCH net-next,v5, 1/6] PCI: hv: Add a paravirtual backchannel in software
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org, Dexuan Cui, Jake Oshins
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>

Windows SR-IOV provides a backchannel mechanism in software for communication
between a VF driver and a PF driver.  These "configuration blocks" are
similar in concept to PCI configuration space, but instead of doing reads and
writes in 32-bit chunks through a very slow path, packets of up to 128 bytes
can be sent or received asynchronously.

Nearly every SR-IOV device contains just such a communications channel in
hardware, so using this one in software is usually optional.  Using the
software channel, however, allows driver implementers to leverage software
tools that fuzz the communications channel looking for vulnerabilities.

The usage model for these packets puts the responsibility for reading or
writing on the VF driver.  The VF driver sends a read or a write packet,
indicating which "block" is being referred to by number.

If the PF driver wishes to initiate communication, it can "invalidate" one or
more of the first 64 blocks.  This invalidation is delivered via a callback
supplied by the VF driver by this driver.

No protocol is implied, except that supplied by the PF and VF drivers.

Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 302 ++++++++++++++++++++++++++++++++++++
 include/linux/hyperv.h              |  15 ++
 2 files changed, 317 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 40b6254..57adeca 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -365,6 +365,39 @@ struct pci_delete_interrupt {
 	struct tran_int_desc int_desc;
 } __packed;
 
+/*
+ * Note: the VM must pass a valid block id, wslot and bytes_requested.
+ */
+struct pci_read_block {
+	struct pci_message message_type;
+	u32 block_id;
+	union win_slot_encoding wslot;
+	u32 bytes_requested;
+} __packed;
+
+struct pci_read_block_response {
+	struct vmpacket_descriptor hdr;
+	u32 status;
+	u8 bytes[HV_CONFIG_BLOCK_SIZE_MAX];
+} __packed;
+
+/*
+ * Note: the VM must pass a valid block id, wslot and byte_count.
+ */
+struct pci_write_block {
+	struct pci_message message_type;
+	u32 block_id;
+	union win_slot_encoding wslot;
+	u32 byte_count;
+	u8 bytes[HV_CONFIG_BLOCK_SIZE_MAX];
+} __packed;
+
+struct pci_dev_inval_block {
+	struct pci_incoming_message incoming;
+	union win_slot_encoding wslot;
+	u64 block_mask;
+} __packed;
+
 struct pci_dev_incoming {
 	struct pci_incoming_message incoming;
 	union win_slot_encoding wslot;
@@ -499,6 +532,9 @@ struct hv_pci_dev {
 	struct hv_pcibus_device *hbus;
 	struct work_struct wrk;
 
+	void (*block_invalidate)(void *context, u64 block_mask);
+	void *invalidate_context;
+
 	/*
 	 * What would be observed if one wrote 0xFFFFFFFF to a BAR and then
 	 * read it back, for each of the BAR offsets within config space.
@@ -817,6 +853,256 @@ static int hv_pcifront_write_config(struct pci_bus *bus, unsigned int devfn,
 	.write = hv_pcifront_write_config,
 };
 
+/*
+ * Paravirtual backchannel
+ *
+ * Hyper-V SR-IOV provides a backchannel mechanism in software for
+ * communication between a VF driver and a PF driver.  These
+ * "configuration blocks" are similar in concept to PCI configuration space,
+ * but instead of doing reads and writes in 32-bit chunks through a very slow
+ * path, packets of up to 128 bytes can be sent or received asynchronously.
+ *
+ * Nearly every SR-IOV device contains just such a communications channel in
+ * hardware, so using this one in software is usually optional.  Using the
+ * software channel, however, allows driver implementers to leverage software
+ * tools that fuzz the communications channel looking for vulnerabilities.
+ *
+ * The usage model for these packets puts the responsibility for reading or
+ * writing on the VF driver.  The VF driver sends a read or a write packet,
+ * indicating which "block" is being referred to by number.
+ *
+ * If the PF driver wishes to initiate communication, it can "invalidate" one or
+ * more of the first 64 blocks.  This invalidation is delivered via a callback
+ * supplied by the VF driver by this driver.
+ *
+ * No protocol is implied, except that supplied by the PF and VF drivers.
+ */
+
+struct hv_read_config_compl {
+	struct hv_pci_compl comp_pkt;
+	void *buf;
+	unsigned int len;
+	unsigned int bytes_returned;
+};
+
+/**
+ * hv_pci_read_config_compl() - Invoked when a response packet
+ * for a read config block operation arrives.
+ * @context:		Identifies the read config operation
+ * @resp:		The response packet itself
+ * @resp_packet_size:	Size in bytes of the response packet
+ */
+static void hv_pci_read_config_compl(void *context, struct pci_response *resp,
+				     int resp_packet_size)
+{
+	struct hv_read_config_compl *comp = context;
+	struct pci_read_block_response *read_resp =
+		(struct pci_read_block_response *)resp;
+	unsigned int data_len, hdr_len;
+
+	hdr_len = offsetof(struct pci_read_block_response, bytes);
+	if (resp_packet_size < hdr_len) {
+		comp->comp_pkt.completion_status = -1;
+		goto out;
+	}
+
+	data_len = resp_packet_size - hdr_len;
+	if (data_len > 0 && read_resp->status == 0) {
+		comp->bytes_returned = min(comp->len, data_len);
+		memcpy(comp->buf, read_resp->bytes, comp->bytes_returned);
+	} else {
+		comp->bytes_returned = 0;
+	}
+
+	comp->comp_pkt.completion_status = read_resp->status;
+out:
+	complete(&comp->comp_pkt.host_event);
+}
+
+/**
+ * hv_read_config_block() - Sends a read config block request to
+ * the back-end driver running in the Hyper-V parent partition.
+ * @pdev:		The PCI driver's representation for this device.
+ * @buf:		Buffer into which the config block will be copied.
+ * @len:		Size in bytes of buf.
+ * @block_id:		Identifies the config block which has been requested.
+ * @bytes_returned:	Size which came back from the back-end driver.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int hv_read_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
+			 unsigned int block_id, unsigned int *bytes_returned)
+{
+	struct hv_pcibus_device *hbus =
+		container_of(pdev->bus->sysdata, struct hv_pcibus_device,
+			     sysdata);
+	struct {
+		struct pci_packet pkt;
+		char buf[sizeof(struct pci_read_block)];
+	} pkt;
+	struct hv_read_config_compl comp_pkt;
+	struct pci_read_block *read_blk;
+	int ret;
+
+	if (len == 0 || len > HV_CONFIG_BLOCK_SIZE_MAX)
+		return -EINVAL;
+
+	init_completion(&comp_pkt.comp_pkt.host_event);
+	comp_pkt.buf = buf;
+	comp_pkt.len = len;
+
+	memset(&pkt, 0, sizeof(pkt));
+	pkt.pkt.completion_func = hv_pci_read_config_compl;
+	pkt.pkt.compl_ctxt = &comp_pkt;
+	read_blk = (struct pci_read_block *)&pkt.pkt.message;
+	read_blk->message_type.type = PCI_READ_BLOCK;
+	read_blk->wslot.slot = devfn_to_wslot(pdev->devfn);
+	read_blk->block_id = block_id;
+	read_blk->bytes_requested = len;
+
+	ret = vmbus_sendpacket(hbus->hdev->channel, read_blk,
+			       sizeof(*read_blk), (unsigned long)&pkt.pkt,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	if (ret)
+		return ret;
+
+	ret = wait_for_response(hbus->hdev, &comp_pkt.comp_pkt.host_event);
+	if (ret)
+		return ret;
+
+	if (comp_pkt.comp_pkt.completion_status != 0 ||
+	    comp_pkt.bytes_returned == 0) {
+		dev_err(&hbus->hdev->device,
+			"Read Config Block failed: 0x%x, bytes_returned=%d\n",
+			comp_pkt.comp_pkt.completion_status,
+			comp_pkt.bytes_returned);
+		return -EIO;
+	}
+
+	*bytes_returned = comp_pkt.bytes_returned;
+	return 0;
+}
+EXPORT_SYMBOL(hv_read_config_block);
+
+/**
+ * hv_pci_write_config_compl() - Invoked when a response packet for a write
+ * config block operation arrives.
+ * @context:		Identifies the write config operation
+ * @resp:		The response packet itself
+ * @resp_packet_size:	Size in bytes of the response packet
+ */
+static void hv_pci_write_config_compl(void *context, struct pci_response *resp,
+				      int resp_packet_size)
+{
+	struct hv_pci_compl *comp_pkt = context;
+
+	comp_pkt->completion_status = resp->status;
+	complete(&comp_pkt->host_event);
+}
+
+/**
+ * hv_write_config_block() - Sends a write config block request to the
+ * back-end driver running in the Hyper-V parent partition.
+ * @pdev:		The PCI driver's representation for this device.
+ * @buf:		Buffer from which the config block will	be copied.
+ * @len:		Size in bytes of buf.
+ * @block_id:		Identifies the config block which is being written.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int hv_write_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
+			  unsigned int block_id)
+{
+	struct hv_pcibus_device *hbus =
+		container_of(pdev->bus->sysdata, struct hv_pcibus_device,
+			     sysdata);
+	struct {
+		struct pci_packet pkt;
+		char buf[sizeof(struct pci_write_block)];
+		u32 reserved;
+	} pkt;
+	struct hv_pci_compl comp_pkt;
+	struct pci_write_block *write_blk;
+	u32 pkt_size;
+	int ret;
+
+	if (len == 0 || len > HV_CONFIG_BLOCK_SIZE_MAX)
+		return -EINVAL;
+
+	init_completion(&comp_pkt.host_event);
+
+	memset(&pkt, 0, sizeof(pkt));
+	pkt.pkt.completion_func = hv_pci_write_config_compl;
+	pkt.pkt.compl_ctxt = &comp_pkt;
+	write_blk = (struct pci_write_block *)&pkt.pkt.message;
+	write_blk->message_type.type = PCI_WRITE_BLOCK;
+	write_blk->wslot.slot = devfn_to_wslot(pdev->devfn);
+	write_blk->block_id = block_id;
+	write_blk->byte_count = len;
+	memcpy(write_blk->bytes, buf, len);
+	pkt_size = offsetof(struct pci_write_block, bytes) + len;
+	/*
+	 * This quirk is required on some hosts shipped around 2018, because
+	 * these hosts don't check the pkt_size correctly (new hosts have been
+	 * fixed since early 2019). The quirk is also safe on very old hosts
+	 * and new hosts, because, on them, what really matters is the length
+	 * specified in write_blk->byte_count.
+	 */
+	pkt_size += sizeof(pkt.reserved);
+
+	ret = vmbus_sendpacket(hbus->hdev->channel, write_blk, pkt_size,
+			       (unsigned long)&pkt.pkt, VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	if (ret)
+		return ret;
+
+	ret = wait_for_response(hbus->hdev, &comp_pkt.host_event);
+	if (ret)
+		return ret;
+
+	if (comp_pkt.completion_status != 0) {
+		dev_err(&hbus->hdev->device,
+			"Write Config Block failed: 0x%x\n",
+			comp_pkt.completion_status);
+		return -EIO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(hv_write_config_block);
+
+/**
+ * hv_register_block_invalidate() - Invoked when a config block invalidation
+ * arrives from the back-end driver.
+ * @pdev:		The PCI driver's representation for this device.
+ * @context:		Identifies the device.
+ * @block_invalidate:	Identifies all of the blocks being invalidated.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int hv_register_block_invalidate(struct pci_dev *pdev, void *context,
+				 void (*block_invalidate)(void *context,
+							  u64 block_mask))
+{
+	struct hv_pcibus_device *hbus =
+		container_of(pdev->bus->sysdata, struct hv_pcibus_device,
+			     sysdata);
+	struct hv_pci_dev *hpdev;
+
+	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
+	if (!hpdev)
+		return -ENODEV;
+
+	hpdev->block_invalidate = block_invalidate;
+	hpdev->invalidate_context = context;
+
+	put_pcichild(hpdev);
+	return 0;
+
+}
+EXPORT_SYMBOL(hv_register_block_invalidate);
+
 /* Interrupt management hooks */
 static void hv_int_desc_free(struct hv_pci_dev *hpdev,
 			     struct tran_int_desc *int_desc)
@@ -1968,6 +2254,7 @@ static void hv_pci_onchannelcallback(void *context)
 	struct pci_response *response;
 	struct pci_incoming_message *new_message;
 	struct pci_bus_relations *bus_rel;
+	struct pci_dev_inval_block *inval;
 	struct pci_dev_incoming *dev_message;
 	struct hv_pci_dev *hpdev;
 
@@ -2045,6 +2332,21 @@ static void hv_pci_onchannelcallback(void *context)
 				}
 				break;
 
+			case PCI_INVALIDATE_BLOCK:
+
+				inval = (struct pci_dev_inval_block *)buffer;
+				hpdev = get_pcichild_wslot(hbus,
+							   inval->wslot.slot);
+				if (hpdev) {
+					if (hpdev->block_invalidate) {
+						hpdev->block_invalidate(
+						    hpdev->invalidate_context,
+						    inval->block_mask);
+					}
+					put_pcichild(hpdev);
+				}
+				break;
+
 			default:
 				dev_warn(&hbus->hdev->device,
 					"Unimplemented protocol message %x\n",
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 6256cc3..9d37f8c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1578,4 +1578,19 @@ struct vmpacket_descriptor *
 	for (pkt = hv_pkt_iter_first(channel); pkt; \
 	    pkt = hv_pkt_iter_next(channel, pkt))
 
+/*
+ * Functions for passing data between SR-IOV PF and VF drivers.  The VF driver
+ * sends requests to read and write blocks. Each block must be 128 bytes or
+ * smaller. Optionally, the VF driver can register a callback function which
+ * will be invoked when the host says that one or more of the first 64 block
+ * IDs is "invalid" which means that the VF driver should reread them.
+ */
+#define HV_CONFIG_BLOCK_SIZE_MAX 128
+int hv_read_config_block(struct pci_dev *dev, void *buf, unsigned int buf_len,
+			 unsigned int block_id, unsigned int *bytes_returned);
+int hv_write_config_block(struct pci_dev *dev, void *buf, unsigned int len,
+			  unsigned int block_id);
+int hv_register_block_invalidate(struct pci_dev *dev, void *context,
+				 void (*block_invalidate)(void *context,
+							  u64 block_mask));
 #endif /* _HYPERV_H */
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 2/6] PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

This interface driver is a helper driver allows other drivers to
have a common interface with the Hyper-V PCI frontend driver.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 MAINTAINERS                              |  1 +
 drivers/pci/Kconfig                      |  1 +
 drivers/pci/controller/Kconfig           |  7 ++++
 drivers/pci/controller/Makefile          |  1 +
 drivers/pci/controller/pci-hyperv-intf.c | 67 ++++++++++++++++++++++++++++++++
 drivers/pci/controller/pci-hyperv.c      | 12 ++++--
 include/linux/hyperv.h                   | 30 ++++++++++----
 7 files changed, 108 insertions(+), 11 deletions(-)
 create mode 100644 drivers/pci/controller/pci-hyperv-intf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a406947..9860853 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7469,6 +7469,7 @@ F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
 F:	drivers/input/serio/hyperv-keyboard.c
 F:	drivers/pci/controller/pci-hyperv.c
+F:	drivers/pci/controller/pci-hyperv-intf.c
 F:	drivers/net/hyperv/
 F:	drivers/scsi/storvsc_drv.c
 F:	drivers/uio/uio_hv_generic.c
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2ab9240..c313de9 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -182,6 +182,7 @@ config PCI_LABEL
 config PCI_HYPERV
         tristate "Hyper-V PCI Frontend"
         depends on X86 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && X86_64
+	select PCI_HYPERV_INTERFACE
         help
           The PCI device frontend driver allows the kernel to import arbitrary
           PCI devices from a PCI backend to support PCI driver domains.
diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig
index fe9f9f1..70e0782 100644
--- a/drivers/pci/controller/Kconfig
+++ b/drivers/pci/controller/Kconfig
@@ -281,5 +281,12 @@ config VMD
 	  To compile this driver as a module, choose M here: the
 	  module will be called vmd.
 
+config PCI_HYPERV_INTERFACE
+	tristate "Hyper-V PCI Interface"
+	depends on X86 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && X86_64
+	help
+	  The Hyper-V PCI Interface is a helper driver allows other drivers to
+	  have a common interface with the Hyper-V PCI frontend driver.
+
 source "drivers/pci/controller/dwc/Kconfig"
 endmenu
diff --git a/drivers/pci/controller/Makefile b/drivers/pci/controller/Makefile
index d56a507..a2a22c9 100644
--- a/drivers/pci/controller/Makefile
+++ b/drivers/pci/controller/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_PCIE_CADENCE_HOST) += pcie-cadence-host.o
 obj-$(CONFIG_PCIE_CADENCE_EP) += pcie-cadence-ep.o
 obj-$(CONFIG_PCI_FTPCI100) += pci-ftpci100.o
 obj-$(CONFIG_PCI_HYPERV) += pci-hyperv.o
+obj-$(CONFIG_PCI_HYPERV_INTERFACE) += pci-hyperv-intf.o
 obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
 obj-$(CONFIG_PCI_AARDVARK) += pci-aardvark.o
 obj-$(CONFIG_PCI_TEGRA) += pci-tegra.o
diff --git a/drivers/pci/controller/pci-hyperv-intf.c b/drivers/pci/controller/pci-hyperv-intf.c
new file mode 100644
index 0000000..cc96be4
--- /dev/null
+++ b/drivers/pci/controller/pci-hyperv-intf.c
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) Microsoft Corporation.
+ *
+ * Author:
+ *   Haiyang Zhang <haiyangz@microsoft.com>
+ *
+ * This small module is a helper driver allows other drivers to
+ * have a common interface with the Hyper-V PCI frontend driver.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/hyperv.h>
+
+struct hyperv_pci_block_ops hvpci_block_ops;
+EXPORT_SYMBOL_GPL(hvpci_block_ops);
+
+int hyperv_read_cfg_blk(struct pci_dev *dev, void *buf, unsigned int buf_len,
+			unsigned int block_id, unsigned int *bytes_returned)
+{
+	if (!hvpci_block_ops.read_block)
+		return -EOPNOTSUPP;
+
+	return hvpci_block_ops.read_block(dev, buf, buf_len, block_id,
+					  bytes_returned);
+}
+EXPORT_SYMBOL_GPL(hyperv_read_cfg_blk);
+
+int hyperv_write_cfg_blk(struct pci_dev *dev, void *buf, unsigned int len,
+			 unsigned int block_id)
+{
+	if (!hvpci_block_ops.write_block)
+		return -EOPNOTSUPP;
+
+	return hvpci_block_ops.write_block(dev, buf, len, block_id);
+}
+EXPORT_SYMBOL_GPL(hyperv_write_cfg_blk);
+
+int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,
+				void (*block_invalidate)(void *context,
+							 u64 block_mask))
+{
+	if (!hvpci_block_ops.reg_blk_invalidate)
+		return -EOPNOTSUPP;
+
+	return hvpci_block_ops.reg_blk_invalidate(dev, context,
+						  block_invalidate);
+}
+EXPORT_SYMBOL_GPL(hyperv_reg_block_invalidate);
+
+static void __exit exit_hv_pci_intf(void)
+{
+}
+
+static int __init init_hv_pci_intf(void)
+{
+	return 0;
+}
+
+module_init(init_hv_pci_intf);
+module_exit(exit_hv_pci_intf);
+
+MODULE_DESCRIPTION("Hyper-V PCI Interface");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 57adeca..9c93ac2 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -983,7 +983,6 @@ int hv_read_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
 	*bytes_returned = comp_pkt.bytes_returned;
 	return 0;
 }
-EXPORT_SYMBOL(hv_read_config_block);
 
 /**
  * hv_pci_write_config_compl() - Invoked when a response packet for a write
@@ -1070,7 +1069,6 @@ int hv_write_config_block(struct pci_dev *pdev, void *buf, unsigned int len,
 
 	return 0;
 }
-EXPORT_SYMBOL(hv_write_config_block);
 
 /**
  * hv_register_block_invalidate() - Invoked when a config block invalidation
@@ -1101,7 +1099,6 @@ int hv_register_block_invalidate(struct pci_dev *pdev, void *context,
 	return 0;
 
 }
-EXPORT_SYMBOL(hv_register_block_invalidate);
 
 /* Interrupt management hooks */
 static void hv_int_desc_free(struct hv_pci_dev *hpdev,
@@ -3045,10 +3042,19 @@ static int hv_pci_remove(struct hv_device *hdev)
 static void __exit exit_hv_pci_drv(void)
 {
 	vmbus_driver_unregister(&hv_pci_drv);
+
+	hvpci_block_ops.read_block = NULL;
+	hvpci_block_ops.write_block = NULL;
+	hvpci_block_ops.reg_blk_invalidate = NULL;
 }
 
 static int __init init_hv_pci_drv(void)
 {
+	/* Initialize PCI block r/w interface */
+	hvpci_block_ops.read_block = hv_read_config_block;
+	hvpci_block_ops.write_block = hv_write_config_block;
+	hvpci_block_ops.reg_blk_invalidate = hv_register_block_invalidate;
+
 	return vmbus_driver_register(&hv_pci_drv);
 }
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 9d37f8c..2afe6fd 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1579,18 +1579,32 @@ struct vmpacket_descriptor *
 	    pkt = hv_pkt_iter_next(channel, pkt))
 
 /*
- * Functions for passing data between SR-IOV PF and VF drivers.  The VF driver
+ * Interface for passing data between SR-IOV PF and VF drivers. The VF driver
  * sends requests to read and write blocks. Each block must be 128 bytes or
  * smaller. Optionally, the VF driver can register a callback function which
  * will be invoked when the host says that one or more of the first 64 block
  * IDs is "invalid" which means that the VF driver should reread them.
  */
 #define HV_CONFIG_BLOCK_SIZE_MAX 128
-int hv_read_config_block(struct pci_dev *dev, void *buf, unsigned int buf_len,
-			 unsigned int block_id, unsigned int *bytes_returned);
-int hv_write_config_block(struct pci_dev *dev, void *buf, unsigned int len,
-			  unsigned int block_id);
-int hv_register_block_invalidate(struct pci_dev *dev, void *context,
-				 void (*block_invalidate)(void *context,
-							  u64 block_mask));
+
+int hyperv_read_cfg_blk(struct pci_dev *dev, void *buf, unsigned int buf_len,
+			unsigned int block_id, unsigned int *bytes_returned);
+int hyperv_write_cfg_blk(struct pci_dev *dev, void *buf, unsigned int len,
+			 unsigned int block_id);
+int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,
+				void (*block_invalidate)(void *context,
+							 u64 block_mask));
+
+struct hyperv_pci_block_ops {
+	int (*read_block)(struct pci_dev *dev, void *buf, unsigned int buf_len,
+			  unsigned int block_id, unsigned int *bytes_returned);
+	int (*write_block)(struct pci_dev *dev, void *buf, unsigned int len,
+			   unsigned int block_id);
+	int (*reg_blk_invalidate)(struct pci_dev *dev, void *context,
+				  void (*block_invalidate)(void *context,
+							   u64 block_mask));
+};
+
+extern struct hyperv_pci_block_ops hvpci_block_ops;
+
 #endif /* _HYPERV_H */
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 3/6] net/mlx5: Add wrappers for HyperV PCIe operations
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

From: Eran Ben Elisha <eranbe@mellanox.com>

Add wrapper functions for HyperV PCIe read / write /
block_invalidate_register operations.  This will be used as an
infrastructure in the downstream patch for software communication.

This will be enabled by default if CONFIG_PCI_HYPERV_INTERFACE is set.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c | 64 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h | 22 ++++++++
 3 files changed, 87 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index bcf3655..fd32a5b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -45,6 +45,7 @@ mlx5_core-$(CONFIG_MLX5_ESWITCH)   += eswitch.o eswitch_offloads.o eswitch_offlo
 mlx5_core-$(CONFIG_MLX5_MPFS)      += lib/mpfs.o
 mlx5_core-$(CONFIG_VXLAN)          += lib/vxlan.o
 mlx5_core-$(CONFIG_PTP_1588_CLOCK) += lib/clock.o
+mlx5_core-$(CONFIG_PCI_HYPERV_INTERFACE) += lib/hv.o
 
 #
 # Ipoib netdev
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
new file mode 100644
index 0000000..fb19008
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2018 Mellanox Technologies
+
+#include <linux/hyperv.h>
+#include "mlx5_core.h"
+#include "lib/hv.h"
+
+static int mlx5_hv_config_common(struct mlx5_core_dev *dev, void *buf, int len,
+				 int offset, bool read)
+{
+	int rc = -EOPNOTSUPP;
+	int bytes_returned;
+	int block_id;
+
+	if (offset % HV_CONFIG_BLOCK_SIZE_MAX || len % HV_CONFIG_BLOCK_SIZE_MAX)
+		return -EINVAL;
+
+	block_id = offset / HV_CONFIG_BLOCK_SIZE_MAX;
+
+	rc = read ?
+	     hyperv_read_cfg_blk(dev->pdev, buf,
+				 HV_CONFIG_BLOCK_SIZE_MAX, block_id,
+				 &bytes_returned) :
+	     hyperv_write_cfg_blk(dev->pdev, buf,
+				  HV_CONFIG_BLOCK_SIZE_MAX, block_id);
+
+	/* Make sure len bytes were read successfully  */
+	if (read && !rc && len != bytes_returned)
+		rc = -EIO;
+
+	if (rc) {
+		mlx5_core_err(dev, "Failed to %s hv config, err = %d, len = %d, offset = %d\n",
+			      read ? "read" : "write", rc, len,
+			      offset);
+		return rc;
+	}
+
+	return 0;
+}
+
+int mlx5_hv_read_config(struct mlx5_core_dev *dev, void *buf, int len,
+			int offset)
+{
+	return mlx5_hv_config_common(dev, buf, len, offset, true);
+}
+
+int mlx5_hv_write_config(struct mlx5_core_dev *dev, void *buf, int len,
+			 int offset)
+{
+	return mlx5_hv_config_common(dev, buf, len, offset, false);
+}
+
+int mlx5_hv_register_invalidate(struct mlx5_core_dev *dev, void *context,
+				void (*block_invalidate)(void *context,
+							 u64 block_mask))
+{
+	return hyperv_reg_block_invalidate(dev->pdev, context,
+					   block_invalidate);
+}
+
+void mlx5_hv_unregister_invalidate(struct mlx5_core_dev *dev)
+{
+	hyperv_reg_block_invalidate(dev->pdev, NULL, NULL);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
new file mode 100644
index 0000000..f9a4557
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#ifndef __LIB_HV_H__
+#define __LIB_HV_H__
+
+#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
+
+#include <linux/hyperv.h>
+#include <linux/mlx5/driver.h>
+
+int mlx5_hv_read_config(struct mlx5_core_dev *dev, void *buf, int len,
+			int offset);
+int mlx5_hv_write_config(struct mlx5_core_dev *dev, void *buf, int len,
+			 int offset);
+int mlx5_hv_register_invalidate(struct mlx5_core_dev *dev, void *context,
+				void (*block_invalidate)(void *context,
+							 u64 block_mask));
+void mlx5_hv_unregister_invalidate(struct mlx5_core_dev *dev);
+#endif
+
+#endif /* __LIB_HV_H__ */
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 4/6] net/mlx5: Add HV VHCA infrastructure
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

From: Eran Ben Elisha <eranbe@mellanox.com>

HV VHCA is a layer which provides PF to VF communication channel based on
HyperV PCI config channel. It implements Mellanox's Inter VHCA control
communication protocol. The protocol contains control block in order to
pass messages between the PF and VF drivers, and data blocks in order to
pass actual data.

The infrastructure is agent based. Each agent will be responsible of
contiguous buffer blocks in the VHCA config space. This infrastructure will
bind agents to their blocks, and those agents can only access read/write
the buffer blocks assigned to them. Each agent will provide three
callbacks (control, invalidate, cleanup). Control will be invoked when
block-0 is invalidated with a command that concerns this agent. Invalidate
callback will be invoked if one of the blocks assigned to this agent was
invalidated. Cleanup will be invoked before the agent is being freed in
order to clean all of its open resources or deferred works.

Block-0 serves as the control block. All execution commands from the PF
will be written by the PF over this block. VF will ack on those by
writing on block-0 as well. Its format is described by struct
mlx5_hv_vhca_control_block layout.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c  | 253 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h  | 102 +++++++++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |   7 +
 include/linux/mlx5/driver.h                        |   2 +
 5 files changed, 365 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index fd32a5b..8d443fc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -45,7 +45,7 @@ mlx5_core-$(CONFIG_MLX5_ESWITCH)   += eswitch.o eswitch_offloads.o eswitch_offlo
 mlx5_core-$(CONFIG_MLX5_MPFS)      += lib/mpfs.o
 mlx5_core-$(CONFIG_VXLAN)          += lib/vxlan.o
 mlx5_core-$(CONFIG_PTP_1588_CLOCK) += lib/clock.o
-mlx5_core-$(CONFIG_PCI_HYPERV_INTERFACE) += lib/hv.o
+mlx5_core-$(CONFIG_PCI_HYPERV_INTERFACE) += lib/hv.o lib/hv_vhca.o
 
 #
 # Ipoib netdev
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
new file mode 100644
index 0000000..84d1d75
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2018 Mellanox Technologies
+
+#include <linux/hyperv.h>
+#include "mlx5_core.h"
+#include "lib/hv.h"
+#include "lib/hv_vhca.h"
+
+struct mlx5_hv_vhca {
+	struct mlx5_core_dev       *dev;
+	struct workqueue_struct    *work_queue;
+	struct mlx5_hv_vhca_agent  *agents[MLX5_HV_VHCA_AGENT_MAX];
+	struct mutex                agents_lock; /* Protect agents array */
+};
+
+struct mlx5_hv_vhca_work {
+	struct work_struct     invalidate_work;
+	struct mlx5_hv_vhca   *hv_vhca;
+	u64                    block_mask;
+};
+
+struct mlx5_hv_vhca_data_block {
+	u16     sequence;
+	u16     offset;
+	u8      reserved[4];
+	u64     data[15];
+};
+
+struct mlx5_hv_vhca_agent {
+	enum mlx5_hv_vhca_agent_type	 type;
+	struct mlx5_hv_vhca		*hv_vhca;
+	void				*priv;
+	u16                              seq;
+	void (*control)(struct mlx5_hv_vhca_agent *agent,
+			struct mlx5_hv_vhca_control_block *block);
+	void (*invalidate)(struct mlx5_hv_vhca_agent *agent,
+			   u64 block_mask);
+	void (*cleanup)(struct mlx5_hv_vhca_agent *agent);
+};
+
+struct mlx5_hv_vhca *mlx5_hv_vhca_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_hv_vhca *hv_vhca = NULL;
+
+	hv_vhca = kzalloc(sizeof(*hv_vhca), GFP_KERNEL);
+	if (!hv_vhca)
+		return ERR_PTR(-ENOMEM);
+
+	hv_vhca->work_queue = create_singlethread_workqueue("mlx5_hv_vhca");
+	if (!hv_vhca->work_queue) {
+		kfree(hv_vhca);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	hv_vhca->dev = dev;
+	mutex_init(&hv_vhca->agents_lock);
+
+	return hv_vhca;
+}
+
+void mlx5_hv_vhca_destroy(struct mlx5_hv_vhca *hv_vhca)
+{
+	if (IS_ERR_OR_NULL(hv_vhca))
+		return;
+
+	destroy_workqueue(hv_vhca->work_queue);
+	kfree(hv_vhca);
+}
+
+static void mlx5_hv_vhca_invalidate_work(struct work_struct *work)
+{
+	struct mlx5_hv_vhca_work *hwork;
+	struct mlx5_hv_vhca *hv_vhca;
+	int i;
+
+	hwork = container_of(work, struct mlx5_hv_vhca_work, invalidate_work);
+	hv_vhca = hwork->hv_vhca;
+
+	mutex_lock(&hv_vhca->agents_lock);
+	for (i = 0; i < MLX5_HV_VHCA_AGENT_MAX; i++) {
+		struct mlx5_hv_vhca_agent *agent = hv_vhca->agents[i];
+
+		if (!agent || !agent->invalidate)
+			continue;
+
+		if (!(BIT(agent->type) & hwork->block_mask))
+			continue;
+
+		agent->invalidate(agent, hwork->block_mask);
+	}
+	mutex_unlock(&hv_vhca->agents_lock);
+
+	kfree(hwork);
+}
+
+void mlx5_hv_vhca_invalidate(void *context, u64 block_mask)
+{
+	struct mlx5_hv_vhca *hv_vhca = (struct mlx5_hv_vhca *)context;
+	struct mlx5_hv_vhca_work *work;
+
+	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return;
+
+	INIT_WORK(&work->invalidate_work, mlx5_hv_vhca_invalidate_work);
+	work->hv_vhca    = hv_vhca;
+	work->block_mask = block_mask;
+
+	queue_work(hv_vhca->work_queue, &work->invalidate_work);
+}
+
+int mlx5_hv_vhca_init(struct mlx5_hv_vhca *hv_vhca)
+{
+	if (IS_ERR_OR_NULL(hv_vhca))
+		return IS_ERR_OR_NULL(hv_vhca);
+
+	return mlx5_hv_register_invalidate(hv_vhca->dev, hv_vhca,
+					   mlx5_hv_vhca_invalidate);
+}
+
+void mlx5_hv_vhca_cleanup(struct mlx5_hv_vhca *hv_vhca)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(hv_vhca))
+		return;
+
+	mutex_lock(&hv_vhca->agents_lock);
+	for (i = 0; i < MLX5_HV_VHCA_AGENT_MAX; i++)
+		WARN_ON(hv_vhca->agents[i]);
+
+	mutex_unlock(&hv_vhca->agents_lock);
+
+	mlx5_hv_unregister_invalidate(hv_vhca->dev);
+}
+
+struct mlx5_hv_vhca_agent *
+mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
+			  enum mlx5_hv_vhca_agent_type type,
+			  void (*control)(struct mlx5_hv_vhca_agent*,
+					  struct mlx5_hv_vhca_control_block *block),
+			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
+					     u64 block_mask),
+			  void (*cleaup)(struct mlx5_hv_vhca_agent *agent),
+			  void *priv)
+{
+	struct mlx5_hv_vhca_agent *agent;
+
+	if (IS_ERR_OR_NULL(hv_vhca))
+		return ERR_PTR(-ENOMEM);
+
+	if (type >= MLX5_HV_VHCA_AGENT_MAX)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&hv_vhca->agents_lock);
+	if (hv_vhca->agents[type]) {
+		mutex_unlock(&hv_vhca->agents_lock);
+		return ERR_PTR(-EINVAL);
+	}
+	mutex_unlock(&hv_vhca->agents_lock);
+
+	agent = kzalloc(sizeof(*agent), GFP_KERNEL);
+	if (!agent)
+		return ERR_PTR(-ENOMEM);
+
+	agent->type      = type;
+	agent->hv_vhca   = hv_vhca;
+	agent->priv      = priv;
+	agent->control   = control;
+	agent->invalidate = invalidate;
+	agent->cleanup   = cleaup;
+
+	mutex_lock(&hv_vhca->agents_lock);
+	hv_vhca->agents[type] = agent;
+	mutex_unlock(&hv_vhca->agents_lock);
+
+	return agent;
+}
+
+void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent)
+{
+	struct mlx5_hv_vhca *hv_vhca = agent->hv_vhca;
+
+	mutex_lock(&hv_vhca->agents_lock);
+
+	if (WARN_ON(agent != hv_vhca->agents[agent->type])) {
+		mutex_unlock(&hv_vhca->agents_lock);
+		return;
+	}
+
+	hv_vhca->agents[agent->type] = NULL;
+	mutex_unlock(&hv_vhca->agents_lock);
+
+	if (agent->cleanup)
+		agent->cleanup(agent);
+
+	kfree(agent);
+}
+
+static int mlx5_hv_vhca_data_block_prepare(struct mlx5_hv_vhca_agent *agent,
+					   struct mlx5_hv_vhca_data_block *data_block,
+					   void *src, int len, int *offset)
+{
+	int bytes = min_t(int, (int)sizeof(data_block->data), len);
+
+	data_block->sequence = agent->seq;
+	data_block->offset   = (*offset)++;
+	memcpy(data_block->data, src, bytes);
+
+	return bytes;
+}
+
+static void mlx5_hv_vhca_agent_seq_update(struct mlx5_hv_vhca_agent *agent)
+{
+	agent->seq++;
+}
+
+int mlx5_hv_vhca_agent_write(struct mlx5_hv_vhca_agent *agent,
+			     void *buf, int len)
+{
+	int offset = agent->type * HV_CONFIG_BLOCK_SIZE_MAX;
+	int block_offset = 0;
+	int total = 0;
+	int err;
+
+	while (len) {
+		struct mlx5_hv_vhca_data_block data_block = {0};
+		int bytes;
+
+		bytes = mlx5_hv_vhca_data_block_prepare(agent, &data_block,
+							buf + total,
+							len, &block_offset);
+		if (!bytes)
+			return -ENOMEM;
+
+		err = mlx5_hv_write_config(agent->hv_vhca->dev, &data_block,
+					   sizeof(data_block), offset);
+		if (err)
+			return err;
+
+		total += bytes;
+		len   -= bytes;
+	}
+
+	mlx5_hv_vhca_agent_seq_update(agent);
+
+	return 0;
+}
+
+void *mlx5_hv_vhca_agent_priv(struct mlx5_hv_vhca_agent *agent)
+{
+	return agent->priv;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
new file mode 100644
index 0000000..cdf1303
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#ifndef __LIB_HV_VHCA_H__
+#define __LIB_HV_VHCA_H__
+
+#include "en.h"
+#include "lib/hv.h"
+
+struct mlx5_hv_vhca_agent;
+struct mlx5_hv_vhca;
+struct mlx5_hv_vhca_control_block;
+
+enum mlx5_hv_vhca_agent_type {
+	MLX5_HV_VHCA_AGENT_MAX = 32,
+};
+
+#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
+
+struct mlx5_hv_vhca_control_block {
+	u32     capabilities;
+	u32     control;
+	u16     command;
+	u16     command_ack;
+	u16     version;
+	u16     rings;
+	u32     reserved1[28];
+};
+
+struct mlx5_hv_vhca *mlx5_hv_vhca_create(struct mlx5_core_dev *dev);
+void mlx5_hv_vhca_destroy(struct mlx5_hv_vhca *hv_vhca);
+int mlx5_hv_vhca_init(struct mlx5_hv_vhca *hv_vhca);
+void mlx5_hv_vhca_cleanup(struct mlx5_hv_vhca *hv_vhca);
+void mlx5_hv_vhca_invalidate(void *context, u64 block_mask);
+
+struct mlx5_hv_vhca_agent *
+mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
+			  enum mlx5_hv_vhca_agent_type type,
+			  void (*control)(struct mlx5_hv_vhca_agent*,
+					  struct mlx5_hv_vhca_control_block *block),
+			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
+					     u64 block_mask),
+			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
+			  void *context);
+
+void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent);
+int mlx5_hv_vhca_agent_write(struct mlx5_hv_vhca_agent *agent,
+			     void *buf, int len);
+void *mlx5_hv_vhca_agent_priv(struct mlx5_hv_vhca_agent *agent);
+
+#else
+
+static inline struct mlx5_hv_vhca *
+mlx5_hv_vhca_create(struct mlx5_core_dev *dev)
+{
+	return NULL;
+}
+
+static inline void mlx5_hv_vhca_destroy(struct mlx5_hv_vhca *hv_vhca)
+{
+}
+
+static inline int mlx5_hv_vhca_init(struct mlx5_hv_vhca *hv_vhca)
+{
+	return 0;
+}
+
+static inline void mlx5_hv_vhca_cleanup(struct mlx5_hv_vhca *hv_vhca)
+{
+}
+
+static inline void mlx5_hv_vhca_invalidate(void *context,
+					   u64 block_mask)
+{
+}
+
+static inline struct mlx5_hv_vhca_agent *
+mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
+			  enum mlx5_hv_vhca_agent_type type,
+			  void (*control)(struct mlx5_hv_vhca_agent*,
+					  struct mlx5_hv_vhca_control_block *block),
+			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
+					     u64 block_mask),
+			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
+			  void *context)
+{
+	return NULL;
+}
+
+static inline void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent)
+{
+}
+
+static inline int
+mlx5_hv_vhca_write_agent(struct mlx5_hv_vhca_agent *agent,
+			 void *buf, int len)
+{
+	return 0;
+}
+#endif
+
+#endif /* __LIB_HV_VHCA_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 0b70b1d..61388ca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -69,6 +69,7 @@
 #include "lib/pci_vsc.h"
 #include "diag/fw_tracer.h"
 #include "ecpf.h"
+#include "lib/hv_vhca.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -870,6 +871,7 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 	}
 
 	dev->tracer = mlx5_fw_tracer_create(dev);
+	dev->hv_vhca = mlx5_hv_vhca_create(dev);
 
 	return 0;
 
@@ -900,6 +902,7 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 {
+	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
@@ -1067,6 +1070,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_fw_tracer;
 	}
 
+	mlx5_hv_vhca_init(dev->hv_vhca);
+
 	err = mlx5_fpga_device_start(dev);
 	if (err) {
 		mlx5_core_err(dev, "fpga device start failed %d\n", err);
@@ -1122,6 +1127,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 err_ipsec_start:
 	mlx5_fpga_device_stop(dev);
 err_fpga_start:
+	mlx5_hv_vhca_cleanup(dev->hv_vhca);
 	mlx5_fw_tracer_cleanup(dev->tracer);
 err_fw_tracer:
 	mlx5_eq_table_destroy(dev);
@@ -1142,6 +1148,7 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 	mlx5_accel_ipsec_cleanup(dev);
 	mlx5_accel_tls_cleanup(dev);
 	mlx5_fpga_device_stop(dev);
+	mlx5_hv_vhca_cleanup(dev->hv_vhca);
 	mlx5_fw_tracer_cleanup(dev->tracer);
 	mlx5_eq_table_destroy(dev);
 	mlx5_irq_table_destroy(dev);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index df23f17..13b4cf2 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -659,6 +659,7 @@ struct mlx5_clock {
 struct mlx5_fw_tracer;
 struct mlx5_vxlan;
 struct mlx5_geneve;
+struct mlx5_hv_vhca;
 
 struct mlx5_core_dev {
 	struct device *device;
@@ -706,6 +707,7 @@ struct mlx5_core_dev {
 	struct mlx5_ib_clock_info  *clock_info;
 	struct mlx5_fw_tracer   *tracer;
 	u32                      vsc_addr;
+	struct mlx5_hv_vhca	*hv_vhca;
 };
 
 struct mlx5_db {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 5/6] net/mlx5: Add HV VHCA control agent
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

From: Eran Ben Elisha <eranbe@mellanox.com>

Control agent is responsible over of the control block (ID 0). It should
update the PF via this block about every capability change. In addition,
upon block 0 invalidate, it should activate all other supported agents
with data requests from the PF.

Upon agent create/destroy, the invalidate callback of the control agent
is being called in order to update the PF driver about this change.

The control agent is an integral part of HV VHCA and will be created
and destroy as part of the HV VHCA init/cleanup flow.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c  | 122 ++++++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h  |   1 +
 2 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
index 84d1d75..4047629 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
@@ -109,22 +109,131 @@ void mlx5_hv_vhca_invalidate(void *context, u64 block_mask)
 	queue_work(hv_vhca->work_queue, &work->invalidate_work);
 }
 
+#define AGENT_MASK(type) (type ? BIT(type - 1) : 0 /* control */)
+
+static void mlx5_hv_vhca_agents_control(struct mlx5_hv_vhca *hv_vhca,
+					struct mlx5_hv_vhca_control_block *block)
+{
+	int i;
+
+	for (i = 0; i < MLX5_HV_VHCA_AGENT_MAX; i++) {
+		struct mlx5_hv_vhca_agent *agent = hv_vhca->agents[i];
+
+		if (!agent || !agent->control)
+			continue;
+
+		if (!(AGENT_MASK(agent->type) & block->control))
+			continue;
+
+		agent->control(agent, block);
+	}
+}
+
+static void mlx5_hv_vhca_capabilities(struct mlx5_hv_vhca *hv_vhca,
+				      u32 *capabilities)
+{
+	int i;
+
+	for (i = 0; i < MLX5_HV_VHCA_AGENT_MAX; i++) {
+		struct mlx5_hv_vhca_agent *agent = hv_vhca->agents[i];
+
+		if (agent)
+			*capabilities |= AGENT_MASK(agent->type);
+	}
+}
+
+static void
+mlx5_hv_vhca_control_agent_invalidate(struct mlx5_hv_vhca_agent *agent,
+				      u64 block_mask)
+{
+	struct mlx5_hv_vhca *hv_vhca = agent->hv_vhca;
+	struct mlx5_core_dev *dev = hv_vhca->dev;
+	struct mlx5_hv_vhca_control_block *block;
+	u32 capabilities = 0;
+	int err;
+
+	block = kzalloc(sizeof(*block), GFP_KERNEL);
+	if (!block)
+		return;
+
+	err = mlx5_hv_read_config(dev, block, sizeof(*block), 0);
+	if (err)
+		goto free_block;
+
+	mlx5_hv_vhca_capabilities(hv_vhca, &capabilities);
+
+	/* In case no capabilities, send empty block in return */
+	if (!capabilities) {
+		memset(block, 0, sizeof(*block));
+		goto write;
+	}
+
+	if (block->capabilities != capabilities)
+		block->capabilities = capabilities;
+
+	if (block->control & ~capabilities)
+		goto free_block;
+
+	mlx5_hv_vhca_agents_control(hv_vhca, block);
+	block->command_ack = block->command;
+
+write:
+	mlx5_hv_write_config(dev, block, sizeof(*block), 0);
+
+free_block:
+	kfree(block);
+}
+
+static struct mlx5_hv_vhca_agent *
+mlx5_hv_vhca_control_agent_create(struct mlx5_hv_vhca *hv_vhca)
+{
+	return mlx5_hv_vhca_agent_create(hv_vhca, MLX5_HV_VHCA_AGENT_CONTROL,
+					 NULL,
+					 mlx5_hv_vhca_control_agent_invalidate,
+					 NULL, NULL);
+}
+
+static void mlx5_hv_vhca_control_agent_destroy(struct mlx5_hv_vhca_agent *agent)
+{
+	mlx5_hv_vhca_agent_destroy(agent);
+}
+
 int mlx5_hv_vhca_init(struct mlx5_hv_vhca *hv_vhca)
 {
+	struct mlx5_hv_vhca_agent *agent;
+	int err;
+
 	if (IS_ERR_OR_NULL(hv_vhca))
 		return IS_ERR_OR_NULL(hv_vhca);
 
-	return mlx5_hv_register_invalidate(hv_vhca->dev, hv_vhca,
-					   mlx5_hv_vhca_invalidate);
+	err = mlx5_hv_register_invalidate(hv_vhca->dev, hv_vhca,
+					  mlx5_hv_vhca_invalidate);
+	if (err)
+		return err;
+
+	agent = mlx5_hv_vhca_control_agent_create(hv_vhca);
+	if (IS_ERR_OR_NULL(agent)) {
+		mlx5_hv_unregister_invalidate(hv_vhca->dev);
+		return IS_ERR_OR_NULL(agent);
+	}
+
+	hv_vhca->agents[MLX5_HV_VHCA_AGENT_CONTROL] = agent;
+
+	return 0;
 }
 
 void mlx5_hv_vhca_cleanup(struct mlx5_hv_vhca *hv_vhca)
 {
+	struct mlx5_hv_vhca_agent *agent;
 	int i;
 
 	if (IS_ERR_OR_NULL(hv_vhca))
 		return;
 
+	agent = hv_vhca->agents[MLX5_HV_VHCA_AGENT_CONTROL];
+	if (agent)
+		mlx5_hv_vhca_control_agent_destroy(agent);
+
 	mutex_lock(&hv_vhca->agents_lock);
 	for (i = 0; i < MLX5_HV_VHCA_AGENT_MAX; i++)
 		WARN_ON(hv_vhca->agents[i]);
@@ -134,6 +243,11 @@ void mlx5_hv_vhca_cleanup(struct mlx5_hv_vhca *hv_vhca)
 	mlx5_hv_unregister_invalidate(hv_vhca->dev);
 }
 
+static void mlx5_hv_vhca_agents_update(struct mlx5_hv_vhca *hv_vhca)
+{
+	mlx5_hv_vhca_invalidate(hv_vhca, BIT(MLX5_HV_VHCA_AGENT_CONTROL));
+}
+
 struct mlx5_hv_vhca_agent *
 mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  enum mlx5_hv_vhca_agent_type type,
@@ -174,6 +288,8 @@ struct mlx5_hv_vhca_agent *
 	hv_vhca->agents[type] = agent;
 	mutex_unlock(&hv_vhca->agents_lock);
 
+	mlx5_hv_vhca_agents_update(hv_vhca);
+
 	return agent;
 }
 
@@ -195,6 +311,8 @@ void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent)
 		agent->cleanup(agent);
 
 	kfree(agent);
+
+	mlx5_hv_vhca_agents_update(hv_vhca);
 }
 
 static int mlx5_hv_vhca_data_block_prepare(struct mlx5_hv_vhca_agent *agent,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
index cdf1303..984e7ad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
@@ -12,6 +12,7 @@
 struct mlx5_hv_vhca_control_block;
 
 enum mlx5_hv_vhca_agent_type {
+	MLX5_HV_VHCA_AGENT_CONTROL = 0,
 	MLX5_HV_VHCA_AGENT_MAX = 32,
 };
 
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 6/6] net/mlx5e: Add mlx5e HV VHCA stats agent
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org
In-Reply-To: <1566512708-13785-1-git-send-email-haiyangz@microsoft.com>

From: Eran Ben Elisha <eranbe@mellanox.com>

HV VHCA stats agent is responsible on running a preiodic rx/tx
packets/bytes stats update. Currently the supported format is version
MLX5_HV_VHCA_STATS_VERSION. Block ID 1 is dedicated for statistics data
transfer from the VF to the PF.

The reporter fetch the statistics data from all opened channels, fill it
in a buffer and send it to mlx5_hv_vhca_write_agent.

As the stats layer should include some metadata per block (sequence and
offset), the HV VHCA layer shall modify the buffer before actually send it
over block 1.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  13 ++
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c | 162 +++++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h |  25 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h  |   1 +
 6 files changed, 205 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 8d443fc..f4de9cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -36,6 +36,7 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o en/port_buffer.o
 mlx5_core-$(CONFIG_MLX5_ESWITCH)     += en_rep.o en_tc.o en/tc_tun.o lib/port_tun.o lag_mp.o \
 					lib/geneve.o en/tc_tun_vxlan.o en/tc_tun_gre.o \
 					en/tc_tun_geneve.o diag/en_tc_tracepoint.o
+mlx5_core-$(CONFIG_PCI_HYPERV_INTERFACE) += en/hv_vhca_stats.o
 
 #
 # Core extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7316571..4467927 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -54,6 +54,7 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 #include "en/fs.h"
+#include "lib/hv_vhca.h"
 
 extern const struct net_device_ops mlx5e_netdev_ops;
 struct page_pool;
@@ -782,6 +783,15 @@ struct mlx5e_modify_sq_param {
 	int rl_index;
 };
 
+#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
+struct mlx5e_hv_vhca_stats_agent {
+	struct mlx5_hv_vhca_agent *agent;
+	struct delayed_work        work;
+	u16                        delay;
+	void                      *buf;
+};
+#endif
+
 struct mlx5e_xsk {
 	/* UMEMs are stored separately from channels, because we don't want to
 	 * lose them when channels are recreated. The kernel also stores UMEMs,
@@ -853,6 +863,9 @@ struct mlx5e_priv {
 	struct devlink_health_reporter *tx_reporter;
 	struct devlink_health_reporter *rx_reporter;
 	struct mlx5e_xsk           xsk;
+#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
+	struct mlx5e_hv_vhca_stats_agent stats_agent;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
new file mode 100644
index 0000000..c37b4ac
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2018 Mellanox Technologies
+
+#include "en.h"
+#include "en/hv_vhca_stats.h"
+#include "lib/hv_vhca.h"
+#include "lib/hv.h"
+
+struct mlx5e_hv_vhca_per_ring_stats {
+	u64     rx_packets;
+	u64     rx_bytes;
+	u64     tx_packets;
+	u64     tx_bytes;
+};
+
+static void
+mlx5e_hv_vhca_fill_ring_stats(struct mlx5e_priv *priv, int ch,
+			      struct mlx5e_hv_vhca_per_ring_stats *data)
+{
+	struct mlx5e_channel_stats *stats;
+	int tc;
+
+	stats = &priv->channel_stats[ch];
+	data->rx_packets = stats->rq.packets;
+	data->rx_bytes   = stats->rq.bytes;
+
+	for (tc = 0; tc < priv->max_opened_tc; tc++) {
+		data->tx_packets += stats->sq[tc].packets;
+		data->tx_bytes   += stats->sq[tc].bytes;
+	}
+}
+
+static void mlx5e_hv_vhca_fill_stats(struct mlx5e_priv *priv, u64 *data,
+				     int buf_len)
+{
+	int ch, i = 0;
+
+	for (ch = 0; ch < priv->max_nch; ch++) {
+		u64 *buf = data + i;
+
+		if (WARN_ON_ONCE(buf +
+				 sizeof(struct mlx5e_hv_vhca_per_ring_stats) >
+				 data + buf_len))
+			return;
+
+		mlx5e_hv_vhca_fill_ring_stats(priv, ch,
+					      (struct mlx5e_hv_vhca_per_ring_stats *)buf);
+		i += sizeof(struct mlx5e_hv_vhca_per_ring_stats) / sizeof(u64);
+	}
+}
+
+static int mlx5e_hv_vhca_stats_buf_size(struct mlx5e_priv *priv)
+{
+	return (sizeof(struct mlx5e_hv_vhca_per_ring_stats) *
+		priv->max_nch);
+}
+
+static void mlx5e_hv_vhca_stats_work(struct work_struct *work)
+{
+	struct mlx5e_hv_vhca_stats_agent *sagent;
+	struct mlx5_hv_vhca_agent *agent;
+	struct delayed_work *dwork;
+	struct mlx5e_priv *priv;
+	int buf_len, rc;
+	void *buf;
+
+	dwork = to_delayed_work(work);
+	sagent = container_of(dwork, struct mlx5e_hv_vhca_stats_agent, work);
+	priv = container_of(sagent, struct mlx5e_priv, stats_agent);
+	buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
+	agent = sagent->agent;
+	buf = sagent->buf;
+
+	memset(buf, 0, buf_len);
+	mlx5e_hv_vhca_fill_stats(priv, buf, buf_len);
+
+	rc = mlx5_hv_vhca_agent_write(agent, buf, buf_len);
+	if (rc) {
+		mlx5_core_err(priv->mdev,
+			      "%s: Failed to write stats, err = %d\n",
+			      __func__, rc);
+		return;
+	}
+
+	if (sagent->delay)
+		queue_delayed_work(priv->wq, &sagent->work, sagent->delay);
+}
+
+enum {
+	MLX5_HV_VHCA_STATS_VERSION     = 1,
+	MLX5_HV_VHCA_STATS_UPDATE_ONCE = 0xFFFF,
+};
+
+static void mlx5e_hv_vhca_stats_control(struct mlx5_hv_vhca_agent *agent,
+					struct mlx5_hv_vhca_control_block *block)
+{
+	struct mlx5e_hv_vhca_stats_agent *sagent;
+	struct mlx5e_priv *priv;
+
+	priv = mlx5_hv_vhca_agent_priv(agent);
+	sagent = &priv->stats_agent;
+
+	block->version = MLX5_HV_VHCA_STATS_VERSION;
+	block->rings   = priv->max_nch;
+
+	if (!block->command) {
+		cancel_delayed_work_sync(&priv->stats_agent.work);
+		return;
+	}
+
+	sagent->delay = block->command == MLX5_HV_VHCA_STATS_UPDATE_ONCE ? 0 :
+			msecs_to_jiffies(block->command * 100);
+
+	queue_delayed_work(priv->wq, &sagent->work, sagent->delay);
+}
+
+static void mlx5e_hv_vhca_stats_cleanup(struct mlx5_hv_vhca_agent *agent)
+{
+	struct mlx5e_priv *priv = mlx5_hv_vhca_agent_priv(agent);
+
+	cancel_delayed_work_sync(&priv->stats_agent.work);
+}
+
+int mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
+{
+	int buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
+	struct mlx5_hv_vhca_agent *agent;
+
+	priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
+	if (!priv->stats_agent.buf)
+		return -ENOMEM;
+
+	agent = mlx5_hv_vhca_agent_create(priv->mdev->hv_vhca,
+					  MLX5_HV_VHCA_AGENT_STATS,
+					  mlx5e_hv_vhca_stats_control, NULL,
+					  mlx5e_hv_vhca_stats_cleanup,
+					  priv);
+
+	if (IS_ERR_OR_NULL(agent)) {
+		if (IS_ERR(agent))
+			netdev_warn(priv->netdev,
+				    "Failed to create hv vhca stats agent, err = %ld\n",
+				    PTR_ERR(agent));
+
+		kfree(priv->stats_agent.buf);
+		return IS_ERR_OR_NULL(agent);
+	}
+
+	priv->stats_agent.agent = agent;
+	INIT_DELAYED_WORK(&priv->stats_agent.work, mlx5e_hv_vhca_stats_work);
+
+	return 0;
+}
+
+void mlx5e_hv_vhca_stats_destroy(struct mlx5e_priv *priv)
+{
+	if (IS_ERR_OR_NULL(priv->stats_agent.agent))
+		return;
+
+	mlx5_hv_vhca_agent_destroy(priv->stats_agent.agent);
+	kfree(priv->stats_agent.buf);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h
new file mode 100644
index 0000000..664463f
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#ifndef __MLX5_EN_STATS_VHCA_H__
+#define __MLX5_EN_STATS_VHCA_H__
+#include "en.h"
+
+#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
+
+int mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv);
+void mlx5e_hv_vhca_stats_destroy(struct mlx5e_priv *priv);
+
+#else
+
+static inline int mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
+{
+	return 0;
+}
+
+static inline void mlx5e_hv_vhca_stats_destroy(struct mlx5e_priv *priv)
+{
+}
+#endif
+
+#endif /* __MLX5_EN_STATS_VHCA_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7fdea64..fa4bf2d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,6 +62,7 @@
 #include "en/xsk/setup.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
+#include "en/hv_vhca_stats.h"
 
 
 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
@@ -5109,6 +5110,7 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
 	if (mlx5e_monitor_counter_supported(priv))
 		mlx5e_monitor_counter_init(priv);
 
+	mlx5e_hv_vhca_stats_create(priv);
 	if (netdev->reg_state != NETREG_REGISTERED)
 		return;
 #ifdef CONFIG_MLX5_CORE_EN_DCB
@@ -5141,6 +5143,7 @@ static void mlx5e_nic_disable(struct mlx5e_priv *priv)
 
 	queue_work(priv->wq, &priv->set_rx_mode_work);
 
+	mlx5e_hv_vhca_stats_destroy(priv);
 	if (mlx5e_monitor_counter_supported(priv))
 		mlx5e_monitor_counter_cleanup(priv);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
index 984e7ad..4bad6a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
@@ -13,6 +13,7 @@
 
 enum mlx5_hv_vhca_agent_type {
 	MLX5_HV_VHCA_AGENT_CONTROL = 0,
+	MLX5_HV_VHCA_AGENT_STATS   = 1,
 	MLX5_HV_VHCA_AGENT_MAX = 32,
 };
 
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,v5, 0/6] Add software backchannel and mlx5e HV VHCA stats
From: Haiyang Zhang @ 2019-08-22 22:26 UTC (permalink / raw)
  To: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
	linux-kernel@vger.kernel.org

This patch set adds paravirtual backchannel in software in pci_hyperv,
which is required by the mlx5e driver HV VHCA stats agent.

The stats agent is responsible on running a periodic rx/tx packets/bytes
stats update.

Dexuan Cui (1):
  PCI: hv: Add a paravirtual backchannel in software

Eran Ben Elisha (4):
  net/mlx5: Add wrappers for HyperV PCIe operations
  net/mlx5: Add HV VHCA infrastructure
  net/mlx5: Add HV VHCA control agent
  net/mlx5e: Add mlx5e HV VHCA stats agent

Haiyang Zhang (1):
  PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface

 MAINTAINERS                                        |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  13 +
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c | 162 +++++++++
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h |  25 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c   |  64 ++++
 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h   |  22 ++
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c  | 371 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h  | 104 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |   7 +
 drivers/pci/Kconfig                                |   1 +
 drivers/pci/controller/Kconfig                     |   7 +
 drivers/pci/controller/Makefile                    |   1 +
 drivers/pci/controller/pci-hyperv-intf.c           |  67 ++++
 drivers/pci/controller/pci-hyperv.c                | 308 +++++++++++++++++
 include/linux/hyperv.h                             |  29 ++
 include/linux/mlx5/driver.h                        |   2 +
 18 files changed, 1189 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
 create mode 100644 drivers/pci/controller/pci-hyperv-intf.c

-- 
1.8.3.1


^ permalink raw reply

* RE: [Patch v2] storvsc: setup 1:1 mapping between hardware queue and CPU queue
From: Long Li @ 2019-08-22 22:21 UTC (permalink / raw)
  To: Michael Kelley, longli@linuxonhyperv.com, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR21MB01372B7A717EAC7F0BCF826AD7A50@DM5PR21MB0137.namprd21.prod.outlook.com>

>>>Subject: RE: [Patch v2] storvsc: setup 1:1 mapping between hardware
>>>queue and CPU queue
>>>
>>>From: Long Li <longli@linuxonhyperv.com> Sent: Thursday, August 22, 2019
>>>1:42 PM
>>>>
>>>> storvsc doesn't use a dedicated hardware queue for a given CPU queue.
>>>> When issuing I/O, it selects returning CPU (hardware queue)
>>>> dynamically based on vmbus channel usage across all channels.
>>>>
>>>> This patch advertises num_possible_cpus() as number of hardware
>>>> queues. This will have upper layer setup 1:1 mapping between hardware
>>>> queue and CPU queue and avoid unnecessary locking when issuing I/O.
>>>>
>>>> Changes:
>>>> v2: rely on default upper layer function to map queues. (suggested by
>>>> Ming Lei
>>>> <tom.leiming@gmail.com>)
>>>>
>>>> Signed-off-by: Long Li <longli@microsoft.com>
>>>> ---
>>>>  drivers/scsi/storvsc_drv.c | 3 +--
>>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
>>>> index b89269120a2d..dfd3b76a4f89 100644
>>>> --- a/drivers/scsi/storvsc_drv.c
>>>> +++ b/drivers/scsi/storvsc_drv.c
>>>> @@ -1836,8 +1836,7 @@ static int storvsc_probe(struct hv_device
>>>*device,
>>>>  	/*
>>>>  	 * Set the number of HW queues we are supporting.
>>>>  	 */
>>>> -	if (stor_device->num_sc != 0)
>>>> -		host->nr_hw_queues = stor_device->num_sc + 1;
>>>> +	host->nr_hw_queues = num_possible_cpus();
>>>
>>>For a lot of the VM sizes in Azure, num_possible_cpus() is 128, even if the
>>>VM has only 4 or 8 or some other smaller number of vCPUs.
>>>So I'm wondering if you really want num_present_cpus() here instead,
>>>which would include only the vCPUs that actually exist in the VM.

I think reporting num_possible_cpus() doesn't do more harm or take more resources. Because block layer allocates map for all the possible CPUs.

The actual mapping is done in blk_mq_map_queues(), and it iterates all the possible CPUs. If we report num_present_cpus(), the rest of the CPUs also need to be mapped.

>>>
>>>Michael
>>>
>>>>
>>>>  	/*
>>>>  	 * Set the error handler work queue.
>>>> --
>>>> 2.17.1


^ permalink raw reply

* RE: [Patch v2] storvsc: setup 1:1 mapping between hardware queue and CPU queue
From: Michael Kelley @ 2019-08-22 21:01 UTC (permalink / raw)
  To: longli@linuxonhyperv.com, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, James E.J. Bottomley,
	Martin K. Petersen, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: Long Li
In-Reply-To: <1566506543-1090-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@linuxonhyperv.com> Sent: Thursday, August 22, 2019 1:42 PM
> 
> storvsc doesn't use a dedicated hardware queue for a given CPU queue. When
> issuing I/O, it selects returning CPU (hardware queue) dynamically based on
> vmbus channel usage across all channels.
> 
> This patch advertises num_possible_cpus() as number of hardware queues. This
> will have upper layer setup 1:1 mapping between hardware queue and CPU queue
> and avoid unnecessary locking when issuing I/O.
> 
> Changes:
> v2: rely on default upper layer function to map queues. (suggested by Ming Lei
> <tom.leiming@gmail.com>)
> 
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index b89269120a2d..dfd3b76a4f89 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1836,8 +1836,7 @@ static int storvsc_probe(struct hv_device *device,
>  	/*
>  	 * Set the number of HW queues we are supporting.
>  	 */
> -	if (stor_device->num_sc != 0)
> -		host->nr_hw_queues = stor_device->num_sc + 1;
> +	host->nr_hw_queues = num_possible_cpus();

For a lot of the VM sizes in Azure, num_possible_cpus() is 128, even if
the VM has only 4 or 8 or some other smaller number of vCPUs.
So I'm wondering if you really want num_present_cpus() here instead,
which would include only the vCPUs that actually exist in the VM.

Michael

> 
>  	/*
>  	 * Set the error handler work queue.
> --
> 2.17.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox