Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Souradeep Chakrabarti @ 2023-08-02  5:37 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil
  Cc: Souradeep Chakrabarti, Simon Horman, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, Long Li,
	Ajay Sharma, leon@kernel.org, cai.huoqing@linux.dev,
	ssengar@linux.microsoft.com, vkuznets, tglx@linutronix.de,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <CAH-L+nPsuoJfCQcJnpMWk5DPGev8f+YWi0K4V+fU=5-bxP5GVw@mail.gmail.com>

On Wed, Aug 02, 2023 at 10:57:52AM +0530, Kalesh Anakkur Purayil wrote:
> Hi Souradeep,
> 
> It looks like the subject line is not complete. I could see "net: mana: Fix
> MANA VF unload when hardware is".
> 
> Is that correct?
> 
> Regards,
> Kalesh
>
Yes, it got truncated. Will fix it in next version. 
> On Wed, Aug 2, 2023 at 12:29 AM Souradeep Chakrabarti <
> schakrabarti@microsoft.com> wrote:
> 
> >
> >
> > >-----Original Message-----
> > >From: Simon Horman <horms@kernel.org>
> > >Sent: Tuesday, August 1, 2023 9:01 PM
> > >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> > >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> > >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> > >Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> > >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets
> > ><vkuznets@redhat.com>; tglx@linutronix.de; linux-hyperv@vger.kernel.org;
> > >netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > >rdma@vger.kernel.org; Souradeep Chakrabarti
> > ><schakrabarti@microsoft.com>; stable@vger.kernel.org
> > >Subject: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when
> > >hardware is
> > >
> > >On Tue, Aug 01, 2023 at 05:29:13AM -0700, Souradeep Chakrabarti wrote:
> > >
> > >...
> > >
> > >Hi Souradeep,
> > >
> > >
> > >> +    for (i = 0; i < apc->num_queues; i++) {
> > >> +            txq = &apc->tx_qp[i].txq;
> > >> +            while (skb = skb_dequeue(&txq->pending_skbs)) {
> > >
> > >W=1 builds with both clang-16 and gcc-12 complain that they would like an
> > >extra set of parentheses around an assignment used as a truth value.
> > Thanks for letting me know. I will fix it in next version.
> > >
> > >> +                    mana_unmap_skb(skb, apc);
> > >> +                    dev_consume_skb_any(skb);
> > >> +            }
> > >> +            atomic_set(&txq->pending_sends, 0);
> > >> +    }
> > >>      /* We're 100% sure the queues can no longer be woken up, because
> > >>       * we're sure now mana_poll_tx_cq() can't be running.
> > >>       */
> > >> --
> > >> 2.34.1
> > >>
> > >>
> >
> >
> 
> -- 
> Regards,
> Kalesh A P



^ permalink raw reply

* [PATCH] hv_netvsc: fix netvsc_send_completion to avoid multiple message length checks
From: Sonia Sharma @ 2023-08-02  5:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, sosha, kys, mikelley, haiyangz, wei.liu, decui,
	longli, davem, edumazet, kuba, pabeni

From: Sonia Sharma <sonia.sharma@linux.microsoft.com>

switch statement in netvsc_send_completion() is incorrectly validating
the length of incoming network packets by falling through next case.
Avoid the fallthrough, instead break after a case match and then process
the complete() call.

Signed-off-by: Sonia Sharma <sonia.sharma@linux.microsoft.com>
---
 drivers/net/hyperv/netvsc.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 82e9796c8f5e..347688dd2eb9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -851,7 +851,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG1_TYPE_SEND_RECV_BUF_COMPLETE:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -860,7 +860,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG1_TYPE_SEND_SEND_BUF_COMPLETE:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -869,7 +869,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG5_TYPE_SUBCHANNEL:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -878,10 +878,6 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		/* Copy the response back */
-		memcpy(&net_device->channel_init_pkt, nvsp_packet,
-		       sizeof(struct nvsp_message));
-		complete(&net_device->channel_init_wait);
 		break;
 
 	case NVSP_MSG1_TYPE_SEND_RNDIS_PKT_COMPLETE:
@@ -904,13 +900,18 @@ static void netvsc_send_completion(struct net_device *ndev,
 
 		netvsc_send_tx_complete(ndev, net_device, incoming_channel,
 					desc, budget);
-		break;
+		return;
 
 	default:
 		netdev_err(ndev,
 			   "Unknown send completion type %d received!!\n",
 			   nvsp_packet->hdr.msg_type);
 	}
+
+	/* Copy the response back */
+	memcpy(&net_device->channel_init_pkt, nvsp_packet,
+			sizeof(struct nvsp_message));
+	complete(&net_device->channel_init_wait);
 }
 
 static u32 netvsc_get_next_send_section(struct netvsc_device *net_device)
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: kernel test robot @ 2023-08-02  7:50 UTC (permalink / raw)
  To: Souradeep Chakrabarti, kys, haiyangz, wei.liu, decui, davem,
	edumazet, kuba, pabeni, longli, sharmaajay, leon, cai.huoqing,
	ssengar, vkuznets, tglx, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: oe-kbuild-all, schakrabarti, Souradeep Chakrabarti, stable
In-Reply-To: <1690892953-25201-1-git-send-email-schakrabarti@linux.microsoft.com>

Hi Souradeep,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Souradeep-Chakrabarti/net-mana-Fix-MANA-VF-unload-when-hardware-is/20230801-203141
base:   net/main
patch link:    https://lore.kernel.org/r/1690892953-25201-1-git-send-email-schakrabarti%40linux.microsoft.com
patch subject: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230802/202308021532.8iYkExDh-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230802/202308021532.8iYkExDh-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308021532.8iYkExDh-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/microsoft/mana/mana_en.c: In function 'mana_dealloc_queues':
>> drivers/net/ethernet/microsoft/mana/mana_en.c:2398:24: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
    2398 |                 while (skb = skb_dequeue(&txq->pending_skbs)) {
         |                        ^~~


vim +2398 drivers/net/ethernet/microsoft/mana/mana_en.c

  2345	
  2346	static int mana_dealloc_queues(struct net_device *ndev)
  2347	{
  2348		struct mana_port_context *apc = netdev_priv(ndev);
  2349		unsigned long timeout = jiffies + 120 * HZ;
  2350		struct gdma_dev *gd = apc->ac->gdma_dev;
  2351		struct mana_txq *txq;
  2352		struct sk_buff *skb;
  2353		int i, err;
  2354		u32 tsleep;
  2355	
  2356		if (apc->port_is_up)
  2357			return -EINVAL;
  2358	
  2359		mana_chn_setxdp(apc, NULL);
  2360	
  2361		if (gd->gdma_context->is_pf)
  2362			mana_pf_deregister_filter(apc);
  2363	
  2364		/* No packet can be transmitted now since apc->port_is_up is false.
  2365		 * There is still a tiny chance that mana_poll_tx_cq() can re-enable
  2366		 * a txq because it may not timely see apc->port_is_up being cleared
  2367		 * to false, but it doesn't matter since mana_start_xmit() drops any
  2368		 * new packets due to apc->port_is_up being false.
  2369		 *
  2370		 * Drain all the in-flight TX packets.
  2371		 * A timeout of 120 seconds for all the queues is used.
  2372		 * This will break the while loop when h/w is not responding.
  2373		 * This value of 120 has been decided here considering max
  2374		 * number of queues.
  2375		 */
  2376	
  2377		for (i = 0; i < apc->num_queues; i++) {
  2378			txq = &apc->tx_qp[i].txq;
  2379			tsleep = 1000;
  2380			while (atomic_read(&txq->pending_sends) > 0 &&
  2381			       time_before(jiffies, timeout)) {
  2382				usleep_range(tsleep, tsleep + 1000);
  2383				tsleep <<= 1;
  2384			}
  2385			if (atomic_read(&txq->pending_sends)) {
  2386				err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
  2387				if (err) {
  2388					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
  2389						   err, atomic_read(&txq->pending_sends),
  2390						   txq->gdma_txq_id);
  2391				}
  2392				break;
  2393			}
  2394		}
  2395	
  2396		for (i = 0; i < apc->num_queues; i++) {
  2397			txq = &apc->tx_qp[i].txq;
> 2398			while (skb = skb_dequeue(&txq->pending_skbs)) {
  2399				mana_unmap_skb(skb, apc);
  2400				dev_consume_skb_any(skb);
  2401			}
  2402			atomic_set(&txq->pending_sends, 0);
  2403		}
  2404		/* We're 100% sure the queues can no longer be woken up, because
  2405		 * we're sure now mana_poll_tx_cq() can't be running.
  2406		 */
  2407	
  2408		apc->rss_state = TRI_STATE_FALSE;
  2409		err = mana_config_rss(apc, TRI_STATE_FALSE, false, false);
  2410		if (err) {
  2411			netdev_err(ndev, "Failed to disable vPort: %d\n", err);
  2412			return err;
  2413		}
  2414	
  2415		mana_destroy_vport(apc);
  2416	
  2417		return 0;
  2418	}
  2419	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH V6 net-next] net: mana: Configure hwc timeout from hardware
From: Souradeep Chakrabarti @ 2023-08-02 11:07 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: schakrabarti, Souradeep Chakrabarti

At present hwc timeout value is a fixed value. This patch sets the hwc
timeout from the hardware. It now uses a new hardware capability
GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG to query and set the value
in hwc_timeout.

Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V5 -> V6:
* used msecs_to_jiffies() in wait_for_completion_timeout.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 30 ++++++++++++++++++-
 .../net/ethernet/microsoft/mana/hw_channel.c  | 24 ++++++++++++++-
 include/net/mana/gdma.h                       | 20 ++++++++++++-
 include/net/mana/hw_channel.h                 |  5 ++++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..2e17ee3acfda 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -106,6 +106,25 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	return 0;
 }
 
+static int mana_gd_query_hwc_timeout(struct pci_dev *pdev, u32 *timeout_val)
+{
+	struct gdma_context *gc = pci_get_drvdata(pdev);
+	struct gdma_query_hwc_timeout_resp resp = {};
+	struct gdma_query_hwc_timeout_req req = {};
+	int err;
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_HWC_TIMEOUT,
+			     sizeof(req), sizeof(resp));
+	req.timeout_ms = *timeout_val;
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+	if (err || resp.hdr.status)
+		return err ? err : -EPROTO;
+
+	*timeout_val = resp.timeout_ms;
+
+	return 0;
+}
+
 static int mana_gd_detect_devices(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -879,8 +898,10 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_verify_ver_resp resp = {};
 	struct gdma_verify_ver_req req = {};
+	struct hw_channel_context *hwc;
 	int err;
 
+	hwc = gc->hwc.driver_data;
 	mana_gd_init_req_hdr(&req.hdr, GDMA_VERIFY_VF_DRIVER_VERSION,
 			     sizeof(req), sizeof(resp));
 
@@ -907,7 +928,14 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 			err, resp.hdr.status);
 		return err ? err : -EPROTO;
 	}
-
+	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
+		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
+		if (err) {
+			dev_err(gc->dev, "Failed to set the hwc timeout %d\n", err);
+			return err;
+		}
+		dev_dbg(gc->dev, "set the hwc timeout to %u\n", hwc->hwc_timeout);
+	}
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 2bd1d74021f7..9d1cd3bfcf66 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -174,7 +174,25 @@ static void mana_hwc_init_event_handler(void *ctx, struct gdma_queue *q_self,
 		complete(&hwc->hwc_init_eqe_comp);
 		break;
 
+	case GDMA_EQE_HWC_SOC_RECONFIG_DATA:
+		type_data.as_uint32 = event->details[0];
+		type = type_data.type;
+		val = type_data.value;
+
+		switch (type) {
+		case HWC_DATA_CFG_HWC_TIMEOUT:
+			hwc->hwc_timeout = val;
+			break;
+
+		default:
+			dev_warn(hwc->dev, "Received unknown reconfig type %u\n", type);
+			break;
+		}
+
+		break;
+
 	default:
+		dev_warn(hwc->dev, "Received unknown gdma event %u\n", event->type);
 		/* Ignore unknown events, which should never happen. */
 		break;
 	}
@@ -696,6 +714,7 @@ int mana_hwc_create_channel(struct gdma_context *gc)
 	gd->driver_data = hwc;
 	hwc->gdma_dev = gd;
 	hwc->dev = gc->dev;
+	hwc->hwc_timeout = HW_CHANNEL_WAIT_RESOURCE_TIMEOUT_MS;
 
 	/* HWC's instance number is always 0. */
 	gd->dev_id.as_uint32 = 0;
@@ -770,6 +789,8 @@ void mana_hwc_destroy_channel(struct gdma_context *gc)
 	hwc->gdma_dev->doorbell = INVALID_DOORBELL;
 	hwc->gdma_dev->pdid = INVALID_PDID;
 
+	hwc->hwc_timeout = 0;
+
 	kfree(hwc);
 	gc->hwc.driver_data = NULL;
 	gc->hwc.gdma_context = NULL;
@@ -825,7 +846,8 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 		goto out;
 	}
 
-	if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
+	if (!wait_for_completion_timeout(&ctx->comp_event,
+					 (msecs_to_jiffies(hwc->hwc_timeout) * HZ))) {
 		dev_err(hwc->dev, "HWC: Request timed out!\n");
 		err = -ETIMEDOUT;
 		goto out;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..88b6ef7ce1a6 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -33,6 +33,7 @@ enum gdma_request_type {
 	GDMA_DESTROY_PD			= 30,
 	GDMA_CREATE_MR			= 31,
 	GDMA_DESTROY_MR			= 32,
+	GDMA_QUERY_HWC_TIMEOUT		= 84, /* 0x54 */
 };
 
 #define GDMA_RESOURCE_DOORBELL_PAGE	27
@@ -57,6 +58,8 @@ enum gdma_eqe_type {
 	GDMA_EQE_HWC_INIT_EQ_ID_DB	= 129,
 	GDMA_EQE_HWC_INIT_DATA		= 130,
 	GDMA_EQE_HWC_INIT_DONE		= 131,
+	GDMA_EQE_HWC_SOC_RECONFIG	= 132,
+	GDMA_EQE_HWC_SOC_RECONFIG_DATA	= 133,
 };
 
 enum {
@@ -531,10 +534,12 @@ enum {
  * so the driver is able to reliably support features like busy_poll.
  */
 #define GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX BIT(2)
+#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
 
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
-	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX)
+	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
@@ -664,6 +669,19 @@ struct gdma_disable_queue_req {
 	u32 alloc_res_id_on_creation;
 }; /* HW DATA */
 
+/* GDMA_QUERY_HWC_TIMEOUT */
+struct gdma_query_hwc_timeout_req {
+	struct gdma_req_hdr hdr;
+	u32 timeout_ms;
+	u32 reserved;
+};
+
+struct gdma_query_hwc_timeout_resp {
+	struct gdma_resp_hdr hdr;
+	u32 timeout_ms;
+	u32 reserved;
+};
+
 enum atb_page_size {
 	ATB_PAGE_SIZE_4K,
 	ATB_PAGE_SIZE_8K,
diff --git a/include/net/mana/hw_channel.h b/include/net/mana/hw_channel.h
index 6a757a6e2732..3d3b5c881bc1 100644
--- a/include/net/mana/hw_channel.h
+++ b/include/net/mana/hw_channel.h
@@ -23,6 +23,10 @@
 #define HWC_INIT_DATA_PF_DEST_RQ_ID	10
 #define HWC_INIT_DATA_PF_DEST_CQ_ID	11
 
+#define HWC_DATA_CFG_HWC_TIMEOUT 1
+
+#define HW_CHANNEL_WAIT_RESOURCE_TIMEOUT_MS 30000
+
 /* Structures labeled with "HW DATA" are exchanged with the hardware. All of
  * them are naturally aligned and hence don't need __packed.
  */
@@ -182,6 +186,7 @@ struct hw_channel_context {
 
 	u32 pf_dest_vrq_id;
 	u32 pf_dest_vrcq_id;
+	u32 hwc_timeout;
 
 	struct hwc_caller_ctx *caller_ctx;
 };
-- 
2.34.1


^ permalink raw reply related

* Re: [EXTERNAL] Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jason Gunthorpe @ 2023-08-02 13:08 UTC (permalink / raw)
  To: Ajay Sharma
  Cc: Long Li, Wei Hu, netdev@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, vkuznets,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <F17A4152-0715-4E73-B276-508354553413@microsoft.com>

On Wed, Aug 02, 2023 at 04:11:18AM +0000, Ajay Sharma wrote:
> 
> 
> > On Aug 1, 2023, at 6:46 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
> > 
> >> The driver interrupt code limits the CPU processing time of each EQ
> >> by reading a small batch of EQEs in this interrupt. It guarantees
> >> all the EQs are checked on this CPU, and limits the interrupt
> >> processing time for any given EQ. In this way, a bad EQ (which is
> >> stormed by a bad user doing unreasonable re-arming on the CQ) can't
> >> storm other EQs on this CPU.
> > 
> > Of course it can, the bad use just creates a million EQs and pushes a
> > bit of work through them constantly. How is that really any different
> > from pushing more EQEs into a single EQ?
> > 
> > And how does your EQ multiplexing work anyhow? Do you poll every EQ on
> > every interrupt? That itself is a DOS vector.
> 
> User does not create eqs directly . EQ creation is by product of
> opening device ie allocating context. 

Which is done directly by the user.

> I am not sure if the same
> process is allowed to open device multiple times 

Of course it can.

> of lock implemented. So million eqs are probably far fetched .

Uh, how do you conclude that?

>  As for how the eq servicing is done - only those eq’s for which the
> interrupt is raised are checked. And each eq is tied only once and
> only to a single interrupt.

So you iterate over a list of EQs in every interrupt?

Allowing userspace to increase the number of EQs on an interrupt is a
direct DOS vector, no special fussing required.

If you want this to work properly you need to have your HW arrange
things so there is only ever one EQE in the EQ for a given CQ at any
time. Another EQE cannot be stuffed by the HW until the kernel reads
the first EQE and acks it back.

You have almost got this right, the mistake is that userspace is the
thing that allows the HW to generate a new EQE. If you care about DOS
then this is the wrong design, the kernel and only the kernel must be
able to trigger a new EQE for the CQ.

In effect you need two CQ doorbells, a userspace one that re-arms the
CQ, and a kernel one that allows a CQ that triggered on ARM to
generate an EQE.

Thus the kernel can strictly limit the flow of EQEs through the EQs
such that an EQ can never overflow and a CQ can never consume more
than one EQE.

You cannot really fix this hardware problem with a software
solution. You will always have a DOS at some point.

Jason

^ permalink raw reply

* RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()
From: Michael Kelley (LINUX) @ 2023-08-02 14:43 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: LKML, x86@kernel.org, Tom Lendacky, Andrew Cooper,
	Arjan van de Ven, James E.J. Bottomley, Dick Kennedy, James Smart,
	Martin K. Petersen, linux-scsi@vger.kernel.org, Guenter Roeck,
	linux-hwmon@vger.kernel.org, Jean Delvare, Huang Rui,
	Juergen Gross, Steve Wahl, Mike Travis, Dimitri Sivanich,
	Russ Anderson, linux-hyperv@vger.kernel.org, Linus Torvalds,
	Greg Kroah-Hartman
In-Reply-To: <87r0omjt8c.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, August 1, 2023 3:25 PM
> 
> Michael!
> 
> On Tue, Aug 01 2023 at 00:12, Thomas Gleixner wrote:
> > On Mon, Jul 31 2023 at 21:27, Michael Kelley wrote:
> > Clearly the hyper-v BIOS people put a lot of thoughts into this:
> >
> >>    x2APIC features / processor topology (0xb):
> >>       extended APIC ID                      = 0
> >>       --- level 0 ---
> >>       level number                          = 0x0 (0)
> >>       level type                            = thread (1)
> >>       bit width of level                    = 0x1 (1)
> >>       number of logical processors at level = 0x2 (2)
> >>       --- level 1 ---
> >>       level number                          = 0x1 (1)
> >>       level type                            = core (2)
> >>       bit width of level                    = 0x6 (6)
> >>       number of logical processors at level = 0x40 (64)
> >
> > FAIL:                                           ^^^^^
> >
> > While that field is not meant for topology evaluation it is at least
> > expected to tell the actual number of logical processors at that level
> > which are actually available.
> >
> > The CPUID APIC ID aka initial_apicid clearly tells that the topology has
> > 36 logical CPUs in package 0 and 36 in package 1 according to your
> > table.
> >
> > On real hardware this looks like this:
> >
> >       --- level 1 ---
> >       level number                          = 0x1 (1)
> >       level type                            = core (2)
> >       bit width of level                    = 0x6 (6)
> >       number of logical processors at level = 0x38 (56)
> >
> > Which corresponds to reality and is consistent. But sure, consistency is
> > overrated.
> 
> So I looked really hard to find some hint how to detect this situation
> on the boot CPU, which allows us to mitigate it, but there is none at
> all.
> 
> So we are caught between a rock and a hard place, which provides us two
> mutually exclusive options to chose from:
> 
>   1) Have a sane topology evaluation mechanism which solves the known
>      problems of hybrid systems, wrong sizing estimates and other
>      unpleasantries.
> 
>   2) Support the Hyper-V BIOS trainwreck forever.
> 
> Unsurprisingly #2 is not really an option as #1 is a crucial issue for
> the kernel and we need it resolved urgently as of yesterday.
> 
> So while I'm definitely a strong supporter of no-regression policy, I
> have to make an argument here why this particular issue is _not_
> covered:
> 
>  1) Hyper-V BIOS/firmware violates the firmware specification and
>     requirements which are clearly spelled out in the SDM.
> 
>  2) This violatation is reported on every boot with one promiment
>     message per brought up AP where the initial APIC ID as provided by
>     CPUID leaf 0xB deviates from the APIC ID read from "hardware", which is
>     also provided by MADT starting with CPU 36 in the provided example:
> 
>     "[FIRMWARE BUG] CPU36: APIC id mismatch. Firmware: 40 APIC: 24"
> 
>     repeating itself up to CPU71 with the relevant diverging APIC IDs.
> 
>     At least that's what the upstream kernel produces according to
>     validate_apic_and_package_id() in such an situation.
> 
>  3) This is known for years and the Hyper-V Linux team tried to get this
>     resolved, but obviously their arguments fell on deaf ears.
> 
>     IOW, the firmware BUG message has been ignored willfully for years
>     due to "works for me, why should I care?" attitude.
> 
> Seriously, kernel development cannot be held hostage forever by the
> wilful ignorance of a BIOS team, which refuses to adhere to
> specifications and defines their own world order.
> 
> The x86 maintainer team is chosing the lesser of two evils and lets
> those who created the problem and refused to resolve it deal with the
> outcome.

Fair enough.  I don't have any basis to argue otherwise.  I'm in
discussions with the Hyper-V team about getting it fully fixed in
Hyper-V, and it looks like there's some movement to make it happen.

> 
> Just to clarify. This is not preventing affected guests from booting.
> The worst consequence is a slight performance regression because the
> firmware provided topology information is not matching reality and
> therefore the scheduler placement vs. L3 affinity sucks. That's clearly
> not a kernel problem.

Yes, if Linux will still boots and runs, that helps.  Then it really is up the
(virtual) firmware in Hyper-V to provide the correct topology information
so performance is as expected.

> 
> I'm happy to aid accelerating this thought process by elevating the
> existing pr_err(FW_BUG....) to a solid WARN_ON_ONCE(). See below.

Your choice.  In this particular case, it won't make a difference either
way.

Michael

> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1688,7 +1688,7 @@ static void validate_apic_and_package_id
> 
>  	apicid = apic->cpu_present_to_apicid(cpu);
> 
> -	if (apicid != c->topo.apicid) {
> +	if (WARN_ON_ONCE(apicid != c->topo.apicid)) {
>  		pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x APIC: %x\n",
>  		       cpu, apicid, c->topo.initial_apicid);
>  	}

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] eth: add missing xdp.h includes in drivers
From: Alexander Lobakin @ 2023-08-02 16:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: ast, netdev, bpf, hawk, amritha.nambiar, j.vosburgh, andy,
	shayagr, akiyano, ioana.ciornei, claudiu.manoil, vladimir.oltean,
	wei.fang, shenwei.wang, xiaoning.wang, linux-imx, dmichail,
	jeroendb, pkaligineedi, shailend, jesse.brandeburg,
	anthony.l.nguyen, horatiu.vultur, UNGLinuxDriver, kys, haiyangz,
	wei.liu, decui, peppe.cavallaro, alexandre.torgue, joabreu,
	mcoquelin.stm32, grygorii.strashko, longli, sharmaajay, daniel,
	john.fastabend, gerhard, simon.horman, leon, linux-hyperv
In-Reply-To: <20230802003246.2153774-2-kuba@kernel.org>

From: Jakub Kicinski <kuba@kernel.org>
Date: Tue,  1 Aug 2023 17:32:44 -0700

> Handful of drivers currently expect to get xdp.h by virtue
> of including netdevice.h. This will soon no longer be the case
> so add explicit includes.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: j.vosburgh@gmail.com
> CC: andy@greyhouse.net
> CC: shayagr@amazon.com
> CC: akiyano@amazon.com
> CC: ioana.ciornei@nxp.com
> CC: claudiu.manoil@nxp.com
> CC: vladimir.oltean@nxp.com
> CC: wei.fang@nxp.com
> CC: shenwei.wang@nxp.com
> CC: xiaoning.wang@nxp.com
> CC: linux-imx@nxp.com
> CC: dmichail@fungible.com
> CC: jeroendb@google.com
> CC: pkaligineedi@google.com
> CC: shailend@google.com
> CC: jesse.brandeburg@intel.com
> CC: anthony.l.nguyen@intel.com
> CC: horatiu.vultur@microchip.com
> CC: UNGLinuxDriver@microchip.com
> CC: kys@microsoft.com
> CC: haiyangz@microsoft.com
> CC: wei.liu@kernel.org
> CC: decui@microsoft.com
> CC: peppe.cavallaro@st.com
> CC: alexandre.torgue@foss.st.com
> CC: joabreu@synopsys.com
> CC: mcoquelin.stm32@gmail.com
> CC: grygorii.strashko@ti.com
> CC: longli@microsoft.com
> CC: sharmaajay@microsoft.com
> CC: daniel@iogearbox.net
> CC: hawk@kernel.org
> CC: john.fastabend@gmail.com
> CC: gerhard@engleder-embedded.com
> CC: simon.horman@corigine.com
> CC: leon@kernel.org
> CC: linux-hyperv@vger.kernel.org
> CC: bpf@vger.kernel.org
> ---
>  drivers/net/bonding/bond_main.c                       | 1 +
>  drivers/net/ethernet/amazon/ena/ena_netdev.h          | 1 +
>  drivers/net/ethernet/engleder/tsnep.h                 | 1 +
>  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h      | 1 +
>  drivers/net/ethernet/freescale/enetc/enetc.h          | 1 +
>  drivers/net/ethernet/freescale/fec.h                  | 1 +
>  drivers/net/ethernet/fungible/funeth/funeth_txrx.h    | 1 +
>  drivers/net/ethernet/google/gve/gve.h                 | 1 +
>  drivers/net/ethernet/intel/igc/igc.h                  | 1 +
>  drivers/net/ethernet/microchip/lan966x/lan966x_main.h | 1 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c         | 1 +
>  drivers/net/ethernet/stmicro/stmmac/stmmac.h          | 1 +
>  drivers/net/ethernet/ti/cpsw_priv.h                   | 1 +
>  drivers/net/hyperv/hyperv_net.h                       | 1 +
>  drivers/net/tap.c                                     | 1 +
>  include/net/mana/mana.h                               | 2 ++
>  16 files changed, 17 insertions(+)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 7a0f25301f7e..2f21cca4fdaf 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -90,6 +90,7 @@
>  #include <net/tls.h>
>  #endif
>  #include <net/ip6_route.h>
> +#include <net/xdp.h>
>  
>  #include "bonding_priv.h"
>  
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
> index 248b715b4d68..a1134152ecce 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
> @@ -15,6 +15,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/skbuff.h>
>  #include <uapi/linux/bpf.h>
> +#include <net/xdp.h>

Alphabetical sorting? :>
(for the entire patch)

>  
>  #include "ena_com.h"
>  #include "ena_eth_com.h"

[...]

Thanks,
Olek

^ permalink raw reply

* RE: [PATCH] hv_netvsc: fix netvsc_send_completion to avoid multiple message length checks
From: Michael Kelley (LINUX) @ 2023-08-02 17:14 UTC (permalink / raw)
  To: Sonia Sharma, linux-kernel@vger.kernel.org
  Cc: linux-hyperv@vger.kernel.org, Sonia Sharma, KY Srinivasan,
	Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui, Long Li,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com
In-Reply-To: <1690955450-5994-1-git-send-email-sosha@linux.microsoft.com>

From: Sonia Sharma <sosha@linux.microsoft.com> Sent: Tuesday, August 1, 2023 10:51 PM
> 

The Subject line for networking patches should be tagged
with "net" for fixes to the release that's currently in progress,
or "net-next" for the next release.  Bug fixes like this one can
be "net".   Look through the mailing list archives for examples.

> switch statement in netvsc_send_completion() is incorrectly validating

s/switch/The switch/

> the length of incoming network packets by falling through next case.

s/through next case/through to the next case/

> Avoid the fallthrough, instead break after a case match and then process

s/fallthrough, instead/fallthrough.  Instead,/

> the complete() call.
> 
> Signed-off-by: Sonia Sharma <sonia.sharma@linux.microsoft.com>
> ---
>  drivers/net/hyperv/netvsc.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 82e9796c8f5e..347688dd2eb9 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -851,7 +851,7 @@ static void netvsc_send_completion(struct net_device *ndev,
>  				   msglen);
>  			return;
>  		}
> -		fallthrough;
> +		break;
> 
>  	case NVSP_MSG1_TYPE_SEND_RECV_BUF_COMPLETE:
>  		if (msglen < sizeof(struct nvsp_message_header) +
> @@ -860,7 +860,7 @@ static void netvsc_send_completion(struct net_device *ndev,
>  				   msglen);
>  			return;
>  		}
> -		fallthrough;
> +		break;
> 
>  	case NVSP_MSG1_TYPE_SEND_SEND_BUF_COMPLETE:
>  		if (msglen < sizeof(struct nvsp_message_header) +
> @@ -869,7 +869,7 @@ static void netvsc_send_completion(struct net_device *ndev,
>  				   msglen);
>  			return;
>  		}
> -		fallthrough;
> +		break;
> 
>  	case NVSP_MSG5_TYPE_SUBCHANNEL:
>  		if (msglen < sizeof(struct nvsp_message_header) +
> @@ -878,10 +878,6 @@ static void netvsc_send_completion(struct net_device *ndev,
>  				   msglen);
>  			return;
>  		}
> -		/* Copy the response back */
> -		memcpy(&net_device->channel_init_pkt, nvsp_packet,
> -		       sizeof(struct nvsp_message));
> -		complete(&net_device->channel_init_wait);
>  		break;
> 
>  	case NVSP_MSG1_TYPE_SEND_RNDIS_PKT_COMPLETE:
> @@ -904,13 +900,18 @@ static void netvsc_send_completion(struct net_device *ndev,
> 
>  		netvsc_send_tx_complete(ndev, net_device, incoming_channel,
>  					desc, budget);
> -		break;
> +		return;
> 
>  	default:
>  		netdev_err(ndev,
>  			   "Unknown send completion type %d received!!\n",
>  			   nvsp_packet->hdr.msg_type);

Need to add a "return" statement here so that the error case doesn't
try to do the memcpy() and complete().

>  	}
> +
> +	/* Copy the response back */
> +	memcpy(&net_device->channel_init_pkt, nvsp_packet,
> +			sizeof(struct nvsp_message));
> +	complete(&net_device->channel_init_wait);
>  }
> 
>  static u32 netvsc_get_next_send_section(struct netvsc_device *net_device)
> --
> 2.25.1


^ permalink raw reply

* RE: [PATCH] hv_balloon: Update the balloon driver to use the SBRM API
From: Michael Kelley (LINUX) @ 2023-08-02 17:47 UTC (permalink / raw)
  To: levymitchell0@gmail.com, KY Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	mikelly@microsoft.com, peterz@infradead.org, Boqun Feng
In-Reply-To: <20230726-master-v1-1-b2ce6a4538db@gmail.com>

From: Mitchell Levy via B4 Relay <devnull+levymitchell0.gmail.com@kernel.org> Sent: Tuesday, July 25, 2023 5:24 PM
>
> This patch is intended as a proof-of-concept for the new SBRM
> machinery[1]. For some brief background, the idea behind SBRM is using
> the __cleanup__ attribute to automatically unlock locks (or otherwise
> release resources) when they go out of scope, similar to C++ style RAII.
> This promises some benefits such as making code simpler (particularly
> where you have lots of goto fail; type constructs) as well as reducing
> the surface area for certain kinds of bugs.
> 
> The changes in this patch should not result in any difference in how the
> code actually runs (i.e., it's purely an exercise in this new syntax
> sugar). In one instance SBRM was not appropriate, so I left that part
> alone, but all other locking/unlocking is handled automatically in this
> patch.
> 
> Link: https://lore.kernel.org/all/20230626125726.GU4253@hirez.programming.kicks-ass.net/ [1]

I haven't previously seen the "[1]" footnote-style identifier used with the
Link: tag.  Usually the "[1]" goes at the beginning of the line with the
additional information, but that conflicts with the Link: tag.  Maybe I'm
wrong, but you might either omit the footnote-style identifier, or the Link:
tag, instead of trying to use them together.

Separately, have you built a kernel for ARM64 with these changes in
place?  The Hyper-V balloon driver is used on both x86 and ARM64
architectures.  There's nothing obviously architecture specific here,
but given that SBRM is new, it might be wise to verify that all is good
when building and running on ARM64.

> 
> Suggested-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: "Mitchell Levy (Microsoft)" <levymitchell0@gmail.com>
> ---
>  drivers/hv/hv_balloon.c | 82 +++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index dffcc894f117..2812601e84da 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -8,6 +8,7 @@
> 
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> 
> +#include <linux/cleanup.h>
>  #include <linux/kernel.h>
>  #include <linux/jiffies.h>
>  #include <linux/mman.h>
> @@ -646,7 +647,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
>  			      void *v)
>  {
>  	struct memory_notify *mem = (struct memory_notify *)v;
> -	unsigned long flags, pfn_count;
> +	unsigned long pfn_count;
> 
>  	switch (val) {
>  	case MEM_ONLINE:
> @@ -655,21 +656,22 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
>  		break;
> 
>  	case MEM_OFFLINE:
> -		spin_lock_irqsave(&dm_device.ha_lock, flags);
> -		pfn_count = hv_page_offline_check(mem->start_pfn,
> -						  mem->nr_pages);
> -		if (pfn_count <= dm_device.num_pages_onlined) {
> -			dm_device.num_pages_onlined -= pfn_count;
> -		} else {
> -			/*
> -			 * We're offlining more pages than we managed to online.
> -			 * This is unexpected. In any case don't let
> -			 * num_pages_onlined wrap around zero.
> -			 */
> -			WARN_ON_ONCE(1);
> -			dm_device.num_pages_onlined = 0;
> +		scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> +			pfn_count = hv_page_offline_check(mem->start_pfn,
> +							  mem->nr_pages);
> +			if (pfn_count <= dm_device.num_pages_onlined) {
> +				dm_device.num_pages_onlined -= pfn_count;
> +			} else {
> +				/*
> +				 * We're offlining more pages than we
> +				 * managed to online. This is
> +				 * unexpected. In any case don't let
> +				 * num_pages_onlined wrap around zero.
> +				 */
> +				WARN_ON_ONCE(1);
> +				dm_device.num_pages_onlined = 0;
> +			}
>  		}
> -		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
>  		break;
>  	case MEM_GOING_ONLINE:
>  	case MEM_GOING_OFFLINE:
> @@ -721,24 +723,23 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
>  	unsigned long start_pfn;
>  	unsigned long processed_pfn;
>  	unsigned long total_pfn = pfn_count;
> -	unsigned long flags;
> 
>  	for (i = 0; i < (size/HA_CHUNK); i++) {
>  		start_pfn = start + (i * HA_CHUNK);
> 
> -		spin_lock_irqsave(&dm_device.ha_lock, flags);
> -		has->ha_end_pfn +=  HA_CHUNK;
> +		scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> +			has->ha_end_pfn +=  HA_CHUNK;
> 
> -		if (total_pfn > HA_CHUNK) {
> -			processed_pfn = HA_CHUNK;
> -			total_pfn -= HA_CHUNK;
> -		} else {
> -			processed_pfn = total_pfn;
> -			total_pfn = 0;
> -		}
> +			if (total_pfn > HA_CHUNK) {
> +				processed_pfn = HA_CHUNK;
> +				total_pfn -= HA_CHUNK;
> +			} else {
> +				processed_pfn = total_pfn;
> +				total_pfn = 0;
> +			}
> 
> -		has->covered_end_pfn +=  processed_pfn;
> -		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
> +			has->covered_end_pfn +=  processed_pfn;
> +		}
> 
>  		reinit_completion(&dm_device.ol_waitevent);
> 
> @@ -758,10 +759,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
>  				 */
>  				do_hot_add = false;
>  			}
> -			spin_lock_irqsave(&dm_device.ha_lock, flags);
> -			has->ha_end_pfn -= HA_CHUNK;
> -			has->covered_end_pfn -=  processed_pfn;
> -			spin_unlock_irqrestore(&dm_device.ha_lock, flags);
> +			scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> +				has->ha_end_pfn -= HA_CHUNK;
> +				has->covered_end_pfn -=  processed_pfn;
> +			}
>  			break;
>  		}
> 
> @@ -781,10 +782,9 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
>  static void hv_online_page(struct page *pg, unsigned int order)
>  {
>  	struct hv_hotadd_state *has;
> -	unsigned long flags;
>  	unsigned long pfn = page_to_pfn(pg);
> 
> -	spin_lock_irqsave(&dm_device.ha_lock, flags);
> +	guard(spinlock_irqsave)(&dm_device.ha_lock);
>  	list_for_each_entry(has, &dm_device.ha_region_list, list) {
>  		/* The page belongs to a different HAS. */
>  		if ((pfn < has->start_pfn) ||
> @@ -794,7 +794,6 @@ static void hv_online_page(struct page *pg, unsigned int order)
>  		hv_bring_pgs_online(has, pfn, 1UL << order);
>  		break;
>  	}
> -	spin_unlock_irqrestore(&dm_device.ha_lock, flags);
>  }
> 
>  static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
> @@ -803,9 +802,8 @@ static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
>  	struct hv_hotadd_gap *gap;
>  	unsigned long residual, new_inc;
>  	int ret = 0;
> -	unsigned long flags;
> 
> -	spin_lock_irqsave(&dm_device.ha_lock, flags);
> +	guard(spinlock_irqsave)(&dm_device.ha_lock);
>  	list_for_each_entry(has, &dm_device.ha_region_list, list) {
>  		/*
>  		 * If the pfn range we are dealing with is not in the current
> @@ -852,7 +850,6 @@ static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
>  		ret = 1;
>  		break;
>  	}
> -	spin_unlock_irqrestore(&dm_device.ha_lock, flags);
> 
>  	return ret;
>  }
> @@ -947,7 +944,6 @@ static unsigned long process_hot_add(unsigned long pg_start,
>  {
>  	struct hv_hotadd_state *ha_region = NULL;
>  	int covered;
> -	unsigned long flags;
> 
>  	if (pfn_cnt == 0)
>  		return 0;
> @@ -979,9 +975,9 @@ static unsigned long process_hot_add(unsigned long pg_start,
>  		ha_region->covered_end_pfn = pg_start;
>  		ha_region->end_pfn = rg_start + rg_size;
> 
> -		spin_lock_irqsave(&dm_device.ha_lock, flags);
> -		list_add_tail(&ha_region->list, &dm_device.ha_region_list);
> -		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
> +		scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> +			list_add_tail(&ha_region->list, &dm_device.ha_region_list);
> +		}
>  	}
> 
>  do_pg_range:
> @@ -2047,7 +2043,6 @@ static void balloon_remove(struct hv_device *dev)
>  	struct hv_dynmem_device *dm = hv_get_drvdata(dev);
>  	struct hv_hotadd_state *has, *tmp;
>  	struct hv_hotadd_gap *gap, *tmp_gap;
> -	unsigned long flags;
> 
>  	if (dm->num_pages_ballooned != 0)
>  		pr_warn("Ballooned pages: %d\n", dm->num_pages_ballooned);
> @@ -2073,7 +2068,7 @@ static void balloon_remove(struct hv_device *dev)
>  #endif
>  	}
> 
> -	spin_lock_irqsave(&dm_device.ha_lock, flags);
> +	guard(spinlock_irqsave)(&dm_device.ha_lock);
>  	list_for_each_entry_safe(has, tmp, &dm->ha_region_list, list) {
>  		list_for_each_entry_safe(gap, tmp_gap, &has->gap_list, list) {
>  			list_del(&gap->list);
> @@ -2082,7 +2077,6 @@ static void balloon_remove(struct hv_device *dev)
>  		list_del(&has->list);
>  		kfree(has);
>  	}
> -	spin_unlock_irqrestore(&dm_device.ha_lock, flags);
>  }
> 
>  static int balloon_suspend(struct hv_device *hv_dev)
> 
> ---
> base-commit: 3f01e9fed8454dcd89727016c3e5b2fbb8f8e50c
> change-id: 20230725-master-bbcd9205758b
> 
> Best regards,
> --
> Mitchell Levy <levymitchell0@gmail.com>

These lines at the end of the patch look spurious.  But Boqun has
already commented on that.

Michael

^ permalink raw reply

* [PATCH V5,net-next] net: mana: Add page pool for RX buffers
From: Haiyang Zhang @ 2023-08-02 18:07 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel

Add page pool for RX buffers for faster buffer cycle and reduce CPU
usage.

The standard page pool API is used.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
V5:
In err path, set page_pool_put_full_page(..., false) as suggested by
Jakub Kicinski
V4:
Add nid setting, remove page_pool_nid_changed(), as suggested by
Jesper Dangaard Brouer
V3:
Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
V2:
Use the standard page pool API as suggested by Jesper Dangaard Brouer
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 90 +++++++++++++++----
 include/net/mana/mana.h                       |  3 +
 2 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ac2acc9aca9d..1a4ac1c8736e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1414,8 +1414,8 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 	return skb;
 }
 
-static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
-			struct mana_rxq *rxq)
+static void mana_rx_skb(void *buf_va, bool from_pool,
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
@@ -1448,6 +1448,9 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	if (!skb)
 		goto drop;
 
+	if (from_pool)
+		skb_mark_for_recycle(skb);
+
 	skb->dev = napi->dev;
 
 	skb->protocol = eth_type_trans(skb, ndev);
@@ -1498,9 +1501,14 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	u64_stats_update_end(&rx_stats->syncp);
 
 drop:
-	WARN_ON_ONCE(rxq->xdp_save_va);
-	/* Save for reuse */
-	rxq->xdp_save_va = buf_va;
+	if (from_pool) {
+		page_pool_recycle_direct(rxq->page_pool,
+					 virt_to_head_page(buf_va));
+	} else {
+		WARN_ON_ONCE(rxq->xdp_save_va);
+		/* Save for reuse */
+		rxq->xdp_save_va = buf_va;
+	}
 
 	++ndev->stats.rx_dropped;
 
@@ -1508,11 +1516,13 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool is_napi)
+			     dma_addr_t *da, bool *from_pool, bool is_napi)
 {
 	struct page *page;
 	void *va;
 
+	*from_pool = false;
+
 	/* Reuse XDP dropped page if available */
 	if (rxq->xdp_save_va) {
 		va = rxq->xdp_save_va;
@@ -1533,17 +1543,22 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 			return NULL;
 		}
 	} else {
-		page = dev_alloc_page();
+		page = page_pool_dev_alloc_pages(rxq->page_pool);
 		if (!page)
 			return NULL;
 
+		*from_pool = true;
 		va = page_to_virt(page);
 	}
 
 	*da = dma_map_single(dev, va + rxq->headroom, rxq->datasize,
 			     DMA_FROM_DEVICE);
 	if (dma_mapping_error(dev, *da)) {
-		put_page(virt_to_head_page(va));
+		if (*from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, false);
+		else
+			put_page(virt_to_head_page(va));
+
 		return NULL;
 	}
 
@@ -1552,21 +1567,25 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf)
+			       struct mana_recv_buf_oob *rxoob, void **old_buf,
+			       bool *old_fp)
 {
+	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, true);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, true);
 	if (!va)
 		return;
 
 	dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 			 DMA_FROM_DEVICE);
 	*old_buf = rxoob->buf_va;
+	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
+	rxoob->from_pool = from_pool;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -1580,6 +1599,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct device *dev = gc->dev;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
+	bool old_fp;
 
 	apc = netdev_priv(ndev);
 
@@ -1622,12 +1642,12 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	rxbuf_oob = &rxq->rx_oobs[curr];
 	WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf);
+	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
 
 	/* Unsuccessful refill will have old_buf == NULL.
 	 * In this case, mana_rx_skb() will drop the packet.
 	 */
-	mana_rx_skb(old_buf, oob, rxq);
+	mana_rx_skb(old_buf, old_fp, oob, rxq);
 
 drop:
 	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
@@ -1887,6 +1907,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	struct mana_recv_buf_oob *rx_oob;
 	struct device *dev = gc->dev;
 	struct napi_struct *napi;
+	struct page *page;
 	int i;
 
 	if (!rxq)
@@ -1919,10 +1940,18 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 		dma_unmap_single(dev, rx_oob->sgl[0].address,
 				 rx_oob->sgl[0].size, DMA_FROM_DEVICE);
 
-		put_page(virt_to_head_page(rx_oob->buf_va));
+		page = virt_to_head_page(rx_oob->buf_va);
+
+		if (rx_oob->from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, false);
+		else
+			put_page(page);
+
 		rx_oob->buf_va = NULL;
 	}
 
+	page_pool_destroy(rxq->page_pool);
+
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
@@ -1933,18 +1962,20 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
 
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, false);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, false);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
+	rx_oob->from_pool = from_pool;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
@@ -2014,6 +2045,26 @@ static int mana_push_wqe(struct mana_rxq *rxq)
 	return 0;
 }
 
+static int mana_create_page_pool(struct mana_rxq *rxq, struct gdma_context *gc)
+{
+	struct page_pool_params pprm = {};
+	int ret;
+
+	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
+	pprm.nid = gc->numa_node;
+	pprm.napi = &rxq->rx_cq.napi;
+
+	rxq->page_pool = page_pool_create(&pprm);
+
+	if (IS_ERR(rxq->page_pool)) {
+		ret = PTR_ERR(rxq->page_pool);
+		rxq->page_pool = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 					u32 rxq_idx, struct mana_eq *eq,
 					struct net_device *ndev)
@@ -2043,6 +2094,13 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	mana_get_rxbuf_cfg(ndev->mtu, &rxq->datasize, &rxq->alloc_size,
 			   &rxq->headroom);
 
+	/* Create page pool for RX queue */
+	err = mana_create_page_pool(rxq, gc);
+	if (err) {
+		netdev_err(ndev, "Create page pool err:%d\n", err);
+		goto out;
+	}
+
 	err = mana_alloc_rx_wqe(apc, rxq, &rq_size, &cq_size);
 	if (err)
 		goto out;
@@ -2114,8 +2172,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	WARN_ON(xdp_rxq_info_reg(&rxq->xdp_rxq, ndev, rxq_idx,
 				 cq->napi.napi_id));
-	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq,
-					   MEM_TYPE_PAGE_SHARED, NULL));
+	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
+					   rxq->page_pool));
 
 	napi_enable(&cq->napi);
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 024ad8ddb27e..b12859511839 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -280,6 +280,7 @@ struct mana_recv_buf_oob {
 	struct gdma_wqe_request wqe_req;
 
 	void *buf_va;
+	bool from_pool; /* allocated from a page pool */
 
 	/* SGL of the buffer going to be sent has part of the work request. */
 	u32 num_sge;
@@ -330,6 +331,8 @@ struct mana_rxq {
 	bool xdp_flush;
 	int xdp_rc; /* XDP redirect return code */
 
+	struct page_pool *page_pool;
+
 	/* MUST BE THE LAST MEMBER:
 	 * Each receive buffer has an associated mana_recv_buf_oob.
 	 */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH RFC net-next v5 11/14] vhost/vsock: implement datagram support
From: Bobby Eshleman @ 2023-08-02 19:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, linux-hyperv, Stefan Hajnoczi, kvm,
	VMware PV-Drivers Reviewers, Simon Horman, virtualization,
	Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu, Dexuan Cui,
	Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
	Krasnov Arseniy, Vishnu Dasa, netdev, linux-kernel, bpf,
	David S. Miller
In-Reply-To: <20230726143850-mutt-send-email-mst@kernel.org>

On Wed, Jul 26, 2023 at 02:40:22PM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 19, 2023 at 12:50:15AM +0000, Bobby Eshleman wrote:
> > This commit implements datagram support for vhost/vsock by teaching
> > vhost to use the common virtio transport datagram functions.
> > 
> > If the virtio RX buffer is too small, then the transmission is
> > abandoned, the packet dropped, and EHOSTUNREACH is added to the socket's
> > error queue.
> > 
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> 
> EHOSTUNREACH?
> 
> 
> > ---
> >  drivers/vhost/vsock.c    | 62 +++++++++++++++++++++++++++++++++++++++++++++---
> >  net/vmw_vsock/af_vsock.c |  5 +++-
> >  2 files changed, 63 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index d5d6a3c3f273..da14260c6654 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -8,6 +8,7 @@
> >   */
> >  #include <linux/miscdevice.h>
> >  #include <linux/atomic.h>
> > +#include <linux/errqueue.h>
> >  #include <linux/module.h>
> >  #include <linux/mutex.h>
> >  #include <linux/vmalloc.h>
> > @@ -32,7 +33,8 @@
> >  enum {
> >  	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> >  			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> > -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> > +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> > +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
> >  };
> >  
> >  enum {
> > @@ -56,6 +58,7 @@ struct vhost_vsock {
> >  	atomic_t queued_replies;
> >  
> >  	u32 guest_cid;
> > +	bool dgram_allow;
> >  	bool seqpacket_allow;
> >  };
> >  
> > @@ -86,6 +89,32 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >  	return NULL;
> >  }
> >  
> > +/* Claims ownership of the skb, do not free the skb after calling! */
> > +static void
> > +vhost_transport_error(struct sk_buff *skb, int err)
> > +{
> > +	struct sock_exterr_skb *serr;
> > +	struct sock *sk = skb->sk;
> > +	struct sk_buff *clone;
> > +
> > +	serr = SKB_EXT_ERR(skb);
> > +	memset(serr, 0, sizeof(*serr));
> > +	serr->ee.ee_errno = err;
> > +	serr->ee.ee_origin = SO_EE_ORIGIN_NONE;
> > +
> > +	clone = skb_clone(skb, GFP_KERNEL);
> > +	if (!clone)
> > +		return;
> > +
> > +	if (sock_queue_err_skb(sk, clone))
> > +		kfree_skb(clone);
> > +
> > +	sk->sk_err = err;
> > +	sk_error_report(sk);
> > +
> > +	kfree_skb(skb);
> > +}
> > +
> >  static void
> >  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> >  			    struct vhost_virtqueue *vq)
> > @@ -160,9 +189,15 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> >  		hdr = virtio_vsock_hdr(skb);
> >  
> >  		/* If the packet is greater than the space available in the
> > -		 * buffer, we split it using multiple buffers.
> > +		 * buffer, we split it using multiple buffers for connectible
> > +		 * sockets and drop the packet for datagram sockets.
> >  		 */
> 
> won't this break things like recently proposed zerocopy?
> I think splitup has to be supported for all types.
> 

Could you elaborate? Is there something about zerocopy that would
prohibit the transport from dropping a datagram?

> 
> >  		if (payload_len > iov_len - sizeof(*hdr)) {
> > +			if (le16_to_cpu(hdr->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> > +				vhost_transport_error(skb, EHOSTUNREACH);
> > +				continue;
> > +			}
> > +
> >  			payload_len = iov_len - sizeof(*hdr);
> >  
> >  			/* As we are copying pieces of large packet's buffer to
> > @@ -394,6 +429,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
> >  	return val < vq->num;
> >  }
> >  
> > +static bool vhost_transport_dgram_allow(u32 cid, u32 port);
> >  static bool vhost_transport_seqpacket_allow(u32 remote_cid);
> >  
> >  static struct virtio_transport vhost_transport = {
> > @@ -410,7 +446,8 @@ static struct virtio_transport vhost_transport = {
> >  		.cancel_pkt               = vhost_transport_cancel_pkt,
> >  
> >  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > -		.dgram_allow              = virtio_transport_dgram_allow,
> > +		.dgram_allow              = vhost_transport_dgram_allow,
> > +		.dgram_addr_init          = virtio_transport_dgram_addr_init,
> >  
> >  		.stream_enqueue           = virtio_transport_stream_enqueue,
> >  		.stream_dequeue           = virtio_transport_stream_dequeue,
> > @@ -443,6 +480,22 @@ static struct virtio_transport vhost_transport = {
> >  	.send_pkt = vhost_transport_send_pkt,
> >  };
> >  
> > +static bool vhost_transport_dgram_allow(u32 cid, u32 port)
> > +{
> > +	struct vhost_vsock *vsock;
> > +	bool dgram_allow = false;
> > +
> > +	rcu_read_lock();
> > +	vsock = vhost_vsock_get(cid);
> > +
> > +	if (vsock)
> > +		dgram_allow = vsock->dgram_allow;
> > +
> > +	rcu_read_unlock();
> > +
> > +	return dgram_allow;
> > +}
> > +
> >  static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> >  {
> >  	struct vhost_vsock *vsock;
> > @@ -799,6 +852,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
> >  	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
> >  		vsock->seqpacket_allow = true;
> >  
> > +	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
> > +		vsock->dgram_allow = true;
> > +
> >  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
> >  		vq = &vsock->vqs[i];
> >  		mutex_lock(&vq->mutex);
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index e73f3b2c52f1..449ed63ac2b0 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -1427,9 +1427,12 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> >  		return prot->recvmsg(sk, msg, len, flags, NULL);
> >  #endif
> >  
> > -	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > +	if (unlikely(flags & MSG_OOB))
> >  		return -EOPNOTSUPP;
> >  
> > +	if (unlikely(flags & MSG_ERRQUEUE))
> > +		return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
> > +
> >  	transport = vsk->transport;
> >  
> >  	/* Retrieve the head sk_buff from the socket's receive queue. */
> > 
> > -- 
> > 2.30.2
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] hv_balloon: Update the balloon driver to use the SBRM API
From: Peter Zijlstra @ 2023-08-02 19:28 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: levymitchell0@gmail.com, KY Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, mikelly@microsoft.com, Boqun Feng
In-Reply-To: <BYAPR21MB16889DB462CEA394895BEBDBD70BA@BYAPR21MB1688.namprd21.prod.outlook.com>

On Wed, Aug 02, 2023 at 05:47:55PM +0000, Michael Kelley (LINUX) wrote:
> From: Mitchell Levy via B4 Relay <devnull+levymitchell0.gmail.com@kernel.org> Sent: Tuesday, July 25, 2023 5:24 PM
> >
> > This patch is intended as a proof-of-concept for the new SBRM
> > machinery[1]. For some brief background, the idea behind SBRM is using
> > the __cleanup__ attribute to automatically unlock locks (or otherwise
> > release resources) when they go out of scope, similar to C++ style RAII.
> > This promises some benefits such as making code simpler (particularly
> > where you have lots of goto fail; type constructs) as well as reducing
> > the surface area for certain kinds of bugs.
> > 
> > The changes in this patch should not result in any difference in how the
> > code actually runs (i.e., it's purely an exercise in this new syntax
> > sugar). In one instance SBRM was not appropriate, so I left that part
> > alone, but all other locking/unlocking is handled automatically in this
> > patch.
> > 
> > Link: https://lore.kernel.org/all/20230626125726.GU4253@hirez.programming.kicks-ass.net/ [1]
> 
> I haven't previously seen the "[1]" footnote-style identifier used with the
> Link: tag.  Usually the "[1]" goes at the beginning of the line with the
> additional information, but that conflicts with the Link: tag.  Maybe I'm
> wrong, but you might either omit the footnote-style identifier, or the Link:
> tag, instead of trying to use them together.
> 
> Separately, have you built a kernel for ARM64 with these changes in
> place?  The Hyper-V balloon driver is used on both x86 and ARM64
> architectures.  There's nothing obviously architecture specific here,
> but given that SBRM is new, it might be wise to verify that all is good
> when building and running on ARM64.

The only issue that has popped up so far is that __cleanup__ and
asm-goto don't interact nicely. GCC will silently mis-compile but clang
will issue a compile error/warning.

Specifically, GCC seems to have implemented asm-goto like the 'labels as
values' extention and loose the source context of the edge or something.
With result that the actual goto does not pass through the __cleanup__.

Other than that, it seems to work as expected across the platforms.

A brief look through the patch didn't show me anything odd, should be
ok. Although my primary purpose was to get rid of 'unlock' labels and
error handling, simple usage like this is perfectly fine too.

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] eth: add missing xdp.h includes in drivers
From: Gerhard Engleder @ 2023-08-02 18:06 UTC (permalink / raw)
  To: Jakub Kicinski, ast
  Cc: netdev, bpf, hawk, amritha.nambiar, aleksander.lobakin,
	j.vosburgh, andy, shayagr, akiyano, ioana.ciornei, claudiu.manoil,
	vladimir.oltean, wei.fang, shenwei.wang, xiaoning.wang, linux-imx,
	dmichail, jeroendb, pkaligineedi, shailend, jesse.brandeburg,
	anthony.l.nguyen, horatiu.vultur, UNGLinuxDriver, kys, haiyangz,
	wei.liu, decui, peppe.cavallaro, alexandre.torgue, joabreu,
	mcoquelin.stm32, grygorii.strashko, longli, sharmaajay, daniel,
	john.fastabend, simon.horman, leon, linux-hyperv
In-Reply-To: <20230802003246.2153774-2-kuba@kernel.org>

On 02.08.23 02:32, Jakub Kicinski wrote:
> Handful of drivers currently expect to get xdp.h by virtue
> of including netdevice.h. This will soon no longer be the case
> so add explicit includes.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

(...)

> diff --git a/drivers/net/ethernet/engleder/tsnep.h b/drivers/net/ethernet/engleder/tsnep.h
> index 11b29f56aaf9..6e14c918e3fb 100644
> --- a/drivers/net/ethernet/engleder/tsnep.h
> +++ b/drivers/net/ethernet/engleder/tsnep.h
> @@ -14,6 +14,7 @@
>   #include <linux/net_tstamp.h>
>   #include <linux/ptp_clock_kernel.h>
>   #include <linux/miscdevice.h>
> +#include <net/xdp.h>

Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>

^ permalink raw reply

* Re: [PATCH RFC net-next v5 07/14] virtio/vsock: add common datagram send path
From: Bobby Eshleman @ 2023-08-02 20:08 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <dbf36361-8b94-e2e3-8478-c643bab54e43@gmail.com>

On Thu, Jul 27, 2023 at 10:57:05AM +0300, Arseniy Krasnov wrote:
> 
> 
> On 26.07.2023 20:08, Bobby Eshleman wrote:
> > On Sat, Jul 22, 2023 at 11:16:05AM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 19.07.2023 03:50, Bobby Eshleman wrote:
> >>> This commit implements the common function
> >>> virtio_transport_dgram_enqueue for enqueueing datagrams. It does not add
> >>> usage in either vhost or virtio yet.
> >>>
> >>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >>> ---
> >>>  net/vmw_vsock/virtio_transport_common.c | 76 ++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 75 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>> index ffcbdd77feaa..3bfaff758433 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -819,7 +819,81 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> >>>  			       struct msghdr *msg,
> >>>  			       size_t dgram_len)
> >>>  {
> >>> -	return -EOPNOTSUPP;
> >>> +	/* Here we are only using the info struct to retain style uniformity
> >>> +	 * and to ease future refactoring and merging.
> >>> +	 */
> >>> +	struct virtio_vsock_pkt_info info_stack = {
> >>> +		.op = VIRTIO_VSOCK_OP_RW,
> >>> +		.msg = msg,
> >>> +		.vsk = vsk,
> >>> +		.type = VIRTIO_VSOCK_TYPE_DGRAM,
> >>> +	};
> >>> +	const struct virtio_transport *t_ops;
> >>> +	struct virtio_vsock_pkt_info *info;
> >>> +	struct sock *sk = sk_vsock(vsk);
> >>> +	struct virtio_vsock_hdr *hdr;
> >>> +	u32 src_cid, src_port;
> >>> +	struct sk_buff *skb;
> >>> +	void *payload;
> >>> +	int noblock;
> >>> +	int err;
> >>> +
> >>> +	info = &info_stack;
> >>
> >> I think 'info' assignment could be moved below, to the place where it is used
> >> first time.
> >>
> >>> +
> >>> +	if (dgram_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >>> +		return -EMSGSIZE;
> >>> +
> >>> +	t_ops = virtio_transport_get_ops(vsk);
> >>> +	if (unlikely(!t_ops))
> >>> +		return -EFAULT;
> >>> +
> >>> +	/* Unlike some of our other sending functions, this function is not
> >>> +	 * intended for use without a msghdr.
> >>> +	 */
> >>> +	if (WARN_ONCE(!msg, "vsock dgram bug: no msghdr found for dgram enqueue\n"))
> >>> +		return -EFAULT;
> >>
> >> Sorry, but is that possible? I thought 'msg' is always provided by general socket layer (e.g. before
> >> af_vsock.c code) and can't be NULL for DGRAM. Please correct me if i'm wrong.
> >>
> >> Also I see, that in af_vsock.c , 'vsock_dgram_sendmsg()' dereferences 'msg' for checking MSG_OOB without any
> >> checks (before calling transport callback - this function in case of virtio). So I think if we want to keep
> >> this type of check - such check must be placed in af_vsock.c or somewhere before first dereference of this pointer.
> >>
> > 
> > There is some talk about dgram sockets adding additional messages types
> > in the future that help with congestion control. Those messages won't
> > come from the socket layer, so msghdr will be null. Since there is no
> > other function for sending datagrams, it seemed likely that this
> > function would be reworked for that purpose. I felt that adding this
> > check was a direct way to make it explicit that this function is
> > currently designed only for the socket-layer caller.
> > 
> > Perhaps a comment would suffice?
> 
> I see, thanks, it is for future usage. Sorry for dumb question: but if msg is NULL, how
> we will decide what to do in this call? Interface of this callback will be updated or
> some fields of 'vsock_sock' will contain type of such messages ?
> 
> Thanks, Arseniy
> 

Hey Arseniy, sorry about the delay I forgot about this chunk of the
thread.

This warning was intended to help by calling attention to the fact that
even though this function is the only way to send dgram packets, unlike
the connectible sending function virtio_transport_send_pkt_info() this
actually requires a non-NULL msg... it seems like it doesn't help and
just causes more confusion than anything. It is a wasted cycle on the
fastpath too, so I think I'll just drop it in the next rev.

> > 
> >>> +
> >>> +	noblock = msg->msg_flags & MSG_DONTWAIT;
> >>> +
> >>> +	/* Use sock_alloc_send_skb to throttle by sk_sndbuf. This helps avoid
> >>> +	 * triggering the OOM.
> >>> +	 */
> >>> +	skb = sock_alloc_send_skb(sk, dgram_len + VIRTIO_VSOCK_SKB_HEADROOM,
> >>> +				  noblock, &err);
> >>> +	if (!skb)
> >>> +		return err;
> >>> +
> >>> +	skb_reserve(skb, VIRTIO_VSOCK_SKB_HEADROOM);
> >>> +
> >>> +	src_cid = t_ops->transport.get_local_cid();
> >>> +	src_port = vsk->local_addr.svm_port;
> >>> +
> >>> +	hdr = virtio_vsock_hdr(skb);
> >>> +	hdr->type	= cpu_to_le16(info->type);
> >>> +	hdr->op		= cpu_to_le16(info->op);
> >>> +	hdr->src_cid	= cpu_to_le64(src_cid);
> >>> +	hdr->dst_cid	= cpu_to_le64(remote_addr->svm_cid);
> >>> +	hdr->src_port	= cpu_to_le32(src_port);
> >>> +	hdr->dst_port	= cpu_to_le32(remote_addr->svm_port);
> >>> +	hdr->flags	= cpu_to_le32(info->flags);
> >>> +	hdr->len	= cpu_to_le32(dgram_len);
> >>> +
> >>> +	skb_set_owner_w(skb, sk);
> >>> +
> >>> +	payload = skb_put(skb, dgram_len);
> >>> +	err = memcpy_from_msg(payload, msg, dgram_len);
> >>> +	if (err)
> >>> +		return err;
> >>
> >> Do we need free allocated skb here ?
> >>
> > 
> > Yep, thanks.
> > 
> >>> +
> >>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>> +					 remote_addr->svm_cid,
> >>> +					 remote_addr->svm_port,
> >>> +					 dgram_len,
> >>> +					 info->type,
> >>> +					 info->op,
> >>> +					 0);
> >>> +
> >>> +	return t_ops->send_pkt(skb);
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> >>>  
> >>>
> >>
> >> Thanks, Arseniy
> > 
> > Thanks for the review!
> > 
> > Best,
> > Bobby

^ permalink raw reply

* RE: [PATCH v3 1/3] uio: Add hv_vmbus_client driver
From: Michael Kelley (LINUX) @ 2023-08-02 21:43 UTC (permalink / raw)
  To: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, gregkh@linuxfoundation.org, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <1689330346-5374-2-git-send-email-ssengar@linux.microsoft.com>

From: Saurabh Sengar <ssengar@linux.microsoft.com> Sent: Friday, July 14, 2023 3:26 AM
> 
> Add a new UIO-based driver that generically supports low speed Hyper-V
> VMBus devices. This driver can be bound to VMBus devices by user space
> drivers that provide device-specific management.
> 
> The new driver provides the following core functionality, which is
> suitable for low speed devices:
> * A single VMBus channel for each device
> * Ability to specify the VMBus channel ring buffer size for each device
> * Host notification via a hypercall instead of monitor bits
> 
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> ---
> [V3]
> - Removed ringbuffer sysfs entry and used uio framework for mmap
> - Remove ".id_table = NULL"
> - kasprintf -> devm_kasprintf
> - Change global variable ring_size to per device
> - More checks on value which can be set for ring_size
> - Remove driverctl, and used echo command instead for driver documentation
> - Remove unnecessary one time use macros
> - Change kernel version and date for sysfs documentation
> - Update documentation
> - Better commit message
> 
> [V2]
> - Update driver info in Documentation/driver-api/uio-howto.rst
> - Update ring_size sysfs info in Documentation/ABI/stable/sysfs-bus-vmbus
> - Remove DRIVER_VERSION
> - Remove refcnt
> - scnprintf -> sysfs_emit
> - sysfs_create_file -> ATTRIBUTE_GROUPS + ".driver.groups";
> - sysfs_create_bin_file -> device_create_bin_file
> - dev_notice -> dev_err
> - remove MODULE_VERSION
> 
>  Documentation/ABI/stable/sysfs-bus-vmbus |  10 ++
>  Documentation/driver-api/uio-howto.rst   |  54 ++++++
>  drivers/uio/Kconfig                      |  12 ++
>  drivers/uio/Makefile                     |   1 +
>  drivers/uio/uio_hv_vmbus_client.c        | 218 +++++++++++++++++++++++
>  5 files changed, 295 insertions(+)
>  create mode 100644 drivers/uio/uio_hv_vmbus_client.c
> 

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH v3 3/3] tools: hv: Add new fcopy application based on uio driver
From: Michael Kelley (LINUX) @ 2023-08-02 21:45 UTC (permalink / raw)
  To: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, gregkh@linuxfoundation.org, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <1689330346-5374-4-git-send-email-ssengar@linux.microsoft.com>

From: Saurabh Sengar <ssengar@linux.microsoft.com> Sent: Friday, July 14, 2023 3:26 AM
> 
> Implement the file copy service for Linux guests on Hyper-V. This
> permits the host to copy a file (over VMBus) into the guest. This
> facility is part of "guest integration services" supported on the
> Hyper-V platform.
> 
> Here is a link that provides additional details on this functionality:
> 
> https://learn.microsoft.com/en-us/powershell/module/hyper-v/copy-vmfile?view=windowsserver2022-ps
> 
> This new fcopy application uses uio_hv_vmbus_client driver which
> makes the earlier hv_util based driver and application obsolete.
> 
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> ---
> [V3]
> - Improve cover letter and commit messages
> - Improve debug prints
> - Instead of hardcoded instance id, query from class id sysfs
> - Set the ring_size value from application
> - Update the application to mmap /dev/uio instead of sysfs
> - new application compilation dependent on x86
> 
> [V2]
> - simpler sysfs path
> 
>  tools/hv/Build                 |   1 +
>  tools/hv/Makefile              |  10 +-
>  tools/hv/hv_fcopy_uio_daemon.c | 578 +++++++++++++++++++++++++++++++++
>  3 files changed, 588 insertions(+), 1 deletion(-)
>  create mode 100644 tools/hv/hv_fcopy_uio_daemon.c
> 
> diff --git a/tools/hv/Build b/tools/hv/Build
> index 2a667d3d94cb..efcbb74a0d23 100644
> --- a/tools/hv/Build
> +++ b/tools/hv/Build
> @@ -2,3 +2,4 @@ hv_kvp_daemon-y += hv_kvp_daemon.o
>  hv_vss_daemon-y += hv_vss_daemon.o
>  hv_fcopy_daemon-y += hv_fcopy_daemon.o
>  vmbus_bufring-y += vmbus_bufring.o
> +hv_fcopy_uio_daemon-y += hv_fcopy_uio_daemon.o
> diff --git a/tools/hv/Makefile b/tools/hv/Makefile
> index 33cf488fd20f..678c6c450a53 100644
> --- a/tools/hv/Makefile
> +++ b/tools/hv/Makefile
> @@ -21,8 +21,10 @@ override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -
> I$(OUTPUT)include
> 
>  ifeq ($(SRCARCH),x86)
>  ALL_LIBS := libvmbus_bufring.a
> -endif
> +ALL_TARGETS := hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
> hv_fcopy_uio_daemon
> +else
>  ALL_TARGETS := hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
> +endif
>  ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS)) $(patsubst
> %,$(OUTPUT)%,$(ALL_LIBS))
> 
>  ALL_SCRIPTS := hv_get_dhcp_info.sh hv_get_dns_info.sh hv_set_ifconfig.sh
> @@ -56,6 +58,12 @@ $(HV_FCOPY_DAEMON_IN): FORCE
>  $(OUTPUT)hv_fcopy_daemon: $(HV_FCOPY_DAEMON_IN)
>  	$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $< -o $@
> 
> +HV_FCOPY_UIO_DAEMON_IN := $(OUTPUT)hv_fcopy_uio_daemon-in.o
> +$(HV_FCOPY_UIO_DAEMON_IN): FORCE
> +	$(Q)$(MAKE) $(build)=hv_fcopy_uio_daemon
> +$(OUTPUT)hv_fcopy_uio_daemon: $(HV_FCOPY_UIO_DAEMON_IN)
> libvmbus_bufring.a
> +	$(QUIET_LINK)$(CC) -lm $< -L. -lvmbus_bufring -o $@
> +
>  clean:
>  	rm -f $(ALL_PROGRAMS)
>  	find $(or $(OUTPUT),.) -name '*.o' -delete -o -name '\.*.d' -delete
> diff --git a/tools/hv/hv_fcopy_uio_daemon.c b/tools/hv/hv_fcopy_uio_daemon.c
> new file mode 100644
> index 000000000000..e8618a30dc7e
> --- /dev/null
> +++ b/tools/hv/hv_fcopy_uio_daemon.c
> @@ -0,0 +1,578 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * An implementation of host to guest copy functionality for Linux.
> + *
> + * Copyright (C) 2023, Microsoft, Inc.
> + *
> + * Author : K. Y. Srinivasan <kys@microsoft.com>
> + * Author : Saurabh Sengar <ssengar@microsoft.com>
> + *
> + */
> +
> +#include <dirent.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <getopt.h>
> +#include <locale.h>
> +#include <stdbool.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <syslog.h>
> +#include <unistd.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <linux/hyperv.h>
> +#include "vmbus_bufring.h"
> +
> +#define ICMSGTYPE_NEGOTIATE	0
> +#define ICMSGTYPE_FCOPY		7
> +
> +#define WIN8_SRV_MAJOR		1
> +#define WIN8_SRV_MINOR		1
> +#define WIN8_SRV_VERSION	(WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
> +
> +#define MAX_PATH_LEN		300
> +#define MAX_LINE_LEN		40
> +#define DEVICES_SYSFS		"/sys/bus/vmbus/devices"
> +#define FCOPY_CLASS_ID		"34d14be3-dee4-41c8-9ae7-6b174977c192"
> +
> +#define FCOPY_VER_COUNT		1
> +static const int fcopy_versions[] = {
> +	WIN8_SRV_VERSION
> +};
> +
> +#define FW_VER_COUNT		1
> +static const int fw_versions[] = {
> +	UTIL_FW_VERSION
> +};
> +
> +#define HV_RING_SIZE		(4 * 4096)
> +
> +unsigned char desc[HV_RING_SIZE];
> +
> +static int target_fd;
> +static char target_fname[PATH_MAX];
> +static unsigned long long filesize;
> +
> +static int hv_fcopy_create_file(char *file_name, char *path_name, __u32 flags)
> +{
> +	int error = HV_E_FAIL;
> +	char *q, *p;
> +
> +	filesize = 0;
> +	p = (char *)path_name;
> +	snprintf(target_fname, sizeof(target_fname), "%s/%s",
> +		 (char *)path_name, (char *)file_name);
> +
> +	/*
> +	 * Check to see if the path is already in place; if not,
> +	 * create if required.
> +	 */
> +	while ((q = strchr(p, '/')) != NULL) {
> +		if (q == p) {
> +			p++;
> +			continue;
> +		}
> +		*q = '\0';
> +		if (access(path_name, F_OK)) {
> +			if (flags & CREATE_PATH) {
> +				if (mkdir(path_name, 0755)) {
> +					syslog(LOG_ERR, "Failed to create %s",
> +					       path_name);
> +					goto done;
> +				}
> +			} else {
> +				syslog(LOG_ERR, "Invalid path: %s", path_name);
> +				goto done;
> +			}
> +		}
> +		p = q + 1;
> +		*q = '/';
> +	}
> +
> +	if (!access(target_fname, F_OK)) {
> +		syslog(LOG_INFO, "File: %s exists", target_fname);
> +		if (!(flags & OVER_WRITE)) {
> +			error = HV_ERROR_ALREADY_EXISTS;
> +			goto done;
> +		}
> +	}
> +
> +	target_fd = open(target_fname,
> +			 O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC, 0744);
> +	if (target_fd == -1) {
> +		syslog(LOG_INFO, "Open Failed: %s", strerror(errno));
> +		goto done;
> +	}
> +
> +	error = 0;
> +done:
> +	if (error)
> +		target_fname[0] = '\0';
> +	return error;
> +}
> +
> +static int hv_copy_data(struct hv_do_fcopy *cpmsg)
> +{
> +	ssize_t bytes_written;
> +	int ret = 0;
> +
> +	bytes_written = pwrite(target_fd, cpmsg->data, cpmsg->size,
> +			       cpmsg->offset);
> +
> +	filesize += cpmsg->size;
> +	if (bytes_written != cpmsg->size) {
> +		switch (errno) {
> +		case ENOSPC:
> +			ret = HV_ERROR_DISK_FULL;
> +			break;
> +		default:
> +			ret = HV_E_FAIL;
> +			break;
> +		}
> +		syslog(LOG_ERR, "pwrite failed to write %llu bytes: %ld (%s)",
> +		       filesize, (long)bytes_written, strerror(errno));
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Reset target_fname to "" in the two below functions for hibernation: if
> + * the fcopy operation is aborted by hibernation, the daemon should remove the
> + * partially-copied file; to achieve this, the hv_utils driver always fakes a
> + * CANCEL_FCOPY message upon suspend, and later when the VM resumes back,
> + * the daemon calls hv_copy_cancel() to remove the file; if a file is copied
> + * successfully before suspend, hv_copy_finished() must reset target_fname to
> + * avoid that the file can be incorrectly removed upon resume, since the faked
> + * CANCEL_FCOPY message is spurious in this case.
> + */
> +static int hv_copy_finished(void)
> +{
> +	close(target_fd);
> +	target_fname[0] = '\0';
> +	return 0;
> +}
> +
> +static void print_usage(char *argv[])
> +{
> +	fprintf(stderr, "Usage: %s [options]\n"
> +		"Options are:\n"
> +		"  -n, --no-daemon        stay in foreground, don't daemonize\n"
> +		"  -h, --help             print this help\n", argv[0]);
> +}
> +
> +static bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, unsigned char *buf,
> +				      unsigned int buflen, const int *fw_version, int fw_vercnt,
> +				const int *srv_version, int srv_vercnt,
> +				int *nego_fw_version, int *nego_srv_version)
> +{
> +	int icframe_major, icframe_minor;
> +	int icmsg_major, icmsg_minor;
> +	int fw_major, fw_minor;
> +	int srv_major, srv_minor;
> +	int i, j;
> +	bool found_match = false;
> +	struct icmsg_negotiate *negop;
> +
> +	/* Check that there's enough space for icframe_vercnt, icmsg_vercnt */
> +	if (buflen < ICMSG_HDR + offsetof(struct icmsg_negotiate, reserved)) {
> +		syslog(LOG_ERR, "Invalid icmsg negotiate");
> +		return false;
> +	}
> +
> +	icmsghdrp->icmsgsize = 0x10;
> +	negop = (struct icmsg_negotiate *)&buf[ICMSG_HDR];
> +
> +	icframe_major = negop->icframe_vercnt;
> +	icframe_minor = 0;
> +
> +	icmsg_major = negop->icmsg_vercnt;
> +	icmsg_minor = 0;
> +
> +	/* Validate negop packet */
> +	if (icframe_major > IC_VERSION_NEGOTIATION_MAX_VER_COUNT ||
> +	    icmsg_major > IC_VERSION_NEGOTIATION_MAX_VER_COUNT ||
> +	    ICMSG_NEGOTIATE_PKT_SIZE(icframe_major, icmsg_major) > buflen) {
> +		syslog(LOG_ERR, "Invalid icmsg negotiate - icframe_major: %u, icmsg_major: %u\n",
> +		       icframe_major, icmsg_major);
> +		goto fw_error;
> +	}
> +
> +	/*
> +	 * Select the framework version number we will
> +	 * support.
> +	 */
> +
> +	for (i = 0; i < fw_vercnt; i++) {
> +		fw_major = (fw_version[i] >> 16);
> +		fw_minor = (fw_version[i] & 0xFFFF);
> +
> +		for (j = 0; j < negop->icframe_vercnt; j++) {
> +			if (negop->icversion_data[j].major == fw_major &&
> +			    negop->icversion_data[j].minor == fw_minor) {
> +				icframe_major = negop->icversion_data[j].major;
> +				icframe_minor = negop->icversion_data[j].minor;
> +				found_match = true;
> +				break;
> +			}
> +		}
> +
> +		if (found_match)
> +			break;
> +	}
> +
> +	if (!found_match)
> +		goto fw_error;
> +
> +	found_match = false;
> +
> +	for (i = 0; i < srv_vercnt; i++) {
> +		srv_major = (srv_version[i] >> 16);
> +		srv_minor = (srv_version[i] & 0xFFFF);
> +
> +		for (j = negop->icframe_vercnt;
> +			(j < negop->icframe_vercnt + negop->icmsg_vercnt);
> +			j++) {
> +			if (negop->icversion_data[j].major == srv_major &&
> +			    negop->icversion_data[j].minor == srv_minor) {
> +				icmsg_major = negop->icversion_data[j].major;
> +				icmsg_minor = negop->icversion_data[j].minor;
> +				found_match = true;
> +				break;
> +			}
> +		}
> +
> +		if (found_match)
> +			break;
> +	}
> +
> +	/*
> +	 * Respond with the framework and service
> +	 * version numbers we can support.
> +	 */
> +fw_error:
> +	if (!found_match) {
> +		negop->icframe_vercnt = 0;
> +		negop->icmsg_vercnt = 0;
> +	} else {
> +		negop->icframe_vercnt = 1;
> +		negop->icmsg_vercnt = 1;
> +	}
> +
> +	if (nego_fw_version)
> +		*nego_fw_version = (icframe_major << 16) | icframe_minor;
> +
> +	if (nego_srv_version)
> +		*nego_srv_version = (icmsg_major << 16) | icmsg_minor;
> +
> +	negop->icversion_data[0].major = icframe_major;
> +	negop->icversion_data[0].minor = icframe_minor;
> +	negop->icversion_data[1].major = icmsg_major;
> +	negop->icversion_data[1].minor = icmsg_minor;
> +
> +	return found_match;
> +}
> +
> +static void wcstoutf8(char *dest, const __u16 *src, size_t dest_size)
> +{
> +	size_t len = 0;
> +
> +	while (len < dest_size) {
> +		if (src[len] < 0x80)
> +			dest[len++] = (char)(*src++);
> +		else
> +			dest[len++] = 'X';
> +	}
> +
> +	dest[len] = '\0';
> +}
> +
> +static int hv_fcopy_start(struct hv_start_fcopy *smsg_in)
> +{
> +	setlocale(LC_ALL, "en_US.utf8");
> +	size_t file_size, path_size;
> +	char *file_name, *path_name;
> +	char *in_file_name = (char *)smsg_in->file_name;
> +	char *in_path_name = (char *)smsg_in->path_name;
> +
> +	file_size = wcstombs(NULL, (const wchar_t *restrict)in_file_name, 0) + 1;
> +	path_size = wcstombs(NULL, (const wchar_t *restrict)in_path_name, 0) + 1;
> +
> +	file_name = (char *)malloc(file_size * sizeof(char));
> +	path_name = (char *)malloc(path_size * sizeof(char));
> +
> +	wcstoutf8(file_name, (__u16 *)in_file_name, file_size);
> +	wcstoutf8(path_name, (__u16 *)in_path_name, path_size);
> +
> +	return hv_fcopy_create_file(file_name, path_name, smsg_in->copy_flags);
> +}
> +
> +static int hv_fcopy_send_data(struct hv_fcopy_hdr *fcopy_msg, int recvlen)
> +{
> +	int operation = fcopy_msg->operation;
> +
> +	/*
> +	 * The  strings sent from the host are encoded in
> +	 * utf16; convert it to utf8 strings.
> +	 * The host assures us that the utf16 strings will not exceed
> +	 * the max lengths specified. We will however, reserve room
> +	 * for the string terminating character - in the utf16s_utf8s()
> +	 * function we limit the size of the buffer where the converted
> +	 * string is placed to W_MAX_PATH -1 to guarantee
> +	 * that the strings can be properly terminated!
> +	 */
> +
> +	switch (operation) {
> +	case START_FILE_COPY:
> +		return hv_fcopy_start((struct hv_start_fcopy *)fcopy_msg);
> +	case WRITE_TO_FILE:
> +		return hv_copy_data((struct hv_do_fcopy *)fcopy_msg);
> +	case COMPLETE_FCOPY:
> +		return hv_copy_finished();
> +	}
> +
> +	return HV_E_FAIL;
> +}
> +
> +/* process the packet recv from host */
> +static int fcopy_pkt_process(struct vmbus_br *txbr)
> +{
> +	int ret, offset, pktlen;
> +	int fcopy_srv_version;
> +	const struct vmbus_chanpkt_hdr *pkt;
> +	struct hv_fcopy_hdr *fcopy_msg;
> +	struct icmsg_hdr *icmsghdr;
> +
> +	pkt = (const struct vmbus_chanpkt_hdr *)desc;
> +	offset = pkt->hlen << 3;
> +	pktlen = (pkt->tlen << 3) - offset;
> +	icmsghdr = (struct icmsg_hdr *)&desc[offset + sizeof(struct vmbuspipe_hdr)];
> +	icmsghdr->status = HV_E_FAIL;
> +
> +	if (icmsghdr->icmsgtype == ICMSGTYPE_NEGOTIATE) {
> +		if (vmbus_prep_negotiate_resp(icmsghdr, desc + offset, pktlen, fw_versions,
> +					      FW_VER_COUNT, fcopy_versions, FCOPY_VER_COUNT,
> +					      NULL, &fcopy_srv_version)) {
> +			syslog(LOG_INFO, "FCopy IC version %d.%d",
> +			       fcopy_srv_version >> 16, fcopy_srv_version & 0xFFFF);
> +			icmsghdr->status = 0;
> +		}
> +	} else if (icmsghdr->icmsgtype == ICMSGTYPE_FCOPY) {
> +		/* Ensure recvlen is big enough to contain hv_fcopy_hdr */
> +		if (pktlen < ICMSG_HDR + sizeof(struct hv_fcopy_hdr)) {
> +			syslog(LOG_ERR, "Invalid Fcopy hdr. Packet length too small: %u",
> +			       pktlen);
> +			return -ENOBUFS;
> +		}
> +
> +		fcopy_msg = (struct hv_fcopy_hdr *)&desc[offset + ICMSG_HDR];
> +		icmsghdr->status = hv_fcopy_send_data(fcopy_msg, pktlen);
> +	}
> +
> +	icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
> +	ret = rte_vmbus_chan_send(txbr, 0x6, desc + offset, pktlen, 0);
> +	if (ret) {
> +		syslog(LOG_ERR, "Write to ringbuffer failed err: %d", ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void fcopy_get_first_folder(char *path, char *chan_no)
> +{
> +	DIR *dir = opendir(path);
> +	struct dirent *entry;
> +
> +	if (!dir) {
> +		syslog(LOG_ERR, "Failed to open directory (errno=%s).\n", strerror(errno));
> +		return;
> +	}
> +
> +	while ((entry = readdir(dir)) != NULL) {
> +		if (entry->d_type == DT_DIR && strcmp(entry->d_name, ".") != 0 &&
> +		    strcmp(entry->d_name, "..") != 0) {
> +			strcpy(chan_no, entry->d_name);
> +			break;
> +		}
> +	}
> +
> +	closedir(dir);
> +}
> +
> +static void fcopy_set_ring_size(char *path, char *inst, int size)
> +{
> +	char ring_size_path[MAX_PATH_LEN] = {0};
> +	FILE *fd;
> +
> +	snprintf(ring_size_path, sizeof(ring_size_path), "%s/%s/%s", path, inst, "ring_size");
> +	fd = fopen(ring_size_path, "w");
> +	if (!fd) {
> +		syslog(LOG_WARNING, "Failed to open ring_size file (errno=%s).\n", strerror(errno));
> +		return;
> +	}
> +	fprintf(fd, "%d", size);

Check for and log an error if the new value isn't accepted by the kernel driver?
The code is using a ring size value that should be accepted by the kernel driver,
but weird stuff happens and it's probably better to know about it.

> +	fclose(fd);
> +}
> +
> +static char *fcopy_read_sysfs(char *path, char *buf, int len)
> +{
> +	FILE *fd;
> +	char *ret;
> +
> +	fd = fopen(path, "r");
> +	if (!fd)
> +		return NULL;
> +
> +	ret = fgets(buf, len, fd);
> +	fclose(fd);
> +
> +	return ret;
> +}
> +
> +static int fcopy_get_instance_id(char *path, char *class_id, char *inst)
> +{
> +	DIR *dir = opendir(path);
> +	struct dirent *entry;
> +	char tmp_path[MAX_PATH_LEN] = {0};
> +	char line[MAX_LINE_LEN];
> +
> +	if (!dir) {
> +		syslog(LOG_ERR, "Failed to open directory (errno=%s).\n", strerror(errno));
> +		return -EINVAL;
> +	}
> +
> +	while ((entry = readdir(dir)) != NULL) {
> +		if (entry->d_type == DT_LNK && strcmp(entry->d_name, ".") != 0 &&
> +		    strcmp(entry->d_name, "..") != 0) {
> +			/* search for the sysfs path with matching class_id */
> +			snprintf(tmp_path, sizeof(tmp_path), "%s/%s/%s",
> +				 path, entry->d_name, "class_id");
> +			if (!fcopy_read_sysfs(tmp_path, line, MAX_LINE_LEN))
> +				continue;
> +
> +			/* class id matches, now fetch the instance id from device_id */
> +			if (strstr(line, class_id)) {
> +				snprintf(tmp_path, sizeof(tmp_path), "%s/%s/%s",
> +					 path, entry->d_name, "device_id");
> +				if (!fcopy_read_sysfs(tmp_path, line, MAX_LINE_LEN))
> +					continue;
> +				/* remove braces */
> +				strncpy(inst, line + 1, strlen(line) - 3);
> +				break;
> +			}
> +		}
> +	}
> +
> +	closedir(dir);
> +	return 0;

If this function doesn't find a matching class_id, it appears that it
returns 0, but with the "inst" parameter unset.  The caller will then
proceed as if "inst" was set when it is actually an uninitialized stack
variable.  Probably need some better error detection and handling.

> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int fcopy_fd = -1, tmp = 1;
> +	int daemonize = 1, long_index = 0, opt, ret = -EINVAL;
> +	struct vmbus_br txbr, rxbr;
> +	void *ring;
> +	uint32_t len = HV_RING_SIZE;
> +	char uio_name[10] = {0};
> +	char uio_dev_path[15] = {0};
> +	char uio_path[MAX_PATH_LEN] = {0};
> +	char inst[MAX_LINE_LEN] = {0};
> +
> +	static struct option long_options[] = {
> +		{"help",	no_argument,	   0,  'h' },
> +		{"no-daemon",	no_argument,	   0,  'n' },
> +		{0,		0,		   0,  0   }
> +	};
> +
> +	while ((opt = getopt_long(argc, argv, "hn", long_options,
> +				  &long_index)) != -1) {
> +		switch (opt) {
> +		case 'n':
> +			daemonize = 0;
> +			break;
> +		case 'h':
> +		default:
> +			print_usage(argv);
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	if (daemonize && daemon(1, 0)) {
> +		syslog(LOG_ERR, "daemon() failed; error: %s", strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	openlog("HV_UIO_FCOPY", 0, LOG_USER);
> +	syslog(LOG_INFO, "starting; pid is:%d", getpid());
> +
> +	/* get instance id */
> +	if (fcopy_get_instance_id(DEVICES_SYSFS, FCOPY_CLASS_ID, inst))
> +		exit(EXIT_FAILURE);

Per above, need better error handling.  And since the syslog is now open,
any errors should be logged rather than having the process just
mysteriously exit.

> +
> +	/* set ring_size value */
> +	fcopy_set_ring_size(DEVICES_SYSFS, inst, HV_RING_SIZE);
> +
> +	/* get /dev/uioX dev path and open it */
> +	snprintf(uio_path, sizeof(uio_path), "%s/%s/%s", DEVICES_SYSFS, inst, "uio");
> +	fcopy_get_first_folder(uio_path, uio_name);
> +	snprintf(uio_dev_path, sizeof(uio_dev_path), "/dev/%s", uio_name);
> +	fcopy_fd = open(uio_dev_path, O_RDWR);
> +
> +	if (fcopy_fd < 0) {
> +		syslog(LOG_ERR, "open %s failed; error: %d %s",
> +		       uio_dev_path, errno, strerror(errno));
> +		syslog(LOG_ERR, "Please make sure module uio_hv_vmbus_client is loaded and" \
> +		       " device is not used by any other application\n");
> +		ret = fcopy_fd;
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ring = mmap(NULL, 2 * HV_RING_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fcopy_fd, 0);
> +	if (ring == MAP_FAILED) {
> +		ret = errno;
> +		syslog(LOG_ERR, "mmap ringbuffer failed; error: %d %s", ret, strerror(ret));
> +		goto close;
> +	}
> +	vmbus_br_setup(&txbr, ring, HV_RING_SIZE);
> +	vmbus_br_setup(&rxbr, (char *)ring + HV_RING_SIZE, HV_RING_SIZE);
> +
> +	while (1) {
> +		/*
> +		 * In this loop we process fcopy messages after the
> +		 * handshake is complete.
> +		 */
> +		ret = pread(fcopy_fd, &tmp, sizeof(int), 0);
> +		if (ret < 0) {
> +			syslog(LOG_ERR, "pread failed: %s", strerror(errno));
> +			continue;
> +		}
> +
> +		len = HV_RING_SIZE;
> +		ret = rte_vmbus_chan_recv_raw(&rxbr, desc, &len);
> +		if (unlikely(ret <= 0)) {
> +			/* This indicates a failure to communicate (or worse) */
> +			syslog(LOG_ERR, "VMBus channel recv error: %d", ret);
> +		} else {
> +			ret = fcopy_pkt_process(&txbr);
> +			if (ret < 0)
> +				goto close;
> +
> +			/* Signal host */
> +			tmp = 1;
> +			if ((write(fcopy_fd, &tmp, sizeof(int))) != sizeof(int)) {
> +				ret = errno;
> +				syslog(LOG_ERR, "Registration failed: %s\n", strerror(ret));
> +				goto close;
> +			}
> +		}
> +	}
> +close:
> +	close(fcopy_fd);
> +	return ret;
> +}
> --
> 2.34.1


^ permalink raw reply

* RE: [PATCH v3 2/3] tools: hv: Add vmbus_bufring
From: Michael Kelley (LINUX) @ 2023-08-02 21:43 UTC (permalink / raw)
  To: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, gregkh@linuxfoundation.org, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <1689330346-5374-3-git-send-email-ssengar@linux.microsoft.com>

From: Saurabh Sengar <ssengar@linux.microsoft.com> Sent: Friday, July 14, 2023 3:26 AM
> 
> Provide a userspace interface for userspace drivers or applications to
> read/write a VMBus ringbuffer. A significant part of this code is
> borrowed from DPDK[1]. Current library is supported exclusively for
> the x86 architecture.
> 
> To build this library:
> make -C tools/hv libvmbus_bufring.a
> 
> Applications using this library can include the vmbus_bufring.h header
> file and libvmbus_bufring.a statically.
> 
> [1] https://github.com/DPDK/dpdk/
> 
> Signed-off-by: Mary Hardy <maryhardy@microsoft.com>
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> ---
> [V3]
> - Made ring buffer data offset depend on page size
> - remove rte_smp_rwmb macro and reused rte_compiler_barrier instead
> - Added legal counsel sign-off
> - Removed "Link:" tag
> - Improve commit messages
> - new library compilation dependent on x86
> - simplify mmap
> 
> [V2]
> - simpler sysfs path, less parsing
> 
>  tools/hv/Build           |   1 +
>  tools/hv/Makefile        |  13 +-
>  tools/hv/vmbus_bufring.c | 297 +++++++++++++++++++++++++++++++++++++++
>  tools/hv/vmbus_bufring.h | 154 ++++++++++++++++++++
>  4 files changed, 464 insertions(+), 1 deletion(-)
>  create mode 100644 tools/hv/vmbus_bufring.c
>  create mode 100644 tools/hv/vmbus_bufring.h
> 
> diff --git a/tools/hv/Build b/tools/hv/Build
> index 6cf51fa4b306..2a667d3d94cb 100644
> --- a/tools/hv/Build
> +++ b/tools/hv/Build
> @@ -1,3 +1,4 @@
>  hv_kvp_daemon-y += hv_kvp_daemon.o
>  hv_vss_daemon-y += hv_vss_daemon.o
>  hv_fcopy_daemon-y += hv_fcopy_daemon.o
> +vmbus_bufring-y += vmbus_bufring.o
> diff --git a/tools/hv/Makefile b/tools/hv/Makefile
> index fe770e679ae8..33cf488fd20f 100644
> --- a/tools/hv/Makefile
> +++ b/tools/hv/Makefile
> @@ -11,14 +11,19 @@ srctree := $(patsubst %/,%,$(dir $(CURDIR)))
>  srctree := $(patsubst %/,%,$(dir $(srctree)))
>  endif
> 
> +include $(srctree)/tools/scripts/Makefile.arch
> +
>  # Do not use make's built-in rules
>  # (this improves performance and avoids hard-to-debug behaviour);
>  MAKEFLAGS += -r
> 
>  override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
> 
> +ifeq ($(SRCARCH),x86)
> +ALL_LIBS := libvmbus_bufring.a
> +endif
>  ALL_TARGETS := hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
> -ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
> +ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS)) $(patsubst
> %,$(OUTPUT)%,$(ALL_LIBS))
> 
>  ALL_SCRIPTS := hv_get_dhcp_info.sh hv_get_dns_info.sh hv_set_ifconfig.sh
> 
> @@ -27,6 +32,12 @@ all: $(ALL_PROGRAMS)
>  export srctree OUTPUT CC LD CFLAGS
>  include $(srctree)/tools/build/Makefile.include
> 
> +HV_VMBUS_BUFRING_IN := $(OUTPUT)vmbus_bufring.o
> +$(HV_VMBUS_BUFRING_IN): FORCE
> +	$(Q)$(MAKE) $(build)=vmbus_bufring
> +$(OUTPUT)libvmbus_bufring.a : vmbus_bufring.o
> +	$(AR) rcs $@ $^
> +
>  HV_KVP_DAEMON_IN := $(OUTPUT)hv_kvp_daemon-in.o
>  $(HV_KVP_DAEMON_IN): FORCE
>  	$(Q)$(MAKE) $(build)=hv_kvp_daemon
> diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c
> new file mode 100644
> index 000000000000..fb1f0489c625
> --- /dev/null
> +++ b/tools/hv/vmbus_bufring.c
> @@ -0,0 +1,297 @@
> +// SPDX-License-Identifier: BSD-3-Clause
> +/*
> + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> + * Copyright (c) 2012 NetApp Inc.
> + * Copyright (c) 2012 Citrix Inc.
> + * All rights reserved.
> + */
> +
> +#include <errno.h>
> +#include <emmintrin.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/uio.h>
> +#include "vmbus_bufring.h"
> +
> +#define	rte_compiler_barrier()	({ asm volatile ("" : : : "memory"); })
> +#define RINGDATA_START_OFFSET	(getpagesize())
> +#define VMBUS_RQST_ERROR	0xFFFFFFFFFFFFFFFF
> +#define ALIGN(val, align)	((typeof(val))((val) & (~((typeof(val))((align) - 1)))))
> +
> +/* Increase bufring index by inc with wraparound */
> +static inline uint32_t vmbus_br_idxinc(uint32_t idx, uint32_t inc, uint32_t sz)
> +{
> +	idx += inc;
> +	if (idx >= sz)
> +		idx -= sz;
> +
> +	return idx;
> +}
> +
> +void vmbus_br_setup(struct vmbus_br *br, void *buf, unsigned int blen)
> +{
> +	br->vbr = buf;
> +	br->windex = br->vbr->windex;
> +	br->dsize = blen - RINGDATA_START_OFFSET;
> +}
> +
> +static inline __always_inline void
> +rte_smp_mb(void)
> +{
> +	asm volatile("lock addl $0, -128(%%rsp); " ::: "memory");
> +}
> +
> +static inline int
> +rte_atomic32_cmpset(volatile uint32_t *dst, uint32_t exp, uint32_t src)
> +{
> +	uint8_t res;
> +
> +	asm volatile("lock ; "
> +		     "cmpxchgl %[src], %[dst];"
> +		     "sete %[res];"
> +		     : [res] "=a" (res),     /* output */
> +		     [dst] "=m" (*dst)
> +		     : [src] "r" (src),      /* input */
> +		     "a" (exp),
> +		     "m" (*dst)
> +		     : "memory");            /* no-clobber list */
> +	return res;
> +}
> +
> +static inline uint32_t
> +vmbus_txbr_copyto(const struct vmbus_br *tbr, uint32_t windex,
> +		  const void *src0, uint32_t cplen)
> +{
> +	uint8_t *br_data = (uint8_t *)tbr->vbr + RINGDATA_START_OFFSET;
> +	uint32_t br_dsize = tbr->dsize;
> +	const uint8_t *src = src0;
> +
> +	if (cplen > br_dsize - windex) {
> +		uint32_t fraglen = br_dsize - windex;
> +
> +		/* Wrap-around detected */
> +		memcpy(br_data + windex, src, fraglen);
> +		memcpy(br_data, src + fraglen, cplen - fraglen);
> +	} else {
> +		memcpy(br_data + windex, src, cplen);
> +	}
> +
> +	return vmbus_br_idxinc(windex, cplen, br_dsize);
> +}
> +
> +/*
> + * Write scattered channel packet to TX bufring.
> + *
> + * The offset of this channel packet is written as a 64bits value
> + * immediately after this channel packet.
> + *
> + * The write goes through three stages:
> + *  1. Reserve space in ring buffer for the new data.
> + *     Writer atomically moves priv_write_index.
> + *  2. Copy the new data into the ring.
> + *  3. Update the tail of the ring (visible to host) that indicates
> + *     next read location. Writer updates write_index
> + */
> +static int
> +vmbus_txbr_write(struct vmbus_br *tbr, const struct iovec iov[], int iovlen,
> +		 bool *need_sig)
> +{
> +	struct vmbus_bufring *vbr = tbr->vbr;
> +	uint32_t ring_size = tbr->dsize;
> +	uint32_t old_windex, next_windex, windex, total;
> +	uint64_t save_windex;
> +	int i;
> +
> +	total = 0;
> +	for (i = 0; i < iovlen; i++)
> +		total += iov[i].iov_len;
> +	total += sizeof(save_windex);
> +
> +	/* Reserve space in ring */
> +	do {
> +		uint32_t avail;
> +
> +		/* Get current free location */
> +		old_windex = tbr->windex;
> +
> +		/* Prevent compiler reordering this with calculation */
> +		rte_compiler_barrier();
> +
> +		avail = vmbus_br_availwrite(tbr, old_windex);
> +
> +		/* If not enough space in ring, then tell caller. */
> +		if (avail <= total)
> +			return -EAGAIN;
> +
> +		next_windex = vmbus_br_idxinc(old_windex, total, ring_size);
> +
> +		/* Atomic update of next write_index for other threads */
> +	} while (!rte_atomic32_cmpset(&tbr->windex, old_windex, next_windex));
> +
> +	/* Space from old..new is now reserved */
> +	windex = old_windex;
> +	for (i = 0; i < iovlen; i++)
> +		windex = vmbus_txbr_copyto(tbr, windex, iov[i].iov_base, iov[i].iov_len);
> +
> +	/* Set the offset of the current channel packet. */
> +	save_windex = ((uint64_t)old_windex) << 32;
> +	windex = vmbus_txbr_copyto(tbr, windex, &save_windex,
> +				   sizeof(save_windex));
> +
> +	/* The region reserved should match region used */
> +	if (windex != next_windex)
> +		return -EINVAL;
> +
> +	/* Ensure that data is available before updating host index */
> +	rte_compiler_barrier();
> +
> +	/* Checkin for our reservation. wait for our turn to update host */
> +	while (!rte_atomic32_cmpset(&vbr->windex, old_windex, next_windex))
> +		_mm_pause();
> +
> +	return 0;
> +}
> +
> +int rte_vmbus_chan_send(struct vmbus_br *txbr, uint16_t type, void *data,
> +			uint32_t dlen, uint32_t flags)
> +{
> +	struct vmbus_chanpkt pkt;
> +	unsigned int pktlen, pad_pktlen;
> +	const uint32_t hlen = sizeof(pkt);
> +	bool send_evt = false;
> +	uint64_t pad = 0;
> +	struct iovec iov[3];
> +	int error;
> +
> +	pktlen = hlen + dlen;
> +	pad_pktlen = ALIGN(pktlen, sizeof(uint64_t));

This ALIGN function rounds down.  So pad_pktlen could be
less than pktlen.

> +
> +	pkt.hdr.type = type;
> +	pkt.hdr.flags = flags;
> +	pkt.hdr.hlen = hlen >> VMBUS_CHANPKT_SIZE_SHIFT;
> +	pkt.hdr.tlen = pad_pktlen >> VMBUS_CHANPKT_SIZE_SHIFT;
> +	pkt.hdr.xactid = VMBUS_RQST_ERROR; /* doesn't support multiple requests at same time */
> +
> +	iov[0].iov_base = &pkt;
> +	iov[0].iov_len = hlen;
> +	iov[1].iov_base = data;
> +	iov[1].iov_len = dlen;
> +	iov[2].iov_base = &pad;
> +	iov[2].iov_len = pad_pktlen - pktlen;

Given the way your ALIGN function works, the above could
produce a negative value for iov[2].iov_len.  Then bad things
will happen. :-(

> +
> +	error = vmbus_txbr_write(txbr, iov, 3, &send_evt);
> +
> +	return error;
> +}
> +
> +static inline uint32_t
> +vmbus_rxbr_copyfrom(const struct vmbus_br *rbr, uint32_t rindex,
> +		    void *dst0, size_t cplen)
> +{
> +	const uint8_t *br_data = (uint8_t *)rbr->vbr + RINGDATA_START_OFFSET;
> +	uint32_t br_dsize = rbr->dsize;
> +	uint8_t *dst = dst0;
> +
> +	if (cplen > br_dsize - rindex) {
> +		uint32_t fraglen = br_dsize - rindex;
> +
> +		/* Wrap-around detected. */
> +		memcpy(dst, br_data + rindex, fraglen);
> +		memcpy(dst + fraglen, br_data, cplen - fraglen);
> +	} else {
> +		memcpy(dst, br_data + rindex, cplen);
> +	}
> +
> +	return vmbus_br_idxinc(rindex, cplen, br_dsize);
> +}
> +
> +/* Copy data from receive ring but don't change index */
> +static int
> +vmbus_rxbr_peek(const struct vmbus_br *rbr, void *data, size_t dlen)
> +{
> +	uint32_t avail;
> +
> +	/*
> +	 * The requested data and the 64bits channel packet
> +	 * offset should be there at least.
> +	 */
> +	avail = vmbus_br_availread(rbr);
> +	if (avail < dlen + sizeof(uint64_t))
> +		return -EAGAIN;
> +
> +	vmbus_rxbr_copyfrom(rbr, rbr->vbr->rindex, data, dlen);
> +	return 0;
> +}
> +
> +/*
> + * Copy data from receive ring and change index
> + * NOTE:
> + * We assume (dlen + skip) == sizeof(channel packet).
> + */
> +static int
> +vmbus_rxbr_read(struct vmbus_br *rbr, void *data, size_t dlen, size_t skip)
> +{
> +	struct vmbus_bufring *vbr = rbr->vbr;
> +	uint32_t br_dsize = rbr->dsize;
> +	uint32_t rindex;
> +
> +	if (vmbus_br_availread(rbr) < dlen + skip + sizeof(uint64_t))
> +		return -EAGAIN;
> +
> +	/* Record where host was when we started read (for debug) */
> +	rbr->windex = rbr->vbr->windex;
> +
> +	/*
> +	 * Copy channel packet from RX bufring.
> +	 */
> +	rindex = vmbus_br_idxinc(rbr->vbr->rindex, skip, br_dsize);
> +	rindex = vmbus_rxbr_copyfrom(rbr, rindex, data, dlen);
> +
> +	/*
> +	 * Discard this channel packet's 64bits offset, which is useless to us.
> +	 */
> +	rindex = vmbus_br_idxinc(rindex, sizeof(uint64_t), br_dsize);
> +
> +	/* Update the read index _after_ the channel packet is fetched.	 */
> +	rte_compiler_barrier();
> +
> +	vbr->rindex = rindex;
> +
> +	return 0;
> +}
> +
> +int rte_vmbus_chan_recv_raw(struct vmbus_br *rxbr,
> +			    void *data, uint32_t *len)
> +{
> +	struct vmbus_chanpkt_hdr pkt;
> +	uint32_t dlen, bufferlen = *len;
> +	int error;
> +
> +	error = vmbus_rxbr_peek(rxbr, &pkt, sizeof(pkt));
> +	if (error)
> +		return error;
> +
> +	if (unlikely(pkt.hlen < VMBUS_CHANPKT_HLEN_MIN))
> +		/* XXX this channel is dead actually. */
> +		return -EIO;
> +
> +	if (unlikely(pkt.hlen > pkt.tlen))
> +		return -EIO;
> +
> +	/* Length are in quad words */
> +	dlen = pkt.tlen << VMBUS_CHANPKT_SIZE_SHIFT;
> +	*len = dlen;
> +
> +	/* If caller buffer is not large enough */
> +	if (unlikely(dlen > bufferlen))
> +		return -ENOBUFS;
> +
> +	/* Read data and skip packet header */
> +	error = vmbus_rxbr_read(rxbr, data, dlen, 0);
> +	if (error)
> +		return error;
> +
> +	/* Return the number of bytes read */
> +	return dlen + sizeof(uint64_t);
> +}
> diff --git a/tools/hv/vmbus_bufring.h b/tools/hv/vmbus_bufring.h
> new file mode 100644
> index 000000000000..45ecc48e517f
> --- /dev/null
> +++ b/tools/hv/vmbus_bufring.h
> @@ -0,0 +1,154 @@
> +/* SPDX-License-Identifier: BSD-3-Clause */
> +
> +#ifndef _VMBUS_BUF_H_
> +#define _VMBUS_BUF_H_
> +
> +#include <stdbool.h>
> +#include <stdint.h>
> +
> +#define __packed   __attribute__((__packed__))
> +#define unlikely(x)	__builtin_expect(!!(x), 0)
> +
> +#define ICMSGHDRFLAG_TRANSACTION	1
> +#define ICMSGHDRFLAG_REQUEST		2
> +#define ICMSGHDRFLAG_RESPONSE		4
> +
> +#define IC_VERSION_NEGOTIATION_MAX_VER_COUNT 100
> +#define ICMSG_HDR (sizeof(struct vmbuspipe_hdr) + sizeof(struct icmsg_hdr))
> +#define ICMSG_NEGOTIATE_PKT_SIZE(icframe_vercnt, icmsg_vercnt) \
> +	(ICMSG_HDR + sizeof(struct icmsg_negotiate) + \
> +	 (((icframe_vercnt) + (icmsg_vercnt)) * sizeof(struct ic_version)))
> +
> +/*
> + * Channel packets
> + */
> +
> +/* Channel packet flags */
> +#define VMBUS_CHANPKT_TYPE_INBAND	0x0006
> +#define VMBUS_CHANPKT_TYPE_RXBUF	0x0007
> +#define VMBUS_CHANPKT_TYPE_GPA		0x0009
> +#define VMBUS_CHANPKT_TYPE_COMP		0x000b
> +
> +#define VMBUS_CHANPKT_FLAG_NONE		0
> +#define VMBUS_CHANPKT_FLAG_RC		0x0001  /* report completion */
> +
> +#define VMBUS_CHANPKT_SIZE_SHIFT	3
> +#define VMBUS_CHANPKT_SIZE_ALIGN	BIT(VMBUS_CHANPKT_SIZE_SHIFT)
> +#define VMBUS_CHANPKT_HLEN_MIN		\
> +	(sizeof(struct vmbus_chanpkt_hdr) >> VMBUS_CHANPKT_SIZE_SHIFT)
> +
> +/*
> + * Buffer ring
> + */
> +struct vmbus_bufring {
> +	volatile uint32_t windex;
> +	volatile uint32_t rindex;
> +
> +	/*
> +	 * Interrupt mask {0,1}
> +	 *
> +	 * For TX bufring, host set this to 1, when it is processing
> +	 * the TX bufring, so that we can safely skip the TX event
> +	 * notification to host.
> +	 *
> +	 * For RX bufring, once this is set to 1 by us, host will not
> +	 * further dispatch interrupts to us, even if there are data
> +	 * pending on the RX bufring.  This effectively disables the
> +	 * interrupt of the channel to which this RX bufring is attached.
> +	 */
> +	volatile uint32_t imask;
> +
> +	/*
> +	 * Win8 uses some of the reserved bits to implement
> +	 * interrupt driven flow management. On the send side
> +	 * we can request that the receiver interrupt the sender
> +	 * when the ring transitions from being full to being able
> +	 * to handle a message of size "pending_send_sz".
> +	 *
> +	 * Add necessary state for this enhancement.
> +	 */
> +	volatile uint32_t pending_send;
> +	uint32_t reserved1[12];
> +
> +	union {
> +		struct {
> +			uint32_t feat_pending_send_sz:1;
> +		};
> +		uint32_t value;
> +	} feature_bits;
> +
> +	/*
> +	 * Ring data starts here + RingDataStartOffset

This mention of RingDataStartOffset looks stale.  I could
not find it defined anywhere.

> +	 * !!! DO NOT place any fields below this !!!
> +	 */
> +	uint8_t data[];
> +} __packed;
> +
> +struct vmbus_br {
> +	struct vmbus_bufring *vbr;
> +	uint32_t	dsize;
> +	uint32_t	windex; /* next available location */
> +};
> +
> +struct vmbus_chanpkt_hdr {
> +	uint16_t	type;	/* VMBUS_CHANPKT_TYPE_ */
> +	uint16_t	hlen;	/* header len, in 8 bytes */
> +	uint16_t	tlen;	/* total len, in 8 bytes */
> +	uint16_t	flags;	/* VMBUS_CHANPKT_FLAG_ */
> +	uint64_t	xactid;
> +} __packed;
> +
> +struct vmbus_chanpkt {
> +	struct vmbus_chanpkt_hdr hdr;
> +} __packed;
> +
> +struct vmbuspipe_hdr {
> +	unsigned int flags;
> +	unsigned int msgsize;
> +} __packed;
> +
> +struct ic_version {
> +	unsigned short major;
> +	unsigned short minor;
> +} __packed;
> +
> +struct icmsg_negotiate {
> +	unsigned short icframe_vercnt;
> +	unsigned short icmsg_vercnt;
> +	unsigned int reserved;
> +	struct ic_version icversion_data[]; /* any size array */
> +} __packed;
> +
> +struct icmsg_hdr {
> +	struct ic_version icverframe;
> +	unsigned short icmsgtype;
> +	struct ic_version icvermsg;
> +	unsigned short icmsgsize;
> +	unsigned int status;
> +	unsigned char ictransaction_id;
> +	unsigned char icflags;
> +	unsigned char reserved[2];
> +} __packed;
> +
> +int rte_vmbus_chan_recv_raw(struct vmbus_br *rxbr, void *data, uint32_t *len);
> +int rte_vmbus_chan_send(struct vmbus_br *txbr, uint16_t type, void *data,
> +			uint32_t dlen, uint32_t flags);
> +void vmbus_br_setup(struct vmbus_br *br, void *buf, unsigned int blen);
> +
> +/* Amount of space available for write */
> +static inline uint32_t vmbus_br_availwrite(const struct vmbus_br *br, uint32_t
> windex)
> +{
> +	uint32_t rindex = br->vbr->rindex;
> +
> +	if (windex >= rindex)
> +		return br->dsize - (windex - rindex);
> +	else
> +		return rindex - windex;
> +}
> +
> +static inline uint32_t vmbus_br_availread(const struct vmbus_br *br)
> +{
> +	return br->dsize - vmbus_br_availwrite(br, br->vbr->windex);
> +}
> +
> +#endif	/* !_VMBUS_BUF_H_ */
> --
> 2.34.1


^ permalink raw reply

* [PATCH v2 net] net: hv_netvsc: fix netvsc_send_completion to avoid multiple message length checks
From: Sonia Sharma @ 2023-08-02 21:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, netdev, sosha, kys, mikelley, haiyangz, wei.liu,
	decui, longli, davem, edumazet, kuba, pabeni

From: Sonia Sharma <sonia.sharma@linux.microsoft.com>

The switch statement in netvsc_send_completion() is incorrectly validating
the length of incoming network packets by falling through to the next case.
Avoid the fallthrough. Instead break after a case match and then process
the complete() call.

Signed-off-by: Sonia Sharma <sonia.sharma@linux.microsoft.com>
---
 drivers/net/hyperv/netvsc.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 82e9796c8f5e..347688dd2eb9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -851,7 +851,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG1_TYPE_SEND_RECV_BUF_COMPLETE:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -860,7 +860,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG1_TYPE_SEND_SEND_BUF_COMPLETE:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -869,7 +869,7 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		fallthrough;
+		break;
 
 	case NVSP_MSG5_TYPE_SUBCHANNEL:
 		if (msglen < sizeof(struct nvsp_message_header) +
@@ -878,10 +878,6 @@ static void netvsc_send_completion(struct net_device *ndev,
 				   msglen);
 			return;
 		}
-		/* Copy the response back */
-		memcpy(&net_device->channel_init_pkt, nvsp_packet,
-		       sizeof(struct nvsp_message));
-		complete(&net_device->channel_init_wait);
 		break;
 
 	case NVSP_MSG1_TYPE_SEND_RNDIS_PKT_COMPLETE:
@@ -904,13 +900,18 @@ static void netvsc_send_completion(struct net_device *ndev,
 
 		netvsc_send_tx_complete(ndev, net_device, incoming_channel,
 					desc, budget);
-		break;
+		return;
 
 	default:
 		netdev_err(ndev,
 			   "Unknown send completion type %d received!!\n",
 			   nvsp_packet->hdr.msg_type);
 	}
+
+	/* Copy the response back */
+	memcpy(&net_device->channel_init_pkt, nvsp_packet,
+			sizeof(struct nvsp_message));
+	complete(&net_device->channel_init_wait);
 }
 
 static u32 netvsc_get_next_send_section(struct netvsc_device *net_device)
-- 
2.25.1


^ permalink raw reply related

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
From: Tom Lendacky @ 2023-08-02 21:57 UTC (permalink / raw)
  To: Michael Kelley, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, hpa, luto, peterz, sathyanarayanan.kuppuswamy,
	kirill.shutemov, seanjc, rick.p.edgecombe, linux-kernel,
	linux-hyperv, x86
In-Reply-To: <1688661719-60329-1-git-send-email-mikelley@microsoft.com>

On 7/6/23 11:41, Michael Kelley wrote:
> In a CoCo VM when a page transitions from private to shared, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could access a transitioning page during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad().  Unfortunately, this exception handling and
> fixup code is tricky and somewhat fragile.  At the moment, it is
> broken for both TDX and SEV-SNP.
> 
> There's also a problem with the current code in paravisor scenarios:
> TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> to be forwarded from the paravisor to the Linux guest, but there
> are no architectural specs for how to do that.
> 
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit
> 
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled.  CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
> 
> This approach may make __set_memory_enc_pgtable() slightly slower
> because of touching the PTEs three times instead of just once. But
> the run time of this function is already dominated by the hypercall
> and the need to flush the TLB at least once and maybe twice. In any
> case, this function is only used for CoCo VM page transitions, and
> is already unsuitable for hot paths.
> 
> The architecture specific callback function for notifying the
> hypervisor typically must translate guest kernel virtual addresses
> into guest physical addresses to pass to the hypervisor.  Because
> the PTEs are invalid at the time of callback, the code for doing the
> translation needs updating.  virt_to_phys() or equivalent continues
> to work for direct map addresses.  But vmalloc addresses cannot use
> vmalloc_to_page() because that function requires the leaf PTE to be
> valid. Instead, slow_virt_to_phys() must be used. Both functions
> manually walk the page table hierarchy, so performance is the same.

This all seems reasonable if it allows the paravisor approach to run 
without issues.

> 
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
> 
> I'm assuming the TDX handling of the data cache flushing is the
> same with this new approach, and that it doesn't need to be paired
> with a TLB flush as in the current code.  If that's not a correct
> assumption, let me know.
> 
> I've left the two hypervisor callbacks, before and after Step 5
> above. If the PTEs are invalid, it seems like the order of Step 5
> and Step 6 shouldn't matter, so perhaps one of the callback could
> be dropped.  Let me know if there are reasons to do otherwise.
> 
> It may well be possible to optimize the new implementation of
> __set_memory_enc_pgtable(). The existing set_memory_np() and
> set_memory_p() functions do all the right things in a very clear
> fashion, but perhaps not as optimally as having all three PTE
> manipulations directly in the same function. It doesn't appear
> that optimizing the performance really matters here, but I'm open
> to suggestions.
> 
> I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> test on SEV-SNP VMs without vTOM. But I'll probably need help
> covering SEV and SEV-ES VMs.

I wouldn't think that SEV and SEV-ES VMs would have any issues. However, 
these types of VMs don't make hypercalls at the moment, but I don't know 
that any slow downs would be noticed.

> 
> This RFC patch does *not* remove code that would no longer be
> needed. If there's agreement to take this new approach, I'll
> add patches to remove the unneeded code.
> 
> This patch is built against linux-next20230704.


> 
>   arch/x86/hyperv/ivm.c        |  3 ++-
>   arch/x86/kernel/sev.c        |  2 +-
>   arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
>   3 files changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 28be6df..2859ec3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>   		return false;
>   
>   	for (i = 0, pfn = 0; i < pagecount; i++) {
> -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;

Definitely needs a comment here (and below) that slow_virt_to_phys() is 
being used because of making the page not present.

>   		pfn++;
>   
>   		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 1ee7bed..59db55e 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
>   		hdr->end_entry = i;
>   
>   		if (is_vmalloc_addr((void *)vaddr)) {
> -			pfn = vmalloc_to_pfn((void *)vaddr);
> +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
>   			use_large_entry = false;
>   		} else {
>   			pfn = __pa(vaddr) >> PAGE_SHIFT;
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f12..8a194c7 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>   		addr &= PAGE_MASK;
>   
> +	/* set_memory_np() removes aliasing mappings and flushes the TLB */

Is there any case where the TLB wouldn't be flushed when it should? Since, 
for SEV at least, the TLB flush being removed below was always performed.

Thanks,
Tom

> +	ret = set_memory_np(addr, numpages);
> +	if (ret)
> +		return ret;
> +
>   	memset(&cpa, 0, sizeof(cpa));
>   	cpa.vaddr = &addr;
>   	cpa.numpages = numpages;
> @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
>   	cpa.pgd = init_mm.pgd;
>   
> -	/* Must avoid aliasing mappings in the highmem code */
> -	kmap_flush_unused();
> -	vm_unmap_aliases();
> -
>   	/* Flush the caches as needed before changing the encryption attribute. */
> -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> +	if (x86_platform.guest.enc_cache_flush_required())
> +		cpa_flush(&cpa, 1);
>   
>   	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>   	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
>   		return -EIO;
>   
>   	ret = __change_page_attr_set_clr(&cpa, 1);
> -
> -	/*
> -	 * After changing the encryption attribute, we need to flush TLBs again
> -	 * in case any speculative TLB caching occurred (but no need to flush
> -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> -	 * flushing gets optimized in the cpa_flush() path use the same logic
> -	 * as above.
> -	 */
> -	cpa_flush(&cpa, 0);
> +	if (ret)
> +		return ret;
>   
>   	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		return -EIO;
>   
> -	return ret;
> +	return set_memory_p(&addr, numpages);
>   }
>   
>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

^ permalink raw reply

* Re: [PATCH RFC net-next v5 11/14] vhost/vsock: implement datagram support
From: Bobby Eshleman @ 2023-08-02 21:23 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <acd54194-d397-e721-28e4-73a69257a2a9@gmail.com>

On Thu, Jul 27, 2023 at 11:00:55AM +0300, Arseniy Krasnov wrote:
> 
> 
> On 26.07.2023 20:55, Bobby Eshleman wrote:
> > On Sat, Jul 22, 2023 at 11:42:38AM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 19.07.2023 03:50, Bobby Eshleman wrote:
> >>> This commit implements datagram support for vhost/vsock by teaching
> >>> vhost to use the common virtio transport datagram functions.
> >>>
> >>> If the virtio RX buffer is too small, then the transmission is
> >>> abandoned, the packet dropped, and EHOSTUNREACH is added to the socket's
> >>> error queue.
> >>>
> >>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >>> ---
> >>>  drivers/vhost/vsock.c    | 62 +++++++++++++++++++++++++++++++++++++++++++++---
> >>>  net/vmw_vsock/af_vsock.c |  5 +++-
> >>>  2 files changed, 63 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >>> index d5d6a3c3f273..da14260c6654 100644
> >>> --- a/drivers/vhost/vsock.c
> >>> +++ b/drivers/vhost/vsock.c
> >>> @@ -8,6 +8,7 @@
> >>>   */
> >>>  #include <linux/miscdevice.h>
> >>>  #include <linux/atomic.h>
> >>> +#include <linux/errqueue.h>
> >>>  #include <linux/module.h>
> >>>  #include <linux/mutex.h>
> >>>  #include <linux/vmalloc.h>
> >>> @@ -32,7 +33,8 @@
> >>>  enum {
> >>>  	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> >>>  			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> >>> -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> >>> +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> >>> +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
> >>>  };
> >>>  
> >>>  enum {
> >>> @@ -56,6 +58,7 @@ struct vhost_vsock {
> >>>  	atomic_t queued_replies;
> >>>  
> >>>  	u32 guest_cid;
> >>> +	bool dgram_allow;
> >>>  	bool seqpacket_allow;
> >>>  };
> >>>  
> >>> @@ -86,6 +89,32 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >>>  	return NULL;
> >>>  }
> >>>  
> >>> +/* Claims ownership of the skb, do not free the skb after calling! */
> >>> +static void
> >>> +vhost_transport_error(struct sk_buff *skb, int err)
> >>> +{
> >>> +	struct sock_exterr_skb *serr;
> >>> +	struct sock *sk = skb->sk;
> >>> +	struct sk_buff *clone;
> >>> +
> >>> +	serr = SKB_EXT_ERR(skb);
> >>> +	memset(serr, 0, sizeof(*serr));
> >>> +	serr->ee.ee_errno = err;
> >>> +	serr->ee.ee_origin = SO_EE_ORIGIN_NONE;
> >>> +
> >>> +	clone = skb_clone(skb, GFP_KERNEL);
> >>
> >> May for skb which is error carrier we can use 'sock_omalloc()', not 'skb_clone()' ? TCP uses skb
> >> allocated by this function as carriers of error structure. I guess 'skb_clone()' also clones data of origin,
> >> but i think that there is no need in data as we insert it to error queue of the socket.
> >>
> >> What do You think?
> > 
> > IIUC skb_clone() is often used in this scenario so that the user can
> > retrieve the error-causing packet from the error queue.  Is there some
> > reason we shouldn't do this?
> > 
> > I'm seeing that the serr bits need to occur on the clone here, not the
> > original. I didn't realize the SKB_EXT_ERR() is a skb->cb cast. I'm not
> > actually sure how this passes the test case since ->cb isn't cloned.
> 
> Ah yes, sorry, You are right, I just confused this case with zerocopy completion
> handling - there we allocate "empty" skb which carries completion metadata in its
> 'cb' field.
> 
> Hm, but can't we just reinsert current skb (update it's 'cb' as 'sock_exterr_skb')
> to error queue of the socket without cloning it ?
> 
> Thanks, Arseniy
> 

I just assumed other socket types used skb_clone() for some reason
unknown to me and I didn't want to deviate.

If it is fine to just use the skb directly, then I am happy to make that
change.

Best,
Bobby

> > 
> >>
> >>> +	if (!clone)
> >>> +		return;
> >>
> >> What will happen here 'if (!clone)' ? skb will leak as it was removed from queue?
> >>
> > 
> > Ah yes, true.
> > 
> >>> +
> >>> +	if (sock_queue_err_skb(sk, clone))
> >>> +		kfree_skb(clone);
> >>> +
> >>> +	sk->sk_err = err;
> >>> +	sk_error_report(sk);
> >>> +
> >>> +	kfree_skb(skb);
> >>> +}
> >>> +
> >>>  static void
> >>>  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> >>>  			    struct vhost_virtqueue *vq)
> >>> @@ -160,9 +189,15 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> >>>  		hdr = virtio_vsock_hdr(skb);
> >>>  
> >>>  		/* If the packet is greater than the space available in the
> >>> -		 * buffer, we split it using multiple buffers.
> >>> +		 * buffer, we split it using multiple buffers for connectible
> >>> +		 * sockets and drop the packet for datagram sockets.
> >>>  		 */
> >>>  		if (payload_len > iov_len - sizeof(*hdr)) {
> >>> +			if (le16_to_cpu(hdr->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> >>> +				vhost_transport_error(skb, EHOSTUNREACH);
> >>> +				continue;
> >>> +			}
> >>> +
> >>>  			payload_len = iov_len - sizeof(*hdr);
> >>>  
> >>>  			/* As we are copying pieces of large packet's buffer to
> >>> @@ -394,6 +429,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
> >>>  	return val < vq->num;
> >>>  }
> >>>  
> >>> +static bool vhost_transport_dgram_allow(u32 cid, u32 port);
> >>>  static bool vhost_transport_seqpacket_allow(u32 remote_cid);
> >>>  
> >>>  static struct virtio_transport vhost_transport = {
> >>> @@ -410,7 +446,8 @@ static struct virtio_transport vhost_transport = {
> >>>  		.cancel_pkt               = vhost_transport_cancel_pkt,
> >>>  
> >>>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> >>> -		.dgram_allow              = virtio_transport_dgram_allow,
> >>> +		.dgram_allow              = vhost_transport_dgram_allow,
> >>> +		.dgram_addr_init          = virtio_transport_dgram_addr_init,
> >>>  
> >>>  		.stream_enqueue           = virtio_transport_stream_enqueue,
> >>>  		.stream_dequeue           = virtio_transport_stream_dequeue,
> >>> @@ -443,6 +480,22 @@ static struct virtio_transport vhost_transport = {
> >>>  	.send_pkt = vhost_transport_send_pkt,
> >>>  };
> >>>  
> >>> +static bool vhost_transport_dgram_allow(u32 cid, u32 port)
> >>> +{
> >>> +	struct vhost_vsock *vsock;
> >>> +	bool dgram_allow = false;
> >>> +
> >>> +	rcu_read_lock();
> >>> +	vsock = vhost_vsock_get(cid);
> >>> +
> >>> +	if (vsock)
> >>> +		dgram_allow = vsock->dgram_allow;
> >>> +
> >>> +	rcu_read_unlock();
> >>> +
> >>> +	return dgram_allow;
> >>> +}
> >>> +
> >>>  static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> >>>  {
> >>>  	struct vhost_vsock *vsock;
> >>> @@ -799,6 +852,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
> >>>  	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
> >>>  		vsock->seqpacket_allow = true;
> >>>  
> >>> +	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
> >>> +		vsock->dgram_allow = true;
> >>> +
> >>>  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
> >>>  		vq = &vsock->vqs[i];
> >>>  		mutex_lock(&vq->mutex);
> >>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >>> index e73f3b2c52f1..449ed63ac2b0 100644
> >>> --- a/net/vmw_vsock/af_vsock.c
> >>> +++ b/net/vmw_vsock/af_vsock.c
> >>> @@ -1427,9 +1427,12 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> >>>  		return prot->recvmsg(sk, msg, len, flags, NULL);
> >>>  #endif
> >>>  
> >>> -	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> >>> +	if (unlikely(flags & MSG_OOB))
> >>>  		return -EOPNOTSUPP;
> >>>  
> >>> +	if (unlikely(flags & MSG_ERRQUEUE))
> >>> +		return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
> >>> +
> >>
> >> Sorry, but I get build error here, because SOL_VSOCK in undefined. I think it should be added to
> >> include/linux/socket.h and to uapi files also for future use in userspace.
> >>
> > 
> > Strange, I built each patch individually without issue. My base is
> > netdev/main with your SOL_VSOCK patch applied. I will look today and see
> > if I'm missing something.
> > 
> >> Also Stefano Garzarella <sgarzare@redhat.com> suggested to add define something like VSOCK_RECVERR,
> >> in the same way as IP_RECVERR, and use it as last parameter of 'sock_recv_errqueue()'.
> >>
> > 
> > Got it, thanks.
> > 
> >>>  	transport = vsk->transport;
> >>>  
> >>>  	/* Retrieve the head sk_buff from the socket's receive queue. */
> >>>
> >>
> >> Thanks, Arseniy
> > 
> > Thanks,
> > Bobby

^ permalink raw reply

* Re: [PATCH V6 net] net: mana: Fix MANA VF unload when hardware is
From: Wei Liu @ 2023-08-02 22:50 UTC (permalink / raw)
  To: Souradeep Chakrabarti
  Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma, schakrabarti,
	stable
In-Reply-To: <1690377336-1353-1-git-send-email-schakrabarti@linux.microsoft.com>

On Wed, Jul 26, 2023 at 06:15:36AM -0700, Souradeep Chakrabarti wrote:
> When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
> hardware to complete any inflight packets and set the pending send count
> to zero. But if the hardware has failed, mana_dealloc_queues()
> could wait forever.
> 
> Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
> which is a somewhat arbitrary value that is more than long enough for
> functional hardware to complete any sends.
> 
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
> 
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

Hi Souradeep. The subject line of this patch seems to be cut off half
way.

Thanks,
Wei.

^ permalink raw reply

* Re: [PATCH v3] x86/hyperv: add noop functions to x86_init mpparse functions
From: Wei Liu @ 2023-08-02 23:10 UTC (permalink / raw)
  To: Saurabh Sengar
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <1687537688-5397-1-git-send-email-ssengar@linux.microsoft.com>

On Fri, Jun 23, 2023 at 09:28:08AM -0700, Saurabh Sengar wrote:
> Hyper-V can run VMs at different privilege "levels" known as Virtual
> Trust Levels (VTL). Sometimes, it chooses to run two different VMs
> at different levels but they share some of their address space. In
> such setups VTL2 (higher level VM) has visibility of all of the
> VTL0 (level 0) memory space.
> 
> When the CONFIG_X86_MPPARSE is enabled for VTL2, the VTL2 kernel
> performs a search within the low memory to locate MP tables. However,
> in systems where VTL0 manages the low memory and may contain valid
> tables, this scanning can result in incorrect MP table information
> being provided to the VTL2 kernel, mistakenly considering VTL0's MP
> table as its own
> 
> Add noop functions to avoid MP parse scan by VTL2.
> 
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>

Applied to hyperv-fixes. Thanks.

^ permalink raw reply

* Re: [PATCH v2 -next] Drivers: hv: vmbus: Remove unused extern declaration vmbus_ontimer()
From: Wei Liu @ 2023-08-02 23:29 UTC (permalink / raw)
  To: YueHaibing
  Cc: kys, haiyangz, wei.liu, decui, mikelley, linux-hyperv,
	linux-kernel
In-Reply-To: <20230725142108.27280-1-yuehaibing@huawei.com>

On Tue, Jul 25, 2023 at 10:21:08PM +0800, YueHaibing wrote:
> Since commit 30fbee49b071 ("Staging: hv: vmbus: Get rid of the unused function vmbus_ontimer()")
> this is not used anymore, so can remove it.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>

Applied to hyperv-fixes, thanks!

^ permalink raw reply

* Re: [PATCH 02/15] mshyperv: Introduce hv_get_hypervisor_version
From: Wei Liu @ 2023-08-02 23:39 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, linux-kernel, x86, linux-arm-kernel, linux-arch,
	mikelley, kys, wei.liu, haiyangz, decui, ssengar, mukeshrathor,
	stanislav.kinsburskiy, jinankjain, apais, Tianyu.Lan, vkuznets,
	tglx, mingo, bp, dave.hansen, hpa, will, catalin.marinas
In-Reply-To: <1690487690-2428-3-git-send-email-nunodasneves@linux.microsoft.com>

On Thu, Jul 27, 2023 at 12:54:37PM -0700, Nuno Das Neves wrote:
> x86_64 and arm64 implementations to get the hypervisor version
> information.
> Also introduce hv_hypervisor_version_info structure to simplify parsing
> the fields.
> Replace the existing parsing when printing the version numbers.

The line wrapping looks odd. When you resend, please fix it. If you
meant to use multiple paragraphs, please insert a blank line between
them.

x86 and ARM maintainers, I don't think there is anything particularly
interesting to you, but I guess you're CC'ed because some of the files
fall under the respective directories. Feel free to ignore this patch.

The code looks fine to me -- it is mostly just refactoring.

Thanks,
Wei.

> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/mshyperv.c      | 23 +++++++++++-------
>  arch/x86/kernel/cpu/mshyperv.c    | 40 ++++++++++++++++++++-----------
>  include/asm-generic/hyperv-tlfs.h | 23 ++++++++++++++++++
>  include/asm-generic/mshyperv.h    |  2 ++
>  4 files changed, 66 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index a406454578f0..d44c358ce45c 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -19,10 +19,19 @@
>  
>  static bool hyperv_initialized;
>  
> +int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
> +{
> +	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
> +			 (struct hv_get_vp_registers_output *)info);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
> +
>  static int __init hyperv_init(void)
>  {
> -	struct hv_get_vp_registers_output	result;
> -	u32	a, b, c, d;
> +	struct hv_get_vp_registers_output result;
> +	union hv_hypervisor_version_info version;
>  	u64	guest_id;
>  	int	ret;
>  
> @@ -55,13 +64,11 @@ static int __init hyperv_init(void)
>  		ms_hyperv.misc_features);
>  
>  	/* Get information about the Hyper-V host version */
> -	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION, &result);
> -	a = result.as32.a;
> -	b = result.as32.b;
> -	c = result.as32.c;
> -	d = result.as32.d;
> +	hv_get_hypervisor_version(&version);
>  	pr_info("Hyper-V: Host Build %d.%d.%d.%d-%d-%d\n",
> -		b >> 16, b & 0xFFFF, a,	d & 0xFFFFFF, c, d >> 24);
> +		version.major_version, version.minor_version,
> +		version.build_number, version.service_number,
> +		version.service_pack, version.service_branch);
>  
>  	ret = hv_common_init();
>  	if (ret)
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 57f6a5879b30..e44291e902ae 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -328,13 +328,30 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
>  }
>  #endif
>  
> +int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
> +{
> +	unsigned int hv_max_functions;
> +
> +	hv_max_functions = cpuid_eax(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS);
> +	if (hv_max_functions < HYPERV_CPUID_VERSION) {
> +		pr_err("%s: Could not detect Hyper-V version\n",
> +			__func__);
> +		return -ENODEV;
> +	}
> +
> +	info->eax = cpuid_eax(HYPERV_CPUID_VERSION);
> +	info->ebx = cpuid_ebx(HYPERV_CPUID_VERSION);
> +	info->ecx = cpuid_ecx(HYPERV_CPUID_VERSION);
> +	info->edx = cpuid_edx(HYPERV_CPUID_VERSION);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
> +
>  static void __init ms_hyperv_init_platform(void)
>  {
>  	int hv_max_functions_eax;
> -	int hv_host_info_eax;
> -	int hv_host_info_ebx;
> -	int hv_host_info_ecx;
> -	int hv_host_info_edx;
> +	union hv_hypervisor_version_info version;
>  
>  #ifdef CONFIG_PARAVIRT
>  	pv_info.name = "Hyper-V";
> @@ -388,16 +405,11 @@ static void __init ms_hyperv_init_platform(void)
>  	/*
>  	 * Extract host information.
>  	 */
> -	if (hv_max_functions_eax >= HYPERV_CPUID_VERSION) {
> -		hv_host_info_eax = cpuid_eax(HYPERV_CPUID_VERSION);
> -		hv_host_info_ebx = cpuid_ebx(HYPERV_CPUID_VERSION);
> -		hv_host_info_ecx = cpuid_ecx(HYPERV_CPUID_VERSION);
> -		hv_host_info_edx = cpuid_edx(HYPERV_CPUID_VERSION);
> -
> -		pr_info("Hyper-V: Host Build %d.%d.%d.%d-%d-%d\n",
> -			hv_host_info_ebx >> 16, hv_host_info_ebx & 0xFFFF,
> -			hv_host_info_eax, hv_host_info_edx & 0xFFFFFF,
> -			hv_host_info_ecx, hv_host_info_edx >> 24);
> +	if (hv_get_hypervisor_version(&version) == 0) {
> +		pr_info("Hyper-V Host Build:%d-%d.%d-%d-%d.%d\n",
> +			version.build_number, version.major_version,
> +			version.minor_version, version.service_pack,
> +			version.service_branch, version.service_number);
>  	}
>  
>  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index f4e4cc4f965f..373f26efa18a 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -786,6 +786,29 @@ struct hv_input_unmap_device_interrupt {
>  #define HV_SOURCE_SHADOW_NONE               0x0
>  #define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE   0x1
>  
> +/*
> + * Version info reported by hypervisor
> + */
> +union hv_hypervisor_version_info {
> +	struct {
> +		u32 build_number;
> +
> +		u32 minor_version : 16;
> +		u32 major_version : 16;
> +
> +		u32 service_pack;
> +
> +		u32 service_number : 24;
> +		u32 service_branch : 8;
> +	};
> +	struct {
> +		u32 eax;
> +		u32 ebx;
> +		u32 ecx;
> +		u32 edx;
> +	};
> +};
> +
>  /*
>   * The whole argument should fit in a page to be able to pass to the hypervisor
>   * in one hypercall.
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 094c57320ed1..233c976344e5 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -153,6 +153,8 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
>  	}
>  }
>  
> +int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);
> +
>  void hv_setup_vmbus_handler(void (*handler)(void));
>  void hv_remove_vmbus_handler(void);
>  void hv_setup_stimer0_handler(void (*handler)(void));
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH 03/15] mshyperv: Introduce numa_node_to_proximity_domain_info
From: Wei Liu @ 2023-08-02 23:47 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, linux-kernel, x86, linux-arm-kernel, linux-arch,
	mikelley, kys, wei.liu, haiyangz, decui, ssengar, mukeshrathor,
	stanislav.kinsburskiy, jinankjain, apais, Tianyu.Lan, vkuznets,
	tglx, mingo, bp, dave.hansen, hpa, will, catalin.marinas, rafael,
	lenb
In-Reply-To: <1690487690-2428-4-git-send-email-nunodasneves@linux.microsoft.com>

On Thu, Jul 27, 2023 at 12:54:38PM -0700, Nuno Das Neves wrote:
> Factor out logic for converting numa node to proximity domain info into
> a helper function, and export it.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

Reviewed-by: Wei Liu <wei.liu@kernel.org>

> ---
>  arch/x86/hyperv/hv_proc.c      |  8 ++------
>  drivers/acpi/numa/srat.c       |  1 +
>  include/asm-generic/mshyperv.h | 18 ++++++++++++++++++
>  3 files changed, 21 insertions(+), 6 deletions(-)
> 
[...]
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 1f4fc5f8a819..0cf9f0574495 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -48,6 +48,7 @@ int node_to_pxm(int node)
>  		return PXM_INVAL;
>  	return node_to_pxm_map[node];
>  }
> +EXPORT_SYMBOL(node_to_pxm);

Rafael and Len, I would like to get an ACK from you on this one line
change. I see a lot of other functions in that file are already
exported, so I hope this is okay, too.

It's user is the function below numa_node_to_proximity_domain_info.

Thanks,
Wei.

>  
>  static void __acpi_map_pxm_to_node(int pxm, int node)
>  {
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 233c976344e5..447e7ebe67ee 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -21,6 +21,7 @@
>  #include <linux/types.h>
>  #include <linux/atomic.h>
>  #include <linux/bitops.h>
> +#include <acpi/acpi_numa.h>
>  #include <linux/cpumask.h>
>  #include <linux/nmi.h>
>  #include <asm/ptrace.h>
> @@ -28,6 +29,23 @@
>  
>  #define VTPM_BASE_ADDRESS 0xfed40000
>  
> +static inline union hv_proximity_domain_info
> +numa_node_to_proximity_domain_info(int node)
> +{
> +	union hv_proximity_domain_info proximity_domain_info;
> +
> +	if (node != NUMA_NO_NODE) {
> +		proximity_domain_info.domain_id = node_to_pxm(node);
> +		proximity_domain_info.flags.reserved = 0;
> +		proximity_domain_info.flags.proximity_info_valid = 1;
> +		proximity_domain_info.flags.proximity_preferred = 1;
> +	} else {
> +		proximity_domain_info.as_uint64 = 0;
> +	}
> +
> +	return proximity_domain_info;
> +}
> +
>  struct ms_hyperv_info {
>  	u32 features;
>  	u32 priv_high;
> -- 
> 2.25.1
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox