Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] wifi: cfg80211: Fix UAF in ieee80211_scan_rx()
From: Siddh Raman Pant @ 2022-08-16  8:52 UTC (permalink / raw)
  To: Johannes Berg
  Cc: jakub kicinski, greg kh, david s. miller, eric dumazet,
	paolo abeni, netdev, linux-kernel-mentees
In-Reply-To: <18fd9b89d45aedc1504d0cbd299ffb289ae96438.camel@sipsolutions.net>

On Tue, 16 Aug 2022 01:28:24 +0530  Johannes Berg  wrote:
> Sorry everyone, I always thought "this looks odd" and then never got
> around to taking a closer look.
> 
> So yeah, I still think this looks odd - cfg80211 doesn't really know
> anything about how mac80211 might be doing something with RCU to protect
> the pointer.
> 
> The patch also leaves the NULL-ing in mac80211 (that is how we reach it)
> broken wrt. the kfree_rcu() since it doesn't happen _before_, and the
> pointer in rdev isn't how this is reached through RCU (it's not even
> __rcu annotated).
> 

Thanks for the critical review.

> I think it might be conceptually better, though not faster, to do
> something like https://p.sipsolutions.net/1d23837f455dc4c2.txt which
> ensures that from mac80211's POV it can no longer be reached before we
> call cfg80211_scan_done().
> 
> Yeah, that's slower, but scanning is still a relatively infrequent (and
> slow anyway) operation, and this way we can stick to "this is not used
> once you call cfg80211_scan_done()" which just makes much more sense?
> 
> johannes
> 

Agreed, that looks like a good way to go about.

Thanks,
Siddh

^ permalink raw reply

* Re: [PATCH net-next v4] net: skb: prevent the split of kfree_skb_reason() by gcc
From: Miguel Ojeda @ 2022-08-16  9:02 UTC (permalink / raw)
  To: menglong8.dong
  Cc: kuba, ojeda, ndesaulniers, davem, edumazet, pabeni, asml.silence,
	imagedong, luiz.von.dentz, vasily.averin, jk, linux-kernel,
	netdev, kernel test robot
In-Reply-To: <20220816032846.2579217-1-imagedong@tencent.com>

On Tue, Aug 16, 2022 at 5:29 AM <menglong8.dong@gmail.com> wrote:
>
> Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>

Hmm... Why did you add me as a reporter?

> + * Optional: not supported by clang.
> + * Optional: not supported by icc.

Much better, thank you! Please add the links to GCC's docs, like in
most attributes (some newer attributes may have been added without
them -- I will fix that).

In any case, no need to send a new version just for this for the
moment, I would recommend waiting until others comment on whether they
want `__optimize__` used here as the workaround.

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
From: Tony Lu @ 2022-08-16  8:37 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <b4e23c1ef29d567661de46a79c00e48a01344366.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:39AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> It's clear that rmbs_lock and sndbufs_lock are aims to protect the
> rmbs list or the sndbufs list.
> 
> During conenction establieshment, smc_buf_get_slot() will always

conenction -> connection

> be invoke, and it only performs read semantics in rmbs list and

invoke -> invoked.

> sndbufs list.
> 
> Based on the above considerations, we replace mutex with rw_semaphore.
> Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
> run concurrently, other part use down_write() to keep exclusive
> semantics.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/smc_core.c | 55 +++++++++++++++++++++++++++---------------------------
>  net/smc/smc_core.h |  4 ++--
>  net/smc/smc_llc.c  | 16 ++++++++--------
>  3 files changed, 38 insertions(+), 37 deletions(-)
> 
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 113804d..b90970a 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1138,8 +1138,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
>  	lgr->freeing = 0;
>  	lgr->vlan_id = ini->vlan_id;
>  	refcount_set(&lgr->refcnt, 1); /* set lgr refcnt to 1 */
> -	mutex_init(&lgr->sndbufs_lock);
> -	mutex_init(&lgr->rmbs_lock);
> +	init_rwsem(&lgr->sndbufs_lock);
> +	init_rwsem(&lgr->rmbs_lock);
>  	rwlock_init(&lgr->conns_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		INIT_LIST_HEAD(&lgr->sndbufs[i]);
> @@ -1380,7 +1380,7 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  			   struct smc_link_group *lgr)
>  {
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int rc;
>  
>  	if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
> @@ -1400,9 +1400,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  		/* buf registration failed, reuse not possible */
>  		lock = is_rmb ? &lgr->rmbs_lock :
>  				&lgr->sndbufs_lock;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_del(&buf_desc->list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  
>  		smc_buf_free(lgr, is_rmb, buf_desc);
>  	} else {
> @@ -1506,15 +1506,16 @@ static void smcr_buf_unmap_lgr(struct smc_link *lnk)
>  	int i;
>  
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
> -		mutex_lock(&lgr->rmbs_lock);
> +		down_write(&lgr->rmbs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list)
>  			smcr_buf_unmap_link(buf_desc, true, lnk);
> -		mutex_unlock(&lgr->rmbs_lock);
> -		mutex_lock(&lgr->sndbufs_lock);
> +		up_write(&lgr->rmbs_lock);
> +
> +		down_write(&lgr->sndbufs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i],
>  					 list)
>  			smcr_buf_unmap_link(buf_desc, false, lnk);
> -		mutex_unlock(&lgr->sndbufs_lock);
> +		up_write(&lgr->sndbufs_lock);
>  	}
>  }
>  
> @@ -2324,19 +2325,19 @@ int smc_uncompress_bufsize(u8 compressed)
>   * buffer size; if not available, return NULL
>   */
>  static struct smc_buf_desc *smc_buf_get_slot(int compressed_bufsize,
> -					     struct mutex *lock,
> +					     struct rw_semaphore *lock,
>  					     struct list_head *buf_list)
>  {
>  	struct smc_buf_desc *buf_slot;
>  
> -	mutex_lock(lock);
> +	down_read(lock);
>  	list_for_each_entry(buf_slot, buf_list, list) {
>  		if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
> -			mutex_unlock(lock);
> +			up_read(lock);
>  			return buf_slot;
>  		}
>  	}
> -	mutex_unlock(lock);
> +	up_read(lock);
>  	return NULL;
>  }
>  
> @@ -2445,13 +2446,13 @@ int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
>  	return 0;
>  }
>  
> -static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
> +static int _smcr_buf_map_lgr(struct smc_link *lnk, struct rw_semaphore *lock,
>  			     struct list_head *lst, bool is_rmb)
>  {
>  	struct smc_buf_desc *buf_desc, *bf;
>  	int rc = 0;
>  
> -	mutex_lock(lock);
> +	down_write(lock);
>  	list_for_each_entry_safe(buf_desc, bf, lst, list) {
>  		if (!buf_desc->used)
>  			continue;
> @@ -2460,7 +2461,7 @@ static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
>  			goto out;
>  	}
>  out:
> -	mutex_unlock(lock);
> +	up_write(lock);
>  	return rc;
>  }
>  
> @@ -2493,37 +2494,37 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
>  	int i, rc = 0;
>  
>  	/* reg all RMBs for a new link */
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
>  			if (!buf_desc->used)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->rmbs_lock);
> +				up_write(&lgr->rmbs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  
>  	if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
>  		return rc;
>  
>  	/* reg all vzalloced sndbufs for a new link */
> -	mutex_lock(&lgr->sndbufs_lock);
> +	down_write(&lgr->sndbufs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
>  			if (!buf_desc->used || !buf_desc->is_vm)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->sndbufs_lock);
> +				up_write(&lgr->sndbufs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->sndbufs_lock);
> +	up_write(&lgr->sndbufs_lock);
>  	return rc;
>  }
>  
> @@ -2641,7 +2642,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  	struct list_head *buf_list;
>  	int bufsize, bufsize_short;
>  	bool is_dgraded = false;
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int sk_buf_size;
>  
>  	if (is_rmb)
> @@ -2689,9 +2690,9 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
>  		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
>  		buf_desc->used = 1;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_add(&buf_desc->list, buf_list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  		break; /* found */
>  	}
>  
> @@ -2765,9 +2766,9 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
>  	/* create rmb */
>  	rc = __smc_buf_create(smc, is_smcd, true);
>  	if (rc) {
> -		mutex_lock(&smc->conn.lgr->sndbufs_lock);
> +		down_write(&smc->conn.lgr->sndbufs_lock);
>  		list_del(&smc->conn.sndbuf_desc->list);
> -		mutex_unlock(&smc->conn.lgr->sndbufs_lock);
> +		up_write(&smc->conn.lgr->sndbufs_lock);
>  		smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
>  		smc->conn.sndbuf_desc = NULL;
>  	}
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 559d330..008148c 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -300,9 +300,9 @@ struct smc_link_group {
>  	unsigned short		vlan_id;	/* vlan id of link group */
>  
>  	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
> -	struct mutex		sndbufs_lock;	/* protects tx buffers */
> +	struct rw_semaphore	sndbufs_lock;	/* protects tx buffers */
>  	struct list_head	rmbs[SMC_RMBE_SIZES];	/* rx buffers */
> -	struct mutex		rmbs_lock;	/* protects rx buffers */
> +	struct rw_semaphore	rmbs_lock;	/* protects rx buffers */
>  
>  	u8			id[SMC_LGR_ID_SIZE];	/* unique lgr id */
>  	struct delayed_work	free_work;	/* delayed freeing of an lgr */
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index d744937..76f9906 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -642,7 +642,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  
>  	prim_lnk_idx = link->link_idx;
>  	lnk_idx = link_new->link_idx;
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	ext->num_rkeys = lgr->conns_num;
>  	if (!ext->num_rkeys)
>  		goto out;
> @@ -662,7 +662,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  	}
>  	len += i * sizeof(ext->rt[0]);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return len;
>  }
>  
> @@ -923,7 +923,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -950,7 +950,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  			break;
>  	} while (num_rkeys_send || num_rkeys_recv);
>  
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> @@ -1033,14 +1033,14 @@ static void smc_llc_save_add_link_rkeys(struct smc_link *link,
>  	ext = (struct smc_llc_msg_add_link_v2_ext *)((u8 *)lgr->wr_rx_buf_v2 +
>  						     SMC_WR_TX_SIZE);
>  	max = min_t(u8, ext->num_rkeys, SMC_LLC_RKEYS_PER_MSG_V2);
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < max; i++) {
>  		smc_rtoken_set(lgr, link->link_idx, link_new->link_idx,
>  			       ext->rt[i].rmb_key,
>  			       ext->rt[i].rmb_vaddr_new,
>  			       ext->rt[i].rmb_key_new);
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  }
>  
>  static void smc_llc_save_add_link_info(struct smc_link *link,
> @@ -1349,7 +1349,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -1374,7 +1374,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	} while (num_rkeys_send || num_rkeys_recv);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Kurt Kanzenbach @ 2022-08-16  8:51 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Vladimir Oltean
  Cc: Ferenc Fejes, netdev@vger.kernel.org, marton12050@gmail.com,
	peti.antal99@gmail.com
In-Reply-To: <87k079hzry.fsf@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

Hi Vinicius,

On Mon Aug 15 2022, Vinicius Costa Gomes wrote:
> I think your question is more "why there's that workqueue on igc?"/"why
> don't you retrieve the TX timestamp 'inline' with the interrupt?", if I
> got that right, then, I don't have a good reason, apart from the feeling
> that reading all those (5-6?) registers may take too long for a
> interrupt handler. And it's something that's being done the same for
> most (all?) Intel drivers.

We do have one optimization for igb which attempts to read the Tx
timestamp directly from the ISR. If that's not ready *only* then we
schedule the worker. I do assume igb and igc have the same logic for
retrieving the timestamps here.

The problem with workqueues is that under heavy system load, it might be
deferred and timestamps will be lost. I guess that workqueue was added
because of something like this: 1f6e8178d685 ("igb: Prevent dropped Tx
timestamps via work items and interrupts.").

>
> I have a TODO to experiment with removing the workqueue, and retrieving
> the TX timestamp in the same context as the interrupt handler, but other
> things always come up.

Let me know if you have interest in that igb patch.

Thanks,
Kurt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply

* RE: [PATCH v2 1/2] fs: Export __receive_fd()
From: David Laight @ 2022-08-16  8:03 UTC (permalink / raw)
  To: 'Kirill Tkhai', Linux Kernel Network Developers
  Cc: davem@davemloft.net, edumazet@google.com, viro@zeniv.linux.org.uk
In-Reply-To: <3a8da760-d58b-04fe-e251-e0d143493df1@ya.ru>

From: Kirill Tkhai
> Sent: 15 August 2022 22:15
> 
> This is needed to make receive_fd_user() available in modules, and it will be used in next patch.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> ---
> v2: New
>  fs/file.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 3bcc1ecc314a..e45d45f1dd45 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1181,6 +1181,7 @@ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
>  	__receive_sock(file);
>  	return new_fd;
>  }
> +EXPORT_SYMBOL_GPL(__receive_fd);

It doesn't seem right (to me) to be exporting a function
with a __ prefix.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply

* [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel
In-Reply-To: <cover.1660639789.git.leonro@nvidia.com>

From: Leon Romanovsky <leonro@nvidia.com>

In the next patches, the xfrm core code will be extended to support
new type of offload - full offload. In that mode, both policy and state
should be specially configured in order to perform whole offloaded data
path.

Full offload takes care of encryption, decryption, encapsulation and
other operations with headers.

As this mode is new for XFRM policy flow, we can "start fresh" with flag
bits and release first and second bit for future use.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/xfrm.h        | 7 +++++++
 include/uapi/linux/xfrm.h | 6 ++++++
 net/xfrm/xfrm_device.c    | 3 +++
 net/xfrm/xfrm_user.c      | 2 ++
 4 files changed, 18 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 6e8fa98f786f..b4d487053dfd 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -131,12 +131,19 @@ enum {
 	XFRM_DEV_OFFLOAD_OUT,
 };
 
+enum {
+	XFRM_DEV_OFFLOAD_UNSPECIFIED,
+	XFRM_DEV_OFFLOAD_CRYPTO,
+	XFRM_DEV_OFFLOAD_FULL,
+};
+
 struct xfrm_dev_offload {
 	struct net_device	*dev;
 	netdevice_tracker	dev_tracker;
 	struct net_device	*real_dev;
 	unsigned long		offload_handle;
 	u8			dir : 2;
+	u8			type : 2;
 };
 
 struct xfrm_mode {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index b1f3e6a8f11a..3b084246d610 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -519,6 +519,12 @@ struct xfrm_user_offload {
  */
 #define XFRM_OFFLOAD_IPV6	1
 #define XFRM_OFFLOAD_INBOUND	2
+/* Two bits above are relevant for state path only, while
+ * offload is used for both policy and state flows.
+ *
+ * In policy offload mode, they are free and can be safely reused.
+ */
+#define XFRM_OFFLOAD_FULL	4
 
 struct xfrm_userpolicy_default {
 #define XFRM_USERPOLICY_UNSPEC	0
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 637ca8838436..6d1124eb1ec8 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -270,12 +270,15 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	else
 		xso->dir = XFRM_DEV_OFFLOAD_OUT;
 
+	xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
+
 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
 		xso->dev = NULL;
 		xso->dir = 0;
 		xso->real_dev = NULL;
 		netdev_put(dev, &xso->dev_tracker);
+		xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
 
 		if (err != -EOPNOTSUPP)
 			return err;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 2ff017117730..9c0aef815730 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -854,6 +854,8 @@ static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
 	xuo->ifindex = xso->dev->ifindex;
 	if (xso->dir == XFRM_DEV_OFFLOAD_IN)
 		xuo->flags = XFRM_OFFLOAD_INBOUND;
+	if (xso->type == XFRM_DEV_OFFLOAD_FULL)
+		xuo->flags |= XFRM_OFFLOAD_FULL;
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related

* [syzbot] upstream boot error: general protection fault in netdev_queue_update_kobjects
From: syzbot @ 2022-08-16  8:35 UTC (permalink / raw)
  To: atenart, davem, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    568035b01cfb Linux 6.0-rc1
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1562e2d3080000
kernel config:  https://syzkaller.appspot.com/x/.config?x=3b9175e0879a7749
dashboard link: https://syzkaller.appspot.com/bug?extid=519ff9362a883a7a4865
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+519ff9362a883a7a4865@syzkaller.appspotmail.com

input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
ACPI: button: Sleep Button [SLPF]
ACPI: \_SB_.LNKC: Enabled at IRQ 11
virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKD: Enabled at IRQ 10
virtio-pci 0000:00:04.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKB: Enabled at IRQ 10
virtio-pci 0000:00:06.0: virtio_pci: leaving for legacy driver
virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
N_HDLC line discipline registered with maxframe=4096
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
ACPI: bus type drm_connector registered
[drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
[drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
Console: switching to colour frame buffer device 128x48
platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
usbcore: registered new interface driver udl
brd: module loaded
loop: module loaded
zram: Added device: zram0
null_blk: disk nullb0 created
null_blk: module loaded
Guest personality initialized and is inactive
VMCI host device registered (name=vmci, major=10, minor=120)
Initialized host personality
usbcore: registered new interface driver rtsx_usb
usbcore: registered new interface driver viperboard
usbcore: registered new interface driver dln2
usbcore: registered new interface driver pn533_usb
nfcsim 0.2 initialized
usbcore: registered new interface driver port100
usbcore: registered new interface driver nfcmrvl
Loading iSCSI transport class v2.0-870.
scsi host0: Virtio SCSI HBA
st: Version 20160209, fixed bufsize 32768, s/g segs 256
Rounding down aligned max_sectors from 4294967295 to 4294967288
db_root: cannot open: /etc/target
slram: not enough parameters.
ftl_cs: FTL header not found.
wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
eql: Equalizer2002: Simon Janes (simon@ncm.com) and David S. Miller (davem@redhat.com)
MACsec IEEE 802.1AE
tun: Universal TUN/TAP device driver, 1.6
general protection fault, probably for non-canonical address 0xffff000000000800: 0000 [#1] PREEMPT SMP KASAN
KASAN: maybe wild-memory-access in range [0xfff8200000004000-0xfff8200000004007]
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.0.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
RIP: 0010:freelist_dereference mm/slub.c:347 [inline]
RIP: 0010:get_freepointer mm/slub.c:354 [inline]
RIP: 0010:get_freepointer_safe mm/slub.c:368 [inline]
RIP: 0010:slab_alloc_node mm/slub.c:3211 [inline]
RIP: 0010:slab_alloc mm/slub.c:3251 [inline]
RIP: 0010:kmem_cache_alloc_trace+0x164/0x3e0 mm/slub.c:3282
Code: 8b 51 08 48 8b 01 48 83 79 10 00 48 89 44 24 08 0f 84 bf 01 00 00 48 85 c0 0f 84 b6 01 00 00 48 8b 7d 00 8b 4d 28 40 f6 c7 0f <48> 8b 1c 08 0f 85 c2 01 00 00 48 8d 4a 08 65 48 0f c7 0f 0f 94 c0
RSP: 0000:ffffc90000067810 EFLAGS: 00010246
RAX: ffff000000000000 RBX: 0000000000000000 RCX: 0000000000000800
RDX: 0000000000002e30 RSI: 0000000000000dc0 RDI: 000000000003dce0
RBP: ffff888011842140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000dc0 R14: 0000000000000a20 R15: 0000000000000dc0
FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000bc8e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 kmalloc include/linux/slab.h:600 [inline]
 kzalloc include/linux/slab.h:733 [inline]
 kobject_uevent_env+0x230/0x1640 lib/kobject_uevent.c:524
 netdev_queue_add_kobject net/core/net-sysfs.c:1677 [inline]
 netdev_queue_update_kobjects+0x3d1/0x4e0 net/core/net-sysfs.c:1718
 register_queue_kobjects net/core/net-sysfs.c:1779 [inline]
 netdev_register_kobject+0x330/0x400 net/core/net-sysfs.c:2019
 register_netdevice+0xe01/0x1680 net/core/dev.c:10070
 virtnet_probe+0x1378/0x2f30 drivers/net/virtio_net.c:3929
 virtio_dev_probe+0x577/0x870 drivers/virtio/virtio.c:305
 call_driver_probe drivers/base/dd.c:530 [inline]
 really_probe+0x249/0xb90 drivers/base/dd.c:609
 __driver_probe_device+0x1df/0x4d0 drivers/base/dd.c:748
 driver_probe_device+0x4c/0x1a0 drivers/base/dd.c:778
 __driver_attach+0x223/0x550 drivers/base/dd.c:1150
 bus_for_each_dev+0x147/0x1d0 drivers/base/bus.c:301
 bus_add_driver+0x4c9/0x640 drivers/base/bus.c:618
 driver_register+0x220/0x3a0 drivers/base/driver.c:240
 virtio_net_driver_init+0x93/0xd2 drivers/net/virtio_net.c:4108
 do_one_initcall+0xfe/0x650 init/main.c:1296
 do_initcall_level init/main.c:1369 [inline]
 do_initcalls init/main.c:1385 [inline]
 do_basic_setup init/main.c:1404 [inline]
 kernel_init_freeable+0x6b1/0x73a init/main.c:1611
 kernel_init+0x1a/0x1d0 init/main.c:1500
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
 </TASK>
Modules linked in:
----------------
Code disassembly (best guess):
   0:	8b 51 08             	mov    0x8(%rcx),%edx
   3:	48 8b 01             	mov    (%rcx),%rax
   6:	48 83 79 10 00       	cmpq   $0x0,0x10(%rcx)
   b:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  10:	0f 84 bf 01 00 00    	je     0x1d5
  16:	48 85 c0             	test   %rax,%rax
  19:	0f 84 b6 01 00 00    	je     0x1d5
  1f:	48 8b 7d 00          	mov    0x0(%rbp),%rdi
  23:	8b 4d 28             	mov    0x28(%rbp),%ecx
  26:	40 f6 c7 0f          	test   $0xf,%dil
* 2a:	48 8b 1c 08          	mov    (%rax,%rcx,1),%rbx <-- trapping instruction
  2e:	0f 85 c2 01 00 00    	jne    0x1f6
  34:	48 8d 4a 08          	lea    0x8(%rdx),%rcx
  38:	65 48 0f c7 0f       	cmpxchg16b %gs:(%rdi)
  3d:	0f 94 c0             	sete   %al


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* [ RFC  net-next 3/3] net/sched: act_api: update hw stats for tc action list
From: Oz Shlomo @ 2022-08-16  9:23 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, Jamal Hadi Salim, Simon Horman, Baowen Zheng,
	Vlad Buslov, Ido Schimmel, Roi Dayan, Oz Shlomo
In-Reply-To: <20220816092338.12613-1-ozsh@nvidia.com>

Currently action hw stats are updated during the tc filter dump sequence.
HW actions are also updated during the tc action dump sequence.
However, tc action dump does not update the hw stats for actions created
during filter instantiation.

Use the existing hw action api to update hw stats during tc action dump.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>

---
 net/sched/act_api.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 817065aa2833..5d7b6e438085 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -301,14 +301,11 @@ static int tcf_action_offload_add(struct tc_action *action,
 	return tcf_action_offload_add_ex(action, extack, NULL, NULL);
 }
 
-int tcf_action_update_hw_stats(struct tc_action *action)
+static int tcf_action_set_hw_stats(struct tc_action *action)
 {
 	struct flow_offload_action fl_act = {};
 	int err;
 
-	if (!tc_act_in_hw(action))
-		return -EOPNOTSUPP;
-
 	err = offload_action_init(&fl_act, action, FLOW_ACT_STATS, NULL);
 	if (err)
 		return err;
@@ -330,6 +327,14 @@ int tcf_action_update_hw_stats(struct tc_action *action)
 
 	return 0;
 }
+
+int tcf_action_update_hw_stats(struct tc_action *action)
+{
+	if (!tc_act_in_hw(action))
+		return -EOPNOTSUPP;
+
+	return tcf_action_set_hw_stats(action);
+}
 EXPORT_SYMBOL(tcf_action_update_hw_stats);
 
 static int tcf_action_offload_del_ex(struct tc_action *action,
@@ -543,6 +548,8 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 			index--;
 			goto nla_put_failure;
 		}
+		tcf_action_set_hw_stats(p);
+
 		err = (act_flags & TCA_ACT_FLAG_TERSE_DUMP) ?
 			tcf_action_dump_terse(skb, p, true) :
 			tcf_action_dump_1(skb, p, 0, 0);
-- 
1.8.3.1


^ permalink raw reply related

* [syzbot] upstream boot error: BUG: unable to handle kernel paging request in __rhashtable_lookup
From: syzbot @ 2022-08-16  8:41 UTC (permalink / raw)
  To: bpf, davem, edumazet, fw, harshit.m.mogalapalli, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    4a9350597aff Merge tag 'sound-fix-6.0-rc1' of git://git.ke..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16259e35080000
kernel config:  https://syzkaller.appspot.com/x/.config?x=5d647c9572405910
dashboard link: https://syzkaller.appspot.com/bug?extid=6fccf62d7c237e1b4c85
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6fccf62d7c237e1b4c85@syzkaller.appspotmail.com

8021q: adding VLAN 0 to HW filter on device bond0
eql: remember to turn off Van-Jacobson compression on your slave devices
BUG: unable to handle page fault for address: ffffdbffffffffc8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 11825067 P4D 11825067 PUD 0 
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3187 Comm: dhcpcd Not tainted 5.19.0-syzkaller-14090-g4a9350597aff #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
RIP: 0010:netlink_compare net/netlink/af_netlink.c:500 [inline]
RIP: 0010:__rhashtable_lookup.constprop.0+0x2ae/0x5e0 include/linux/rhashtable.h:601
Code: 01 44 89 e6 e8 83 19 e8 f9 45 84 e4 0f 85 8f 01 00 00 e8 f5 1c e8 f9 4d 8d 24 1f 49 8d bc 24 00 05 00 00 48 89 f8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 9c 02 00 00 45 8b ac 24 00
RSP: 0018:ffffc9000351f818 EFLAGS: 00010a02
RAX: 1fffdfffffffffc8 RBX: fffffffffffff940 RCX: 0000000000000000
RDX: ffff88807cc7c140 RSI: ffffffff879318cb RDI: fffefffffffffe40
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: fffefffffffff940
R13: ffffffff9152ccc0 R14: dffffc0000000000 R15: ffff000000000000
FS:  00007fbe244b5740(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffdbffffffffc8 CR3: 0000000021e67000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 rhashtable_lookup include/linux/rhashtable.h:638 [inline]
 rhashtable_lookup_fast include/linux/rhashtable.h:664 [inline]
 __netlink_lookup net/netlink/af_netlink.c:518 [inline]
 netlink_lookup+0x130/0x470 net/netlink/af_netlink.c:538
 netlink_getsockbyportid net/netlink/af_netlink.c:1154 [inline]
 netlink_unicast+0x244/0x7f0 net/netlink/af_netlink.c:1339
 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921
 sock_sendmsg_nosec net/socket.c:714 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:734
 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482
 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fbe245ad163
Code: 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48
RSP: 002b:00007ffd4a7738f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fbe244b56c8 RCX: 00007fbe245ad163
RDX: 0000000000000000 RSI: 00007ffd4a787aa8 RDI: 0000000000000005
RBP: 0000000000000005 R08: 0000000000000000 R09: 00007ffd4a787aa8
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 00007ffd4a787aa8 R14: 0000000000000030 R15: 0000000000000001
 </TASK>
Modules linked in:
CR2: ffffdbffffffffc8
---[ end trace 0000000000000000 ]---
RIP: 0010:netlink_compare net/netlink/af_netlink.c:500 [inline]
RIP: 0010:__rhashtable_lookup.constprop.0+0x2ae/0x5e0 include/linux/rhashtable.h:601
Code: 01 44 89 e6 e8 83 19 e8 f9 45 84 e4 0f 85 8f 01 00 00 e8 f5 1c e8 f9 4d 8d 24 1f 49 8d bc 24 00 05 00 00 48 89 f8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 9c 02 00 00 45 8b ac 24 00
RSP: 0018:ffffc9000351f818 EFLAGS: 00010a02
RAX: 1fffdfffffffffc8 RBX: fffffffffffff940 RCX: 0000000000000000
RDX: ffff88807cc7c140 RSI: ffffffff879318cb RDI: fffefffffffffe40
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: fffefffffffff940
R13: ffffffff9152ccc0 R14: dffffc0000000000 R15: ffff000000000000
FS:  00007fbe244b5740(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffdbffffffffc8 CR3: 0000000021e67000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
   0:	01 44 89 e6          	add    %eax,-0x1a(%rcx,%rcx,4)
   4:	e8 83 19 e8 f9       	callq  0xf9e8198c
   9:	45 84 e4             	test   %r12b,%r12b
   c:	0f 85 8f 01 00 00    	jne    0x1a1
  12:	e8 f5 1c e8 f9       	callq  0xf9e81d0c
  17:	4d 8d 24 1f          	lea    (%r15,%rbx,1),%r12
  1b:	49 8d bc 24 00 05 00 	lea    0x500(%r12),%rdi
  22:	00
  23:	48 89 f8             	mov    %rdi,%rax
  26:	48 c1 e8 03          	shr    $0x3,%rax
* 2a:	42 0f b6 04 30       	movzbl (%rax,%r14,1),%eax <-- trapping instruction
  2f:	84 c0                	test   %al,%al
  31:	74 08                	je     0x3b
  33:	3c 03                	cmp    $0x3,%al
  35:	0f 8e 9c 02 00 00    	jle    0x2d7
  3b:	45                   	rex.RB
  3c:	8b                   	.byte 0x8b
  3d:	ac                   	lods   %ds:(%rsi),%al
  3e:	24 00                	and    $0x0,%al


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex
From: Tony Lu @ 2022-08-16  8:17 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <f04de6958ebe9539227ae459837993e8c3e6ede6.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:34AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> We know that all flows except confirm_rkey and delete_rkey are exclusive,
> confirm/delete rkey flows can run concurrently (local and remote).
> 
> Although the protocol allows, all flows are actually mutually exclusive
> in implementation, dues to waiting for LLC messages is in serial.
> 
> This aggravates the time for establishing or destroying a SMC-R
> connections, connections have to be queued in smc_llc_wait.
> 
> We use rtokens or rkey to correlate a confirm/delete rkey message
> with its response.
> 
> Before sending a message, we put context with rtokens or rkey into
> wait queue. When a response message received, we wakeup the context
> which with the same rtokens or rkey against the response message.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/smc_llc.c | 174 +++++++++++++++++++++++++++++++++++++++++-------------
>  net/smc/smc_wr.c  |  10 ----
>  net/smc/smc_wr.h  |  10 ++++
>  3 files changed, 143 insertions(+), 51 deletions(-)
> 
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index 8134c15..b026df2 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -200,6 +200,7 @@ struct smc_llc_msg_delete_rkey_v2 {	/* type 0x29 */
>  struct smc_llc_qentry {
>  	struct list_head list;
>  	struct smc_link *link;
> +	void *private;
>  	union smc_llc_msg msg;
>  };
>  
> @@ -479,19 +480,17 @@ int smc_llc_send_confirm_link(struct smc_link *link,
>  	return rc;
>  }
>  
> -/* send LLC confirm rkey request */
> -static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
> -				     struct smc_buf_desc *rmb_desc)
> +/* build LLC confirm rkey request */
> +static int smc_llc_build_confirm_rkey_request(struct smc_link *send_link,
> +					      struct smc_buf_desc *rmb_desc,
> +					      struct smc_wr_tx_pend_priv **priv)
>  {
>  	struct smc_llc_msg_confirm_rkey *rkeyllc;
> -	struct smc_wr_tx_pend_priv *pend;
>  	struct smc_wr_buf *wr_buf;
>  	struct smc_link *link;
>  	int i, rc, rtok_ix;
>  
> -	if (!smc_wr_tx_link_hold(send_link))
> -		return -ENOLINK;
> -	rc = smc_llc_add_pending_send(send_link, &wr_buf, &pend);
> +	rc = smc_llc_add_pending_send(send_link, &wr_buf, priv);
>  	if (rc)
>  		goto put_out;
>  	rkeyllc = (struct smc_llc_msg_confirm_rkey *)wr_buf;
> @@ -521,25 +520,20 @@ static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
>  		cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
>  		cpu_to_be64((u64)sg_dma_address
>  			    (rmb_desc->sgt[send_link->link_idx].sgl));
> -	/* send llc message */
> -	rc = smc_wr_tx_send(send_link, pend);
>  put_out:
> -	smc_wr_tx_link_put(send_link);
>  	return rc;
>  }
>  
> -/* send LLC delete rkey request */
> -static int smc_llc_send_delete_rkey(struct smc_link *link,
> -				    struct smc_buf_desc *rmb_desc)
> +/* build LLC delete rkey request */
> +static int smc_llc_build_delete_rkey_request(struct smc_link *link,
> +					     struct smc_buf_desc *rmb_desc,
> +					     struct smc_wr_tx_pend_priv **priv)
>  {
>  	struct smc_llc_msg_delete_rkey *rkeyllc;
> -	struct smc_wr_tx_pend_priv *pend;
>  	struct smc_wr_buf *wr_buf;
>  	int rc;
>  
> -	if (!smc_wr_tx_link_hold(link))
> -		return -ENOLINK;
> -	rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
> +	rc = smc_llc_add_pending_send(link, &wr_buf, priv);
>  	if (rc)
>  		goto put_out;
>  	rkeyllc = (struct smc_llc_msg_delete_rkey *)wr_buf;
> @@ -548,10 +542,7 @@ static int smc_llc_send_delete_rkey(struct smc_link *link,
>  	smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc));
>  	rkeyllc->num_rkeys = 1;
>  	rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey);
> -	/* send llc message */
> -	rc = smc_wr_tx_send(link, pend);
>  put_out:
> -	smc_wr_tx_link_put(link);
>  	return rc;
>  }
>  
> @@ -2023,7 +2014,8 @@ static void smc_llc_rx_response(struct smc_link *link,
>  	case SMC_LLC_DELETE_RKEY:
>  		if (flowtype != SMC_LLC_FLOW_RKEY || flow->qentry)
>  			break;	/* drop out-of-flow response */
> -		goto assign;
> +		__wake_up(&link->lgr->llc_msg_waiter, TASK_NORMAL, 1, qentry);
> +		goto free;
>  	case SMC_LLC_CONFIRM_RKEY_CONT:
>  		/* not used because max links is 3 */
>  		break;
> @@ -2032,6 +2024,7 @@ static void smc_llc_rx_response(struct smc_link *link,
>  					   qentry->msg.raw.hdr.common.type);
>  		break;
>  	}
> +free:
>  	kfree(qentry);
>  	return;
>  assign:
> @@ -2191,25 +2184,98 @@ void smc_llc_link_clear(struct smc_link *link, bool log)
>  	cancel_delayed_work_sync(&link->llc_testlink_wrk);
>  }
>  
> +static int smc_llc_rkey_response_wake_function(struct wait_queue_entry *wq_entry,
> +					       unsigned int mode, int sync, void *key)
> +{
> +	struct smc_llc_qentry *except, *incoming;
> +	u8 except_llc_type;
> +
> +	/* not a rkey response */
> +	if (!key)
> +		return 0;
> +
> +	except = wq_entry->private;
> +	incoming = key;
> +
> +	except_llc_type = except->msg.raw.hdr.common.llc_type;
> +
> +	/* except LLC MSG TYPE mismatch */
> +	if (except_llc_type != incoming->msg.raw.hdr.common.llc_type)
> +		return 0;
> +
> +	switch (except_llc_type) {
> +	case SMC_LLC_CONFIRM_RKEY:
> +		if (memcmp(except->msg.confirm_rkey.rtoken, incoming->msg.confirm_rkey.rtoken,
> +			   sizeof(struct smc_rmb_rtoken) *
> +			   except->msg.confirm_rkey.rtoken[0].num_rkeys))
> +			return 0;
> +		break;
> +	case SMC_LLC_DELETE_RKEY:
> +		if (memcmp(except->msg.delete_rkey.rkey, incoming->msg.delete_rkey.rkey,
> +			   sizeof(__be32) * except->msg.delete_rkey.num_rkeys))
> +			return 0;
> +		break;
> +	default:
> +		panic("invalid except llc msg %d", except_llc_type);

Replace panic with pr_warn?

> +		return 0;
> +	}
> +
> +	/* match, save hdr */
> +	memcpy(&except->msg.raw.hdr, &incoming->msg.raw.hdr, sizeof(except->msg.raw.hdr));
> +
> +	wq_entry->private = except->private;
> +	return woken_wake_function(wq_entry, mode, sync, NULL);
> +}
> +
>  /* register a new rtoken at the remote peer (for all links) */
>  int smc_llc_do_confirm_rkey(struct smc_link *send_link,
>  			    struct smc_buf_desc *rmb_desc)
>  {
> +	long timeout = SMC_LLC_WAIT_TIME;

Reverse Christmas trees.

>  	struct smc_link_group *lgr = send_link->lgr;
> -	struct smc_llc_qentry *qentry = NULL;
> -	int rc = 0;
> +	struct smc_llc_qentry qentry;
> +	struct smc_wr_tx_pend *pend;
> +	struct smc_wr_tx_pend_priv *priv;
> +	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
> +	int rc = 0, flags;

Ditto.

>  
> -	rc = smc_llc_send_confirm_rkey(send_link, rmb_desc);
> +	if (!smc_wr_tx_link_hold(send_link))
> +		return -ENOLINK;
> +
> +	rc = smc_llc_build_confirm_rkey_request(send_link, rmb_desc, &priv);
>  	if (rc)
>  		goto out;
> -	/* receive CONFIRM RKEY response from server over RoCE fabric */
> -	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
> -			      SMC_LLC_CONFIRM_RKEY);
> -	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
> +
> +	pend = container_of(priv, struct smc_wr_tx_pend, priv);
> +	/* make a copy of send msg */
> +	memcpy(&qentry.msg.confirm_rkey, send_link->wr_tx_bufs[pend->idx].raw,
> +	       sizeof(qentry.msg.confirm_rkey));
> +
> +	qentry.private = wait.private;
> +	wait.private = &qentry;
> +
> +	add_wait_queue(&lgr->llc_msg_waiter, &wait);
> +
> +	/* send llc message */
> +	rc = smc_wr_tx_send(send_link, priv);
> +	smc_wr_tx_link_put(send_link);
> +	if (rc) {
> +		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +		goto out;
> +	}
> +
> +	while (!signal_pending(current) && timeout) {
> +		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
> +		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
> +			break;
> +	}
> +
> +	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +	flags = qentry.msg.raw.hdr.flags;
> +
> +	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
>  		rc = -EFAULT;
>  out:
> -	if (qentry)
> -		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	return rc;
>  }
>  
> @@ -2217,26 +2283,52 @@ int smc_llc_do_confirm_rkey(struct smc_link *send_link,
>  int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
>  			   struct smc_buf_desc *rmb_desc)
>  {
> -	struct smc_llc_qentry *qentry = NULL;
> +	long timeout = SMC_LLC_WAIT_TIME;
> +	struct smc_llc_qentry qentry;
> +	struct smc_wr_tx_pend *pend;
>  	struct smc_link *send_link;
> -	int rc = 0;
> +	struct smc_wr_tx_pend_priv *priv;

Reverse Christmas trees.

> +	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
> +	int rc = 0, flags;
>  
>  	send_link = smc_llc_usable_link(lgr);
> -	if (!send_link)
> +	if (!send_link || !smc_wr_tx_link_hold(send_link))
>  		return -ENOLINK;
>  
> -	/* protected by llc_flow control */
> -	rc = smc_llc_send_delete_rkey(send_link, rmb_desc);
> +	rc = smc_llc_build_delete_rkey_request(send_link, rmb_desc, &priv);
>  	if (rc)
>  		goto out;
> -	/* receive DELETE RKEY response from server over RoCE fabric */
> -	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
> -			      SMC_LLC_DELETE_RKEY);
> -	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
> +
> +	pend = container_of(priv, struct smc_wr_tx_pend, priv);
> +	/* make a copy of send msg */
> +	memcpy(&qentry.msg.delete_link, send_link->wr_tx_bufs[pend->idx].raw,
> +	       sizeof(qentry.msg.delete_link));
> +
> +	qentry.private = wait.private;
> +	wait.private = &qentry;
> +
> +	add_wait_queue(&lgr->llc_msg_waiter, &wait);
> +
> +	/* send llc message */
> +	rc = smc_wr_tx_send(send_link, priv);
> +	smc_wr_tx_link_put(send_link);
> +	if (rc) {
> +		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +		goto out;
> +	}
> +
> +	while (!signal_pending(current) && timeout) {
> +		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
> +		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
> +			break;
> +	}
> +
> +	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +	flags = qentry.msg.raw.hdr.flags;
> +
> +	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
>  		rc = -EFAULT;
>  out:
> -	if (qentry)
> -		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	return rc;
>  }
>  
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index 26f8f24..52af94f 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -37,16 +37,6 @@
>  static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
>  static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>  
> -struct smc_wr_tx_pend {	/* control data for a pending send request */
> -	u64			wr_id;		/* work request id sent */
> -	smc_wr_tx_handler	handler;
> -	enum ib_wc_status	wc_status;	/* CQE status */
> -	struct smc_link		*link;
> -	u32			idx;
> -	struct smc_wr_tx_pend_priv priv;
> -	u8			compl_requested;
> -};
> -
>  /******************************** send queue *********************************/
>  
>  /*------------------------------- completion --------------------------------*/
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index a54e90a..9946ed5 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -46,6 +46,16 @@ struct smc_wr_rx_handler {
>  	u8			type;
>  };
>  
> +struct smc_wr_tx_pend {	/* control data for a pending send request */
> +	u64			wr_id;		/* work request id sent */
> +	smc_wr_tx_handler	handler;
> +	enum ib_wc_status	wc_status;	/* CQE status */
> +	struct smc_link		*link;
> +	u32			idx;
> +	struct smc_wr_tx_pend_priv priv;
> +	u8			compl_requested;
> +};
> +
>  /* Only used by RDMA write WRs.
>   * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly
>   */
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH 2/2] vDPA: conditionally read fields in virtio-net dev
From: Si-Wei Liu @ 2022-08-16  7:58 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst
  Cc: virtualization, netdev, kvm, parav, xieyongji, gautam.dawar
In-Reply-To: <20e92551-a639-ec13-3d9c-13bb215422e1@intel.com>



On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>
>
> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>
>>
>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>> Some fields of virtio-net device config space are
>>> conditional on the feature bits, the spec says:
>>>
>>> "The mac address field always exists
>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>
>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>> or VIRTIO_NET_F_RSS is set"
>>>
>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>
>>> so we should read MTU, MAC and MQ in the device config
>>> space only when these feature bits are offered.
>>>
>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>> not set, the virtio device should have
>>> one queue pair as default value, so when userspace querying queue 
>>> pair numbers,
>>> it should return mq=1 than zero.
>>>
>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>> MTU from the device config sapce.
>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>> Ethernet Networks>
>>> says:"The minimum length of the data field of a packet sent over an
>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>> to support full-length packets"
>> Noted there's a typo in the above "The *maximum* length of the data 
>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>> RFC was written 1984.
> the spec RFC894 says it is 1500, see <a 
> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://www.rfc-editor.org/rfc/rfc894.txt</a>
>>
>> Apparently that is no longer true with the introduction of Jumbo size 
>> frame later in the 2000s. I'm not sure what is the point of mention 
>> this ancient RFC. It doesn't say default MTU of any Ethernet 
>> NIC/switch should be 1500 in either  case.
> This could be a larger number for sure, we are trying to find out the 
> min value for Ethernet here, to support 1500 octets, MTU should be 
> 1500 at least, so I assume 1500 should be the default value for MTU
>>
>>>
>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>> card",
>> Right,
>>> so the default MTU value should be 1500 for virtio-net.
>> ... but it doesn't say the default is 1500. At least, not in explicit 
>> way. Why it can't be 1492 or even lower? In practice, if the network 
>> backend has a MTU higher than 1500, there's nothing wrong for guest 
>> to configure default MTU more than 1500.
> same as above
>>
>>>
>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>> the configuration space mac entry indicates the “physical” address
>>> of the network card, otherwise the driver would typically
>>> generate a random local MAC address." So there is no
>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>
>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>> MQ when _F_MQ is not present.
>>>
>>> These functions should check devices features than driver
>>> features, and struct vdpa_device is not needed as a parameter
>>>
>>> The test & userspace tool output:
>>>
>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>
>>> However, it is challenging to "disable" the related fields
>>> in the HW device config space, so let's just assume the values
>>> are meaningless if the feature bits are not set.
>>>
>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>> are not set, iproute2 output:
>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>>>    negotiated_features
>>>
>>> without this commit, function vdpa_dev_net_config_fill()
>>> reads all config space fields unconditionally, so let's
>>> assume the MAC and MTU are meaningless, and it checks
>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>
>>> After applying this commit, when feature bits for
>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>> $vdpa dev config show vdpa0
>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>    negotiated_features
>>>
>>> As explained above:
>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>> and there is no default value for MAC. It shows
>>> max_vq_paris = 1 because even without MQ feature,
>>> a functional virtio-net must have one queue pair.
>>> mtu = 1500 is the default value as ethernet
>>> required.
>>>
>>> This commit also add supplementary comments for
>>> __virtio16_to_cpu(true, xxx) operations in
>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>
>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>> ---
>>>   drivers/vdpa/vdpa.c | 60 
>>> +++++++++++++++++++++++++++++++++++----------
>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>> index efb55a06e961..a74660b98979 100644
>>> --- a/drivers/vdpa/vdpa.c
>>> +++ b/drivers/vdpa/vdpa.c
>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>> sk_buff *msg, struct netlink_callba
>>>       return msg->len;
>>>   }
>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>> -                       struct sk_buff *msg, u64 features,
>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>> features,
>>>                          const struct virtio_net_config *config)
>>>   {
>>>       u16 val_u16;
>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>> -        return 0;
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>> +        val_u16 = 1;
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, 
>>> config->max_virtqueue_pairs);
>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>>   }
>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    u16 val_u16;
>>> +
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>> +        val_u16 = 1500;
>> As said, there's no virtio spec defined value for MTU. Please leave 
>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
> same as above
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>> +
>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>> +}
>>> +
>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>> +        return 0;
>>> +    else
>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>> +                sizeof(config->mac), config->mac);
>>> +}
>>> +
>>> +
>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>> struct sk_buff *msg)
>>>   {
>>>       struct virtio_net_config config = {};
>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>> sizeof(config.mac),
>>> -            config.mac))
>>> -        return -EMSGSIZE;
>>> +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>> AFAIK there's nothing userspace needs to do to infer the endianness. 
>> IMHO it's the kernel's job to provide an abstraction rather than rely 
>> on userspace guessing it.
> we have discussed it in another thread, and this comment is suggested 
> by MST.
Can you provide the context or link? It shouldn't work like this, 
otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
kernel/device be backward compatible with older vdpa tool (which has 
knowledge of this endianness implication/assumption from day one)?

-Siwei

>>
>>> +     */
>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>           return -EMSGSIZE;
>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>> -        return -EMSGSIZE;
>>> -
>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>>> features_driver,
>>>                     VDPA_ATTR_PAD))
>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>                     VDPA_ATTR_PAD))
>>>           return -EMSGSIZE;
>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>> features_driver, &config);
>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>>>   }
>>>     static int
>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>> vdpa_device *vdev, struct sk_buff *msg,
>>>       }
>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>>> +     */
>>> +
>> Ditto.
> same as above
>
> Thanks
>>
>> Thanks,
>> -Siwei
>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>           return -EMSGSIZE;
>>
>


^ permalink raw reply

* Re: [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
From: Tony Lu @ 2022-08-16  7:58 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <01105a98ac715b6df6d019c0b6a9916814fdcff4.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:33AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> As commit "net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause
> by server" mentioned, it works only when all connection creations are

This is a format issue, it's better to use:

commit 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB
error cause by server").

> completely protected by smc_server_lgr_pending lock, since we already
> cancel the lock, we need to re-fix the issues.
> 
> Fixes: 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server")
> 
^^^ This blank line is unnecessary.

> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   |  2 ++
>  net/smc/smc_core.c | 11 ++++++++---
>  net/smc/smc_core.h | 21 +++++++++++++++++++++
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index af4b0aa..c0842a9 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2413,6 +2413,7 @@ static void smc_listen_work(struct work_struct *work)
>  		if (rc)
>  			goto out_unlock;
>  	}
> +	smc_conn_leave_rtoken_pending(new_smc, ini);
>  	smc_conn_save_peer_info(new_smc, cclc);
>  	smc_listen_out_connected(new_smc);
>  	SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
> @@ -2422,6 +2423,7 @@ static void smc_listen_work(struct work_struct *work)
>  	if (ini->is_smcd)
>  		mutex_unlock(&smc_server_lgr_pending);
>  out_decl:
> +	smc_conn_leave_rtoken_pending(new_smc, ini);
>  	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
>  			   proposal_version);
>  out_free:
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index a3338cc..61a3854 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -2190,14 +2190,19 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		     lgr->vlan_id == ini->vlan_id) &&
>  		    (role == SMC_CLNT || ini->is_smcd ||
>  		    (lgr->conns_num < SMC_RMBS_PER_LGR_MAX &&
> -		      !bitmap_full(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)))) {
> +		    (SMC_RMBS_PER_LGR_MAX -
> +			bitmap_weight(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)
> +				> atomic_read(&lgr->rtoken_pendings))))) {
>  			/* link group found */
>  			ini->first_contact_local = 0;
>  			conn->lgr = lgr;
>  			rc = smc_lgr_register_conn(conn, false);
>  			write_unlock_bh(&lgr->conns_lock);
> -			if (!rc && delayed_work_pending(&lgr->free_work))
> -				cancel_delayed_work(&lgr->free_work);
> +			if (!rc) {
> +				smc_conn_enter_rtoken_pending(smc, ini);
> +				if (delayed_work_pending(&lgr->free_work))
> +					cancel_delayed_work(&lgr->free_work);
> +			}
>  			break;
>  		}
>  		write_unlock_bh(&lgr->conns_lock);
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 199f533..acc2869 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -293,6 +293,9 @@ struct smc_link_group {
>  	struct rb_root		conns_all;	/* connection tree */
>  	rwlock_t		conns_lock;	/* protects conns_all */
>  	unsigned int		conns_num;	/* current # of connections */
> +	atomic_t		rtoken_pendings;/* number of connection that
> +						 * lgr assigned but no rtoken got yet
> +						 */
>  	unsigned short		vlan_id;	/* vlan id of link group */
>  
>  	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
> @@ -603,6 +606,24 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
>  int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
>  
> +static inline void smc_conn_enter_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct smc_link_group *lgr;

Consider this: struct smc_link_group *lgr = smc->conn.lgr ?

> +
> +	lgr = smc->conn.lgr;
> +	if (lgr && !ini->first_contact_local)
> +		atomic_inc(&lgr->rtoken_pendings);
> +}
> +
> +static inline void smc_conn_leave_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct smc_link_group *lgr;

Ditto.

> +
> +	lgr = smc->conn.lgr;
> +	if (lgr && !ini->first_contact_local)
> +		atomic_dec(&lgr->rtoken_pendings);
> +}
> +
>  void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
>  
>  static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
> -- 
> 1.8.3.1

^ permalink raw reply

* [RFC PATCH net-next] net: ngbe: Add build support for ngbe
From: Mengyuan Lou @ 2022-08-16  7:53 UTC (permalink / raw)
  To: netdev; +Cc: jiawenwu, Mengyuan Lou

Add build options and guidance doc.
Initialize pci device access for Wangxun Gigabit Ethernet devices.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
---
 .../device_drivers/ethernet/index.rst         |   1 +
 .../device_drivers/ethernet/wangxun/ngbe.rst  |  21 +++
 MAINTAINERS                                   |   4 +-
 drivers/net/ethernet/wangxun/Kconfig          |  13 ++
 drivers/net/ethernet/wangxun/Makefile         |   1 +
 drivers/net/ethernet/wangxun/ngbe/Makefile    |   9 +
 drivers/net/ethernet/wangxun/ngbe/ngbe.h      |  24 +++
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c | 173 ++++++++++++++++++
 drivers/net/ethernet/wangxun/ngbe/ngbe_type.h |  50 +++++
 9 files changed, 295 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
 create mode 100644 drivers/net/ethernet/wangxun/ngbe/Makefile
 create mode 100644 drivers/net/ethernet/wangxun/ngbe/ngbe.h
 create mode 100644 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
 create mode 100644 drivers/net/ethernet/wangxun/ngbe/ngbe_type.h

diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 7f1777173abb..5196905582c5 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -52,6 +52,7 @@ Contents:
    ti/tlan
    toshiba/spider_net
    wangxun/txgbe
+   wangxun/ngbe
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
new file mode 100644
index 000000000000..411768cfb257
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================================
+Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters
+================================================================
+
+WangXun Gigabit Linux driver.
+Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+
+
+Contents
+========
+
+- Support
+
+
+Support
+=======
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at nic-support@net-swift.com or check our website
+ at https://www.net-swift.com
diff --git a/MAINTAINERS b/MAINTAINERS
index 386178699ae7..13fcf6352138 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21677,9 +21677,11 @@ F:	drivers/input/tablet/wacom_serial4.c
 
 WANGXUN ETHERNET DRIVER
 M:	Jiawen Wu <jiawenwu@trustnetic.com>
+M:	Mengyuan Lou <mengyuanlou@net-swift.com>
+W:	https://www.net-swift.com
 L:	netdev@vger.kernel.org
 S:	Maintained
-F:	Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
+F:	Documentation/networking/device_drivers/ethernet/wangxun/*
 F:	drivers/net/ethernet/wangxun/
 
 WATCHDOG DEVICE DRIVERS
diff --git a/drivers/net/ethernet/wangxun/Kconfig b/drivers/net/ethernet/wangxun/Kconfig
index b4a4fa0a58f8..f5d43d8c9629 100644
--- a/drivers/net/ethernet/wangxun/Kconfig
+++ b/drivers/net/ethernet/wangxun/Kconfig
@@ -16,6 +16,19 @@ config NET_VENDOR_WANGXUN
 
 if NET_VENDOR_WANGXUN
 
+config NGBE
+	tristate "Wangxun(R) GbE PCI Express adapters support"
+	depends on PCI
+	help
+	  This driver supports Wangxun(R) GbE PCI Express family of
+	  adapters.
+
+	  More specific information on configuring the driver is in
+	  <file:Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst>.
+
+	  To compile this driver as a module, choose M here. The module
+	  will be called ngbe.
+
 config TXGBE
 	tristate "Wangxun(R) 10GbE PCI Express adapters support"
 	depends on PCI
diff --git a/drivers/net/ethernet/wangxun/Makefile b/drivers/net/ethernet/wangxun/Makefile
index c34db1bead25..1193b4f738b8 100644
--- a/drivers/net/ethernet/wangxun/Makefile
+++ b/drivers/net/ethernet/wangxun/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_TXGBE) += txgbe/
+obj-$(CONFIG_TXGBE) += ngbe/
diff --git a/drivers/net/ethernet/wangxun/ngbe/Makefile b/drivers/net/ethernet/wangxun/ngbe/Makefile
new file mode 100644
index 000000000000..0baf75907496
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/ngbe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+#
+# Makefile for the Wangxun(R) GbE PCI Express ethernet driver
+#
+
+obj-$(CONFIG_NGBE) += ngbe.o
+
+ngbe-objs := ngbe_main.o
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe.h b/drivers/net/ethernet/wangxun/ngbe/ngbe.h
new file mode 100644
index 000000000000..f5fa6e5238cc
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd. */
+
+#ifndef _NGBE_H_
+#define _NGBE_H_
+
+#include "ngbe_type.h"
+
+#define NGBE_MAX_FDIR_INDICES		7
+
+#define NGBE_MAX_RX_QUEUES		(NGBE_MAX_FDIR_INDICES + 1)
+#define NGBE_MAX_TX_QUEUES		(NGBE_MAX_FDIR_INDICES + 1)
+
+/* board specific private data structure */
+struct ngbe_adapter {
+	u8 __iomem *io_addr;    /* Mainly for iounmap use */
+	/* OS defined structs */
+	struct net_device *netdev;
+	struct pci_dev *pdev;
+};
+
+extern char ngbe_driver_name[];
+
+#endif /* _NGBE_H_ */
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
new file mode 100644
index 000000000000..fa388d579bf7
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd. */
+
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/string.h>
+#include <linux/aer.h>
+#include <linux/etherdevice.h>
+
+#include "ngbe.h"
+char ngbe_driver_name[] = "ngbe";
+
+/* ngbe_pci_tbl - PCI Device ID Table
+ *
+ * Wildcard entries (PCI_ANY_ID) should come last
+ * Last entry must be all 0s
+ *
+ * { Vendor ID, Device ID, SubVendor ID, SubDevice ID,
+ *   Class, Class Mask, private data (not used) }
+ */
+static const struct pci_device_id ngbe_pci_tbl[] = {
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860AL_W), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A2), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A2S), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A4), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A4S), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860AL2), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860AL2S), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860AL4), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860AL4S), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860LC), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A1), 0},
+	{ PCI_VDEVICE(WANGXUN, NGBE_DEV_ID_EM_WX1860A1L), 0},
+	/* required last entry */
+	{ .device = 0 }
+};
+
+static void ngbe_dev_shutdown(struct pci_dev *pdev, bool *enable_wake)
+{
+	struct ngbe_adapter *adapter = pci_get_drvdata(pdev);
+	struct net_device *netdev = adapter->netdev;
+
+	netif_device_detach(netdev);
+
+	pci_disable_device(pdev);
+}
+
+static void ngbe_shutdown(struct pci_dev *pdev)
+{
+	bool wake;
+
+	ngbe_dev_shutdown(pdev, &wake);
+
+	if (system_state == SYSTEM_POWER_OFF) {
+		pci_wake_from_d3(pdev, wake);
+		pci_set_power_state(pdev, PCI_D3hot);
+	}
+}
+
+/**
+ * ngbe_probe - Device Initialization Routine
+ * @pdev: PCI device information struct
+ * @ent: entry in ngbe_pci_tbl
+ *
+ * Returns 0 on success, negative on failure
+ *
+ * ngbe_probe initializes an adapter identified by a pci_dev structure.
+ * The OS initialization, configuring of the adapter private structure,
+ * and a hardware reset occur.
+ **/
+static int ngbe_probe(struct pci_dev *pdev,
+		      const struct pci_device_id __always_unused *ent)
+{
+	struct ngbe_adapter *adapter = NULL;
+	struct net_device *netdev;
+	int err;
+
+	err = pci_enable_device_mem(pdev);
+	if (err)
+		return err;
+
+	err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+	if (err) {
+		dev_err(&pdev->dev,
+			"No usable DMA configuration, aborting\n");
+		goto err_pci_disable_dev;
+	}
+
+	err = pci_request_selected_regions(pdev,
+					   pci_select_bars(pdev, IORESOURCE_MEM),
+					   ngbe_driver_name);
+	if (err) {
+		dev_err(&pdev->dev,
+			"pci_request_selected_regions failed 0x%x\n", err);
+		goto err_pci_disable_dev;
+	}
+
+	pci_enable_pcie_error_reporting(pdev);
+	pci_set_master(pdev);
+
+	netdev = devm_alloc_etherdev_mqs(&pdev->dev,
+					 sizeof(struct ngbe_adapter),
+					 NGBE_MAX_TX_QUEUES,
+					 NGBE_MAX_RX_QUEUES);
+	if (!netdev) {
+		err = -ENOMEM;
+		goto err_pci_release_regions;
+	}
+
+	SET_NETDEV_DEV(netdev, &pdev->dev);
+
+	adapter = netdev_priv(netdev);
+	adapter->netdev = netdev;
+	adapter->pdev = pdev;
+
+	adapter->io_addr = devm_ioremap(&pdev->dev,
+					pci_resource_start(pdev, 0),
+					pci_resource_len(pdev, 0));
+	if (!adapter->io_addr) {
+		err = -EIO;
+		goto err_pci_release_regions;
+	}
+
+	netdev->features |= NETIF_F_HIGHDMA;
+
+	pci_set_drvdata(pdev, adapter);
+
+	return 0;
+
+err_pci_release_regions:
+	pci_disable_pcie_error_reporting(pdev);
+	pci_release_selected_regions(pdev,
+				     pci_select_bars(pdev, IORESOURCE_MEM));
+err_pci_disable_dev:
+	pci_disable_device(pdev);
+	return err;
+}
+
+/**
+ * ngbe_remove - Device Removal Routine
+ * @pdev: PCI device information struct
+ *
+ * ngbe_remove is called by the PCI subsystem to alert the driver
+ * that it should release a PCI device.  The could be caused by a
+ * Hot-Plug event, or because the driver is going to be removed from
+ * memory.
+ **/
+static void ngbe_remove(struct pci_dev *pdev)
+{
+	pci_release_selected_regions(pdev,
+				     pci_select_bars(pdev, IORESOURCE_MEM));
+
+	pci_disable_pcie_error_reporting(pdev);
+
+	pci_disable_device(pdev);
+}
+
+static struct pci_driver ngbe_driver = {
+	.name     = ngbe_driver_name,
+	.id_table = ngbe_pci_tbl,
+	.probe    = ngbe_probe,
+	.remove   = ngbe_remove,
+	.shutdown = ngbe_shutdown,
+};
+
+module_pci_driver(ngbe_driver);
+
+MODULE_DEVICE_TABLE(pci, ngbe_pci_tbl);
+MODULE_AUTHOR("Beijing WangXun Technology Co., Ltd, <software@net-swift.com>");
+MODULE_DESCRIPTION("WangXun(R) Gigabit PCI Express Network Driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
new file mode 100644
index 000000000000..26e776c3539a
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd. */
+
+#ifndef _NGBE_TYPE_H_
+#define _NGBE_TYPE_H_
+
+#include <linux/types.h>
+#include <linux/netdevice.h>
+
+/************ NGBE_register.h ************/
+/* Vendor ID */
+#ifndef PCI_VENDOR_ID_WANGXUN
+#define PCI_VENDOR_ID_WANGXUN			0x8088
+#endif
+
+/* Device IDs */
+#define NGBE_DEV_ID_EM_WX1860AL_W		0x0100
+#define NGBE_DEV_ID_EM_WX1860A2			0x0101
+#define NGBE_DEV_ID_EM_WX1860A2S		0x0102
+#define NGBE_DEV_ID_EM_WX1860A4			0x0103
+#define NGBE_DEV_ID_EM_WX1860A4S		0x0104
+#define NGBE_DEV_ID_EM_WX1860AL2		0x0105
+#define NGBE_DEV_ID_EM_WX1860AL2S		0x0106
+#define NGBE_DEV_ID_EM_WX1860AL4		0x0107
+#define NGBE_DEV_ID_EM_WX1860AL4S		0x0108
+#define NGBE_DEV_ID_EM_WX1860LC			0x0109
+#define NGBE_DEV_ID_EM_WX1860A1			0x010a
+#define NGBE_DEV_ID_EM_WX1860A1L		0x010b
+
+/* Subsystem ID */
+#define NGBE_SUBID_M88E1512_SFP			0x0003
+#define NGBE_SUBID_OCP_CARD			0x0040
+#define NGBE_SUBID_LY_M88E1512_SFP		0x0050
+#define NGBE_SUBID_M88E1512_RJ45		0x0051
+#define NGBE_SUBID_M88E1512_MIX			0x0052
+#define NGBE_SUBID_YT8521S_SFP			0x0060
+#define NGBE_SUBID_INTERNAL_YT8521S_SFP		0x0061
+#define NGBE_SUBID_YT8521S_SFP_GPIO		0x0062
+#define NGBE_SUBID_INTERNAL_YT8521S_SFP_GPIO	0x0064
+#define NGBE_SUBID_LY_YT8521S_SFP		0x0070
+#define NGBE_SUBID_RGMII_FPGA			0x0080
+
+#define NGBE_OEM_MASK				0x00FF
+
+#define NGBE_NCSI_SUP				0x8000
+#define NGBE_NCSI_MASK				0x8000
+#define NGBE_WOL_SUP				0x4000
+#define NGBE_WOL_MASK				0x4000
+
+#endif /* _NGBE_TYPE_H_ */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v4 net-next 3/6] drivers: net: dsa: add locked fdb entry flag to drivers
From: netdev @ 2022-08-16  7:51 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, davem, kuba, netdev, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Eric Dumazet, Paolo Abeni, Jiri Pirko,
	Ivan Vecera, Roopa Prabhu, Nikolay Aleksandrov, Shuah Khan,
	Daniel Borkmann, linux-kernel, bridge, linux-kselftest
In-Reply-To: <YvNcitNnyFxTw8bs@shredder>

On 2022-08-10 09:21, Ido Schimmel wrote:
>> >
>> > 1. It discards packets with matching DMAC, regardless of ingress port. I
>> > read the document [1] you linked to in a different reply and could not
>> > find anything against this approach, so this might be fine or at least
>> > not very significant.
>> >
>> > Note that this means that "locked" entries need to be notified to device
>> > drivers so that they will install a matching entry in the HW FDB.
>> 

I just want to be completely sure as what should be done in version 5 
with locked entries from the bridge, as - if I should implement it so 
that they are sent to all the drivers, and the drivers then ignore them 
if they don't need to take action? (for the mv88e6xxx driver, it does 
not need them and can ignore but other drivers might need.)

^ permalink raw reply

* Re: [RFC net-next 1/1] net/smc: SMC for inter-VM communication
From: Tony Lu @ 2022-08-16  7:48 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: Stephen Hemminger, Matthew Rosato, kgraul, wenjia, davem,
	edumazet, kuba, pabeni, netdev, linux-s390, zmlcc, hans,
	zhiyuan2048, herongguang
In-Reply-To: <85ee3db7-a755-d527-026e-d59c72a60010@linux.ibm.com>

On Thu, Aug 04, 2022 at 11:27:35AM +0200, Alexandra Winter wrote:
> 
> 
> On 04.08.22 01:41, Stephen Hemminger wrote:
> > On Wed, 3 Aug 2022 16:27:54 -0400
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> > 
> >> On 7/20/22 1:00 PM, Tony Lu wrote:
> >>> Hi all,
> >>>
> >>> # Background
> >>>
> >>> We (Alibaba Cloud) have already used SMC in cloud environment to
> >>> transparently accelerate TCP applications with ERDMA [1]. Nowadays,
> >>> there is a common scenario that deploy containers (which runtime is
> >>> based on lightweight virtual machine) on ECS (Elastic Compute Service),
> >>> and the containers may want to be scheduled on the same host in order to
> >>> get higher performance of network, such as AI, big data or other
> >>> scenarios that are sensitive with bandwidth and latency. Currently, the
> >>> performance of inter-VM is poor and CPU resource is wasted (see
> >>> #Benchmark virtio). This scenario has been discussed many times, but a
> >>> solution for a common scenario for applications is missing [2] [3] [4].
> >>>
> >>> # Design
> >>>
> >>> In inter-VM scenario, we use ivshmem (Inter-VM shared memory device)
> >>> which is modeled by QEMU [5]. With it, multiple VMs can access one
> >>> shared memory. This shared memory device is statically created by host
> >>> and shared to desired guests. The device exposes as a PCI BAR, and can
> >>> interrupt its peers (ivshmem-doorbell).
> >>>
> >>> In order to use ivshmem in SMC, we write a draft device driver as a
> >>> bridge between SMC and ivshmem PCI device. To make it easier, this
> >>> driver acts like a SMC-D device in order to fit in SMC without modifying
> >>> the code, which is named ivpci (see patch #1).
> >>>
> >>>    ┌───────────────────────────────────────┐
> >>>    │  ┌───────────────┐ ┌───────────────┐  │
> >>>    │  │      VM1      │ │      VM2      │  │
> >>>    │  │┌─────────────┐│ │┌─────────────┐│  │
> >>>    │  ││ Application ││ ││ Application ││  │
> >>>    │  │├─────────────┤│ │├─────────────┤│  │
> >>>    │  ││     SMC     ││ ││     SMC     ││  │
> >>>    │  │├─────────────┤│ │├─────────────┤│  │
> >>>    │  ││    ivpci    ││ ││    ivpci    ││  │
> >>>    │  └└─────────────┘┘ └└─────────────┘┘  │
> >>>    │        x  *               x  *        │
> >>>    │        x  ****************x* *        │
> >>>    │        x  xxxxxxxxxxxxxxxxx* *        │
> >>>    │        x  x                * *        │
> >>>    │  ┌───────────────┐ ┌───────────────┐  │
> >>>    │  │shared memories│ │ivshmem-server │  │
> >>>    │  └───────────────┘ └───────────────┘  │
> >>>    │                HOST A                 │
> >>>    └───────────────────────────────────────┘
> >>>     *********** Control flow (interrupt)
> >>>     xxxxxxxxxxx Data flow (memory access)
> >>>
> >>> Inside ivpci driver, it implements almost all the operations of SMC-D
> >>> device. It can be divided into two parts:
> >>>
> >>> - control flow, most of it is same with SMC-D, use ivshmem trigger
> >>>    interruptions in ivpci and process CDC flow.
> >>>
> >>> - data flow, the shared memory of each connection is one large region
> >>>    and divided into two part for local and remote RMB. Every writer
> >>>    syscall copies data to sndbuf and calls ISM's move_data() to move data
> >>>    to remote RMB in ivshmem and interrupt remote. And reader then
> >>>    receives interruption and check CDC message, consume data if cursor is
> >>>    updated.
> >>>
> >>> # Benchmark
> >>>
> >>> Current POC of ivpci is unstable and only works for single SMC
> >>> connection. Here is the brief data:
> >>>
> >>> Items         Latency (pingpong)    Throughput (64KB)
> >>> TCP (virtio)   19.3 us                3794.185 MBps
> >>> TCP (SR-IOV)   13.2 us                3948.792 MBps
> >>> SMC (ivshmem)   6.3 us               11900.269 MBps
> >>>
> >>> Test environments:
> >>>
> >>> - CPU Intel Xeon Platinum 8 core, mem 32 GiB
> >>> - NIC Mellanox CX4 with 2 VFs in two different guests
> >>> - using virsh to setup virtio-net + vhost
> >>> - using sockperf and single connection
> >>> - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy)
> >>>    with intrusive modification of SMC (see patch #1), latency (pingpong)
> >>>    use two-copy (user -> kernel and move_data() copy, patch version).
> >>>
> >>> With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half
> >>> latency.
> >>>
> >>> TCP + virtio is the most usage solution for guest, it gains lower
> >>> performance. Moreover, it consumes extra thread with full CPU core
> >>> occupied in host to transfer data, wastes more CPU resource. If the host
> >>> is very busy, the performance will be worse.
> >>>   
> >>
> >> Hi Tony,
> >>
> >> Quite interesting!  FWIW for s390x we are also looking at passthrough of 
> >> host ISM devices to enable SMC-D in QEMU guests:
> >> https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@linux.ibm.com/
> >> https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@linux.ibm.com/
> >>
> >> But seems to me an 'emulated ISM' of sorts could still be interesting 
> >> even on s390x e.g. for scenarios where host device passthrough is not 
> >> possible/desired.
> >>
> >> Out of curiosity I tried this ivpci module on s390x but the device won't 
> >> probe -- This is possibly an issue with the s390x PCI emulation layer in 
> >> QEMU, I'll have to look into that.
> >>
> >>> # Discussion
> >>>
> >>> This RFC and solution is still in early stage, so we want to come it up
> >>> as soon as possible and fully discuss with IBM and community. We have
> >>> some topics putting on the table:
> >>>
> >>> 1. SMC officially supports this scenario.
> >>>
> >>> SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D
> >>> and mocking ISM device might not be the official solution, maybe another
> >>> extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC
> >>> would accept this idea to fix this scenario? Are there any other
> >>> possibilities?  
> >>
> >> I am curious about ivshmem and its current state though -- e.g. looking 
> >> around I see mention of v2 which you also referenced but don't see any 
> >> activity on it for a few years?  And as far as v1 ivshmem -- server "not 
> >> for production use", etc.
> >>
> >> Thanks,
> >> Matt
> >>
> >>>
> >>> 2. Implementation of SMC for inter-VM.
> >>>
> >>> SMC is used in container and cloud environment, maybe we can propose a
> >>> new device and new protocol if possible in these new scenarios to solve
> >>> this problem.
> >>>
> >>> 3. Standardize this new protocol and device.
> >>>
> >>> SMC-R has an open RFC 7609, so can this new device or protocol like
> >>> SMC-D can be standardized. There is a possible option that proposing a
> >>> new device model in QEMU + virtio ecosystem and SMC supports this
> >>> standard virtio device, like [6].
> >>>
> >>> If there are any problems, please point them out.
> >>>
> >>> Hope to hear from you, thank you.
> >>>
> >>> [1] https://lwn.net/Articles/879373/
> >>> [2] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.html
> >>> [3] https://dl.acm.org/doi/10.1145/2847562
> >>> [4] https://hal.archives-ouvertes.fr/hal-00368622/document
> >>> [5] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt
> >>> [6] https://github.com/siemens/jailhouse/blob/master/Documentation/ivshmem-v2-specification.md
> >>>
> >>> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>  
> > 
> > 
> > Also looks a lot like existing VSOCK which has transports for Virtio, HyperV and VMWare already.
> 
> To have it documented in this thread:
> As Wenjia Zhang <wenjia@linux.ibm.com> mentioned in
> https://lore.kernel.org/netdev/Yt9Xfv0bN0AGMdGP@TonyMac-Alibaba/t/#mcfaa50f7142f923d2b570dc19b70c73ceddc1270
> we are working on some patches to cleanup the interface between the ism device driver and the SMC-D protocol
> layer. They may simplify a project like the one described in this RFC. Stay tuned.

Thanks, I'll keep watching.

Tony Lu

^ permalink raw reply

* Re: [RFC net-next 1/1] net/smc: SMC for inter-VM communication
From: Tony Lu @ 2022-08-16  7:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Matthew Rosato, kgraul, wenjia, davem, edumazet, kuba, pabeni,
	netdev, linux-s390, zmlcc, hans, zhiyuan2048, herongguang
In-Reply-To: <20220803164119.5955b442@hermes.local>

On Wed, Aug 03, 2022 at 04:41:19PM -0700, Stephen Hemminger wrote:
> On Wed, 3 Aug 2022 16:27:54 -0400
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
> > On 7/20/22 1:00 PM, Tony Lu wrote:
> > > Hi all,
> > > 
> > > # Background
> > > 
> > > We (Alibaba Cloud) have already used SMC in cloud environment to
> > > transparently accelerate TCP applications with ERDMA [1]. Nowadays,
> > > there is a common scenario that deploy containers (which runtime is
> > > based on lightweight virtual machine) on ECS (Elastic Compute Service),
> > > and the containers may want to be scheduled on the same host in order to
> > > get higher performance of network, such as AI, big data or other
> > > scenarios that are sensitive with bandwidth and latency. Currently, the
> > > performance of inter-VM is poor and CPU resource is wasted (see
> > > #Benchmark virtio). This scenario has been discussed many times, but a
> > > solution for a common scenario for applications is missing [2] [3] [4].
> > > 
> > > # Design
> > > 
> > > In inter-VM scenario, we use ivshmem (Inter-VM shared memory device)
> > > which is modeled by QEMU [5]. With it, multiple VMs can access one
> > > shared memory. This shared memory device is statically created by host
> > > and shared to desired guests. The device exposes as a PCI BAR, and can
> > > interrupt its peers (ivshmem-doorbell).
> > > 
> > > In order to use ivshmem in SMC, we write a draft device driver as a
> > > bridge between SMC and ivshmem PCI device. To make it easier, this
> > > driver acts like a SMC-D device in order to fit in SMC without modifying
> > > the code, which is named ivpci (see patch #1).
> > > 
> > >    ┌───────────────────────────────────────┐
> > >    │  ┌───────────────┐ ┌───────────────┐  │
> > >    │  │      VM1      │ │      VM2      │  │
> > >    │  │┌─────────────┐│ │┌─────────────┐│  │
> > >    │  ││ Application ││ ││ Application ││  │
> > >    │  │├─────────────┤│ │├─────────────┤│  │
> > >    │  ││     SMC     ││ ││     SMC     ││  │
> > >    │  │├─────────────┤│ │├─────────────┤│  │
> > >    │  ││    ivpci    ││ ││    ivpci    ││  │
> > >    │  └└─────────────┘┘ └└─────────────┘┘  │
> > >    │        x  *               x  *        │
> > >    │        x  ****************x* *        │
> > >    │        x  xxxxxxxxxxxxxxxxx* *        │
> > >    │        x  x                * *        │
> > >    │  ┌───────────────┐ ┌───────────────┐  │
> > >    │  │shared memories│ │ivshmem-server │  │
> > >    │  └───────────────┘ └───────────────┘  │
> > >    │                HOST A                 │
> > >    └───────────────────────────────────────┘
> > >     *********** Control flow (interrupt)
> > >     xxxxxxxxxxx Data flow (memory access)
> > > 
> > > Inside ivpci driver, it implements almost all the operations of SMC-D
> > > device. It can be divided into two parts:
> > > 
> > > - control flow, most of it is same with SMC-D, use ivshmem trigger
> > >    interruptions in ivpci and process CDC flow.
> > > 
> > > - data flow, the shared memory of each connection is one large region
> > >    and divided into two part for local and remote RMB. Every writer
> > >    syscall copies data to sndbuf and calls ISM's move_data() to move data
> > >    to remote RMB in ivshmem and interrupt remote. And reader then
> > >    receives interruption and check CDC message, consume data if cursor is
> > >    updated.
> > > 
> > > # Benchmark
> > > 
> > > Current POC of ivpci is unstable and only works for single SMC
> > > connection. Here is the brief data:
> > > 
> > > Items         Latency (pingpong)    Throughput (64KB)
> > > TCP (virtio)   19.3 us                3794.185 MBps
> > > TCP (SR-IOV)   13.2 us                3948.792 MBps
> > > SMC (ivshmem)   6.3 us               11900.269 MBps
> > > 
> > > Test environments:
> > > 
> > > - CPU Intel Xeon Platinum 8 core, mem 32 GiB
> > > - NIC Mellanox CX4 with 2 VFs in two different guests
> > > - using virsh to setup virtio-net + vhost
> > > - using sockperf and single connection
> > > - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy)
> > >    with intrusive modification of SMC (see patch #1), latency (pingpong)
> > >    use two-copy (user -> kernel and move_data() copy, patch version).
> > > 
> > > With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half
> > > latency.
> > > 
> > > TCP + virtio is the most usage solution for guest, it gains lower
> > > performance. Moreover, it consumes extra thread with full CPU core
> > > occupied in host to transfer data, wastes more CPU resource. If the host
> > > is very busy, the performance will be worse.
> > >   
> > 
> > Hi Tony,
> > 
> > Quite interesting!  FWIW for s390x we are also looking at passthrough of 
> > host ISM devices to enable SMC-D in QEMU guests:
> > https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@linux.ibm.com/
> > https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@linux.ibm.com/
> > 
> > But seems to me an 'emulated ISM' of sorts could still be interesting 
> > even on s390x e.g. for scenarios where host device passthrough is not 
> > possible/desired.
> > 
> > Out of curiosity I tried this ivpci module on s390x but the device won't 
> > probe -- This is possibly an issue with the s390x PCI emulation layer in 
> > QEMU, I'll have to look into that.
> > 
> > > # Discussion
> > > 
> > > This RFC and solution is still in early stage, so we want to come it up
> > > as soon as possible and fully discuss with IBM and community. We have
> > > some topics putting on the table:
> > > 
> > > 1. SMC officially supports this scenario.
> > > 
> > > SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D
> > > and mocking ISM device might not be the official solution, maybe another
> > > extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC
> > > would accept this idea to fix this scenario? Are there any other
> > > possibilities?  
> > 
> > I am curious about ivshmem and its current state though -- e.g. looking 
> > around I see mention of v2 which you also referenced but don't see any 
> > activity on it for a few years?  And as far as v1 ivshmem -- server "not 
> > for production use", etc.
> > 
> > Thanks,
> > Matt
> > 
> > > 
> > > 2. Implementation of SMC for inter-VM.
> > > 
> > > SMC is used in container and cloud environment, maybe we can propose a
> > > new device and new protocol if possible in these new scenarios to solve
> > > this problem.
> > > 
> > > 3. Standardize this new protocol and device.
> > > 
> > > SMC-R has an open RFC 7609, so can this new device or protocol like
> > > SMC-D can be standardized. There is a possible option that proposing a
> > > new device model in QEMU + virtio ecosystem and SMC supports this
> > > standard virtio device, like [6].
> > > 
> > > If there are any problems, please point them out.
> > > 
> > > Hope to hear from you, thank you.
> > > 
> > > [1] https://lwn.net/Articles/879373/
> > > [2] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.html
> > > [3] https://dl.acm.org/doi/10.1145/2847562
> > > [4] https://hal.archives-ouvertes.fr/hal-00368622/document
> > > [5] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt
> > > [6] https://github.com/siemens/jailhouse/blob/master/Documentation/ivshmem-v2-specification.md
> > > 
> > > Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>  
> 
> 
> Also looks a lot like existing VSOCK which has transports for Virtio, HyperV and VMWare already.

Yes, VSOCK, ivshmem with SMC are both approaches to communicate
across VMs, and widely used.

The main idea that combines SMC and ivshmem is to let SMC cover this
case transparently and get better performance. SMC can choose the best
approach for peers communication, such as local shared memory, ISM, RDMA
or even fallback to TCP, and this is transparent for applications.

Tony Lu

^ permalink raw reply

* Re: [PATCH v4 1/3] dt-bindings: net: ti: k3-am654-cpsw-nuss: Update bindings for J7200 CPSW5G
From: Krzysztof Kozlowski @ 2022-08-16  7:44 UTC (permalink / raw)
  To: Siddharth Vadapalli, davem, edumazet, kuba, pabeni, robh+dt,
	krzysztof.kozlowski+dt, linux, vladimir.oltean, grygorii.strashko,
	vigneshr, nsekhar
  Cc: netdev, devicetree, linux-kernel, kishon
In-Reply-To: <20220816060139.111934-2-s-vadapalli@ti.com>

On 16/08/2022 09:01, Siddharth Vadapalli wrote:
> Update bindings for TI K3 J7200 SoC which contains 5 ports (4 external
> ports) CPSW5G module and add compatible for it.
> 
> Changes made:
>     - Add new compatible ti,j7200-cpswxg-nuss for CPSW5G.
>     - Extend pattern properties for new compatible.
>     - Change maximum number of CPSW ports to 4 for new compatible.
> 
> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> ---
>  .../bindings/net/ti,k3-am654-cpsw-nuss.yaml     | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
> index b8281d8be940..5366a367c387 100644
> --- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
> +++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
> @@ -57,6 +57,7 @@ properties:
>        - ti,am654-cpsw-nuss
>        - ti,j721e-cpsw-nuss
>        - ti,am642-cpsw-nuss
> +      - ti,j7200-cpswxg-nuss

Keep some order in the list, so maybe before j721e.

>  
>    reg:
>      maxItems: 1
> @@ -110,7 +111,7 @@ properties:
>          const: 0
>  
>      patternProperties:
> -      port@[1-2]:
> +      "^port@[1-4]$":
>          type: object
>          description: CPSWxG NUSS external ports
>  
> @@ -119,7 +120,7 @@ properties:
>          properties:
>            reg:
>              minimum: 1
> -            maximum: 2
> +            maximum: 4
>              description: CPSW port number
>  
>            phys:
> @@ -151,6 +152,18 @@ properties:
>  
>      additionalProperties: false
>  
> +if:

This goes under allOf just before unevaluated/additionalProperties:false


Best regards,
Krzysztof

^ permalink raw reply

* Re: [PATCH V5 4/6] vDPA: !FEATURES_OK should not block querying device config space
From: Si-Wei Liu @ 2022-08-16  7:41 UTC (permalink / raw)
  To: mst, Zhu Lingshan
  Cc: virtualization, netdev, kvm, parav, xieyongji, gautam.dawar,
	jasowang
In-Reply-To: <20220812104500.163625-5-lingshan.zhu@intel.com>

Hi Michael,

I just noticed this patch got pulled to linux-next prematurely without 
getting consensus on code review, am not sure why. Hope it was just an 
oversight.

Unfortunately this introduced functionality regression to at least two 
cases so far as I see:

1. (bogus) VDPA_ATTR_DEV_NEGOTIATED_FEATURES are inadvertently exposed 
and displayed in "vdpa dev config show" before feature negotiation is 
done. Noted the corresponding features name shown in vdpa tool is called 
"negotiated_features" rather than "driver_features". I see in no way the 
intended change of the patch should break this user level expectation 
regardless of any spec requirement. Do you agree on this point?

2. There was also another implicit assumption that is broken by this 
patch. There could be a vdpa tool query of config via 
vdpa_dev_net_config_fill()->vdpa_get_config_unlocked() that races with 
the first vdpa_set_features() call from VMM e.g. QEMU. Since the 
S_FEATURES_OK blocking condition is removed, if the vdpa tool query 
occurs earlier than the first set_driver_features() call from VMM, the 
following code will treat the guest as legacy and then trigger an 
erroneous vdpa_set_features_unlocked(... , 0) call to the vdpa driver:

  374         /*
  375          * Config accesses aren't supposed to trigger before 
features are set.
  376          * If it does happen we assume a legacy guest.
  377          */
  378         if (!vdev->features_valid)
  379                 vdpa_set_features_unlocked(vdev, 0);
  380         ops->get_config(vdev, offset, buf, len);

Depending on vendor driver's implementation, L380 may either return 
invalid config data (or invalid endianness if on BE) or only config 
fields that are valid in legacy layout. What's more severe is that, vdpa 
tool query in theory shouldn't affect feature negotiation at all by 
making confusing calls to the device, but now it is possible with the 
patch. Fixing this would require more delicate work on the other paths 
involving the cf_lock reader/write semaphore.

Not sure what you plan to do next, post the fixes for both issues and 
get the community review? Or simply revert the patch in question? Let us 
know.

Thanks,
-Siwei

On 8/12/2022 3:44 AM, Zhu Lingshan wrote:
> Users may want to query the config space of a vDPA device,
> to choose a appropriate one for a certain guest. This means the
> users need to read the config space before FEATURES_OK, and
> the existence of config space contents does not depend on
> FEATURES_OK.
>
> The spec says:
> The device MUST allow reading of any device-specific configuration
> field before FEATURES_OK is set by the driver. This includes
> fields which are conditional on feature bits, as long as those
> feature bits are offered by the device.
>
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>   drivers/vdpa/vdpa.c | 8 --------
>   1 file changed, 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 6eb3d972d802..bf312d9c59ab 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -855,17 +855,9 @@ vdpa_dev_config_fill(struct vdpa_device *vdev, struct sk_buff *msg, u32 portid,
>   {
>   	u32 device_id;
>   	void *hdr;
> -	u8 status;
>   	int err;
>   
>   	down_read(&vdev->cf_lock);
> -	status = vdev->config->get_status(vdev);
> -	if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
> -		NL_SET_ERR_MSG_MOD(extack, "Features negotiation not completed");
> -		err = -EAGAIN;
> -		goto out;
> -	}
> -
>   	hdr = genlmsg_put(msg, portid, seq, &vdpa_nl_family, flags,
>   			  VDPA_CMD_DEV_CONFIG_GET);
>   	if (!hdr) {

^ permalink raw reply

* Re: [RFC net-next 1/1] net/smc: SMC for inter-VM communication
From: Tony Lu @ 2022-08-16  7:34 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: kgraul, wenjia, davem, edumazet, kuba, pabeni, netdev, linux-s390,
	zmlcc, hans, zhiyuan2048, herongguang
In-Reply-To: <0ccf9cc6-4916-7815-9ce2-990dc7884849@linux.ibm.com>

On Wed, Aug 03, 2022 at 04:27:54PM -0400, Matthew Rosato wrote:
> On 7/20/22 1:00 PM, Tony Lu wrote:
> > Hi all,
> > 
> > # Background
> > 
> > We (Alibaba Cloud) have already used SMC in cloud environment to
> > transparently accelerate TCP applications with ERDMA [1]. Nowadays,
> > there is a common scenario that deploy containers (which runtime is
> > based on lightweight virtual machine) on ECS (Elastic Compute Service),
> > and the containers may want to be scheduled on the same host in order to
> > get higher performance of network, such as AI, big data or other
> > scenarios that are sensitive with bandwidth and latency. Currently, the
> > performance of inter-VM is poor and CPU resource is wasted (see
> > #Benchmark virtio). This scenario has been discussed many times, but a
> > solution for a common scenario for applications is missing [2] [3] [4].
> > 
> > # Design
> > 
> > In inter-VM scenario, we use ivshmem (Inter-VM shared memory device)
> > which is modeled by QEMU [5]. With it, multiple VMs can access one
> > shared memory. This shared memory device is statically created by host
> > and shared to desired guests. The device exposes as a PCI BAR, and can
> > interrupt its peers (ivshmem-doorbell).
> > 
> > In order to use ivshmem in SMC, we write a draft device driver as a
> > bridge between SMC and ivshmem PCI device. To make it easier, this
> > driver acts like a SMC-D device in order to fit in SMC without modifying
> > the code, which is named ivpci (see patch #1).
> > 
> >    ┌───────────────────────────────────────┐
> >    │  ┌───────────────┐ ┌───────────────┐  │
> >    │  │      VM1      │ │      VM2      │  │
> >    │  │┌─────────────┐│ │┌─────────────┐│  │
> >    │  ││ Application ││ ││ Application ││  │
> >    │  │├─────────────┤│ │├─────────────┤│  │
> >    │  ││     SMC     ││ ││     SMC     ││  │
> >    │  │├─────────────┤│ │├─────────────┤│  │
> >    │  ││    ivpci    ││ ││    ivpci    ││  │
> >    │  └└─────────────┘┘ └└─────────────┘┘  │
> >    │        x  *               x  *        │
> >    │        x  ****************x* *        │
> >    │        x  xxxxxxxxxxxxxxxxx* *        │
> >    │        x  x                * *        │
> >    │  ┌───────────────┐ ┌───────────────┐  │
> >    │  │shared memories│ │ivshmem-server │  │
> >    │  └───────────────┘ └───────────────┘  │
> >    │                HOST A                 │
> >    └───────────────────────────────────────┘
> >     *********** Control flow (interrupt)
> >     xxxxxxxxxxx Data flow (memory access)
> > 
> > Inside ivpci driver, it implements almost all the operations of SMC-D
> > device. It can be divided into two parts:
> > 
> > - control flow, most of it is same with SMC-D, use ivshmem trigger
> >    interruptions in ivpci and process CDC flow.
> > 
> > - data flow, the shared memory of each connection is one large region
> >    and divided into two part for local and remote RMB. Every writer
> >    syscall copies data to sndbuf and calls ISM's move_data() to move data
> >    to remote RMB in ivshmem and interrupt remote. And reader then
> >    receives interruption and check CDC message, consume data if cursor is
> >    updated.
> > 
> > # Benchmark
> > 
> > Current POC of ivpci is unstable and only works for single SMC
> > connection. Here is the brief data:
> > 
> > Items         Latency (pingpong)    Throughput (64KB)
> > TCP (virtio)   19.3 us                3794.185 MBps
> > TCP (SR-IOV)   13.2 us                3948.792 MBps
> > SMC (ivshmem)   6.3 us               11900.269 MBps
> > 
> > Test environments:
> > 
> > - CPU Intel Xeon Platinum 8 core, mem 32 GiB
> > - NIC Mellanox CX4 with 2 VFs in two different guests
> > - using virsh to setup virtio-net + vhost
> > - using sockperf and single connection
> > - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy)
> >    with intrusive modification of SMC (see patch #1), latency (pingpong)
> >    use two-copy (user -> kernel and move_data() copy, patch version).
> > 
> > With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half
> > latency.
> > 
> > TCP + virtio is the most usage solution for guest, it gains lower
> > performance. Moreover, it consumes extra thread with full CPU core
> > occupied in host to transfer data, wastes more CPU resource. If the host
> > is very busy, the performance will be worse.
> > 
> 
> Hi Tony,
> 
> Quite interesting!  FWIW for s390x we are also looking at passthrough of
> host ISM devices to enable SMC-D in QEMU guests:
> https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@linux.ibm.com/
> https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@linux.ibm.com/
> 
> But seems to me an 'emulated ISM' of sorts could still be interesting even
> on s390x e.g. for scenarios where host device passthrough is not
> possible/desired.
> 
> Out of curiosity I tried this ivpci module on s390x but the device won't
> probe -- This is possibly an issue with the s390x PCI emulation layer in
> QEMU, I'll have to look into that.
> 
> > # Discussion
> > 
> > This RFC and solution is still in early stage, so we want to come it up
> > as soon as possible and fully discuss with IBM and community. We have
> > some topics putting on the table:
> > 
> > 1. SMC officially supports this scenario.
> > 
> > SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D
> > and mocking ISM device might not be the official solution, maybe another
> > extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC
> > would accept this idea to fix this scenario? Are there any other
> > possibilities?
> 
> I am curious about ivshmem and its current state though -- e.g. looking
> around I see mention of v2 which you also referenced but don't see any
> activity on it for a few years?  And as far as v1 ivshmem -- server "not for
> production use", etc.
> 
> Thanks,
> Matt
> 

Hi Matt,

Glad to hear that. And sorry for my late reply.

Current POC version of ivpci for 'emulated ISM' is unstable and buggy,
and depends on ivshmem [1] and a hypervisor process [2], maybe there are
some issues blocks that. Please point them out, and I will fix them.

The current state of ivshmem, yes, v1 is still unstable and v2 isn't
active for a long time. This patch uses ivshmem to act as backend of
emulated ISM device for prototype. And ivshmem doesn't suit for this
scenario as a product-level solution, such as interruption, shared
memory memory allocation and live migration etc.

So we are planning to come up with a new device and the corresponding
driver. We are trying to extend virtio for shared memory, so that it can
use shared memory in s390 or others scenarios. We are working on the
details and send out the proposal later. And what about your opinions?

[1] https://www.qemu.org/docs/master/system/devices/ivshmem.html
[2] https://github.com/qemu/qemu/tree/master/contrib/ivshmem-server

Cheers,
Tony Lu

^ permalink raw reply

* Re: [PATCH net v3 2/2] bonding: 802.3ad: fix no transmission of LACPDUs
From: Hangbin Liu @ 2022-08-16  7:29 UTC (permalink / raw)
  To: Jonathan Toppins
  Cc: netdev, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel
In-Reply-To: <0639f1e3d366c5098d561a947fd416fa5277e7f4.1660572700.git.jtoppins@redhat.com>

On Mon, Aug 15, 2022 at 11:08:35AM -0400, Jonathan Toppins wrote:
> This is caused by the global variable ad_ticks_per_sec being zero as
> demonstrated by the reproducer script discussed below. This causes
> all timer values in __ad_timer_to_ticks to be zero, resulting
> in the periodic timer to never fire.
> 
> To reproduce:
> Run the script in
> `tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which
> puts bonding into a state where it never transmits LACPDUs.
> 
> line 44: ip link add fbond type bond mode 4 miimon 200 \
>             xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast
> setting bond param: ad_actor_sys_prio
> given:
>     params.ad_actor_system = 0
> call stack:
>     bond_option_ad_actor_sys_prio()
>     -> bond_3ad_update_ad_actor_settings()
>        -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>        -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>             params.ad_actor_system == 0
> results:
>      ad.system.sys_mac_addr = bond->dev->dev_addr
> 
> line 48: ip link set fbond address 52:54:00:3B:7C:A6
> setting bond MAC addr
> call stack:
>     bond->dev->dev_addr = new_mac
> 
> line 52: ip link set fbond type bond ad_actor_sys_prio 65535
> setting bond param: ad_actor_sys_prio
> given:
>     params.ad_actor_system = 0
> call stack:
>     bond_option_ad_actor_sys_prio()
>     -> bond_3ad_update_ad_actor_settings()
>        -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>        -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>             params.ad_actor_system == 0
> results:
>      ad.system.sys_mac_addr = bond->dev->dev_addr
> 
> line 60: ip link set veth1-bond down master fbond
> given:
>     params.ad_actor_system = 0
>     params.mode = BOND_MODE_8023AD
>     ad.system.sys_mac_addr == bond->dev->dev_addr
> call stack:
>     bond_enslave
>     -> bond_3ad_initialize(); because first slave
>        -> if ad.system.sys_mac_addr != bond->dev->dev_addr
>           return
> results:
>      Nothing is run in bond_3ad_initialize() because dev_add equals
>      sys_mac_addr leaving the global ad_ticks_per_sec zero as it is
>      never initialized anywhere else.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
> ---
> 
> Notes:
>     v2:
>      * split this fix from the reproducer
>     v3:
>      * rebased to latest net/master
> 
>  drivers/net/bonding/bond_3ad.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index d7fb33c078e8..957d30db6f95 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -84,7 +84,8 @@ enum ad_link_speed_type {
>  static const u8 null_mac_addr[ETH_ALEN + 2] __long_aligned = {
>  	0, 0, 0, 0, 0, 0
>  };
> -static u16 ad_ticks_per_sec;
> +
> +static u16 ad_ticks_per_sec = 1000 / AD_TIMER_INTERVAL;
>  static const int ad_delta_in_ticks = (AD_TIMER_INTERVAL * HZ) / 1000;
>  
>  static const u8 lacpdu_mcast_addr[ETH_ALEN + 2] __long_aligned =
> -- 
> 2.31.1
> 

Acked-by: Hangbin Liu <liuhangbin@gmail.com>

^ permalink raw reply

* Re: [PATCH net v3 1/2] selftests: include bonding tests into the kselftest infra
From: Hangbin Liu @ 2022-08-16  7:28 UTC (permalink / raw)
  To: Jonathan Toppins; +Cc: netdev, Shuah Khan, linux-kernel, linux-kselftest
In-Reply-To: <3cb3b4ce2b761a1e1ac56b0505b9ea63dbf9e075.1660572700.git.jtoppins@redhat.com>

On Mon, Aug 15, 2022 at 11:08:34AM -0400, Jonathan Toppins wrote:
> This creates a test collection in drivers/net/bonding for bonding
> specific kernel selftests.
> 
> The first test is a reproducer that provisions a bond and given the
> specific order in how the ip-link(8) commands are issued the bond never
> transmits an LACPDU frame on any of its slaves.
> 
> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
> ---
> 
> Notes:
>     v2:
>      * fully integrated the test into the kselftests infrastructure
>      * moved the reproducer to under
>        tools/testing/selftests/drivers/net/bonding
>      * reduced the test to its minimial amount and used ip-link(8) for
>        all bond interface configuration
>     v3:
>      * rebase to latest net/master
>      * remove `#set -x` requested by Hangbin
> 
>  MAINTAINERS                                   |  1 +
>  tools/testing/selftests/Makefile              |  1 +
>  .../selftests/drivers/net/bonding/Makefile    |  6 ++
>  .../net/bonding/bond-break-lacpdu-tx.sh       | 81 +++++++++++++++++++
>  .../selftests/drivers/net/bonding/config      |  1 +
>  .../selftests/drivers/net/bonding/settings    |  1 +
>  6 files changed, 91 insertions(+)
>  create mode 100644 tools/testing/selftests/drivers/net/bonding/Makefile
>  create mode 100755 tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh
>  create mode 100644 tools/testing/selftests/drivers/net/bonding/config
>  create mode 100644 tools/testing/selftests/drivers/net/bonding/settings
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f2d64020399b..e5fb14dc302d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3672,6 +3672,7 @@ F:	Documentation/networking/bonding.rst
>  F:	drivers/net/bonding/
>  F:	include/net/bond*
>  F:	include/uapi/linux/if_bonding.h
> +F:	tools/testing/selftests/net/bonding/
>  
>  BOSCH SENSORTEC BMA400 ACCELEROMETER IIO DRIVER
>  M:	Dan Robertson <dan@dlrobertson.com>
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 10b34bb03bc1..c2064a35688b 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -12,6 +12,7 @@ TARGETS += cpu-hotplug
>  TARGETS += damon
>  TARGETS += drivers/dma-buf
>  TARGETS += drivers/s390x/uvdevice
> +TARGETS += drivers/net/bonding
>  TARGETS += efivarfs
>  TARGETS += exec
>  TARGETS += filesystems
> diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile
> new file mode 100644
> index 000000000000..ab6c54b12098
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/bonding/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Makefile for net selftests
> +
> +TEST_PROGS := bond-break-lacpdu-tx.sh
> +
> +include ../../../lib.mk
> diff --git a/tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh b/tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh
> new file mode 100755
> index 000000000000..47ab90596acb
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh
> @@ -0,0 +1,81 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# Regression Test:
> +#   Verify LACPDUs get transmitted after setting the MAC address of
> +#   the bond.
> +#
> +# https://bugzilla.redhat.com/show_bug.cgi?id=2020773
> +#
> +#       +---------+
> +#       | fab-br0 |
> +#       +---------+
> +#            |
> +#       +---------+
> +#       |  fbond  |
> +#       +---------+
> +#        |       |
> +#    +------+ +------+
> +#    |veth1 | |veth2 |
> +#    +------+ +------+
> +#
> +# We use veths instead of physical interfaces
> +
> +set -e
> +tmp=$(mktemp -q dump.XXXXXX)
> +cleanup() {
> +	ip link del fab-br0 >/dev/null 2>&1 || :
> +	ip link del fbond  >/dev/null 2>&1 || :
> +	ip link del veth1-bond  >/dev/null 2>&1 || :
> +	ip link del veth2-bond  >/dev/null 2>&1 || :
> +	modprobe -r bonding  >/dev/null 2>&1 || :
> +	rm -f -- ${tmp}
> +}
> +
> +trap cleanup 0 1 2
> +cleanup
> +sleep 1
> +
> +# create the bridge
> +ip link add fab-br0 address 52:54:00:3B:7C:A6 mtu 1500 type bridge \
> +	forward_delay 15
> +
> +# create the bond
> +ip link add fbond type bond mode 4 miimon 200 xmit_hash_policy 1 \
> +	ad_actor_sys_prio 65535 lacp_rate fast
> +
> +# set bond address
> +ip link set fbond address 52:54:00:3B:7C:A6
> +ip link set fbond up
> +
> +# set again bond sysfs parameters
> +ip link set fbond type bond ad_actor_sys_prio 65535
> +
> +# create veths
> +ip link add name veth1-bond type veth peer name veth1-end
> +ip link add name veth2-bond type veth peer name veth2-end
> +
> +# add ports
> +ip link set fbond master fab-br0
> +ip link set veth1-bond down master fbond
> +ip link set veth2-bond down master fbond
> +
> +# bring up
> +ip link set veth1-end up
> +ip link set veth2-end up
> +ip link set fab-br0 up
> +ip link set fbond up
> +ip addr add dev fab-br0 10.0.0.3
> +
> +tcpdump -n -i veth1-end -e ether proto 0x8809 >${tmp} 2>&1 &
> +sleep 15
> +pkill tcpdump >/dev/null 2>&1
> +rc=0
> +num=$(grep "packets captured" ${tmp} | awk '{print $1}')
> +if test "$num" -gt 0; then
> +	echo "PASS, captured ${num}"
> +else
> +	echo "FAIL"
> +	rc=1
> +fi
> +exit $rc
> diff --git a/tools/testing/selftests/drivers/net/bonding/config b/tools/testing/selftests/drivers/net/bonding/config
> new file mode 100644
> index 000000000000..dc1c22de3c92
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/bonding/config
> @@ -0,0 +1 @@
> +CONFIG_BONDING=y
> diff --git a/tools/testing/selftests/drivers/net/bonding/settings b/tools/testing/selftests/drivers/net/bonding/settings
> new file mode 100644
> index 000000000000..867e118223cd
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/bonding/settings
> @@ -0,0 +1 @@
> +timeout=60
> -- 
> 2.31.1
> 

Acked-by: Hangbin Liu <liuhangbin@gmail.com>

^ permalink raw reply

* Re: [RFC net-next 2/4] ynl: add the schema for the schemas
From: Johannes Berg @ 2022-08-16  7:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <20220815174742.32b3611e@kernel.org>

On Mon, 2022-08-15 at 17:47 -0700, Jakub Kicinski wrote:
> On Mon, 15 Aug 2022 22:09:11 +0200 Johannes Berg wrote:
> > On Wed, 2022-08-10 at 19:23 -0700, Jakub Kicinski wrote:
> > > 
> > > +        attributes:
> > > +          description: List of attributes in the space.
> > > +          type: array
> > > +          items:
> > > +            type: object
> > > +            required: [ name, type ]
> > > +            additionalProperties: False
> > > +            properties:
> > > +              name:
> > > +                type: string
> > > +              type: &attr-type
> > > +                enum: [ unused, flag, binary, u8, u16, u32, u64, s32, s64,
> > > +                        nul-string, multi-attr, nest, array-nest, nest-type-value ]  
> > 
> > nest-type-value?
> 
> It's the incredibly inventive nesting format used in genetlink policy
> dumps where the type of the sub-attr(s there are actually two levels)
> carry a value (index of the policy and attribute) rather than denoting
> a type :S :S :S

Hmm, OK, in the policy dump (not specific to genetlink, btw, can be used
for any policy, but is only generically hooked up for genetlink), we
have

[policy_idx] = {
  [attr_idx] = {
    [NL_POLICY_TYPE_ATTR_...] = ...
  }
}

Is that what you mean?

I guess I never really thought about this format much from a description
POV, no need to have a policy since you simply iterate (for_each_attr)
when reading it, and don't really need to care about the attribute
index, at least.

For future reference, how would you suggest to have done this instead?


> > > +              description:
> > > +                description: Documentation of the attribute.
> > > +                type: string
> > > +              type-value:
> > > +                description: Name of the value extracted from the type of a nest-type-value attribute.
> > > +                type: array
> > > +                items:
> > > +                  type: string
> > > +              len:
> > > +                oneOf: [ { type: string }, { type: integer }]
> > > +              sub-type: *attr-type
> > > +              nested-attributes:
> > > +                description: Name of the space (sub-space) used inside the attribute.
> > > +                type: string  
> > 
> > Maybe expand that description a bit, it's not really accurate for
> > "array-nest"?
> 
> Slightly guessing but I think I know what you mean -> the value of the
> array is a nest with index as the type and then inside that is the
> entry of the array with its attributes <- and that's where the space is
> applied, not at the first nest level?

Right.

> Right, I should probably put that in the docs rather than the schema,
> array-nests are expected to strip one layer of nesting and put the
> value taken from the type (:D) into an @idx member of the struct
> representing the values of the array. Or at least that's what I do in
> the C codegen.

Well mostly you're not supposed to care about the 'value'/'type', I
guess?

> Not that any of these beautiful, precious formats should be encouraged
> going forward. multi-attr all the way!

multi-attr?

> > Do you mean the "name of the enumeration" or the "name of the
> > enumeration constant"? (per C99 concepts) I'm a bit confused? I guess
> > you mean the "name of the enumeration constant" though I agree most
> > people probably don't know the names from C99 (I had to look them up too
> > for the sake of being precise here ...)
> 
> I meant the type. I think. When u32 carries values of an enum.
> Enumeration constant for the attribute type is constructed from
> it's name and the prefix/suffix kludge.
> 
Indeed, I confused myself too ...

johannes


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next] igc: add xdp frags support to ndo_xdp_xmit
From: naamax.meir @ 2022-08-16  7:14 UTC (permalink / raw)
  To: Lorenzo Bianconi, netdev
  Cc: intel-wired-lan, jesse.brandeburg, jbrouer, edumazet, kuba,
	pabeni, davem, magnus.karlsson
In-Reply-To: <d8e3744f060ee11d5069bfd0f581f02d0ecb5e08.1657093744.git.lorenzo@kernel.org>

On 7/6/2022 10:54, Lorenzo Bianconi wrote:
> Add the capability to map non-linear xdp frames in XDP_TX and
> ndo_xdp_xmit callback.
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> Please note this patch is only compiled tested since I do not have
> access to a igc NIC
> ---
>   drivers/net/ethernet/intel/igc/igc_main.c | 128 ++++++++++++++--------
>   1 file changed, 83 insertions(+), 45 deletions(-)
Tested-by: Naama Meir <naamax.meir@linux.intel.com>

^ permalink raw reply

* Re: [RFC net-next 1/4] ynl: add intro docs for the concept
From: Johannes Berg @ 2022-08-16  7:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <20220815173254.1809b44a@kernel.org>

On Mon, 2022-08-15 at 17:32 -0700, Jakub Kicinski wrote:
> On Mon, 15 Aug 2022 22:09:29 +0200 Johannes Berg wrote:
> > On Wed, 2022-08-10 at 19:23 -0700, Jakub Kicinski wrote:
> > > 
> > > +Note that attribute spaces do not themselves nest, nested attributes refer to their internal
> > > +space via a ``nested-attributes`` property, so the YAML spec does not resemble the format
> > > +of the netlink messages directly.  
> > 
> > I find this a bit ... confusing.
> > 
> > I think reading the other patch I know what you mean, but if I think of
> > this I think more of the policy declarations than the message itself,
> > and there we do refer to another policy?
> > 
> > Maybe reword a bit and say
> > 
> >    Note that attribute spaces do not themselves nest, nested attributes
> >    refer to their internal space via a ``nested-attributes`` property
> >    (the name of another or the same attribute space).
> > 
> > or something?
> 
> I think I put the cart before the horse in this looong sentence. How
> about:
> 
>   Note that the YAML spec is "flattened" and is not meant to visually
>   resemble the format of the netlink messages (unlike certain ad-hoc documentation
>   formats seen in kernel comments). In the YAML spec subordinate attribute sets
>   are not defined inline as a nest, but defined in a separate attribute set
>   referred to with a ``nested-attributes`` property of the container.
> 
Yeah, that makes sense.

Like I said, I was already thinking of the policy structures (and the
policy advertisement to userspace) which is exactly the same way, so I
didn't see this as much different - but of course it _is_ different from
the message itself.

johannes

^ permalink raw reply

* Re: [PATCH bpf-next v7 4/8] bpf: Introduce cgroup iter
From: Hao Luo @ 2022-08-16  6:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yosry Ahmed, Alexei Starovoitov, Linux Kernel Mailing List, bpf,
	Cgroups, Networking, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Tejun Heo, Zefan Li, KP Singh, Johannes Weiner, Michal Hocko,
	Benjamin Tissoires, John Fastabend, Michal Koutny, Roman Gushchin,
	David Rientjes, Stanislav Fomichev, Shakeel Butt
In-Reply-To: <CAEf4BzZTrsBOPpCTFouoWZJG9yXkz8LZgLQrqDREAY-XdGb7ew@mail.gmail.com>

On Mon, Aug 15, 2022 at 9:13 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 7:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Aug 10, 2022 at 8:10 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Tue, Aug 9, 2022 at 11:38 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > On Tue, Aug 9, 2022 at 9:23 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 08, 2022 at 05:56:57PM -0700, Hao Luo wrote:
> > > > > > On Mon, Aug 8, 2022 at 5:19 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 5, 2022 at 2:49 PM Hao Luo <haoluo@google.com> wrote:
> > > > > > > >
> > > > > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
> > > > > > > >
> > > > > > > >  - walking a cgroup's descendants in pre-order.
> > > > > > > >  - walking a cgroup's descendants in post-order.
> > > > > > > >  - walking a cgroup's ancestors.
> > > > > > > >  - process only the given cgroup.
> > > > > > > >
> > > > [...]
> > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > > > index 59a217ca2dfd..4d758b2e70d6 100644
> > > > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > > > @@ -87,10 +87,37 @@ struct bpf_cgroup_storage_key {
> > > > > > > >         __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +enum bpf_iter_order {
> > > > > > > > +       BPF_ITER_ORDER_DEFAULT = 0,     /* default order. */
> > > > > > >
> > > > > > > why is this default order necessary? It just adds confusion (I had to
> > > > > > > look up source code to know what is default order). I might have
> > > > > > > missed some discussion, so if there is some very good reason, then
> > > > > > > please document this in commit message. But I'd rather not do some
> > > > > > > magical default order instead. We can set 0 to mean invalid and error
> > > > > > > out, or just do SELF as the very first value (and if user forgot to
> > > > > > > specify more fancy mode, they hopefully will quickly discover this in
> > > > > > > their testing).
> > > > > > >
> > > > > >
> > > > > > PRE/POST/UP are tree-specific orders. SELF applies on all iters and
> > > > > > yields only a single object. How does task_iter express a non-self
> > > > > > order? By non-self, I mean something like "I don't care about the
> > > > > > order, just scan _all_ the objects". And this "don't care" order, IMO,
> > > > > > may be the common case. I don't think everyone cares about walking
> > > > > > order for tasks. The DEFAULT is intentionally put at the first value,
> > > > > > so that if users don't care about order, they don't have to specify
> > > > > > this field.
> > > > > >
> > > > > > If that sounds valid, maybe using "UNSPEC" instead of "DEFAULT" is better?
> > > > >
> > > > > I agree with Andrii.
> > > > > This:
> > > > > +       if (order == BPF_ITER_ORDER_DEFAULT)
> > > > > +               order = BPF_ITER_DESCENDANTS_PRE;
> > > > >
> > > > > looks like an arbitrary choice.
> > > > > imo
> > > > > BPF_ITER_DESCENDANTS_PRE = 0,
> > > > > would have been more obvious. No need to dig into definition of "default".
> > > > >
> > > > > UNSPEC = 0
> > > > > is fine too if we want user to always be conscious about the order
> > > > > and the kernel will error if that field is not initialized.
> > > > > That would be my preference, since it will match the rest of uapi/bpf.h
> > > > >
> > > >
> > > > Sounds good. In the next version, will use
> > > >
> > > > enum bpf_iter_order {
> > > >         BPF_ITER_ORDER_UNSPEC = 0,
> > > >         BPF_ITER_SELF_ONLY,             /* process only a single object. */
> > > >         BPF_ITER_DESCENDANTS_PRE,       /* walk descendants in pre-order. */
> > > >         BPF_ITER_DESCENDANTS_POST,      /* walk descendants in post-order. */
> > > >         BPF_ITER_ANCESTORS_UP,          /* walk ancestors upward. */
> > > > };
> > > >
> > >
> > > Sigh, I find that having UNSPEC=0 and erroring out when seeing UNSPEC
> > > doesn't work. Basically, if we have a non-iter prog and a cgroup_iter
> > > prog written in the same source file, I can't use
> > > bpf_object__attach_skeleton to attach them. Because the default
> > > prog_attach_fn for iter initializes `order` to 0 (that is, UNSPEC),
> > > which is going to be rejected by the kernel. In order to make
> > > bpf_object__attach_skeleton work on cgroup_iter, I think I need to use
> > > the following
> > >
> > > enum bpf_iter_order {
>
> so first of all, this can't be called "bpf_iter_order" as it doesn't
> apply to BPF iterators in general. I think this should be called
> bpf_iter_cgroup_order (or maybe bpf_cgroup_iter_order) and if/when we
> add ability to iterate tasks within cgroups then we'll just reuse enum
> bpf_iter_cgroup_order as an extra parameter for task iterator.
>
> And with that future case in mind I do think that we should have 0
> being "UNSPEC" case.
>

Ok.

> > >         BPF_ITER_DESCENDANTS_PRE,       /* walk descendants in pre-order. */
> > >         BPF_ITER_DESCENDANTS_POST,      /* walk descendants in post-order. */
> > >         BPF_ITER_ANCESTORS_UP,          /* walk ancestors upward. */
> > >         BPF_ITER_SELF_ONLY,             /* process only a single object. */
> > > };
> > >
> > > So that when calling bpf_object__attach_skeleton() on cgroup_iter, a
> > > link can be generated and the generated link defaults to pre-order
> > > walk on the whole hierarchy. Is there a better solution?
> > >
>
> I was actually surprised that we specify these additional parameters
> at attach (LINK_CREATE) time, and not at bpf_iter_create() call time.
> It seems more appropriate to allow to specify such runtime parameters
> very late, when we create a specific instance of seq_file. But I guess
> this was done because one of the initial motivations for iterators was
> to be pinned in BPFFS and read as a file, so it was more convenient to
> store such parameters upfront at link creation time to keep
> BPF_OBJ_PIN simpler. I guess it makes sense, worst case you'll need to
> create multiple bpf_link files, one for each cgroup hierarchy you'd
> like to query with the same single BPF program.
>

Right. That was the design from the beginning.

> But I digress.
>
> As for not being able to auto-attach cgroup iterator. I think that's
> sort of expected and is in line with not being able to auto-attach
> cgroup programs, as you need cgroup FD at runtime. So even if you had
> some reasonable default order, you still would need to specify target
> cgroup (either through FD or ID).
>
> So... either don't do skeleton auto-attach,

This is not okay IMHO. It would be very inconvenient to use.

> or let's teach libbpf code
> to not auto-attach some iter types?
>

I'm thinking of two options:

1. Maybe we could add libbpf APIs for disabling auto-attach just like
prog autoload. Like:

bpf_program__set_auto_attach()
bpf_program__get_auto_attach(...)

2. In auto-attach, if the program's link is already set, attach will
be skipped. So, we could just manually attach, which specifies the
order, and set the link in skeleton. This way, no change in libbpf is
needed. Does this sound good to you?

> Alternatively, we could teach libbpf to parse some sort of cgroup
> iterator spec, like:
>
> SEC("iter/cgroup:/path/to/cgroup:descendants_pre")
>
> But this approach won't work for a bunch of other parameterized
> iterators (e.g., task iter, or map elem iter), so I'm hesitant about
> adding this to libbpf as a generic functionality.
>

Agree. Let's explore other options first.

> >
> > I think this can be handled by userspace? We can attach the
> > cgroup_iter separately first (and maybe we will need to set prog->link
> > as well) so that bpf_object__attach_skeleton() doesn't try to attach
> > it? I am following this pattern in the selftest in the final patch,
> > although I think I might be missing setting prog->link, so I am
> > wondering why there are no issues in that selftest which has the same
> > scenario that you are talking about.
> >
> > I think such a pattern will need to be used anyway if the users need
> > to set any non-default arguments for the cgroup_iter prog (like the
> > selftest), right? The only case we are discussing here is the case
> > where the user wants to attach the cgroup_iter with all default
> > options (in which case the default order will fail).
> > I agree that it might be inconvenient if the default/uninitialized
> > options don't work for cgroup_iter, but Alexei pointed out that this
> > matches other bpf uapis.
> >
> > My concern is that in the future we try to reuse enum bpf_iter_order
> > to set ordering for other iterators, and then the
> > default/uninitialized value (BPF_ITER_DESCENDANTS_PRE) doesn't make
> > sense for that iterator (e.g. not a tree). In this case, the same
> > problem that we are avoiding for cgroup_iter here will show up for
> > that iterator, and we can't easily change it at this point because
> > it's uapi.
>
> Yep, valid concern, I agree.
>

Andrii, other than auto-attach, do you have any concern for the rest
of this patchset?

> >
> >
> > > > and explicitly list the values acceptable by cgroup_iter, error out if
> > > > UNSPEC is detected.
> > > >
> > > > Also, following Andrii's comments, will change BPF_ITER_SELF to
> > > > BPF_ITER_SELF_ONLY, which does seem a little bit explicit in
> > > > comparison.
> > > >
> > > > > I applied the first 3 patches to ease respin.
> > > >
> > > > Thanks! This helps!
> > > >
> > > > > Thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox