Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH ipsec 2/2] xfrm interface: ifname may be wrong in logs
From: Nicolas Dichtel @ 2019-07-10  7:45 UTC (permalink / raw)
  To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <20190710074536.7505-1-nicolas.dichtel@6wind.com>

The ifname is copied when the interface is created, but is never updated
later. In fact, this property is used only in one error message, where the
netdevice pointer is available, thus let's use it.

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/xfrm.h        |  1 -
 net/xfrm/xfrm_interface.c | 10 +---------
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index a2907873ed56..287e39753d94 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -988,7 +988,6 @@ static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
 
 struct xfrm_if_parms {
-	char name[IFNAMSIZ];	/* name of XFRM device */
 	int link;		/* ifindex of underlying L2 interface */
 	u32 if_id;		/* interface identifyer */
 };
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index dfa5aebdec57..a60d391f7ebe 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -145,8 +145,6 @@ static int xfrmi_create(struct net_device *dev)
 	if (err < 0)
 		goto out;
 
-	strcpy(xi->p.name, dev->name);
-
 	dev_hold(dev);
 	xfrmi_link(xfrmn, xi);
 
@@ -294,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 	if (tdev == dev) {
 		stats->collisions++;
 		net_warn_ratelimited("%s: Local routing loop detected!\n",
-				     xi->p.name);
+				     dev->name);
 		goto tx_err_dst_release;
 	}
 
@@ -638,12 +636,6 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
 	int err;
 
 	xfrmi_netlink_parms(data, &p);
-
-	if (!tb[IFLA_IFNAME])
-		return -EINVAL;
-
-	nla_strlcpy(p.name, tb[IFLA_IFNAME], IFNAMSIZ);
-
 	xi = xfrmi_locate(net, &p);
 	if (xi)
 		return -EEXIST;
-- 
2.21.0


^ permalink raw reply related

* [PATCH ipsec 1/2] xfrm interface: avoid corruption on changelink
From: Nicolas Dichtel @ 2019-07-10  7:45 UTC (permalink / raw)
  To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <20190710074536.7505-1-nicolas.dichtel@6wind.com>

The new parameters must not be stored in the netdev_priv() before
validation, it may corrupt the interface. Note also that if data is NULL,
only a memset() is done.

$ ip link add xfrm1 type xfrm dev lo if_id 1
$ ip link add xfrm2 type xfrm dev lo if_id 2
$ ip link set xfrm1 type xfrm dev lo if_id 2
RTNETLINK answers: File exists
$ ip -d link list dev xfrm1
5: xfrm1@lo: <NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/none 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 minmtu 68 maxmtu 1500
    xfrm if_id 0x2 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

=> "if_id 0x2"

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/xfrm/xfrm_interface.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 7dbe0c608df5..dfa5aebdec57 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -671,12 +671,12 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
 			   struct nlattr *data[],
 			   struct netlink_ext_ack *extack)
 {
-	struct xfrm_if *xi = netdev_priv(dev);
 	struct net *net = dev_net(dev);
+	struct xfrm_if_parms p;
+	struct xfrm_if *xi;
 
-	xfrmi_netlink_parms(data, &xi->p);
-
-	xi = xfrmi_locate(net, &xi->p);
+	xfrmi_netlink_parms(data, &p);
+	xi = xfrmi_locate(net, &p);
 	if (!xi) {
 		xi = netdev_priv(dev);
 	} else {
@@ -684,7 +684,7 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
 			return -EEXIST;
 	}
 
-	return xfrmi_update(xi, &xi->p);
+	return xfrmi_update(xi, &p);
 }
 
 static size_t xfrmi_get_size(const struct net_device *dev)
-- 
2.21.0


^ permalink raw reply related

* [PATCH ipsec 0/2] xfrm interface: bug fix on changelink
From: Nicolas Dichtel @ 2019-07-10  7:45 UTC (permalink / raw)
  To: steffen.klassert, davem; +Cc: netdev

Here are two bug fix seen by code review. The first one avoids a corruption of
existing xfrm interfaces and the second is a minor fix of an error message.

 include/net/xfrm.h        |  1 -
 net/xfrm/xfrm_interface.c | 20 ++++++--------------
 2 files changed, 6 insertions(+), 15 deletions(-)

Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH net-next,v4 05/12] net: flow_offload: add list handling functions
From: Jiri Pirko @ 2019-07-10  7:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netdev, davem, thomas.lendacky, f.fainelli, ariel.elior,
	michael.chan, madalin.bucur, yisen.zhuang, salil.mehta,
	jeffrey.t.kirsher, tariqt, saeedm, jiri, idosch, jakub.kicinski,
	peppe.cavallaro, grygorii.strashko, andrew, vivien.didelot,
	alexandre.torgue, joabreu, linux-net-drivers, ogerlitz,
	Manish.Chopra, marcelo.leitner, mkubecek, venkatkumar.duvvuru,
	maxime.chevallier, cphealy, phil, netfilter-devel
In-Reply-To: <20190709205550.3160-6-pablo@netfilter.org>

Tue, Jul 09, 2019 at 10:55:43PM CEST, pablo@netfilter.org wrote:

[...]


>@@ -176,6 +176,7 @@ struct flow_block_cb *flow_block_cb_alloc(struct net *net, tc_setup_cb_t *cb,
> 	if (!block_cb)
> 		return ERR_PTR(-ENOMEM);
> 
>+	block_cb->net = net;
> 	block_cb->cb = cb;
> 	block_cb->cb_ident = cb_ident;
> 	block_cb->cb_priv = cb_priv;
>@@ -194,6 +195,22 @@ void flow_block_cb_free(struct flow_block_cb *block_cb)
> }
> EXPORT_SYMBOL(flow_block_cb_free);
> 
>+struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
>+					   tc_setup_cb_t *cb, void *cb_ident)
>+{
>+	struct flow_block_cb *block_cb;
>+
>+	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
>+		if (block_cb->net == f->net &&

I don't understand why you need net for this. You should have a list of
cbs per subsystem (tc/nft) go over it here.

The clash of 2 suybsytems is prevented later on by
flow_block_cb_is_busy().

Am I missing something?
If not, could you please remove use of net from flow_block_cb_alloc()
and from here and replace it by some shared flow structure holding the
cb list that would be used by both tc and nft?



>+		    block_cb->cb == cb &&
>+		    block_cb->cb_ident == cb_ident)
>+			return block_cb;
>+	}
>+
>+	return NULL;
>+}
>+EXPORT_SYMBOL(flow_block_cb_lookup);
>+

[...]

^ permalink raw reply

* Re: [PATCH v6 rdma-next 0/6] RDMA/qedr: Use the doorbell overflow recovery mechanism for RDMA
From: Gal Pressman @ 2019-07-10  7:32 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford
  Cc: linux-rdma, davem, netdev, sleybo
In-Reply-To: <20190709141735.19193-1-michal.kalderon@marvell.com>

On 09/07/2019 17:17, Michal Kalderon wrote:
> This patch series uses the doorbell overflow recovery mechanism
> introduced in
> commit 36907cd5cd72 ("qed: Add doorbell overflow recovery mechanism")
> for rdma ( RoCE and iWARP )
> 
> The first three patches modify the core code to contain helper
> functions for managing mmap_xa inserting, getting and freeing
> entries. The code was taken almost as is from the efa driver.
> There is still an open discussion on whether we should take
> this even further and make the entire mmap generic. Until a
> decision is made, I only created the database API and modified
> the efa and qedr driver to use it. The doorbell recovery code will be based
> on the common code.
> 
> Efa driver was compile tested only.

For the whole series:
Tested-by: Gal Pressman <galpress@amazon.com>

^ permalink raw reply

* Re: [PATCH] ipvs: Delete some unused space characters in Kconfig
From: Simon Horman @ 2019-07-10  7:29 UTC (permalink / raw)
  To: xianfengting221, Pablo Neira Ayuso
  Cc: wensong, ja, pablo, kadlec, fw, davem, netdev, lvs-devel,
	linux-kernel
In-Reply-To: <1562473009-29726-1-git-send-email-xianfengting221@163.com>

On Sun, Jul 07, 2019 at 12:16:49PM +0800, xianfengting221@163.com wrote:
> From: Hu Haowen <xianfengting221@163.com>
> 
> The space characters at the end of lines are always unused and
> not easy to find. This patch deleted some of them I have found
> in Kconfig.
> 
> Signed-off-by: Hu Haowen <xianfengting221@163.com>
> ---
> 
> This is my first patch to the Linux kernel, so please forgive
> me if anything went wrong.

Acked-by: Simon Horman <horms+renesas@verge.net.au>

Thanks Hu,

this looks good to me.

Pablo, please consider this for inclusion in nf-next.

> 
>  net/netfilter/ipvs/Kconfig | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index f6f1a0d..54afad5 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -120,7 +120,7 @@ config	IP_VS_RR
>  
>  	  If you want to compile it in kernel, say Y. To compile it as a
>  	  module, choose M here. If unsure, say N.
> - 
> +
>  config	IP_VS_WRR
>  	tristate "weighted round-robin scheduling"
>  	---help---
> @@ -138,7 +138,7 @@ config	IP_VS_LC
>          tristate "least-connection scheduling"
>  	---help---
>  	  The least-connection scheduling algorithm directs network
> -	  connections to the server with the least number of active 
> +	  connections to the server with the least number of active
>  	  connections.
>  
>  	  If you want to compile it in kernel, say Y. To compile it as a
> @@ -193,7 +193,7 @@ config  IP_VS_LBLCR
>  	tristate "locality-based least-connection with replication scheduling"
>  	---help---
>  	  The locality-based least-connection with replication scheduling
> -	  algorithm is also for destination IP load balancing. It is 
> +	  algorithm is also for destination IP load balancing. It is
>  	  usually used in cache cluster. It differs from the LBLC scheduling
>  	  as follows: the load balancer maintains mappings from a target
>  	  to a set of server nodes that can serve the target. Requests for
> @@ -250,8 +250,8 @@ config	IP_VS_SED
>  	tristate "shortest expected delay scheduling"
>  	---help---
>  	  The shortest expected delay scheduling algorithm assigns network
> -	  connections to the server with the shortest expected delay. The 
> -	  expected delay that the job will experience is (Ci + 1) / Ui if 
> +	  connections to the server with the shortest expected delay. The
> +	  expected delay that the job will experience is (Ci + 1) / Ui if
>  	  sent to the ith server, in which Ci is the number of connections
>  	  on the ith server and Ui is the fixed service rate (weight)
>  	  of the ith server.
> -- 
> 2.7.4
> 
> 

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
From: Paolo Abeni @ 2019-07-10  7:27 UTC (permalink / raw)
  To: Edward Cree, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <7920e85c-439e-0622-46f8-0602cf37e306@solarflare.com>

Hi,

On Tue, 2019-07-09 at 20:27 +0100, Edward Cree wrote:
> Where not specified (as batch=), net.core.gro_normal_batch was set to 8.
> The net-next baseline used for these tests was commit 7d30a7f6424e.
> TCP 4 streams, GRO on: all results line rate (9.415Gbps)
> net-next: 210.3% cpu
> after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
> after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next)
> TCP 4 streams, GRO off:
> after #1: 7.785 Gbps
> after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *)
> TCP 1 stream, GRO on: all results line rate & ~200% cpu.
> TCP 1 stream, GRO off:
> after #1: 6.444 Gbps
> after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1)
> batch=16: 7.199 Gbps
> batch= 4: 7.354 Gbps
> batch= 0: 5.899 Gbps
> TCP 100 RR, GRO off:
> net-next: 995.083 us
> after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
> after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
> 
> (*) These tests produced a mixture of line-rate and below-line-rate results,
>  meaning that statistically speaking the results were 'censored' by the
>  upper bound, and were thus not normally distributed, making a Welch t-test
>  mathematically invalid.  I therefore also calculated estimators according
>  to [2], which gave the following:
> after #1: 8.155 Gbps
> after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1)
> (though my procedure for determining ν wasn't mathematically well-founded
>  either, so take that p-value with a grain of salt).

I'm toying with a patch similar to your 3/3 (most relevant difference
being the lack of a limit to the batch size), on top of ixgbe (which
sends all the pkts to the GRO engine), and I'm observing more
controversial results (UDP only):

* when a single rx queue is running, I see a just-above-noise
peformance delta
* when multiple rx queues are running, I observe measurable regressions
(note: I use small pkts, still well under line rate even with multiple
rx queues)

I'll try to test your patch in the following days.

Side note: I think that in patch 3/3, it's necessary to add a call to
gro_normal_list() also inside napi_busy_loop().

Cheers,

Paolo





^ permalink raw reply

* [PATCH iproute2-rc 4/8] rdma: Add rdma statistic counter per-port auto mode support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

With per-QP statistic counter support, a user is allowed to monitor
specific QPs categories, which are bound to/unbound from counters
dynamically allocated/deallocated.

In per-port "auto" mode, QPs are bound to counters automatically
according to common criteria. For example a per "type"(qp type)
scheme, where in each process all QPs have same qp type are bind
automatically to a single counter.
Currently only "type" (qp type) is supported. Examples:

$ rdma statistic qp set link mlx5_2/1 auto type on
$ rdma statistic qp set link mlx5_2/1 auto off

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/stat.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 rdma/utils.c |  1 +
 2 files changed, 88 insertions(+)

diff --git a/rdma/stat.c b/rdma/stat.c
index 0c239851..ad1cc063 100644
--- a/rdma/stat.c
+++ b/rdma/stat.c
@@ -14,12 +14,17 @@ static int stat_help(struct rd *rd)
 	pr_out("       %s statistic OBJECT show\n", rd->filename);
 	pr_out("       %s statistic OBJECT show link [ DEV/PORT_INDEX ] [ FILTER-NAME FILTER-VALUE ]\n", rd->filename);
 	pr_out("       %s statistic OBJECT mode\n", rd->filename);
+	pr_out("       %s statistic OBJECT set COUNTER_SCOPE [DEV/PORT_INDEX] auto {CRITERIA | off}\n", rd->filename);
 	pr_out("where  OBJECT: = { qp }\n");
+	pr_out("       CRITERIA : = { type }\n");
+	pr_out("       COUNTER_SCOPE: = { link | dev }\n");
 	pr_out("Examples:\n");
 	pr_out("       %s statistic qp show\n", rd->filename);
 	pr_out("       %s statistic qp show link mlx5_2/1\n", rd->filename);
 	pr_out("       %s statistic qp mode\n", rd->filename);
 	pr_out("       %s statistic qp mode link mlx5_0\n", rd->filename);
+	pr_out("       %s statistic qp set link mlx5_2/1 auto type on\n", rd->filename);
+	pr_out("       %s statistic qp set link mlx5_2/1 auto off\n", rd->filename);
 
 	return 0;
 }
@@ -381,6 +386,87 @@ static int stat_qp_show(struct rd *rd)
 	return rd_exec_cmd(rd, cmds, "parameter");
 }
 
+static int stat_qp_set_link_auto_sendmsg(struct rd *rd, uint32_t mask)
+{
+	uint32_t seq;
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_SET,
+		       &seq, (NLM_F_REQUEST | NLM_F_ACK));
+
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_MODE,
+			 RDMA_COUNTER_MODE_AUTO);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_AUTO_MODE_MASK, mask);
+
+	return rd_sendrecv_msg(rd, seq);
+}
+
+static int stat_one_qp_set_link_auto_off(struct rd *rd)
+{
+	return stat_qp_set_link_auto_sendmsg(rd, 0);
+}
+
+static int stat_one_qp_set_auto_type_on(struct rd *rd)
+{
+	return stat_qp_set_link_auto_sendmsg(rd, RDMA_COUNTER_MASK_QP_TYPE);
+}
+
+static int stat_one_qp_set_link_auto_type(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_help },
+		{ "on",		stat_one_qp_set_auto_type_on },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+static int stat_one_qp_set_link_auto(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_one_qp_link_get_mode },
+		{ "off",	stat_one_qp_set_link_auto_off },
+		{ "type",	stat_one_qp_set_link_auto_type },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+static int stat_one_qp_set_link(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_one_qp_link_get_mode },
+		{ "auto",	stat_one_qp_set_link_auto },
+		{ 0 }
+	};
+
+	if (!rd->port_idx)
+		return 0;
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+static int stat_qp_set_link(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_one_qp_set_link, false);
+}
+
+static int stat_qp_set(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_help },
+		{ "link",	stat_qp_set_link },
+		{ "help",	stat_help },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
 static int stat_qp(struct rd *rd)
 {
 	const struct rd_cmd cmds[] =  {
@@ -388,6 +474,7 @@ static int stat_qp(struct rd *rd)
 		{ "show",	stat_qp_show },
 		{ "list",	stat_qp_show },
 		{ "mode",	stat_qp_get_mode },
+		{ "set",	stat_qp_set },
 		{ "help",	stat_help },
 		{ 0 }
 	};
diff --git a/rdma/utils.c b/rdma/utils.c
index 9c885ad7..aed1a3d0 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -445,6 +445,7 @@ static const enum mnl_attr_data_type nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE] = MNL_TYPE_U64,
 	[RDMA_NLDEV_ATTR_STAT_MODE] = MNL_TYPE_U32,
 	[RDMA_NLDEV_ATTR_STAT_RES] = MNL_TYPE_U32,
+	[RDMA_NLDEV_ATTR_STAT_AUTO_MODE_MASK] = MNL_TYPE_U32,
 };
 
 int rd_attr_check(const struct nlattr *attr, int *typep)
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 8/8] rdma: Document counter statistic
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

Add document of accessing the QP counter, including bind/unbind a QP
to a counter manually or automatically, and dump counter statistics.

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 man/man8/rdma-dev.8       |   1 +
 man/man8/rdma-link.8      |   1 +
 man/man8/rdma-resource.8  |   1 +
 man/man8/rdma-statistic.8 | 167 ++++++++++++++++++++++++++++++++++++++
 man/man8/rdma.8           |   7 +-
 5 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/rdma-statistic.8

diff --git a/man/man8/rdma-dev.8 b/man/man8/rdma-dev.8
index 38e34b3b..e77e7cd0 100644
--- a/man/man8/rdma-dev.8
+++ b/man/man8/rdma-dev.8
@@ -77,6 +77,7 @@ previously created using iproute2 ip command.
 .BR rdma-link (8),
 .BR rdma-resource (8),
 .BR rdma-system (8),
+.BR rdma-statistic (8),
 .br
 
 .SH AUTHOR
diff --git a/man/man8/rdma-link.8 b/man/man8/rdma-link.8
index b3b40de7..32f80228 100644
--- a/man/man8/rdma-link.8
+++ b/man/man8/rdma-link.8
@@ -97,6 +97,7 @@ Removes RXE link rxe_eth0
 .BR rdma (8),
 .BR rdma-dev (8),
 .BR rdma-resource (8),
+.BR rdma-statistic (8),
 .br
 
 .SH AUTHOR
diff --git a/man/man8/rdma-resource.8 b/man/man8/rdma-resource.8
index 40b073db..05030d0a 100644
--- a/man/man8/rdma-resource.8
+++ b/man/man8/rdma-resource.8
@@ -103,6 +103,7 @@ Show CQs belonging to pid 30489
 .BR rdma (8),
 .BR rdma-dev (8),
 .BR rdma-link (8),
+.BR rdma-statistic (8),
 .br
 
 .SH AUTHOR
diff --git a/man/man8/rdma-statistic.8 b/man/man8/rdma-statistic.8
new file mode 100644
index 00000000..2c31b08a
--- /dev/null
+++ b/man/man8/rdma-statistic.8
@@ -0,0 +1,167 @@
+.TH RDMA\-STATISTIC 8 "17 Mar 2019" "iproute2" "Linux"
+.SH NAME
+rdma-statistic \- RDMA statistic counter configuration
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B rdma
+.RI "[ " OPTIONS " ]"
+.B statistic
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.B rdma statistic
+.RI "[ " OBJECT " ]"
+.B show
+
+.ti -8
+.B rdma statistic
+.RI "[ " OBJECT " ]"
+.B show link
+.RI "[ " DEV/PORT_INDX " ]"
+
+.ti -8
+.B rdma statistic
+.IR OBJECT
+.B mode
+
+.ti -8
+.B rdma statistic
+.IR OBJECT
+.B set
+.IR COUNTER_SCOPE
+.RI "[ " DEV/PORT_INDEX "]"
+.B auto
+.RI "{ " CRITERIA " | "
+.BR off " }"
+
+.ti -8
+.B rdma statistic
+.IR OBJECT
+.B bind
+.IR COUNTER_SCOPE
+.RI "[ " DEV/PORT_INDEX "]"
+.RI "[ " OBJECT-ID " ]"
+.RI "[ " COUNTER-ID " ]"
+
+.ti -8
+.B rdma statistic
+.IR OBJECT
+.B unbind
+.IR COUNTER_SCOPE
+.RI "[ " DEV/PORT_INDEX "]"
+.RI "[ " COUNTER-ID " ]"
+.RI "[ " OBJECT-ID " ]"
+
+.ti -8
+.IR COUNTER_SCOPE " := "
+.RB "{ " link " | " dev " }"
+
+.ti -8
+.IR OBJECT " := "
+.RB "{ " qp " }"
+
+.ti -8
+.IR CRITERIA " := "
+.RB "{ " type " }"
+
+.SH "DESCRIPTION"
+.SS rdma statistic [object] show - Queries the specified RDMA device for RDMA and driver-specific statistics. Show the default hw counters if object is not specified
+
+.PP
+.I "DEV"
+- specifies counters on this RDMA device to show.
+
+.I "PORT_INDEX"
+- specifies counters on this RDMA port to show.
+
+.SS rdma statistic <object> set - configure counter statistic auto-mode for a specific device/port
+In auto mode all objects belong to one category are bind automatically to a single counter set.
+
+.SS rdma statistic <object> bind - manually bind an object (e.g., a qp) with a counter
+When bound the statistics of this object are available in this counter.
+
+.SS rdma statistic <object> unbind - manually unbind an object (e.g., a qp) from the counter previously bound
+When unbound the statistics of this object are no longer available in this counter; And if object id is not specified then all objects on this counter will be unbound.
+
+.I "COUNTER-ID"
+- specifies the id of the counter to be bound.
+If this argument is omitted then a new counter will be allocated.
+
+.SH "EXAMPLES"
+.PP
+rdma statistic show
+.RS 4
+Shows the state of the default counter of all RDMA devices on the system.
+.RE
+.PP
+rdma statistic show link mlx5_2/1
+.RS 4
+Shows the state of the default counter of specified RDMA port
+.RE
+.PP
+rdma statistic qp show
+.RS 4
+Shows the state of all qp counters of all RDMA devices on the system.
+.RE
+.PP
+rdma statistic qp show link mlx5_2/1
+.RS 4
+Shows the state of all qp counters of specified RDMA port.
+.RE
+.PP
+rdma statistic qp show link mlx5_2 pid 30489
+.RS 4
+Shows the state of all qp counters of specified RDMA port and belonging to pid 30489
+.RE
+.PP
+rdma statistic qp mode
+.RS 4
+List current counter mode on all deivces
+.RE
+.PP
+rdma statistic qp mode link mlx5_2/1
+.RS 4
+List current counter mode of device mlx5_2 port 1
+.RE
+.PP
+rdma statistic qp set link mlx5_2/1 auto type on
+.RS 4
+On device mlx5_2 port 1, for each new QP bind it with a counter automatically. Per counter for QPs with same qp type in each process. Currently only "type" is supported.
+.RE
+.PP
+rdma statistic qp set link mlx5_2/1 auto off
+.RS 4
+Turn-off auto mode on device mlx5_2 port 1. The allocated counters can be manually accessed.
+.RE
+.PP
+rdma statistic qp bind link mlx5_2/1 lqpn 178
+.RS 4
+On device mlx5_2 port 1, allocate a counter and bind the specified qp on it
+.RE
+.PP
+rdma statistic qp unbind link mlx5_2/1 cntn 4 lqpn 178
+.RS 4
+On device mlx5_2 port 1, bind the specified qp on the specified counter
+.RE
+.PP
+rdma statistic qp unbind link mlx5_2/1 cntn 4
+.RS 4
+On device mlx5_2 port 1, unbind all QPs on the specified counter. After that this counter will be released automatically by the kernel.
+
+.RE
+.PP
+
+.SH SEE ALSO
+.BR rdma (8),
+.BR rdma-dev (8),
+.BR rdma-link (8),
+.BR rdma-resource (8),
+.br
+
+.SH AUTHOR
+Mark Zhang <markz@mellanox.com>
diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
index 3ae33987..ef29b1c6 100644
--- a/man/man8/rdma.8
+++ b/man/man8/rdma.8
@@ -19,7 +19,7 @@ rdma \- RDMA tool
 
 .ti -8
 .IR OBJECT " := { "
-.BR dev " | " link " | " system " }"
+.BR dev " | " link " | " system " | " statistic " }"
 .sp
 
 .ti -8
@@ -74,6 +74,10 @@ Generate JSON output.
 .B sys
 - RDMA subsystem related.
 
+.TP
+.B statistic
+- RDMA counter statistic related.
+
 .PP
 The names of all objects may be written in full or
 abbreviated form, for example
@@ -112,6 +116,7 @@ Exit status is 0 if command was successful or a positive integer upon failure.
 .BR rdma-link (8),
 .BR rdma-resource (8),
 .BR rdma-system (8),
+.BR rdma-statistic (8),
 .br
 
 .SH REPORTING BUGS
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 7/8] rdma: Add default counter show support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

Show default counter statistics, which are same through the sysfs
interface: /sys/class/infiniband/<dev>/ports/<port>/hw_counters/

Example:
$ rdma stat show link mlx5_2/1
link mlx5_2/1 rx_write_requests 8 rx_read_requests 4 rx_atomic_requests 0
out_of_buffer 0 out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0
packet_seq_err 0 implied_nak_seq_err 0 local_ack_timeout_err 0
resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
req_remote_invalid_request 0 req_remote_access_errors 0
resp_remote_access_errors 0 resp_cqe_flush_error 0 req_cqe_flush_error 0
rp_cnp_ignored 0 rp_cnp_handled 0 np_ecn_marked_roce_packets 0
np_cnp_sent 0 rx_icrc_encapsulated 0

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/stat.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/rdma/stat.c b/rdma/stat.c
index 942c1ac3..ef0bbcf1 100644
--- a/rdma/stat.c
+++ b/rdma/stat.c
@@ -17,6 +17,8 @@ static int stat_help(struct rd *rd)
 	pr_out("       %s statistic OBJECT set COUNTER_SCOPE [DEV/PORT_INDEX] auto {CRITERIA | off}\n", rd->filename);
 	pr_out("       %s statistic OBJECT bind COUNTER_SCOPE [DEV/PORT_INDEX] [OBJECT-ID] [COUNTER-ID]\n", rd->filename);
 	pr_out("       %s statistic OBJECT unbind COUNTER_SCOPE [DEV/PORT_INDEX] [COUNTER-ID]\n", rd->filename);
+	pr_out("       %s statistic show\n", rd->filename);
+	pr_out("       %s statistic show link [ DEV/PORT_INDEX ]\n", rd->filename);
 	pr_out("where  OBJECT: = { qp }\n");
 	pr_out("       CRITERIA : = { type }\n");
 	pr_out("       COUNTER_SCOPE: = { link | dev }\n");
@@ -31,6 +33,8 @@ static int stat_help(struct rd *rd)
 	pr_out("       %s statistic qp bind link mlx5_2/1 lqpn 178 cntn 4\n", rd->filename);
 	pr_out("       %s statistic qp unbind link mlx5_2/1 cntn 4\n", rd->filename);
 	pr_out("       %s statistic qp unbind link mlx5_2/1 cntn 4 lqpn 178\n", rd->filename);
+	pr_out("       %s statistic show\n", rd->filename);
+	pr_out("       %s statistic show link mlx5_2/1\n", rd->filename);
 
 	return 0;
 }
@@ -674,10 +678,78 @@ static int stat_qp(struct rd *rd)
 	return rd_exec_cmd(rd, cmds, "parameter");
 }
 
+static int stat_show_parse_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {};
+	struct rd *rd = data;
+	const char *name;
+	uint32_t port;
+	int ret;
+
+	mnl_attr_parse(nlh, 0, rd_attr_cb, tb);
+	if (!tb[RDMA_NLDEV_ATTR_DEV_INDEX] || !tb[RDMA_NLDEV_ATTR_DEV_NAME] ||
+	    !tb[RDMA_NLDEV_ATTR_PORT_INDEX] ||
+	    !tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS])
+		return MNL_CB_ERROR;
+
+	name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_DEV_NAME]);
+	port = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	if (rd->json_output) {
+		jsonw_string_field(rd->jw, "ifname", name);
+		jsonw_uint_field(rd->jw, "port", port);
+	} else {
+		pr_out("link %s/%u ", name, port);
+	}
+
+	ret = res_get_hwcounters(rd, tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS], true);
+
+	if (!rd->json_output)
+		pr_out("\n");
+	return ret;
+}
+
+static int stat_show_one_link(struct rd *rd)
+{
+	int flags = NLM_F_REQUEST | NLM_F_ACK;
+	uint32_t seq;
+	int ret;
+
+	if (!rd->port_idx)
+		return 0;
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_GET, &seq,  flags);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	ret = rd_send_msg(rd);
+	if (ret)
+		return ret;
+
+	return rd_recv_msg(rd, stat_show_parse_cb, rd, seq);
+}
+
+static int stat_show_link(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_show_one_link, false);
+}
+
+static int stat_show(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_show_link },
+		{ "link",	stat_show_link },
+		{ "help",	stat_help },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
 int cmd_stat(struct rd *rd)
 {
 	const struct rd_cmd cmds[] =  {
-		{ NULL,		stat_help },
+		{ NULL,		stat_show },
+		{ "show",	stat_show },
+		{ "list",	stat_show },
 		{ "help",	stat_help },
 		{ "qp",		stat_qp },
 		{ 0 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 6/8] rdma: Add stat manual mode support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

In manual mode a QP can be manually bound to a counter. If the counter
id(cntn) is not specified that kernel will allocate one. After a
successful bind, the cntn can be seen through "rdma statistic qp show".
And in unbind if lqpn is not specified then all QPs on this counter will
be unbound.
The manual and auto mode are mutual-exclusive.

Examples:
$ rdma statistic qp bind link mlx5_2/1 lqpn 178
$ rdma statistic qp bind link mlx5_2/1 lqpn 178 cntn 4
$ rdma statistic qp unbind link mlx5_2/1 cntn 4
$ rdma statistic qp unbind link mlx5_2/1 cntn 4 lqpn 178

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/stat.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)

diff --git a/rdma/stat.c b/rdma/stat.c
index ad1cc063..942c1ac3 100644
--- a/rdma/stat.c
+++ b/rdma/stat.c
@@ -15,6 +15,8 @@ static int stat_help(struct rd *rd)
 	pr_out("       %s statistic OBJECT show link [ DEV/PORT_INDEX ] [ FILTER-NAME FILTER-VALUE ]\n", rd->filename);
 	pr_out("       %s statistic OBJECT mode\n", rd->filename);
 	pr_out("       %s statistic OBJECT set COUNTER_SCOPE [DEV/PORT_INDEX] auto {CRITERIA | off}\n", rd->filename);
+	pr_out("       %s statistic OBJECT bind COUNTER_SCOPE [DEV/PORT_INDEX] [OBJECT-ID] [COUNTER-ID]\n", rd->filename);
+	pr_out("       %s statistic OBJECT unbind COUNTER_SCOPE [DEV/PORT_INDEX] [COUNTER-ID]\n", rd->filename);
 	pr_out("where  OBJECT: = { qp }\n");
 	pr_out("       CRITERIA : = { type }\n");
 	pr_out("       COUNTER_SCOPE: = { link | dev }\n");
@@ -25,6 +27,10 @@ static int stat_help(struct rd *rd)
 	pr_out("       %s statistic qp mode link mlx5_0\n", rd->filename);
 	pr_out("       %s statistic qp set link mlx5_2/1 auto type on\n", rd->filename);
 	pr_out("       %s statistic qp set link mlx5_2/1 auto off\n", rd->filename);
+	pr_out("       %s statistic qp bind link mlx5_2/1 lqpn 178\n", rd->filename);
+	pr_out("       %s statistic qp bind link mlx5_2/1 lqpn 178 cntn 4\n", rd->filename);
+	pr_out("       %s statistic qp unbind link mlx5_2/1 cntn 4\n", rd->filename);
+	pr_out("       %s statistic qp unbind link mlx5_2/1 cntn 4 lqpn 178\n", rd->filename);
 
 	return 0;
 }
@@ -467,6 +473,190 @@ static int stat_qp_set(struct rd *rd)
 	return rd_exec_cmd(rd, cmds, "parameter");
 }
 
+static int stat_get_arg(struct rd *rd, const char *arg)
+{
+	int value = 0;
+	char *endp;
+
+	if (strcmpx(rd_argv(rd), arg) != 0)
+		return -EINVAL;
+
+	rd_arg_inc(rd);
+	value = strtol(rd_argv(rd), &endp, 10);
+	rd_arg_inc(rd);
+
+	return value;
+}
+
+static int stat_one_qp_bind(struct rd *rd)
+{
+	int lqpn = 0, cntn = 0, ret;
+	uint32_t seq;
+
+	if (rd_no_arg(rd)) {
+		stat_help(rd);
+		return -EINVAL;
+	}
+
+	ret = rd_build_filter(rd, stat_valid_filters);
+	if (ret)
+		return ret;
+
+	lqpn = stat_get_arg(rd, "lqpn");
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_SET,
+		       &seq, (NLM_F_REQUEST | NLM_F_ACK));
+
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_MODE,
+			 RDMA_COUNTER_MODE_MANUAL);
+
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_RES_LQPN, lqpn);
+
+	if (rd_argc(rd)) {
+		cntn = stat_get_arg(rd, "cntn");
+		mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_COUNTER_ID,
+				 cntn);
+	}
+
+	return rd_sendrecv_msg(rd, seq);
+}
+
+static int do_stat_qp_unbind_lqpn(struct rd *rd, uint32_t cntn, uint32_t lqpn)
+{
+	uint32_t seq;
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_DEL,
+		       &seq, (NLM_F_REQUEST | NLM_F_ACK));
+
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_MODE,
+			 RDMA_COUNTER_MODE_MANUAL);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_COUNTER_ID, cntn);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_RES_LQPN, lqpn);
+
+	return rd_sendrecv_msg(rd, seq);
+}
+
+static int stat_get_counter_parse_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {};
+	struct nlattr *nla_table, *nla_entry;
+	struct rd *rd = data;
+	uint32_t lqpn, cntn;
+	int err;
+
+	mnl_attr_parse(nlh, 0, rd_attr_cb, tb);
+
+	if (!tb[RDMA_NLDEV_ATTR_STAT_COUNTER_ID])
+		return MNL_CB_ERROR;
+	cntn = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_STAT_COUNTER_ID]);
+
+	nla_table = tb[RDMA_NLDEV_ATTR_RES_QP];
+	if (!nla_table)
+		return MNL_CB_ERROR;
+
+	mnl_attr_for_each_nested(nla_entry, nla_table) {
+		struct nlattr *nla_line[RDMA_NLDEV_ATTR_MAX] = {};
+
+		err = mnl_attr_parse_nested(nla_entry, rd_attr_cb, nla_line);
+		if (err != MNL_CB_OK)
+			return -EINVAL;
+
+		if (!nla_line[RDMA_NLDEV_ATTR_RES_LQPN])
+			return -EINVAL;
+
+		lqpn = mnl_attr_get_u32(nla_line[RDMA_NLDEV_ATTR_RES_LQPN]);
+		err = do_stat_qp_unbind_lqpn(rd, cntn, lqpn);
+		if (err)
+			return MNL_CB_ERROR;
+	}
+
+	return MNL_CB_OK;
+}
+
+static int stat_one_qp_unbind(struct rd *rd)
+{
+	int flags = NLM_F_REQUEST | NLM_F_ACK, ret;
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	int lqpn = 0, cntn = 0;
+	unsigned int portid;
+	uint32_t seq;
+
+	ret = rd_build_filter(rd, stat_valid_filters);
+	if (ret)
+		return ret;
+
+	cntn = stat_get_arg(rd, "cntn");
+	if (rd_argc(rd)) {
+		lqpn = stat_get_arg(rd, "lqpn");
+		return do_stat_qp_unbind_lqpn(rd, cntn, lqpn);
+	}
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_GET, &seq, flags);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_COUNTER_ID, cntn);
+	ret = rd_send_msg(rd);
+	if (ret)
+		return ret;
+
+
+	/* Can't use rd_recv_msg() since the callback also calls it (recursively),
+	 * then rd_recv_msg() always return -1 here
+	 */
+	portid = mnl_socket_get_portid(rd->nl);
+	ret = mnl_socket_recvfrom(rd->nl, buf, sizeof(buf));
+	if (ret <= 0)
+		return ret;
+
+	ret = mnl_cb_run(buf, ret, seq, portid, stat_get_counter_parse_cb, rd);
+	mnl_socket_close(rd->nl);
+	if (ret != MNL_CB_OK)
+		return ret;
+
+	return 0;
+}
+
+static int stat_qp_bind_link(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_one_qp_bind, true);
+}
+
+static int stat_qp_bind(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_help },
+		{ "link",	stat_qp_bind_link },
+		{ "help",	stat_help },
+		{ 0 },
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+static int stat_qp_unbind_link(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_one_qp_unbind, true);
+}
+
+static int stat_qp_unbind(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_help },
+		{ "link",	stat_qp_unbind_link },
+		{ "help",	stat_help },
+		{ 0 },
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
 static int stat_qp(struct rd *rd)
 {
 	const struct rd_cmd cmds[] =  {
@@ -475,6 +665,8 @@ static int stat_qp(struct rd *rd)
 		{ "list",	stat_qp_show },
 		{ "mode",	stat_qp_get_mode },
 		{ "set",	stat_qp_set },
+		{ "bind",	stat_qp_bind },
+		{ "unbind",	stat_qp_unbind },
 		{ "help",	stat_help },
 		{ 0 }
 	};
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 5/8] rdma: Make get_port_from_argv() returns valid port in strict port mode
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

When strict_port is set, make get_port_from_argv() returns failure if
no valid port is specified.

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/utils.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/rdma/utils.c b/rdma/utils.c
index aed1a3d0..95b669f3 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -56,7 +56,7 @@ bool rd_no_arg(struct rd *rd)
  * mlx5_1/1    | 1          | false
  * mlx5_1/-    | 0          | false
  *
- * In strict mode, /- will return error.
+ * In strict port mode, a non-0 port must be provided
  */
 static int get_port_from_argv(struct rd *rd, uint32_t *port,
 			      bool *is_dump_all, bool strict_port)
@@ -64,7 +64,7 @@ static int get_port_from_argv(struct rd *rd, uint32_t *port,
 	char *slash;
 
 	*port = 0;
-	*is_dump_all = true;
+	*is_dump_all = strict_port ? false : true;
 
 	slash = strchr(rd_argv(rd), '/');
 	/* if no port found, return 0 */
@@ -83,6 +83,9 @@ static int get_port_from_argv(struct rd *rd, uint32_t *port,
 		if (!*port && strlen(slash))
 			return -EINVAL;
 	}
+	if (strict_port && (*port == 0))
+		return -EINVAL;
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 3/8] rdma: Add get per-port counter mode support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

Add an interface to show which mode is active. Two modes are supported:
- "auto": In this mode all QPs belong to one category are bind automatically
  to a single counter set. Currently only "qp type" is supported;
- "manual": In this mode QPs are bound to a counter manually.

Examples:
$ rdma statistic qp mode
0/1: mlx5_0/1: qp auto off
1/1: mlx5_1/1: qp auto off
2/1: mlx5_2/1: qp auto type on
3/1: mlx5_3/1: qp auto off

$ rdma statistic qp mode link mlx5_0
0/1: mlx5_0/1: qp auto off

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/stat.c  | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++
 rdma/utils.c |   2 +
 2 files changed, 142 insertions(+)

diff --git a/rdma/stat.c b/rdma/stat.c
index da35ef7d..0c239851 100644
--- a/rdma/stat.c
+++ b/rdma/stat.c
@@ -13,13 +13,152 @@ static int stat_help(struct rd *rd)
 	pr_out("Usage: %s [ OPTIONS ] statistic { COMMAND | help }\n", rd->filename);
 	pr_out("       %s statistic OBJECT show\n", rd->filename);
 	pr_out("       %s statistic OBJECT show link [ DEV/PORT_INDEX ] [ FILTER-NAME FILTER-VALUE ]\n", rd->filename);
+	pr_out("       %s statistic OBJECT mode\n", rd->filename);
+	pr_out("where  OBJECT: = { qp }\n");
 	pr_out("Examples:\n");
 	pr_out("       %s statistic qp show\n", rd->filename);
 	pr_out("       %s statistic qp show link mlx5_2/1\n", rd->filename);
+	pr_out("       %s statistic qp mode\n", rd->filename);
+	pr_out("       %s statistic qp mode link mlx5_0\n", rd->filename);
 
 	return 0;
 }
 
+struct counter_param {
+	char *name;
+	uint32_t attr;
+};
+
+static struct counter_param auto_params[] = {
+	{ "type", RDMA_COUNTER_MASK_QP_TYPE, },
+	{ NULL },
+};
+
+static int prepare_auto_mode_str(struct nlattr **tb, uint32_t mask,
+				 char *output, int len)
+{
+	char s[] = "qp auto";
+	int i, outlen = strlen(s);
+
+	memset(output, 0, len);
+	snprintf(output, len, "%s", s);
+
+	if (mask) {
+		for (i = 0; auto_params[i].name != NULL; i++) {
+			if (mask & auto_params[i].attr) {
+				outlen += strlen(auto_params[i].name) + 1;
+				if (outlen >= len)
+					return -EINVAL;
+				strcat(output, " ");
+				strcat(output, auto_params[i].name);
+			}
+		}
+
+		if (outlen + strlen(" on") >= len)
+			return -EINVAL;
+		strcat(output, " on");
+	} else {
+		if (outlen + strlen(" off") >= len)
+			return -EINVAL;
+		strcat(output, " off");
+	}
+
+	return 0;
+}
+
+static int qp_link_get_mode_parse_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {};
+	uint32_t mode = 0, mask = 0;
+	char output[128] = {};
+	struct rd *rd = data;
+	uint32_t idx, port;
+	const char *name;
+
+	mnl_attr_parse(nlh, 0, rd_attr_cb, tb);
+	if (!tb[RDMA_NLDEV_ATTR_DEV_INDEX] || !tb[RDMA_NLDEV_ATTR_DEV_NAME])
+		return MNL_CB_ERROR;
+
+	if (!tb[RDMA_NLDEV_ATTR_PORT_INDEX]) {
+		pr_err("This tool doesn't support switches yet\n");
+		return MNL_CB_ERROR;
+	}
+
+	idx = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	port = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_DEV_NAME]);
+	if (tb[RDMA_NLDEV_ATTR_STAT_MODE])
+		mode = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_STAT_MODE]);
+
+	if (mode == RDMA_COUNTER_MODE_AUTO) {
+		if (!tb[RDMA_NLDEV_ATTR_STAT_AUTO_MODE_MASK])
+			return MNL_CB_ERROR;
+		mask = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_STAT_AUTO_MODE_MASK]);
+		prepare_auto_mode_str(tb, mask, output, sizeof(output));
+	} else {
+		snprintf(output, sizeof(output), "qp auto off");
+	}
+
+	if (rd->json_output) {
+		jsonw_uint_field(rd->jw, "ifindex", idx);
+		jsonw_uint_field(rd->jw, "port", port);
+		jsonw_string_field(rd->jw, "mode", output);
+	} else {
+		pr_out("%u/%u: %s/%u: %s\n", idx, port, name, port, output);
+	}
+
+	return MNL_CB_OK;
+}
+
+static int stat_one_qp_link_get_mode(struct rd *rd)
+{
+	uint32_t seq;
+	int ret;
+
+	if (!rd->port_idx)
+		return 0;
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_GET,
+		       &seq, (NLM_F_REQUEST | NLM_F_ACK));
+
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	/* Make RDMA_NLDEV_ATTR_STAT_MODE valid so that kernel knows
+	 * return only mode instead of all counters
+	 */
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_MODE,
+			 RDMA_COUNTER_MODE_MANUAL);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	ret = rd_send_msg(rd);
+	if (ret)
+		return ret;
+
+	if (rd->json_output)
+		jsonw_start_object(rd->jw);
+	ret = rd_recv_msg(rd, qp_link_get_mode_parse_cb, rd, seq);
+	if (rd->json_output)
+		jsonw_end_object(rd->jw);
+
+	return ret;
+}
+
+static int stat_qp_link_get_mode(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_one_qp_link_get_mode, false);
+}
+
+static int stat_qp_get_mode(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_qp_link_get_mode },
+		{ "link",	stat_qp_link_get_mode },
+		{ "help",	stat_help },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
 static int res_get_hwcounters(struct rd *rd, struct nlattr *hwc_table, bool print)
 {
 	struct nlattr *nla_entry;
@@ -248,6 +387,7 @@ static int stat_qp(struct rd *rd)
 		{ NULL,		stat_qp_show },
 		{ "show",	stat_qp_show },
 		{ "list",	stat_qp_show },
+		{ "mode",	stat_qp_get_mode },
 		{ "help",	stat_help },
 		{ 0 }
 	};
diff --git a/rdma/utils.c b/rdma/utils.c
index 7bc0439a..9c885ad7 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -443,6 +443,8 @@ static const enum mnl_attr_data_type nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY] = MNL_TYPE_NESTED,
 	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME] = MNL_TYPE_NUL_STRING,
 	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE] = MNL_TYPE_U64,
+	[RDMA_NLDEV_ATTR_STAT_MODE] = MNL_TYPE_U32,
+	[RDMA_NLDEV_ATTR_STAT_RES] = MNL_TYPE_U32,
 };
 
 int rd_attr_check(const struct nlattr *attr, int *typep)
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 2/8] rdma: Add "stat qp show" support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

This patch presents link, id, task name, lqpn, as well as all sub
counters of a QP counter.
A QP counter is a dynamically allocated statistic counter that is
bound with one or more QPs. It has several sub-counters, each is
used for a different purpose.

Examples:
$ rdma stat qp show
link mlx5_2/1 cntn 5 pid 31609 comm client.1 rx_write_requests 0
rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0 out_of_sequence 0
duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0
implied_nak_seq_err 0 local_ack_timeout_err 0 resp_local_length_error 0
resp_cqe_error 0 req_cqe_error 0 req_remote_invalid_request 0
req_remote_access_errors 0 resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0
    LQPN: <178>
$ rdma stat show link rocep1s0f5/1
link rocep1s0f5/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0 duplicate_request 0
rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0 local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0
req_cqe_error 0 req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0 resp_cqe_flush_error 0
req_cqe_flush_error 0 rp_cnp_ignored 0 rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0
$ rdma stat show link rocep1s0f5/1 -p
link rocep1s0f5/1
    rx_write_requests 0
    rx_read_requests 0
    rx_atomic_requests 0
    out_of_buffer 0
    duplicate_request 0
    rnr_nak_retry_err 0
    packet_seq_err 0
    implied_nak_seq_err 0
    local_ack_timeout_err 0
    resp_local_length_error 0
    resp_cqe_error 0
    req_cqe_error 0
    req_remote_invalid_request 0
    req_remote_access_errors 0
    resp_remote_access_errors 0
    resp_cqe_flush_error 0
    req_cqe_flush_error 0
    rp_cnp_ignored 0
    rp_cnp_handled 0
    np_ecn_marked_roce_packets 0
    np_cnp_sent 0

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/Makefile |   2 +-
 rdma/rdma.c   |   3 +-
 rdma/rdma.h   |   1 +
 rdma/stat.c   | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++
 rdma/utils.c  |   7 ++
 5 files changed, 279 insertions(+), 2 deletions(-)
 create mode 100644 rdma/stat.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 4847f27e..e3f550bf 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -7,7 +7,7 @@ ifeq ($(HAVE_MNL),y)
 CFLAGS += -I./include/uapi/
 
 RDMA_OBJ = rdma.o utils.o dev.o link.o res.o res-pd.o res-mr.o res-cq.o \
-	   res-cmid.o res-qp.o sys.o
+	   res-cmid.o res-qp.o sys.o stat.o
 
 TARGETS += rdma
 endif
diff --git a/rdma/rdma.c b/rdma/rdma.c
index e9f1b4bb..4e34da92 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -11,7 +11,7 @@ static void help(char *name)
 {
 	pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
 	       "       %s [ -f[orce] ] -b[atch] filename\n"
-	       "where  OBJECT := { dev | link | resource | system | help }\n"
+	       "where  OBJECT := { dev | link | resource | system | statistic | help }\n"
 	       "       OPTIONS := { -V[ersion] | -d[etails] | -j[son] | -p[retty]}\n", name, name);
 }
 
@@ -30,6 +30,7 @@ static int rd_cmd(struct rd *rd, int argc, char **argv)
 		{ "link",	cmd_link },
 		{ "resource",	cmd_res },
 		{ "system",	cmd_sys },
+		{ "statistic",	cmd_stat },
 		{ 0 }
 	};
 
diff --git a/rdma/rdma.h b/rdma/rdma.h
index 885a751e..23157743 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -94,6 +94,7 @@ int cmd_dev(struct rd *rd);
 int cmd_link(struct rd *rd);
 int cmd_res(struct rd *rd);
 int cmd_sys(struct rd *rd);
+int cmd_stat(struct rd *rd);
 int rd_exec_cmd(struct rd *rd, const struct rd_cmd *c, const char *str);
 int rd_exec_dev(struct rd *rd, int (*cb)(struct rd *rd));
 int rd_exec_require_dev(struct rd *rd, int (*cb)(struct rd *rd));
diff --git a/rdma/stat.c b/rdma/stat.c
new file mode 100644
index 00000000..da35ef7d
--- /dev/null
+++ b/rdma/stat.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * rdma.c	RDMA tool
+ * Authors:     Mark Zhang <markz@mellanox.com>
+ */
+
+#include "rdma.h"
+#include "res.h"
+#include <inttypes.h>
+
+static int stat_help(struct rd *rd)
+{
+	pr_out("Usage: %s [ OPTIONS ] statistic { COMMAND | help }\n", rd->filename);
+	pr_out("       %s statistic OBJECT show\n", rd->filename);
+	pr_out("       %s statistic OBJECT show link [ DEV/PORT_INDEX ] [ FILTER-NAME FILTER-VALUE ]\n", rd->filename);
+	pr_out("Examples:\n");
+	pr_out("       %s statistic qp show\n", rd->filename);
+	pr_out("       %s statistic qp show link mlx5_2/1\n", rd->filename);
+
+	return 0;
+}
+
+static int res_get_hwcounters(struct rd *rd, struct nlattr *hwc_table, bool print)
+{
+	struct nlattr *nla_entry;
+	const char *nm;
+	uint64_t v;
+	int err;
+
+	mnl_attr_for_each_nested(nla_entry, hwc_table) {
+		struct nlattr *hw_line[RDMA_NLDEV_ATTR_MAX] = {};
+
+		err = mnl_attr_parse_nested(nla_entry, rd_attr_cb, hw_line);
+		if (err != MNL_CB_OK)
+			return -EINVAL;
+
+		if (!hw_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME] ||
+		    !hw_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE]) {
+			return -EINVAL;
+		}
+
+		if (!print)
+			continue;
+
+		nm = mnl_attr_get_str(hw_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME]);
+		v = mnl_attr_get_u64(hw_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE]);
+		if (rd->pretty_output && !rd->json_output)
+			newline_indent(rd);
+		res_print_uint(rd, nm, v, hw_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME]);
+	}
+
+	return MNL_CB_OK;
+}
+
+static int res_counter_line(struct rd *rd, const char *name, int index,
+		       struct nlattr **nla_line)
+{
+	uint32_t cntn, port = 0, pid = 0, qpn;
+	struct nlattr *hwc_table, *qp_table;
+	struct nlattr *nla_entry;
+	const char *comm = NULL;
+	bool isfirst;
+	int err;
+
+	if (nla_line[RDMA_NLDEV_ATTR_PORT_INDEX])
+		port = mnl_attr_get_u32(nla_line[RDMA_NLDEV_ATTR_PORT_INDEX]);
+
+	hwc_table = nla_line[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS];
+	qp_table = nla_line[RDMA_NLDEV_ATTR_RES_QP];
+	if (!hwc_table || !qp_table ||
+	    !nla_line[RDMA_NLDEV_ATTR_STAT_COUNTER_ID])
+		return MNL_CB_ERROR;
+
+	cntn = mnl_attr_get_u32(nla_line[RDMA_NLDEV_ATTR_STAT_COUNTER_ID]);
+	if (rd_is_filtered_attr(rd, "cntn", cntn,
+				nla_line[RDMA_NLDEV_ATTR_STAT_COUNTER_ID]))
+		return MNL_CB_OK;
+
+	if (nla_line[RDMA_NLDEV_ATTR_RES_PID]) {
+		pid = mnl_attr_get_u32(nla_line[RDMA_NLDEV_ATTR_RES_PID]);
+		comm = get_task_name(pid);
+	}
+	if (rd_is_filtered_attr(rd, "pid", pid,
+				nla_line[RDMA_NLDEV_ATTR_RES_PID]))
+		return MNL_CB_OK;
+
+	if (nla_line[RDMA_NLDEV_ATTR_RES_KERN_NAME])
+		comm = (char *)mnl_attr_get_str(
+			nla_line[RDMA_NLDEV_ATTR_RES_KERN_NAME]);
+
+	mnl_attr_for_each_nested(nla_entry, qp_table) {
+		struct nlattr *qp_line[RDMA_NLDEV_ATTR_MAX] = {};
+
+		err = mnl_attr_parse_nested(nla_entry, rd_attr_cb, qp_line);
+		if (err != MNL_CB_OK)
+			return -EINVAL;
+
+		if (!qp_line[RDMA_NLDEV_ATTR_RES_LQPN])
+			return -EINVAL;
+
+		qpn = mnl_attr_get_u32(qp_line[RDMA_NLDEV_ATTR_RES_LQPN]);
+		if (rd_is_filtered_attr(rd, "lqpn", qpn,
+					qp_line[RDMA_NLDEV_ATTR_RES_LQPN]))
+			return MNL_CB_OK;
+	}
+
+	err = res_get_hwcounters(rd, hwc_table, false);
+	if (err != MNL_CB_OK)
+		return err;
+
+	if (rd->json_output) {
+		jsonw_string_field(rd->jw, "ifname", name);
+		if (port)
+			jsonw_uint_field(rd->jw, "port", port);
+		jsonw_uint_field(rd->jw, "cntn", cntn);
+	} else {
+		if (port)
+			pr_out("link %s/%u cntn %u ", name, port, cntn);
+		else
+			pr_out("dev %s cntn %u ", name, cntn);
+	}
+
+	res_print_uint(rd, "pid", pid, nla_line[RDMA_NLDEV_ATTR_RES_PID]);
+	print_comm(rd, comm, nla_line);
+
+	res_get_hwcounters(rd, hwc_table, true);
+
+	isfirst = true;
+	mnl_attr_for_each_nested(nla_entry, qp_table) {
+		struct nlattr *qp_line[RDMA_NLDEV_ATTR_MAX] = {};
+
+		if (isfirst && !rd->json_output)
+			pr_out("\n    LQPN: <");
+
+		err = mnl_attr_parse_nested(nla_entry, rd_attr_cb, qp_line);
+		if (err != MNL_CB_OK)
+			return -EINVAL;
+
+		if (!qp_line[RDMA_NLDEV_ATTR_RES_LQPN])
+			return -EINVAL;
+
+		qpn = mnl_attr_get_u32(qp_line[RDMA_NLDEV_ATTR_RES_LQPN]);
+		if (rd->json_output) {
+			jsonw_uint_field(rd->jw, "lqpn", qpn);
+		} else {
+			if (isfirst)
+				pr_out("%d", qpn);
+			else
+				pr_out(", %d", qpn);
+		}
+		isfirst = false;
+	}
+
+	if (!rd->json_output)
+		pr_out(">\n");
+	return MNL_CB_OK;
+}
+
+static int stat_qp_show_parse_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {};
+	struct nlattr *nla_table, *nla_entry;
+	struct rd *rd = data;
+	const char *name;
+	uint32_t idx;
+	int ret;
+
+	mnl_attr_parse(nlh, 0, rd_attr_cb, tb);
+	if (!tb[RDMA_NLDEV_ATTR_DEV_INDEX] || !tb[RDMA_NLDEV_ATTR_DEV_NAME] ||
+	    !tb[RDMA_NLDEV_ATTR_STAT_COUNTER])
+		return MNL_CB_ERROR;
+
+	name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_DEV_NAME]);
+	idx = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	nla_table = tb[RDMA_NLDEV_ATTR_STAT_COUNTER];
+
+	mnl_attr_for_each_nested(nla_entry, nla_table) {
+		struct nlattr *nla_line[RDMA_NLDEV_ATTR_MAX] = {};
+
+		ret = mnl_attr_parse_nested(nla_entry, rd_attr_cb, nla_line);
+		if (ret != MNL_CB_OK)
+			break;
+
+		ret = res_counter_line(rd, name, idx, nla_line);
+		if (ret != MNL_CB_OK)
+			break;
+	}
+
+	return ret;
+}
+
+static const struct filters stat_valid_filters[MAX_NUMBER_OF_FILTERS] = {
+	{ .name = "cntn", .is_number = true },
+	{ .name = "lqpn", .is_number = true },
+	{ .name = "pid", .is_number = true },
+};
+
+static int stat_qp_show_one_link(struct rd *rd)
+{
+	int flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP;
+	uint32_t seq;
+	int ret;
+
+	if (!rd->port_idx)
+		return 0;
+
+	ret = rd_build_filter(rd, stat_valid_filters);
+	if (ret)
+		return ret;
+
+	rd_prepare_msg(rd, RDMA_NLDEV_CMD_STAT_GET, &seq, flags);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_PORT_INDEX, rd->port_idx);
+	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_STAT_RES, RDMA_NLDEV_ATTR_RES_QP);
+	ret = rd_send_msg(rd);
+	if (ret)
+		return ret;
+
+	if (rd->json_output)
+		jsonw_start_object(rd->jw);
+	ret = rd_recv_msg(rd, stat_qp_show_parse_cb, rd, seq);
+	if (rd->json_output)
+		jsonw_end_object(rd->jw);
+
+	return ret;
+}
+
+static int stat_qp_show_link(struct rd *rd)
+{
+	return rd_exec_link(rd, stat_qp_show_one_link, false);
+}
+
+static int stat_qp_show(struct rd *rd)
+{
+	const struct rd_cmd cmds[] = {
+		{ NULL,		stat_qp_show_link },
+		{ "link",	stat_qp_show_link },
+		{ "help",	stat_help },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+static int stat_qp(struct rd *rd)
+{
+	const struct rd_cmd cmds[] =  {
+		{ NULL,		stat_qp_show },
+		{ "show",	stat_qp_show },
+		{ "list",	stat_qp_show },
+		{ "help",	stat_help },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "parameter");
+}
+
+int cmd_stat(struct rd *rd)
+{
+	const struct rd_cmd cmds[] =  {
+		{ NULL,		stat_help },
+		{ "help",	stat_help },
+		{ "qp",		stat_qp },
+		{ 0 }
+	};
+
+	return rd_exec_cmd(rd, cmds, "statistic command");
+}
diff --git a/rdma/utils.c b/rdma/utils.c
index 558d1c29..7bc0439a 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -436,6 +436,13 @@ static const enum mnl_attr_data_type nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_DRIVER_S64] = MNL_TYPE_U64,
 	[RDMA_NLDEV_ATTR_DRIVER_U64] = MNL_TYPE_U64,
 	[RDMA_NLDEV_SYS_ATTR_NETNS_MODE] = MNL_TYPE_U8,
+	[RDMA_NLDEV_ATTR_STAT_COUNTER] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_STAT_COUNTER_ENTRY] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_STAT_COUNTER_ID] = MNL_TYPE_U32,
+	[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY] = MNL_TYPE_NESTED,
+	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME] = MNL_TYPE_NUL_STRING,
+	[RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE] = MNL_TYPE_U64,
 };
 
 int rd_attr_check(const struct nlattr *attr, int *typep)
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-1-leon@kernel.org>

From: Mark Zhang <markz@mellanox.com>

Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
Allow get default counter statistics through RDMA netlink").

Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/include/uapi/rdma/rdma_netlink.h | 82 +++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 4 deletions(-)

diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
index 41cfa84c..d42d6fb2 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -147,6 +147,18 @@ enum {
 	IWPM_NLA_HELLO_MAX
 };
 
+/* For RDMA_NLDEV_ATTR_DEV_NODE_TYPE */
+enum {
+	/* IB values map to NodeInfo:NodeType. */
+	RDMA_NODE_IB_CA = 1,
+	RDMA_NODE_IB_SWITCH,
+	RDMA_NODE_IB_ROUTER,
+	RDMA_NODE_RNIC,
+	RDMA_NODE_USNIC,
+	RDMA_NODE_USNIC_UDP,
+	RDMA_NODE_UNSPECIFIED,
+};
+
 /*
  * Local service operations:
  *   RESOLVE - The client requests the local service to resolve a path.
@@ -267,11 +279,15 @@ enum rdma_nldev_command {
 
 	RDMA_NLDEV_CMD_RES_PD_GET, /* can dump */
 
-	RDMA_NLDEV_NUM_OPS
-};
+	RDMA_NLDEV_CMD_GET_CHARDEV,
 
-enum {
-	RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
+	RDMA_NLDEV_CMD_STAT_SET,
+
+	RDMA_NLDEV_CMD_STAT_GET, /* can dump */
+
+	RDMA_NLDEV_CMD_STAT_DEL,
+
+	RDMA_NLDEV_NUM_OPS
 };
 
 enum rdma_nldev_print_type {
@@ -478,10 +494,68 @@ enum rdma_nldev_attr {
 	 * File descriptor handle of the net namespace object
 	 */
 	RDMA_NLDEV_NET_NS_FD,			/* u32 */
+	/*
+	 * Information about a chardev.
+	 * CHARDEV_TYPE is the name of the chardev ABI (ie uverbs, umad, etc)
+	 * CHARDEV_ABI signals the ABI revision (historical)
+	 * CHARDEV_NAME is the kernel name for the /dev/ file (no directory)
+	 * CHARDEV is the 64 bit dev_t for the inode
+	 */
+	RDMA_NLDEV_ATTR_CHARDEV_TYPE,		/* string */
+	RDMA_NLDEV_ATTR_CHARDEV_NAME,		/* string */
+	RDMA_NLDEV_ATTR_CHARDEV_ABI,		/* u64 */
+	RDMA_NLDEV_ATTR_CHARDEV,		/* u64 */
+	RDMA_NLDEV_ATTR_UVERBS_DRIVER_ID,       /* u64 */
+	/*
+	 * Counter-specific attributes.
+	 */
+	RDMA_NLDEV_ATTR_STAT_MODE,		/* u32 */
+	RDMA_NLDEV_ATTR_STAT_RES,		/* u32 */
+	RDMA_NLDEV_ATTR_STAT_AUTO_MODE_MASK,	/* u32 */
+	RDMA_NLDEV_ATTR_STAT_COUNTER,		/* nested table */
+	RDMA_NLDEV_ATTR_STAT_COUNTER_ENTRY,	/* nested table */
+	RDMA_NLDEV_ATTR_STAT_COUNTER_ID,	/* u32 */
+	RDMA_NLDEV_ATTR_STAT_HWCOUNTERS,	/* nested table */
+	RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY,	/* nested table */
+	RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME,	/* string */
+	RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE,	/* u64 */
 
 	/*
 	 * Always the end
 	 */
 	RDMA_NLDEV_ATTR_MAX
 };
+
+/*
+ * Supported counter bind modes. All modes are mutual-exclusive.
+ */
+enum rdma_nl_counter_mode {
+	RDMA_COUNTER_MODE_NONE,
+
+	/*
+	 * A qp is bound with a counter automatically during initialization
+	 * based on the auto mode (e.g., qp type, ...)
+	 */
+	RDMA_COUNTER_MODE_AUTO,
+
+	/*
+	 * Which qp are bound with which counter is explicitly specified
+	 * by the user
+	 */
+	RDMA_COUNTER_MODE_MANUAL,
+
+	/*
+	 * Always the end
+	 */
+	RDMA_COUNTER_MODE_MAX,
+};
+
+/*
+ * Supported criteria in counter auto mode.
+ * Currently only "qp type" is supported
+ */
+enum rdma_nl_counter_mask {
+	RDMA_COUNTER_MASK_QP_TYPE = 1,
+};
+
 #endif /* _RDMA_NETLINK_H */
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2-rc 0/8] Statistics counter support
From: Leon Romanovsky @ 2019-07-10  7:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list

From: Leon Romanovsky <leonro@mellanox.com>

Hi,

This is supplementary part of accepted to rdma-next kernel series,
that kernel series provided an option to get various counters: global
and per-objects.

Currently, all counters are printed in format similar to other
device/link properties, while "-p" option will print them in table like
format.

[leonro@server ~]$ rdma stat show
link mlx5_0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests
0 out_of_buffer 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err
0 implied_nak_seq_err 0 local_ack_timeout_err 0 resp_local_length_error
0 resp_cqe_error 0 req_cqe_error 0 req_remote_invalid_request 0
req_remote_access_errors 0 resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0 rp_cnp_ignored 0
rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0

[leonro@server ~]$ rdma stat show -p
link mlx5_0/1
	rx_write_requests 0
	rx_read_requests 0
	rx_atomic_requests 0
	out_of_buffer 0
	duplicate_request 0
	rnr_nak_retry_err 0
	packet_seq_err 0
	implied_nak_seq_err 0
	local_ack_timeout_err 0
	resp_local_length_error 0
	resp_cqe_error 0
	req_cqe_error 0
	req_remote_invalid_request 0
	req_remote_access_errors 0
	resp_remote_access_errors 0
	resp_cqe_flush_error 0
	req_cqe_flush_error 0
	rp_cnp_ignored 0
	rp_cnp_handled 0
	np_ecn_marked_roce_packets 0
	np_cnp_sent 0

Thanks

Mark Zhang (8):
  rdma: Update uapi headers to add statistic counter support
  rdma: Add "stat qp show" support
  rdma: Add get per-port counter mode support
  rdma: Add rdma statistic counter per-port auto mode support
  rdma: Make get_port_from_argv() returns valid port in strict port mode
  rdma: Add stat manual mode support
  rdma: Add default counter show support
  rdma: Document counter statistic

 man/man8/rdma-dev.8                   |   1 +
 man/man8/rdma-link.8                  |   1 +
 man/man8/rdma-resource.8              |   1 +
 man/man8/rdma-statistic.8             | 167 ++++++
 man/man8/rdma.8                       |   7 +-
 rdma/Makefile                         |   2 +-
 rdma/include/uapi/rdma/rdma_netlink.h |  82 ++-
 rdma/rdma.c                           |   3 +-
 rdma/rdma.h                           |   1 +
 rdma/stat.c                           | 759 ++++++++++++++++++++++++++
 rdma/utils.c                          |  17 +-
 11 files changed, 1032 insertions(+), 9 deletions(-)
 create mode 100644 man/man8/rdma-statistic.8
 create mode 100644 rdma/stat.c

--
2.20.1


^ permalink raw reply

* Re: [RFC v2] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2019-07-10  7:22 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Alex Williamson, mst, maxime.coquelin, linux-kernel, kvm,
	virtualization, netdev, dan.daly, cunming.liang, zhihong.wang,
	idos, Rob Miller, Ariel Adam
In-Reply-To: <20190710062233.GA16212@___>


On 2019/7/10 下午2:22, Tiwei Bie wrote:
> On Wed, Jul 10, 2019 at 10:26:10AM +0800, Jason Wang wrote:
>> On 2019/7/9 下午2:33, Tiwei Bie wrote:
>>> On Tue, Jul 09, 2019 at 10:50:38AM +0800, Jason Wang wrote:
>>>> On 2019/7/8 下午2:16, Tiwei Bie wrote:
>>>>> On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
>>>>>> On Thu, 4 Jul 2019 14:21:34 +0800
>>>>>> Tiwei Bie <tiwei.bie@intel.com> wrote:
>>>>>>> On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
>>>>>>>> On 2019/7/3 下午9:08, Tiwei Bie wrote:
>>>>>>>>> On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
>>>>>>>>>> On 2019/7/3 下午7:52, Tiwei Bie wrote:
>>>>>>>>>>> On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
>>>>>>>>>>>> On 2019/7/3 下午5:13, Tiwei Bie wrote:
>>>>>>>>>>>>> Details about this can be found here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://lwn.net/Articles/750770/
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's new in this version
>>>>>>>>>>>>> ==========================
>>>>>>>>>>>>>
>>>>>>>>>>>>> A new VFIO device type is introduced - vfio-vhost. This addressed
>>>>>>>>>>>>> some comments from here:https://patchwork.ozlabs.org/cover/984763/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Below is the updated device interface:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently, there are two regions of this device: 1) CONFIG_REGION
>>>>>>>>>>>>> (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
>>>>>>>>>>>>> device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
>>>>>>>>>>>>> can be used to notify the device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. CONFIG_REGION
>>>>>>>>>>>>>
>>>>>>>>>>>>> The region described by CONFIG_REGION is the main control interface.
>>>>>>>>>>>>> Messages will be written to or read from this region.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The message type is determined by the `request` field in message
>>>>>>>>>>>>> header. The message size is encoded in the message header too.
>>>>>>>>>>>>> The message format looks like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> struct vhost_vfio_op {
>>>>>>>>>>>>> 	__u64 request;
>>>>>>>>>>>>> 	__u32 flags;
>>>>>>>>>>>>> 	/* Flag values: */
>>>>>>>>>>>>>        #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
>>>>>>>>>>>>> 	__u32 size;
>>>>>>>>>>>>> 	union {
>>>>>>>>>>>>> 		__u64 u64;
>>>>>>>>>>>>> 		struct vhost_vring_state state;
>>>>>>>>>>>>> 		struct vhost_vring_addr addr;
>>>>>>>>>>>>> 	} payload;
>>>>>>>>>>>>> };
>>>>>>>>>>>>>
>>>>>>>>>>>>> The existing vhost-kernel ioctl cmds are reused as the message
>>>>>>>>>>>>> requests in above structure.
>>>>>>>>>>>> Still a comments like V1. What's the advantage of inventing a new protocol?
>>>>>>>>>>> I'm trying to make it work in VFIO's way..
>>>>>>>>>>>> I believe either of the following should be better:
>>>>>>>>>>>>
>>>>>>>>>>>> - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL and
>>>>>>>>>>>> extend it with e.g notify region. The advantages is that all exist userspace
>>>>>>>>>>>> program could be reused without modification (or minimal modification). And
>>>>>>>>>>>> vhost API hides lots of details that is not necessary to be understood by
>>>>>>>>>>>> application (e.g in the case of container).
>>>>>>>>>>> Do you mean reusing vhost's ioctl on VFIO device fd directly,
>>>>>>>>>>> or introducing another mdev driver (i.e. vhost_mdev instead of
>>>>>>>>>>> using the existing vfio_mdev) for mdev device?
>>>>>>>>>> Can we simply add them into ioctl of mdev_parent_ops?
>>>>>>>>> Right, either way, these ioctls have to be and just need to be
>>>>>>>>> added in the ioctl of the mdev_parent_ops. But another thing we
>>>>>>>>> also need to consider is that which file descriptor the userspace
>>>>>>>>> will do the ioctl() on. So I'm wondering do you mean let the
>>>>>>>>> userspace do the ioctl() on the VFIO device fd of the mdev
>>>>>>>>> device?
>>>>>>>> Yes.
>>>>>>> Got it! I'm not sure what's Alex opinion on this. If we all
>>>>>>> agree with this, I can do it in this way.
>>>>>>>
>>>>>>>> Is there any other way btw?
>>>>>>> Just a quick thought.. Maybe totally a bad idea. I was thinking
>>>>>>> whether it would be odd to do non-VFIO's ioctls on VFIO's device
>>>>>>> fd. So I was wondering whether it's possible to allow binding
>>>>>>> another mdev driver (e.g. vhost_mdev) to the supported mdev
>>>>>>> devices. The new mdev driver, vhost_mdev, can provide similar
>>>>>>> ways to let userspace open the mdev device and do the vhost ioctls
>>>>>>> on it. To distinguish with the vfio_mdev compatible mdev devices,
>>>>>>> the device API of the new vhost_mdev compatible mdev devices
>>>>>>> might be e.g. "vhost-net" for net?
>>>>>>>
>>>>>>> So in VFIO case, the device will be for passthru directly. And
>>>>>>> in VHOST case, the device can be used to accelerate the existing
>>>>>>> virtualized devices.
>>>>>>>
>>>>>>> How do you think?
>>>>>> VFIO really can't prevent vendor specific ioctls on the device file
>>>>>> descriptor for mdevs, but a) we'd want to be sure the ioctl address
>>>>>> space can't collide with ioctls we'd use for vfio defined purposes and
>>>>>> b) maybe the VFIO user API isn't what you want in the first place if
>>>>>> you intend to mostly/entirely ignore the defined ioctl set and replace
>>>>>> them with your own.  In the case of the latter, you're also not getting
>>>>>> the advantages of the existing VFIO userspace code, so why expose a
>>>>>> VFIO device at all.
>>>>> Yeah, I totally agree.
>>>> I guess the original idea is to reuse the VFIO DMA/IOMMU API for this. Then
>>>> we have the chance to reuse vfio codes in qemu for dealing with e.g vIOMMU.
>>> Yeah, you are right. We have several choices here:
>>>
>>> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>>>       based DMA API and potentially reuse a lot of VFIO code in QEMU.
>>>
>>>       But in this case, we have two choices for the VFIO device interface
>>>       (i.e. the interface on top of VFIO device fd):
>>>
>>>       A) we may invent a new vhost protocol (as demonstrated by the code
>>>          in this RFC) on VFIO device fd to make it work in VFIO's way,
>>>          i.e. regions and irqs.
>>>
>>>       B) Or as you proposed, instead of inventing a new vhost protocol,
>>>          we can reuse most existing vhost ioctls on the VFIO device fd
>>>          directly. There should be no conflicts between the VFIO ioctls
>>>          (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>>>
>>> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>>>       And we will introduce a new mdev driver vhost-mdev to do this.
>>>       It would be natural to reuse the existing kernel vhost interface
>>>       (ioctls) on it as much as possible. But we will need to invent
>>>       some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>>>       choice, but it's too heavy and doesn't support vIOMMU by itself).
>>>
>>> I'm not sure which one is the best choice we all want..
>>> Which one (#1/A, #1/B, or #2) would you prefer?
>>
>> #2 looks better. One concern is that we may end up with similar API as what
>> VFIO does.
> Yeah, that's a major concern. If it's true, is it something
> that's not acceptable?


I think not, but I don't know if any other one that care this.


>
>> And I do see some new RFC for VFIO to add more DMA API.
> Is there any pointers?


I don't remember the details, but it should be something related to SVA 
support in recent intel IOMMU.


>
>> Consider it was still in the stage of RFC, does it make sense if we try this
>> way with some sample parents?
> I think it makes sense.


Just one more thought, for sample parents, vhost-net should be much more 
easier in both implementation and testing.


>
>>
>>>>>> The mdev interface does provide a general interface for creating and
>>>>>> managing virtual devices, vfio-mdev is just one driver on the mdev
>>>>>> bus.  Parav (Mellanox) has been doing work on mdev-core to help clean
>>>>>> out vfio-isms from the interface, aiui, with the intent of implementing
>>>>>> another mdev bus driver for using the devices within the kernel.
>>>>> Great to know this! I found below series after some searching:
>>>>>
>>>>> https://lkml.org/lkml/2019/3/8/821
>>>>>
>>>>> In above series, the new mlx5_core mdev driver will do the probe
>>>>> by calling mlx5_get_core_dev() first on the parent device of the
>>>>> mdev device. In vhost_mdev, maybe we can also keep track of all
>>>>> the compatible mdev devices and use this info to do the probe.
>>>> I don't get why this is needed. My understanding is if we want to go this
>>>> way, there're actually two parts. 1) Vhost mdev that implements the device
>>>> managements and vhost ioctl. 2) Vhost it self, which can accept mdev fd as
>>>> it backend through VHOST_NET_SET_BACKEND.
>>> I think with vhost-mdev (or with vfio-mdev if we agree to do vhost
>>> ioctls on vfio device fd directly), we don't need to open /dev/vhost-net
>>> (and there is no VHOST_NET_SET_BACKEND needed) at all. Either way,
>>> after getting the fd of the mdev, we just need to do vhost ioctls
>>> on it directly.
>>
>> The reason I ask is that vhost-net is designed to not tied to any kind of
>> backend. So it's better to have a single place to deal with ioctl. But it's
>> not must.
> I think in vhost-mdev, there is a chance for us to have a
> unified interface in /dev for all vhost mediated devices
> (not limited to net) in the system (similar to the case of
> /dev/vfio/) instead of making it a backend of vhost-net.
>
> For the code organization, it's possible for us to refactor
> drivers/vhost/ and let it provide some APIs for parent devices
> to handle generic vhost ioctls.


Yes, and separate the current kthread based software dataplane out of 
the core APIs.

Thanks


>
> Thanks,
> Tiwei
>
>> Thanks
>>
>>
>>>>> But we also need a way to allow vfio_mdev driver to distinguish
>>>>> and reject the incompatible mdev devices.
>>>> One issue for this series is that it doesn't consider DMA isolation at all.
>>>>
>>>>
>>>>>> It
>>>>>> seems like this vhost-mdev driver might be similar, using mdev but not
>>>>>> necessarily vfio-mdev to expose devices.  Thanks,
>>>>> Yeah, I also think so!
>>>> I've cced some driver developers for their inputs. I think we need a sample
>>>> parent drivers in the next version for us to understand the full picture.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thanks!
>>>>> Tiwei
>>>>>
>>>>>> Alex

^ permalink raw reply

* Re: [PATCH] r8169: add enable_aspm parameter
From: AceLan Kao @ 2019-07-10  7:05 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Realtek linux nic maintainers, David S. Miller, netdev,
	Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <CAFv23Q=mA9t0j2F4fKdOkgG6sao0m7rR_9-d9OvAmSerZf_=ew@mail.gmail.com>

Hi Heiner,

I've tried and verified your PCI ASPM patches and it works well.
I've replied the patch thread and hope this can make it get some progress.

BTW, do you think we can revert commit b75bb8a5b755 ("r8169: disable
ASPM again") once the PCI ASPM patches get merged?

Best regards,
AceLan Kao.

AceLan Kao <acelan.kao@canonical.com> 於 2019年7月9日 週二 上午11:19寫道：
>
> Heiner Kallweit <hkallweit1@gmail.com> 於 2019年7月9日 週二 上午2:27寫道：
> >
> > On 08.07.2019 08:37, AceLan Kao wrote:
> > > We have many commits in the driver which enable and then disable ASPM
> > > function over and over again.
> > >    commit b75bb8a5b755 ("r8169: disable ASPM again")
> > >    commit 0866cd15029b ("r8169: enable ASPM on RTL8106E")
> > >    commit 94235460f9ea ("r8169: Align ASPM/CLKREQ setting function with vendor driver")
> > >    commit aa1e7d2c31ef ("r8169: enable ASPM on RTL8168E-VL")
> > >    commit f37658da21aa ("r8169: align ASPM entry latency setting with vendor driver")
> > >    commit a99790bf5c7f ("r8169: Reinstate ASPM Support")
> > >    commit 671646c151d4 ("r8169: Don't disable ASPM in the driver")
> > >    commit 4521e1a94279 ("Revert "r8169: enable internal ASPM and clock request settings".")
> > >    commit d64ec841517a ("r8169: enable internal ASPM and clock request settings")
> > >
> > > This function is very important for production, and if we can't come out
> > > a solution to make both happy, I'd suggest we add a parameter in the
> > > driver to toggle it.
> > >
> > The usage of a module parameter to control ASPM is discouraged.
> > There have been more such attempts in the past that have been declined.
> >
> > Pending with the PCI maintainers is a series adding ASPM control
> > via sysfs, see here: https://www.spinics.net/lists/linux-pci/msg83228.html
> Cool, I'll try your patches and reply on that thread.
>
> >
> > Also more details than just stating "it's important for production"
> > would have been appreciated in the commit message, e.g. which
> > power-savings you can achieve with ASPM on which systems.
> I should use more specific wordings rather than "important for
> production", thanks.

^ permalink raw reply

* Re: Question about nf_conntrack_proto for IPsec
From: Naruto Nguyen @ 2019-07-10  6:55 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev, netfilter
In-Reply-To: <20190626111322.gks5qptax3iqrjao@breakpoint.cc>

Hi Florian,

Thanks a lot for your reply.

Could you please elaborate more on how generic tracker tracks ESP connection?

Brs,
Bao

On Wed, 26 Jun 2019 at 18:13, Florian Westphal <fw@strlen.de> wrote:
>
> Naruto Nguyen <narutonguyen2018@gmail.com> wrote:
> > In linux/latest/source/net/netfilter/ folder, I only see we have
> > nf_conntrack_proto_tcp.c, nf_conntrack_proto_udp.c and some other
> > conntrack implementations for other protocols but I do not see
> > nf_conntrack_proto for IPsec, so does it mean connection tracking
> > cannot track ESP or AH protocol as a connection. I mean when I use
> > "conntrack -L" command, I will not see ESP or AH  connection is saved
> > in conntrack list. Could you please help me to understand if conntrack
> > supports that and any reasons if it does not support?
>
> ESP/AH etc. use the generic tracker, i.e. only one ESP connection
> is tracked between each endpoint.

^ permalink raw reply

* Re: [PATCH v3 net-next 19/19] ionic: Add basic devlink interface
From: Jiri Pirko @ 2019-07-10  6:48 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev
In-Reply-To: <0ae90b8d-5c73-e60d-8e56-5f6f56331e1a@pensando.io>

Tue, Jul 09, 2019 at 09:13:53PM CEST, snelson@pensando.io wrote:
>On 7/8/19 11:56 PM, Jiri Pirko wrote:
>> Tue, Jul 09, 2019 at 12:58:00AM CEST, snelson@pensando.io wrote:
>> > On 7/8/19 1:03 PM, Jiri Pirko wrote:
>> > > Mon, Jul 08, 2019 at 09:58:09PM CEST, snelson@pensando.io wrote:
>> > > > On 7/8/19 12:34 PM, Jiri Pirko wrote:
>> > > > > Mon, Jul 08, 2019 at 09:25:32PM CEST, snelson@pensando.io wrote:
>> > > > > > +
>> > > > > > +static const struct devlink_ops ionic_dl_ops = {
>> > > > > > +	.info_get	= ionic_dl_info_get,
>> > > > > > +};
>> > > > > > +
>> > > > > > +int ionic_devlink_register(struct ionic *ionic)
>> > > > > > +{
>> > > > > > +	struct devlink *dl;
>> > > > > > +	struct ionic **ip;
>> > > > > > +	int err;
>> > > > > > +
>> > > > > > +	dl = devlink_alloc(&ionic_dl_ops, sizeof(struct ionic *));
>> > > > > Oups. Something is wrong with your flow. The devlink alloc is allocating
>> > > > > the structure that holds private data (per-device data) for you. This is
>> > > > > misuse :/
>> > > > > 
>> > > > > You are missing one parent device struct apparently.
>> > > > > 
>> > > > > Oh, I think I see something like it. The unused "struct ionic_devlink".
>> > > > If I'm not mistaken, the alloc is only allocating enough for a pointer, not
>> > > > the whole per device struct, and a few lines down from here the pointer to
>> > > > the new devlink struct is assigned to ionic->dl.  This was based on what I
>> > > > found in the qed driver's qed_devlink_register(), and it all seems to work.
>> > > I'm not saying your code won't work. What I say is that you should have
>> > > a struct for device that would be allocated by devlink_alloc()
>> > Is there a particular reason why?  I appreciate that devlink_alloc() can give
>> > you this device specific space, just as alloc_etherdev_mq() can, but is there
>> Yes. Devlink manipulates with the whole device. However,
>> alloc_etherdev_mq() allocates only net_device. These are 2 different
>> things. devlink port relates 1:1 to net_device. However, devlink
>> instance can have multiple ports. What I say is do it correctly.
>
>So what you are saying is that anyone who wants to add even the smallest
>devlink feature to their driver needs to rework their basic device memory
>setup to do it the devlink way.  I can see where some folks may have a
>problem with this.

It's just about having a structure to hold device data. You don't have
to rework anything, just add this small one.


>
>> 
>> 
>> > a specific reason why this should be used instead of setting up simply a
>> > pointer to a space that has already been allocated?  There are several
>> > drivers that are using it the way I've setup here, which happened to be the
>> > first examples I followed - are they doing something different that makes
>> > this valid for them?
>> Nope. I'll look at that and fix.
>> 
>> 
>> > > The ionic struct should be associated with devlink_port. That you are
>> > > missing too.
>> > We don't support any of devlink_port features at this point, just the simple
>> > device information.
>> No problem, you can still register devlink_port. You don't have to do
>> much in order to do so.
>
>Is there any write-up to help guide developers new to devlink in using the
>interface correctly?  I haven't found much yet, but perhaps I've missed
>something.  The manpages are somewhat useful in showing what the user might
>do, but they really don't help much in guiding the developer through these
>details.

That is not job of a manpage. See the rest of the code to get inspired.

>
>sln
>

^ permalink raw reply

* Re: [PATCH net-next v6 0/5] devlink: Introduce PCI PF, VF ports and attributes
From: Jiri Pirko @ 2019-07-10  6:41 UTC (permalink / raw)
  To: David Miller; +Cc: jakub.kicinski, parav, netdev, jiri, saeedm
In-Reply-To: <20190709.120336.1987683013901804676.davem@davemloft.net>

Tue, Jul 09, 2019 at 09:03:36PM CEST, davem@davemloft.net wrote:
>From: Jakub Kicinski <jakub.kicinski@netronome.com>
>Date: Tue, 9 Jul 2019 11:20:58 -0700
>
>> On Tue, 9 Jul 2019 08:17:11 +0200, Jiri Pirko wrote:
>>> >But I'll leave it to Jiri and Dave to decide if its worth a respin :)
>>> >Functionally I think this is okay.
>>> 
>>> I'm happy with the set as it is right now. 
>> 
>> To be clear, I am happy enough as well. Hence the review tag.
>
>Series applied, thanks everyone.
>
>>> Anyway, if you want your concerns to be addresses, you should write
>>> them to the appropriate code. This list is hard to follow.
>> 
>> Sorry, I was trying to be concise.
>
>Jiri et al., if Jakub put forth the time and effort to make the list
>and give you feedback you can put forth the effort to go through the
>list and address his feedback with follow-up patches.  You cannot
>dictate how people give feedback to your changes, thank you.

I don't want to do such thing. I'm just saying it's much easier to
follow the comments when they are provided by the actual code. That's it.
It's the usual way.

^ permalink raw reply

* [Patch net] hsr: switch ->dellink() to ->ndo_uninit()
From: Cong Wang @ 2019-07-10  6:24 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, syzbot+097ef84cdc95843fbaa8, Arvid Brodin

Switching from ->priv_destructor to dellink() has an unexpected
consequence: existing RCU readers, that is, hsr_port_get_hsr()
callers, may still be able to read the port list.

Instead of checking the return value of each hsr_port_get_hsr(),
we can just move it to ->ndo_uninit() which is called after
device unregister and synchronize_net(), and we still have RTNL
lock there.

Fixes: b9a1e627405d ("hsr: implement dellink to clean up resources")
Fixes: edf070a0fb45 ("hsr: fix a NULL pointer deref in hsr_dev_xmit()")
Reported-by: syzbot+097ef84cdc95843fbaa8@syzkaller.appspotmail.com
Cc: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/hsr/hsr_device.c  | 18 ++++++++----------
 net/hsr/hsr_device.h  |  1 -
 net/hsr/hsr_netlink.c |  7 -------
 3 files changed, 8 insertions(+), 18 deletions(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index f0f9b493c47b..f509b495451a 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -227,13 +227,8 @@ static int hsr_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct hsr_port *master;
 
 	master = hsr_port_get_hsr(hsr, HSR_PT_MASTER);
-	if (master) {
-		skb->dev = master->dev;
-		hsr_forward_skb(skb, master);
-	} else {
-		atomic_long_inc(&dev->tx_dropped);
-		dev_kfree_skb_any(skb);
-	}
+	skb->dev = master->dev;
+	hsr_forward_skb(skb, master);
 	return NETDEV_TX_OK;
 }
 
@@ -348,7 +343,11 @@ static void hsr_announce(struct timer_list *t)
 	rcu_read_unlock();
 }
 
-void hsr_dev_destroy(struct net_device *hsr_dev)
+/* This has to be called after all the readers are gone.
+ * Otherwise we would have to check the return value of
+ * hsr_port_get_hsr().
+ */
+static void hsr_dev_destroy(struct net_device *hsr_dev)
 {
 	struct hsr_priv *hsr;
 	struct hsr_port *port;
@@ -364,8 +363,6 @@ void hsr_dev_destroy(struct net_device *hsr_dev)
 	del_timer_sync(&hsr->prune_timer);
 	del_timer_sync(&hsr->announce_timer);
 
-	synchronize_rcu();
-
 	hsr_del_self_node(&hsr->self_node_db);
 	hsr_del_nodes(&hsr->node_db);
 }
@@ -376,6 +373,7 @@ static const struct net_device_ops hsr_device_ops = {
 	.ndo_stop = hsr_dev_close,
 	.ndo_start_xmit = hsr_dev_xmit,
 	.ndo_fix_features = hsr_fix_features,
+	.ndo_uninit = hsr_dev_destroy,
 };
 
 static struct device_type hsr_type = {
diff --git a/net/hsr/hsr_device.h b/net/hsr/hsr_device.h
index d0fa6b0696d2..6d7759c4f5f9 100644
--- a/net/hsr/hsr_device.h
+++ b/net/hsr/hsr_device.h
@@ -14,7 +14,6 @@
 void hsr_dev_setup(struct net_device *dev);
 int hsr_dev_finalize(struct net_device *hsr_dev, struct net_device *slave[2],
 		     unsigned char multicast_spec, u8 protocol_version);
-void hsr_dev_destroy(struct net_device *hsr_dev);
 void hsr_check_carrier_and_operstate(struct hsr_priv *hsr);
 bool is_hsr_master(struct net_device *dev);
 int hsr_get_max_mtu(struct hsr_priv *hsr);
diff --git a/net/hsr/hsr_netlink.c b/net/hsr/hsr_netlink.c
index 160edd24de4e..8f8337f893ba 100644
--- a/net/hsr/hsr_netlink.c
+++ b/net/hsr/hsr_netlink.c
@@ -69,12 +69,6 @@ static int hsr_newlink(struct net *src_net, struct net_device *dev,
 	return hsr_dev_finalize(dev, link, multicast_spec, hsr_version);
 }
 
-static void hsr_dellink(struct net_device *hsr_dev, struct list_head *head)
-{
-	hsr_dev_destroy(hsr_dev);
-	unregister_netdevice_queue(hsr_dev, head);
-}
-
 static int hsr_fill_info(struct sk_buff *skb, const struct net_device *dev)
 {
 	struct hsr_priv *hsr;
@@ -119,7 +113,6 @@ static struct rtnl_link_ops hsr_link_ops __read_mostly = {
 	.priv_size	= sizeof(struct hsr_priv),
 	.setup		= hsr_dev_setup,
 	.newlink	= hsr_newlink,
-	.dellink	= hsr_dellink,
 	.fill_info	= hsr_fill_info,
 };
 
-- 
2.21.0


^ permalink raw reply related

* Re: [RFC v2] vhost: introduce mdev based hardware vhost backend
From: Tiwei Bie @ 2019-07-10  6:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Alex Williamson, mst, maxime.coquelin, linux-kernel, kvm,
	virtualization, netdev, dan.daly, cunming.liang, zhihong.wang,
	idos, Rob Miller, Ariel Adam
In-Reply-To: <9aafdc4d-0203-b96e-c205-043db132eb06@redhat.com>

On Wed, Jul 10, 2019 at 10:26:10AM +0800, Jason Wang wrote:
> On 2019/7/9 下午2:33, Tiwei Bie wrote:
> > On Tue, Jul 09, 2019 at 10:50:38AM +0800, Jason Wang wrote:
> > > On 2019/7/8 下午2:16, Tiwei Bie wrote:
> > > > On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
> > > > > On Thu, 4 Jul 2019 14:21:34 +0800
> > > > > Tiwei Bie <tiwei.bie@intel.com> wrote:
> > > > > > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > > > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > > > > > > > Details about this can be found here:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > > > > > 
> > > > > > > > > > > > What's new in this version
> > > > > > > > > > > > ==========================
> > > > > > > > > > > > 
> > > > > > > > > > > > A new VFIO device type is introduced - vfio-vhost. This addressed
> > > > > > > > > > > > some comments from here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > > > > > 
> > > > > > > > > > > > Below is the updated device interface:
> > > > > > > > > > > > 
> > > > > > > > > > > > Currently, there are two regions of this device: 1) CONFIG_REGION
> > > > > > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > > > > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > > > > > > > can be used to notify the device.
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. CONFIG_REGION
> > > > > > > > > > > > 
> > > > > > > > > > > > The region described by CONFIG_REGION is the main control interface.
> > > > > > > > > > > > Messages will be written to or read from this region.
> > > > > > > > > > > > 
> > > > > > > > > > > > The message type is determined by the `request` field in message
> > > > > > > > > > > > header. The message size is encoded in the message header too.
> > > > > > > > > > > > The message format looks like this:
> > > > > > > > > > > > 
> > > > > > > > > > > > struct vhost_vfio_op {
> > > > > > > > > > > > 	__u64 request;
> > > > > > > > > > > > 	__u32 flags;
> > > > > > > > > > > > 	/* Flag values: */
> > > > > > > > > > > >       #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > > > > > > > > > 	__u32 size;
> > > > > > > > > > > > 	union {
> > > > > > > > > > > > 		__u64 u64;
> > > > > > > > > > > > 		struct vhost_vring_state state;
> > > > > > > > > > > > 		struct vhost_vring_addr addr;
> > > > > > > > > > > > 	} payload;
> > > > > > > > > > > > };
> > > > > > > > > > > > 
> > > > > > > > > > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > > > > > > > > > requests in above structure.
> > > > > > > > > > > Still a comments like V1. What's the advantage of inventing a new protocol?
> > > > > > > > > > I'm trying to make it work in VFIO's way..
> > > > > > > > > > > I believe either of the following should be better:
> > > > > > > > > > > 
> > > > > > > > > > > - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL and
> > > > > > > > > > > extend it with e.g notify region. The advantages is that all exist userspace
> > > > > > > > > > > program could be reused without modification (or minimal modification). And
> > > > > > > > > > > vhost API hides lots of details that is not necessary to be understood by
> > > > > > > > > > > application (e.g in the case of container).
> > > > > > > > > > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > > > > > > > > > or introducing another mdev driver (i.e. vhost_mdev instead of
> > > > > > > > > > using the existing vfio_mdev) for mdev device?
> > > > > > > > > Can we simply add them into ioctl of mdev_parent_ops?
> > > > > > > > Right, either way, these ioctls have to be and just need to be
> > > > > > > > added in the ioctl of the mdev_parent_ops. But another thing we
> > > > > > > > also need to consider is that which file descriptor the userspace
> > > > > > > > will do the ioctl() on. So I'm wondering do you mean let the
> > > > > > > > userspace do the ioctl() on the VFIO device fd of the mdev
> > > > > > > > device?
> > > > > > > Yes.
> > > > > > Got it! I'm not sure what's Alex opinion on this. If we all
> > > > > > agree with this, I can do it in this way.
> > > > > > 
> > > > > > > Is there any other way btw?
> > > > > > Just a quick thought.. Maybe totally a bad idea. I was thinking
> > > > > > whether it would be odd to do non-VFIO's ioctls on VFIO's device
> > > > > > fd. So I was wondering whether it's possible to allow binding
> > > > > > another mdev driver (e.g. vhost_mdev) to the supported mdev
> > > > > > devices. The new mdev driver, vhost_mdev, can provide similar
> > > > > > ways to let userspace open the mdev device and do the vhost ioctls
> > > > > > on it. To distinguish with the vfio_mdev compatible mdev devices,
> > > > > > the device API of the new vhost_mdev compatible mdev devices
> > > > > > might be e.g. "vhost-net" for net?
> > > > > > 
> > > > > > So in VFIO case, the device will be for passthru directly. And
> > > > > > in VHOST case, the device can be used to accelerate the existing
> > > > > > virtualized devices.
> > > > > > 
> > > > > > How do you think?
> > > > > VFIO really can't prevent vendor specific ioctls on the device file
> > > > > descriptor for mdevs, but a) we'd want to be sure the ioctl address
> > > > > space can't collide with ioctls we'd use for vfio defined purposes and
> > > > > b) maybe the VFIO user API isn't what you want in the first place if
> > > > > you intend to mostly/entirely ignore the defined ioctl set and replace
> > > > > them with your own.  In the case of the latter, you're also not getting
> > > > > the advantages of the existing VFIO userspace code, so why expose a
> > > > > VFIO device at all.
> > > > Yeah, I totally agree.
> > > 
> > > I guess the original idea is to reuse the VFIO DMA/IOMMU API for this. Then
> > > we have the chance to reuse vfio codes in qemu for dealing with e.g vIOMMU.
> > Yeah, you are right. We have several choices here:
> > 
> > #1. We expose a VFIO device, so we can reuse the VFIO container/group
> >      based DMA API and potentially reuse a lot of VFIO code in QEMU.
> > 
> >      But in this case, we have two choices for the VFIO device interface
> >      (i.e. the interface on top of VFIO device fd):
> > 
> >      A) we may invent a new vhost protocol (as demonstrated by the code
> >         in this RFC) on VFIO device fd to make it work in VFIO's way,
> >         i.e. regions and irqs.
> > 
> >      B) Or as you proposed, instead of inventing a new vhost protocol,
> >         we can reuse most existing vhost ioctls on the VFIO device fd
> >         directly. There should be no conflicts between the VFIO ioctls
> >         (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
> > 
> > #2. Instead of exposing a VFIO device, we may expose a VHOST device.
> >      And we will introduce a new mdev driver vhost-mdev to do this.
> >      It would be natural to reuse the existing kernel vhost interface
> >      (ioctls) on it as much as possible. But we will need to invent
> >      some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
> >      choice, but it's too heavy and doesn't support vIOMMU by itself).
> > 
> > I'm not sure which one is the best choice we all want..
> > Which one (#1/A, #1/B, or #2) would you prefer?
> 
> 
> #2 looks better. One concern is that we may end up with similar API as what
> VFIO does.

Yeah, that's a major concern. If it's true, is it something
that's not acceptable?

> And I do see some new RFC for VFIO to add more DMA API.

Is there any pointers?

> 
> Consider it was still in the stage of RFC, does it make sense if we try this
> way with some sample parents?

I think it makes sense.

> 
> 
> > 
> > > 
> > > > > The mdev interface does provide a general interface for creating and
> > > > > managing virtual devices, vfio-mdev is just one driver on the mdev
> > > > > bus.  Parav (Mellanox) has been doing work on mdev-core to help clean
> > > > > out vfio-isms from the interface, aiui, with the intent of implementing
> > > > > another mdev bus driver for using the devices within the kernel.
> > > > Great to know this! I found below series after some searching:
> > > > 
> > > > https://lkml.org/lkml/2019/3/8/821
> > > > 
> > > > In above series, the new mlx5_core mdev driver will do the probe
> > > > by calling mlx5_get_core_dev() first on the parent device of the
> > > > mdev device. In vhost_mdev, maybe we can also keep track of all
> > > > the compatible mdev devices and use this info to do the probe.
> > > 
> > > I don't get why this is needed. My understanding is if we want to go this
> > > way, there're actually two parts. 1) Vhost mdev that implements the device
> > > managements and vhost ioctl. 2) Vhost it self, which can accept mdev fd as
> > > it backend through VHOST_NET_SET_BACKEND.
> > I think with vhost-mdev (or with vfio-mdev if we agree to do vhost
> > ioctls on vfio device fd directly), we don't need to open /dev/vhost-net
> > (and there is no VHOST_NET_SET_BACKEND needed) at all. Either way,
> > after getting the fd of the mdev, we just need to do vhost ioctls
> > on it directly.
> 
> 
> The reason I ask is that vhost-net is designed to not tied to any kind of
> backend. So it's better to have a single place to deal with ioctl. But it's
> not must.

I think in vhost-mdev, there is a chance for us to have a
unified interface in /dev for all vhost mediated devices
(not limited to net) in the system (similar to the case of
/dev/vfio/) instead of making it a backend of vhost-net.

For the code organization, it's possible for us to refactor
drivers/vhost/ and let it provide some APIs for parent devices
to handle generic vhost ioctls.

Thanks,
Tiwei

> 
> Thanks
> 
> 
> > 
> > > 
> > > > But we also need a way to allow vfio_mdev driver to distinguish
> > > > and reject the incompatible mdev devices.
> > > 
> > > One issue for this series is that it doesn't consider DMA isolation at all.
> > > 
> > > 
> > > > > It
> > > > > seems like this vhost-mdev driver might be similar, using mdev but not
> > > > > necessarily vfio-mdev to expose devices.  Thanks,
> > > > Yeah, I also think so!
> > > 
> > > I've cced some driver developers for their inputs. I think we need a sample
> > > parent drivers in the next version for us to understand the full picture.
> > > 
> > > 
> > > Thanks
> > > 
> > > 
> > > > Thanks!
> > > > Tiwei
> > > > 
> > > > > Alex

^ permalink raw reply

* [PATCH net-next v3] net/mlx5e: Convert single case statement switch statements into if statements
From: Nathan Chancellor @ 2019-07-10  6:06 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky
  Cc: David S. Miller, Boris Pismenny, netdev, linux-rdma, linux-kernel,
	clang-built-linux, Nathan Chancellor, Nick Desaulniers
In-Reply-To: <20190710044748.3924-1-natechancellor@gmail.com>

During the review of commit 1ff2f0fa450e ("net/mlx5e: Return in default
case statement in tx_post_resync_params"), Leon and Nick pointed out
that the switch statements can be converted to single if statements
that return early so that the code is easier to follow.

Suggested-by: Leon Romanovsky <leon@kernel.org>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---

v1 -> v2:

* Refactor switch statements into if statements

v2 -> v3:

* Rebase on net-next after v1 was already applied, patch just refactors
  switch statements to if statements.

 .../mellanox/mlx5/core/en_accel/ktls_tx.c     | 34 ++++++-------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 5c08891806f0..ea032f54197e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -25,23 +25,17 @@ static void
 fill_static_params_ctx(void *ctx, struct mlx5e_ktls_offload_context_tx *priv_tx)
 {
 	struct tls_crypto_info *crypto_info = priv_tx->crypto_info;
+	struct tls12_crypto_info_aes_gcm_128 *info;
 	char *initial_rn, *gcm_iv;
 	u16 salt_sz, rec_seq_sz;
 	char *salt, *rec_seq;
 	u8 tls_version;
 
-	switch (crypto_info->cipher_type) {
-	case TLS_CIPHER_AES_GCM_128: {
-		struct tls12_crypto_info_aes_gcm_128 *info =
-			(struct tls12_crypto_info_aes_gcm_128 *)crypto_info;
-
-		EXTRACT_INFO_FIELDS;
-		break;
-	}
-	default:
-		WARN_ON(1);
+	if (WARN_ON(crypto_info->cipher_type != TLS_CIPHER_AES_GCM_128))
 		return;
-	}
+
+	info = (struct tls12_crypto_info_aes_gcm_128 *)crypto_info;
+	EXTRACT_INFO_FIELDS;
 
 	gcm_iv      = MLX5_ADDR_OF(tls_static_params, ctx, gcm_iv);
 	initial_rn  = MLX5_ADDR_OF(tls_static_params, ctx, initial_record_number);
@@ -234,24 +228,18 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
 		      u64 rcd_sn)
 {
 	struct tls_crypto_info *crypto_info = priv_tx->crypto_info;
+	struct tls12_crypto_info_aes_gcm_128 *info;
 	__be64 rn_be = cpu_to_be64(rcd_sn);
 	bool skip_static_post;
 	u16 rec_seq_sz;
 	char *rec_seq;
 
-	switch (crypto_info->cipher_type) {
-	case TLS_CIPHER_AES_GCM_128: {
-		struct tls12_crypto_info_aes_gcm_128 *info =
-			(struct tls12_crypto_info_aes_gcm_128 *)crypto_info;
-
-		rec_seq = info->rec_seq;
-		rec_seq_sz = sizeof(info->rec_seq);
-		break;
-	}
-	default:
-		WARN_ON(1);
+	if (WARN_ON(crypto_info->cipher_type != TLS_CIPHER_AES_GCM_128))
 		return;
-	}
+
+	info = (struct tls12_crypto_info_aes_gcm_128 *)crypto_info;
+	rec_seq = info->rec_seq;
+	rec_seq_sz = sizeof(info->rec_seq);
 
 	skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
 	if (!skip_static_post)
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2] net/mlx5e: Refactor switch statements to avoid using uninitialized variables
From: Nathan Chancellor @ 2019-07-10  5:57 UTC (permalink / raw)
  To: David Miller
  Cc: saeedm, leon, borisp, netdev, linux-rdma, linux-kernel,
	clang-built-linux, ndesaulniers
In-Reply-To: <20190709.223657.1108624224137142530.davem@davemloft.net>

On Tue, Jul 09, 2019 at 10:36:57PM -0700, David Miller wrote:
> 
> I applied your simpler addition of the return statement so that I could
> get the net-next pull request out tonight, just FYI...

Thanks for the heads up, I'll spin up a v3 just focusing on the
refactoring.

Cheers,
Nathan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox