Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg
From: Jiri Pirko @ 2018-06-25 21:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625141014.61501799@cakuba.netronome.com>

Mon, Jun 25, 2018 at 11:10:14PM CEST, jakub.kicinski@netronome.com wrote:
>On Mon, 25 Jun 2018 22:58:32 +0200, Jiri Pirko wrote:
>> Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicinski@netronome.com wrote:
>> >From: John Hurley <john.hurley@netronome.com>  
>> 
>> [...]	
>> 	
>> >+static int
>> >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
>> >+			    void *cb_priv, bool add, bool offload_in_use,
>> >+			    struct netlink_ext_ack *extack)
>> >+{
>> >+	struct tcf_chain *chain;
>> >+	struct tcf_proto *tp;
>> >+	int err;
>> >+
>> >+	list_for_each_entry(chain, &block->chain_list, list) {
>> >+		for (tp = rtnl_dereference(chain->filter_chain); tp;
>> >+		     tp = rtnl_dereference(tp->next)) {
>> >+			if (tp->ops->reoffload) {
>> >+				err = tp->ops->reoffload(tp, add, cb, cb_priv,
>> >+							 extack);
>> >+				if (err && add)
>> >+					goto err_playback_remove;
>> >+			} else if (add && offload_in_use) {
>> >+				err = -EOPNOTSUPP;
>> >+				NL_SET_ERR_MSG(extack, "Filter replay failed - a filters doesn't support re-offloading");  
>> 
>> This msg sounds weird. Please fix it.
>
>Indeed..  How about: 
>
>"Filter HW offload failed - classifier without re-offloading support"

Sounds good.

>
>> Otherwise this looks very good to me! Thanks!
>
>Cool, thanks for the comments!  I will respin shortly.

^ permalink raw reply

* [PATCH net-next v2 0/7] net: sched: support replay of filter offload when binding to block
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	Jakub Kicinski

Hi!

This series from John adds the ability to replay filter offload requests
when new offload callback is being registered on a TC block.  This is most
likely to take place for shared blocks today, when a block which already
has rules is bound to another interface.  Prior to this patch set if any
of the rules were offloaded the block bind would fail.

A new tcf_proto_op is added to generate a filter-specific offload request.
The new 'offload' op is supporting extack from day 0, hence we need to
propagate extack to .ndo_setup_tc TC_BLOCK_BIND/TC_BLOCK_UNBIND and
through tcf_block_cb_register() to tcf_block_playback_offloads().

The immediate use of this patch set is to simplify life of drivers which
require duplicating rules when sharing blocks.  Switch drivers (mlxsw)
can bind ports to rule lists dynamically, NIC drivers generally don't
have that ability and need the rules to be duplicated for each ingress
they match on.  In code terms this means that switch drivers don't
register multiple callbacks for each port.  NIC drivers do, and get a
separate request and hance rule per-port, as if the block was not shared.
The registration fails today, however, if some rules were already present.

As John notes in description of patch 7, drivers which register multiple
callbacks to shared blocks will likely need to flush the rules on block
unbind.  This set makes the core not only replay the the offload add
requests but also offload remove requests when callback is unregistered.

v2:
 - name parameters in patch 2;
 - use unsigned int instead of u32 for in_hw_coun;
 - improve extack message in patch 7.

John Hurley (7):
  net: sched: pass extack pointer to block binds and cb registration
  net: sched: add tcf_proto_op to offload a rule
  net: sched: cls_flower: implement offload tcf_proto_op
  net: sched: cls_matchall: implement offload tcf_proto_op
  net: sched: cls_u32: implement offload tcf_proto_op
  net: sched: cls_bpf: implement offload tcf_proto_op
  net: sched: call reoffload op on block callback reg

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |   2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |   2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  14 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.c |   2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |   2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |   2 +-
 drivers/net/netdevsim/netdev.c                |   2 +-
 include/net/act_api.h                         |   3 -
 include/net/pkt_cls.h                         |  17 ++-
 include/net/sch_generic.h                     |  21 ++++
 net/dsa/slave.c                               |   2 +-
 net/sched/cls_api.c                           |  79 ++++++++++---
 net/sched/cls_bpf.c                           |  39 ++++++
 net/sched/cls_flower.c                        |  44 +++++++
 net/sched/cls_matchall.c                      |  32 +++++
 net/sched/cls_u32.c                           | 111 ++++++++++++++++++
 23 files changed, 342 insertions(+), 46 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 10 +++++---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +-
 drivers/net/netdevsim/netdev.c                |  2 +-
 include/net/pkt_cls.h                         | 11 +++++---
 net/dsa/slave.c                               |  2 +-
 net/sched/cls_api.c                           | 25 ++++++++++++-------
 17 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 176fc9f4d7de..b5fc6414a951 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7984,7 +7984,7 @@ static int bnxt_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, bnxt_setup_tc_block_cb,
-					     bp, bp);
+					     bp, bp, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, bnxt_setup_tc_block_cb, bp);
 		return 0;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 05d405905906..0745f2dfc80c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -173,7 +173,7 @@ static int bnxt_vf_rep_setup_tc_block(struct net_device *dev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     bnxt_vf_rep_setup_tc_block_cb,
-					     vf_rep, vf_rep);
+					     vf_rep, vf_rep, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					bnxt_vf_rep_setup_tc_block_cb, vf_rep);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index dd04a2f89ce6..84eca1d45ad1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3016,7 +3016,7 @@ static int cxgb_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, cxgb_setup_tc_block_cb,
-					     pi, dev);
+					     pi, dev, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, cxgb_setup_tc_block_cb, pi);
 		return 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7ad2b1b0b125..426b0ccb1fc6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7554,7 +7554,7 @@ static int i40e_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, i40e_setup_tc_block_cb,
-					     np, np);
+					     np, np, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, i40e_setup_tc_block_cb, np);
 		return 0;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index dc56a8667495..5906c1c1d19d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2926,7 +2926,7 @@ static int i40evf_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, i40evf_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, i40evf_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 6a78d8272eb2..f1e3397bd405 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2728,7 +2728,7 @@ static int igb_setup_tc_block(struct igb_adapter *adapter,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3e87dbbc9024..d29bd8fc3ff3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9325,7 +9325,7 @@ static int ixgbe_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, ixgbe_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, ixgbe_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 56c1b6f5593e..134f20a182b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3371,7 +3371,7 @@ static int mlx5e_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, mlx5e_setup_tc_block_cb,
-					     priv, priv);
+					     priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, mlx5e_setup_tc_block_cb,
 					priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 57987f6546e8..3f2fe95e01d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -797,7 +797,7 @@ static int mlx5e_rep_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, mlx5e_rep_setup_tc_cb,
-					     priv, priv);
+					     priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, mlx5e_rep_setup_tc_cb, priv);
 		return 0;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 968b88af2ef5..d2bc335dda11 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1503,7 +1503,8 @@ static int mlxsw_sp_setup_tc_block_cb_flower(enum tc_setup_type type,
 
 static int
 mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
-				    struct tcf_block *block, bool ingress)
+				    struct tcf_block *block, bool ingress,
+				    struct netlink_ext_ack *extack)
 {
 	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
 	struct mlxsw_sp_acl_block *acl_block;
@@ -1518,7 +1519,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
 			return -ENOMEM;
 		block_cb = __tcf_block_cb_register(block,
 						   mlxsw_sp_setup_tc_block_cb_flower,
-						   mlxsw_sp, acl_block);
+						   mlxsw_sp, acl_block, extack);
 		if (IS_ERR(block_cb)) {
 			err = PTR_ERR(block_cb);
 			goto err_cb_register;
@@ -1596,11 +1597,12 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port *mlxsw_sp_port,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		err = tcf_block_cb_register(f->block, cb, mlxsw_sp_port,
-					    mlxsw_sp_port);
+					    mlxsw_sp_port, f->extack);
 		if (err)
 			return err;
 		err = mlxsw_sp_setup_tc_block_flower_bind(mlxsw_sp_port,
-							  f->block, ingress);
+							  f->block, ingress,
+							  f->extack);
 		if (err) {
 			tcf_block_cb_unregister(f->block, cb, mlxsw_sp_port);
 			return err;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index fcdfb8e7fdea..bf46f7bff912 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -206,7 +206,7 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     nfp_bpf_setup_tc_block_cb,
-					     nn, nn);
+					     nn, nn, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					nfp_bpf_setup_tc_block_cb,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index c0e74aa4cb5e..a427dab4bf49 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -627,7 +627,7 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     nfp_flower_setup_tc_block_cb,
-					     repr, repr);
+					     repr, repr, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					nfp_flower_setup_tc_block_cb,
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index cba46b62a1cd..2354e30caa78 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3776,7 +3776,7 @@ static int stmmac_setup_tc_block(struct stmmac_priv *priv,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, stmmac_setup_tc_block_cb,
-				priv, priv);
+				priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, stmmac_setup_tc_block_cb, priv);
 		return 0;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index ec68f38213d9..c9dacc6fcd59 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -260,7 +260,7 @@ nsim_setup_tc_block(struct net_device *dev, struct tc_block_offload *f)
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, nsim_setup_tc_block_cb,
-					     ns, ns);
+					     ns, ns, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, nsim_setup_tc_block_cb, ns);
 		return 0;
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a3c1a2c47cd4..a2c6d35ba057 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -73,10 +73,11 @@ void tcf_block_cb_incref(struct tcf_block_cb *block_cb);
 unsigned int tcf_block_cb_decref(struct tcf_block_cb *block_cb);
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
-					     void *cb_priv);
+					     void *cb_priv,
+					     struct netlink_ext_ack *extack);
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
-			  void *cb_priv);
+			  void *cb_priv, struct netlink_ext_ack *extack);
 void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb);
 void tcf_block_cb_unregister(struct tcf_block *block,
 			     tc_setup_cb_t *cb, void *cb_ident);
@@ -161,7 +162,8 @@ unsigned int tcf_block_cb_decref(struct tcf_block_cb *block_cb)
 static inline
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
-					     void *cb_priv)
+					     void *cb_priv,
+					     struct netlink_ext_ack *extack)
 {
 	return NULL;
 }
@@ -169,7 +171,7 @@ struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 static inline
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
-			  void *cb_priv)
+			  void *cb_priv, struct netlink_ext_ack *extack)
 {
 	return 0;
 }
@@ -596,6 +598,7 @@ struct tc_block_offload {
 	enum tc_block_command command;
 	enum tcf_block_binder_type binder_type;
 	struct tcf_block *block;
+	struct netlink_ext_ack *extack;
 };
 
 struct tc_cls_common_offload {
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 1e3b6a6d8a40..71536c435132 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -900,7 +900,7 @@ static int dsa_slave_setup_tc_block(struct net_device *dev,
 
 	switch (f->command) {
 	case TC_BLOCK_BIND:
-		return tcf_block_cb_register(f->block, cb, dev, dev);
+		return tcf_block_cb_register(f->block, cb, dev, dev, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, cb, dev);
 		return 0;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index cdc3c87c53e6..8c9fb4b827a1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -277,18 +277,21 @@ static bool tcf_block_offload_in_use(struct tcf_block *block)
 static int tcf_block_offload_cmd(struct tcf_block *block,
 				 struct net_device *dev,
 				 struct tcf_block_ext_info *ei,
-				 enum tc_block_command command)
+				 enum tc_block_command command,
+				 struct netlink_ext_ack *extack)
 {
 	struct tc_block_offload bo = {};
 
 	bo.command = command;
 	bo.binder_type = ei->binder_type;
 	bo.block = block;
+	bo.extack = extack;
 	return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
 }
 
 static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
-				  struct tcf_block_ext_info *ei)
+				  struct tcf_block_ext_info *ei,
+				  struct netlink_ext_ack *extack)
 {
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
@@ -299,10 +302,12 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	/* If tc offload feature is disabled and the block we try to bind
 	 * to already has some offloaded filters, forbid to bind.
 	 */
-	if (!tc_can_offload(dev) && tcf_block_offload_in_use(block))
+	if (!tc_can_offload(dev) && tcf_block_offload_in_use(block)) {
+		NL_SET_ERR_MSG(extack, "Bind to offloaded block failed as dev has offload disabled");
 		return -EOPNOTSUPP;
+	}
 
-	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND);
+	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND, extack);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_inc;
 	return err;
@@ -322,7 +327,7 @@ static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_dec;
-	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND);
+	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND, NULL);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_dec;
 	return;
@@ -612,7 +617,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 	if (err)
 		goto err_chain_head_change_cb_add;
 
-	err = tcf_block_offload_bind(block, q, ei);
+	err = tcf_block_offload_bind(block, q, ei, extack);
 	if (err)
 		goto err_block_offload_bind;
 
@@ -748,7 +753,8 @@ EXPORT_SYMBOL(tcf_block_cb_decref);
 
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
-					     void *cb_priv)
+					     void *cb_priv,
+					     struct netlink_ext_ack *extack)
 {
 	struct tcf_block_cb *block_cb;
 
@@ -772,11 +778,12 @@ EXPORT_SYMBOL(__tcf_block_cb_register);
 
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
-			  void *cb_priv)
+			  void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct tcf_block_cb *block_cb;
 
-	block_cb = __tcf_block_cb_register(block, cb, cb_ident, cb_priv);
+	block_cb = __tcf_block_cb_register(block, cb, cb_ident, cb_priv,
+					   extack);
 	return IS_ERR(block_cb) ? PTR_ERR(block_cb) : 0;
 }
 EXPORT_SYMBOL(tcf_block_cb_register);
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Create a new tcf_proto_op called 'reoffload' that generates a new offload
message for each node in a tcf_proto. Pointers to the tcf_proto and
whether the offload request is to add or delete the node are included.
Also included is a callback function to send the offload message to and
the option of priv data to go with the cb.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
v2:
 - name parameters in header file.
---
 include/net/act_api.h     | 3 ---
 include/net/sch_generic.h | 6 ++++++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9e59ebfded62..5ff11adbe2a6 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -190,9 +190,6 @@ static inline void tcf_action_stats_update(struct tc_action *a, u64 bytes,
 #endif
 }
 
-typedef int tc_setup_cb_t(enum tc_setup_type type,
-			  void *type_data, void *cb_priv);
-
 #ifdef CONFIG_NET_CLS_ACT
 int tc_setup_cb_egdev_register(const struct net_device *dev,
 			       tc_setup_cb_t *cb, void *cb_priv);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6488daa32f82..18adc9142b18 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -20,6 +20,9 @@ struct qdisc_walker;
 struct tcf_walker;
 struct module;
 
+typedef int tc_setup_cb_t(enum tc_setup_type type,
+			  void *type_data, void *cb_priv);
+
 struct qdisc_rate_table {
 	struct tc_ratespec rate;
 	u32		data[256];
@@ -256,6 +259,9 @@ struct tcf_proto_ops {
 					  bool *last,
 					  struct netlink_ext_ack *);
 	void			(*walk)(struct tcf_proto*, struct tcf_walker *arg);
+	int			(*reoffload)(struct tcf_proto *tp, bool add,
+					     tc_setup_cb_t *cb, void *cb_priv,
+					     struct netlink_ext_ack *extack);
 	void			(*bind_class)(void *, u32, unsigned long);
 
 	/* rtnetlink specific */
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Add the reoffload tcf_proto_op in flower to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the reoffload function properly maintains this flag, keep a
reference counter for the number of instances of the filter that are in
hardware. Only update the flag when this counter changes from or to 0. Add
a generic helper function to implement this behaviour.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
--
v2:
 - use unsigned int for counting how many times filter was offloaded.
---
 include/net/sch_generic.h | 15 +++++++++++++
 net/sched/cls_flower.c    | 44 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 18adc9142b18..7432100027b7 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -336,6 +336,21 @@ static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
 	block->offloadcnt--;
 }
 
+static inline void
+tc_cls_offload_cnt_update(struct tcf_block *block, unsigned int *cnt,
+			  u32 *flags, bool add)
+{
+	if (add) {
+		if (!*cnt)
+			tcf_block_offload_inc(block, flags);
+		(*cnt)++;
+	} else {
+		(*cnt)--;
+		if (!*cnt)
+			tcf_block_offload_dec(block, flags);
+	}
+}
+
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 {
 	struct qdisc_skb_cb *qcb;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9e8b26a80fb3..352876bb901b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -87,6 +87,7 @@ struct cls_fl_filter {
 	struct list_head list;
 	u32 handle;
 	u32 flags;
+	unsigned int in_hw_count;
 	struct rcu_work rwork;
 	struct net_device *hw_dev;
 };
@@ -289,6 +290,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 		fl_hw_destroy_filter(tp, f, NULL);
 		return err;
 	} else if (err > 0) {
+		f->in_hw_count = err;
 		tcf_block_offload_inc(block, &f->flags);
 	}
 
@@ -1087,6 +1089,47 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	}
 }
 
+static int fl_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+			void *cb_priv, struct netlink_ext_ack *extack)
+{
+	struct cls_fl_head *head = rtnl_dereference(tp->root);
+	struct tc_cls_flower_offload cls_flower = {};
+	struct tcf_block *block = tp->chain->block;
+	struct fl_flow_mask *mask;
+	struct cls_fl_filter *f;
+	int err;
+
+	list_for_each_entry(mask, &head->masks, list) {
+		list_for_each_entry(f, &mask->filters, list) {
+			if (tc_skip_hw(f->flags))
+				continue;
+
+			tc_cls_common_offload_init(&cls_flower.common, tp,
+						   f->flags, extack);
+			cls_flower.command = add ?
+				TC_CLSFLOWER_REPLACE : TC_CLSFLOWER_DESTROY;
+			cls_flower.cookie = (unsigned long)f;
+			cls_flower.dissector = &mask->dissector;
+			cls_flower.mask = &f->mkey;
+			cls_flower.key = &f->key;
+			cls_flower.exts = &f->exts;
+			cls_flower.classid = f->res.classid;
+
+			err = cb(TC_SETUP_CLSFLOWER, &cls_flower, cb_priv);
+			if (err) {
+				if (add && tc_skip_sw(f->flags))
+					return err;
+				continue;
+			}
+
+			tc_cls_offload_cnt_update(block, &f->in_hw_count,
+						  &f->flags, add);
+		}
+	}
+
+	return 0;
+}
+
 static int fl_dump_key_val(struct sk_buff *skb,
 			   void *val, int val_type,
 			   void *mask, int mask_type, int len)
@@ -1438,6 +1481,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
 	.change		= fl_change,
 	.delete		= fl_delete,
 	.walk		= fl_walk,
+	.reoffload	= fl_reoffload,
 	.dump		= fl_dump,
 	.bind_class	= fl_bind_class,
 	.owner		= THIS_MODULE,
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Add the reoffload tcf_proto_op in matchall to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

Ensure matchall flags correctly report if the rule is in hw by keeping a
reference counter for the number of instances of the rule offloaded. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/cls_matchall.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 47b207ef7762..af16f36ed578 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -21,6 +21,7 @@ struct cls_mall_head {
 	struct tcf_result res;
 	u32 handle;
 	u32 flags;
+	unsigned int in_hw_count;
 	struct rcu_work rwork;
 };
 
@@ -95,6 +96,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 		mall_destroy_hw_filter(tp, head, cookie, NULL);
 		return err;
 	} else if (err > 0) {
+		head->in_hw_count = err;
 		tcf_block_offload_inc(block, &head->flags);
 	}
 
@@ -235,6 +237,35 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	arg->count++;
 }
 
+static int mall_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+			  void *cb_priv, struct netlink_ext_ack *extack)
+{
+	struct cls_mall_head *head = rtnl_dereference(tp->root);
+	struct tc_cls_matchall_offload cls_mall = {};
+	struct tcf_block *block = tp->chain->block;
+	int err;
+
+	if (tc_skip_hw(head->flags))
+		return 0;
+
+	tc_cls_common_offload_init(&cls_mall.common, tp, head->flags, extack);
+	cls_mall.command = add ?
+		TC_CLSMATCHALL_REPLACE : TC_CLSMATCHALL_DESTROY;
+	cls_mall.exts = &head->exts;
+	cls_mall.cookie = (unsigned long)head;
+
+	err = cb(TC_SETUP_CLSMATCHALL, &cls_mall, cb_priv);
+	if (err) {
+		if (add && tc_skip_sw(head->flags))
+			return err;
+		return 0;
+	}
+
+	tc_cls_offload_cnt_update(block, &head->in_hw_count, &head->flags, add);
+
+	return 0;
+}
+
 static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh,
 		     struct sk_buff *skb, struct tcmsg *t)
 {
@@ -289,6 +320,7 @@ static struct tcf_proto_ops cls_mall_ops __read_mostly = {
 	.change		= mall_change,
 	.delete		= mall_delete,
 	.walk		= mall_walk,
+	.reoffload	= mall_reoffload,
 	.dump		= mall_dump,
 	.bind_class	= mall_bind_class,
 	.owner		= THIS_MODULE,
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 5/7] net: sched: cls_u32: implement offload tcf_proto_op
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Add the offload tcf_proto_op in cls_u32 to generate an offload message for
each filter and the hashtable in the given tcf_proto. Call the specified
callback with this new offload message. The function only returns an error
if the callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the filter that are in hardware.
Only update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/cls_u32.c | 111 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index fb861f90fde6..d5d2a6dc3921 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -62,6 +62,7 @@ struct tc_u_knode {
 	struct tc_u32_pcnt __percpu *pf;
 #endif
 	u32			flags;
+	unsigned int		in_hw_count;
 #ifdef CONFIG_CLS_U32_MARK
 	u32			val;
 	u32			mask;
@@ -571,6 +572,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n,
 		u32_remove_hw_knode(tp, n, NULL);
 		return err;
 	} else if (err > 0) {
+		n->in_hw_count = err;
 		tcf_block_offload_inc(block, &n->flags);
 	}
 
@@ -1199,6 +1201,114 @@ static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	}
 }
 
+static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
+			       bool add, tc_setup_cb_t *cb, void *cb_priv,
+			       struct netlink_ext_ack *extack)
+{
+	struct tc_cls_u32_offload cls_u32 = {};
+	int err;
+
+	tc_cls_common_offload_init(&cls_u32.common, tp, ht->flags, extack);
+	cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE;
+	cls_u32.hnode.divisor = ht->divisor;
+	cls_u32.hnode.handle = ht->handle;
+	cls_u32.hnode.prio = ht->prio;
+
+	err = cb(TC_SETUP_CLSU32, &cls_u32, cb_priv);
+	if (err && add && tc_skip_sw(ht->flags))
+		return err;
+
+	return 0;
+}
+
+static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
+			       bool add, tc_setup_cb_t *cb, void *cb_priv,
+			       struct netlink_ext_ack *extack)
+{
+	struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
+	struct tcf_block *block = tp->chain->block;
+	struct tc_cls_u32_offload cls_u32 = {};
+	int err;
+
+	tc_cls_common_offload_init(&cls_u32.common, tp, n->flags, extack);
+	cls_u32.command = add ?
+		TC_CLSU32_REPLACE_KNODE : TC_CLSU32_DELETE_KNODE;
+	cls_u32.knode.handle = n->handle;
+
+	if (add) {
+		cls_u32.knode.fshift = n->fshift;
+#ifdef CONFIG_CLS_U32_MARK
+		cls_u32.knode.val = n->val;
+		cls_u32.knode.mask = n->mask;
+#else
+		cls_u32.knode.val = 0;
+		cls_u32.knode.mask = 0;
+#endif
+		cls_u32.knode.sel = &n->sel;
+		cls_u32.knode.exts = &n->exts;
+		if (n->ht_down)
+			cls_u32.knode.link_handle = ht->handle;
+	}
+
+	err = cb(TC_SETUP_CLSU32, &cls_u32, cb_priv);
+	if (err) {
+		if (add && tc_skip_sw(n->flags))
+			return err;
+		return 0;
+	}
+
+	tc_cls_offload_cnt_update(block, &n->in_hw_count, &n->flags, add);
+
+	return 0;
+}
+
+static int u32_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+			 void *cb_priv, struct netlink_ext_ack *extack)
+{
+	struct tc_u_common *tp_c = tp->data;
+	struct tc_u_hnode *ht;
+	struct tc_u_knode *n;
+	unsigned int h;
+	int err;
+
+	for (ht = rtnl_dereference(tp_c->hlist);
+	     ht;
+	     ht = rtnl_dereference(ht->next)) {
+		if (ht->prio != tp->prio)
+			continue;
+
+		/* When adding filters to a new dev, try to offload the
+		 * hashtable first. When removing, do the filters before the
+		 * hashtable.
+		 */
+		if (add && !tc_skip_hw(ht->flags)) {
+			err = u32_reoffload_hnode(tp, ht, add, cb, cb_priv,
+						  extack);
+			if (err)
+				return err;
+		}
+
+		for (h = 0; h <= ht->divisor; h++) {
+			for (n = rtnl_dereference(ht->ht[h]);
+			     n;
+			     n = rtnl_dereference(n->next)) {
+				if (tc_skip_hw(n->flags))
+					continue;
+
+				err = u32_reoffload_knode(tp, n, add, cb,
+							  cb_priv, extack);
+				if (err)
+					return err;
+			}
+		}
+
+		if (!add && !tc_skip_hw(ht->flags))
+			u32_reoffload_hnode(tp, ht, add, cb, cb_priv, extack);
+	}
+
+	return 0;
+}
+
 static void u32_bind_class(void *fh, u32 classid, unsigned long cl)
 {
 	struct tc_u_knode *n = fh;
@@ -1336,6 +1446,7 @@ static struct tcf_proto_ops cls_u32_ops __read_mostly = {
 	.change		=	u32_change,
 	.delete		=	u32_delete,
 	.walk		=	u32_walk,
+	.reoffload	=	u32_reoffload,
 	.dump		=	u32_dump,
 	.bind_class	=	u32_bind_class,
 	.owner		=	THIS_MODULE,
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 6/7] net: sched: cls_bpf: implement offload tcf_proto_op
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Add the offload tcf_proto_op in cls_bpf to generate an offload message for
each bpf prog in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' prog.

A prog contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the prog that are in hardware. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/cls_bpf.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 1aa7f6511065..66e0ac9811f9 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -43,6 +43,7 @@ struct cls_bpf_prog {
 	struct tcf_result res;
 	bool exts_integrated;
 	u32 gen_flags;
+	unsigned int in_hw_count;
 	struct tcf_exts exts;
 	u32 handle;
 	u16 bpf_num_ops;
@@ -174,6 +175,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog,
 			cls_bpf_offload_cmd(tp, oldprog, prog, extack);
 			return err;
 		} else if (err > 0) {
+			prog->in_hw_count = err;
 			tcf_block_offload_inc(block, &prog->gen_flags);
 		}
 	}
@@ -652,6 +654,42 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	}
 }
 
+static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+			     void *cb_priv, struct netlink_ext_ack *extack)
+{
+	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+	struct tcf_block *block = tp->chain->block;
+	struct tc_cls_bpf_offload cls_bpf = {};
+	struct cls_bpf_prog *prog;
+	int err;
+
+	list_for_each_entry(prog, &head->plist, link) {
+		if (tc_skip_hw(prog->gen_flags))
+			continue;
+
+		tc_cls_common_offload_init(&cls_bpf.common, tp, prog->gen_flags,
+					   extack);
+		cls_bpf.command = TC_CLSBPF_OFFLOAD;
+		cls_bpf.exts = &prog->exts;
+		cls_bpf.prog = add ? prog->filter : NULL;
+		cls_bpf.oldprog = add ? NULL : prog->filter;
+		cls_bpf.name = prog->bpf_name;
+		cls_bpf.exts_integrated = prog->exts_integrated;
+
+		err = cb(TC_SETUP_CLSBPF, &cls_bpf, cb_priv);
+		if (err) {
+			if (add && tc_skip_sw(prog->gen_flags))
+				return err;
+			continue;
+		}
+
+		tc_cls_offload_cnt_update(block, &prog->in_hw_count,
+					  &prog->gen_flags, add);
+	}
+
+	return 0;
+}
+
 static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.kind		=	"bpf",
 	.owner		=	THIS_MODULE,
@@ -662,6 +700,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.change		=	cls_bpf_change,
 	.delete		=	cls_bpf_delete,
 	.walk		=	cls_bpf_walk,
+	.reoffload	=	cls_bpf_reoffload,
 	.dump		=	cls_bpf_dump,
 	.bind_class	=	cls_bpf_bind_class,
 };
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg
From: Jakub Kicinski @ 2018-06-25 21:30 UTC (permalink / raw)
  To: davem, jiri
  Cc: xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625213010.13266-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
block when a callback tries to register to a block that already has
offloaded rules. If all existing rules cannot be offloaded then the
registration is rejected. This replaces the previous policy of rejecting
such callback registration outright.

On unregistration of a callback, the rules are flushed for that given cb.
The implementation of block sharing in the NFP driver, for example,
duplicates shared rules to all devs bound to a block. This meant that
rules could still exist in hw even after a device is unbound from a block
(assuming the block still remains active).

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
v2:
 - improve extack message.
---
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  4 +-
 include/net/pkt_cls.h                         |  6 ++-
 net/sched/cls_api.c                           | 54 ++++++++++++++++---
 3 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index d2bc335dda11..52437363766a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1542,7 +1542,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
 
 err_block_bind:
 	if (!tcf_block_cb_decref(block_cb)) {
-		__tcf_block_cb_unregister(block_cb);
+		__tcf_block_cb_unregister(block, block_cb);
 err_cb_register:
 		mlxsw_sp_acl_block_destroy(acl_block);
 	}
@@ -1572,7 +1572,7 @@ mlxsw_sp_setup_tc_block_flower_unbind(struct mlxsw_sp_port *mlxsw_sp_port,
 	err = mlxsw_sp_acl_block_unbind(mlxsw_sp, acl_block,
 					mlxsw_sp_port, ingress);
 	if (!err && !tcf_block_cb_decref(block_cb)) {
-		__tcf_block_cb_unregister(block_cb);
+		__tcf_block_cb_unregister(block, block_cb);
 		mlxsw_sp_acl_block_destroy(acl_block);
 	}
 }
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a2c6d35ba057..4070b8eb6d14 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -78,7 +78,8 @@ struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
 			  void *cb_priv, struct netlink_ext_ack *extack);
-void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb);
+void __tcf_block_cb_unregister(struct tcf_block *block,
+			       struct tcf_block_cb *block_cb);
 void tcf_block_cb_unregister(struct tcf_block *block,
 			     tc_setup_cb_t *cb, void *cb_ident);
 
@@ -177,7 +178,8 @@ int tcf_block_cb_register(struct tcf_block *block,
 }
 
 static inline
-void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb)
+void __tcf_block_cb_unregister(struct tcf_block *block,
+			       struct tcf_block_cb *block_cb)
 {
 }
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8c9fb4b827a1..bbf8dda96b0e 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -751,19 +751,53 @@ unsigned int tcf_block_cb_decref(struct tcf_block_cb *block_cb)
 }
 EXPORT_SYMBOL(tcf_block_cb_decref);
 
+static int
+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
+			    void *cb_priv, bool add, bool offload_in_use,
+			    struct netlink_ext_ack *extack)
+{
+	struct tcf_chain *chain;
+	struct tcf_proto *tp;
+	int err;
+
+	list_for_each_entry(chain, &block->chain_list, list) {
+		for (tp = rtnl_dereference(chain->filter_chain); tp;
+		     tp = rtnl_dereference(tp->next)) {
+			if (tp->ops->reoffload) {
+				err = tp->ops->reoffload(tp, add, cb, cb_priv,
+							 extack);
+				if (err && add)
+					goto err_playback_remove;
+			} else if (add && offload_in_use) {
+				err = -EOPNOTSUPP;
+				NL_SET_ERR_MSG(extack, "Filter HW offload failed - classifier without re-offloading support");
+				goto err_playback_remove;
+			}
+		}
+	}
+
+	return 0;
+
+err_playback_remove:
+	tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use,
+				    extack);
+	return err;
+}
+
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
 					     void *cb_priv,
 					     struct netlink_ext_ack *extack)
 {
 	struct tcf_block_cb *block_cb;
+	int err;
 
-	/* At this point, playback of previous block cb calls is not supported,
-	 * so forbid to register to block which already has some offloaded
-	 * filters present.
-	 */
-	if (tcf_block_offload_in_use(block))
-		return ERR_PTR(-EOPNOTSUPP);
+	/* Replay any already present rules */
+	err = tcf_block_playback_offloads(block, cb, cb_priv, true,
+					  tcf_block_offload_in_use(block),
+					  extack);
+	if (err)
+		return ERR_PTR(err);
 
 	block_cb = kzalloc(sizeof(*block_cb), GFP_KERNEL);
 	if (!block_cb)
@@ -788,8 +822,12 @@ int tcf_block_cb_register(struct tcf_block *block,
 }
 EXPORT_SYMBOL(tcf_block_cb_register);
 
-void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb)
+void __tcf_block_cb_unregister(struct tcf_block *block,
+			       struct tcf_block_cb *block_cb)
 {
+	tcf_block_playback_offloads(block, block_cb->cb, block_cb->cb_priv,
+				    false, tcf_block_offload_in_use(block),
+				    NULL);
 	list_del(&block_cb->list);
 	kfree(block_cb);
 }
@@ -803,7 +841,7 @@ void tcf_block_cb_unregister(struct tcf_block *block,
 	block_cb = tcf_block_cb_lookup(block, cb, cb_ident);
 	if (!block_cb)
 		return;
-	__tcf_block_cb_unregister(block_cb);
+	__tcf_block_cb_unregister(block, block_cb);
 }
 EXPORT_SYMBOL(tcf_block_cb_unregister);
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
From: Jason Gunthorpe @ 2018-06-25 21:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Matan Barak,
	Michael J Ruhl, Noa Osherovich, Raed Salem, Yishai Hadas,
	Saeed Mahameed, linux-netdev
In-Reply-To: <20180624082353.16138-1-leon@kernel.org>

On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> Hi,
> 
> This is bunch of patches trigged by running syzkaller internally.
> 
> I'm sending them based on rdma-next mainly for two reasons:
> 1, Most of the patches fix the old issues and it doesn't matter when
> they will hit the Linus's tree: now or later in a couple of weeks
> during merge window.
> 2. They interleave with code cleanup, mlx5-next patches and Michael's
> feedback on flow counters series.
> 
> Thanks
> 
> Leon Romanovsky (12):
>   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
>   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow

I applied these two to for-rc

>   RDMA/uverbs: Check existence of create_flow callback
>   RDMA/verbs: Drop kernel variant of create_flow
>   RDMA/verbs: Drop kernel variant of destroy_flow
>   net/mlx5: Rate limit errors in command interface
>   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
>   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
>   RDMA/uverbs: Remove redundant check

These to for-next

>   overflow.h: Add arithmetic shift helper
>   RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
>   RDMA/mlx5: Reuse existed shift_overlow helper

And these will have to be respun.

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration
From: Jiri Pirko @ 2018-06-25 21:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625213010.13266-2-jakub.kicinski@netronome.com>

Mon, Jun 25, 2018 at 11:30:04PM CEST, jakub.kicinski@netronome.com wrote:
>From: John Hurley <john.hurley@netronome.com>
>
>Pass the extact struct from a tc qdisc add to the block bind function and,
>in turn, to the setup_tc ndo of binding device via the tc_block_offload
>struct. Pass this back to any block callback registrations to allow
>netlink logging of fails in the bind process.
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule
From: Jiri Pirko @ 2018-06-25 21:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625213010.13266-3-jakub.kicinski@netronome.com>

Mon, Jun 25, 2018 at 11:30:05PM CEST, jakub.kicinski@netronome.com wrote:
>From: John Hurley <john.hurley@netronome.com>
>
>Create a new tcf_proto_op called 'reoffload' that generates a new offload
>message for each node in a tcf_proto. Pointers to the tcf_proto and
>whether the offload request is to add or delete the node are included.
>Also included is a callback function to send the offload message to and
>the option of priv data to go with the cb.
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op
From: Jiri Pirko @ 2018-06-25 21:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625213010.13266-4-jakub.kicinski@netronome.com>

Mon, Jun 25, 2018 at 11:30:06PM CEST, jakub.kicinski@netronome.com wrote:
>From: John Hurley <john.hurley@netronome.com>
>
>Add the reoffload tcf_proto_op in flower to generate an offload message
>for each filter in the given tcf_proto. Call the specified callback with
>this new offload message. The function only returns an error if the
>callback rejects adding a 'hardware only' rule.
>
>A filter contains a flag to indicate if it is in hardware or not. To
>ensure the reoffload function properly maintains this flag, keep a
>reference counter for the number of instances of the filter that are in
>hardware. Only update the flag when this counter changes from or to 0. Add
>a generic helper function to implement this behaviour.
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op
From: Jiri Pirko @ 2018-06-25 21:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625213010.13266-5-jakub.kicinski@netronome.com>

Mon, Jun 25, 2018 at 11:30:07PM CEST, jakub.kicinski@netronome.com wrote:
>From: John Hurley <john.hurley@netronome.com>
>
>Add the reoffload tcf_proto_op in matchall to generate an offload message
>for each filter in the given tcf_proto. Call the specified callback with
>this new offload message. The function only returns an error if the
>callback rejects adding a 'hardware only' rule.
>
>Ensure matchall flags correctly report if the rule is in hw by keeping a
>reference counter for the number of instances of the rule offloaded. Only
>update the flag when this counter changes from or to 0.
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg
From: Jiri Pirko @ 2018-06-25 21:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, xiyou.wangcong, jhs, gerlitz.or, netdev, oss-drivers,
	John Hurley
In-Reply-To: <20180625213010.13266-8-jakub.kicinski@netronome.com>

Mon, Jun 25, 2018 at 11:30:10PM CEST, jakub.kicinski@netronome.com wrote:
>From: John Hurley <john.hurley@netronome.com>
>
>Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
>block when a callback tries to register to a block that already has
>offloaded rules. If all existing rules cannot be offloaded then the
>registration is rejected. This replaces the previous policy of rejecting
>such callback registration outright.
>
>On unregistration of a callback, the rules are flushed for that given cb.
>The implementation of block sharing in the NFP driver, for example,
>duplicates shared rules to all devs bound to a block. This meant that
>rules could still exist in hw even after a device is unbound from a block
>(assuming the block still remains active).
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* [PATCH v3 net-next] mdio-mux-gpio: Remove VLA usage
From: Kees Cook @ 2018-06-25 22:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, Andrew Lunn, netdev

In the quest to remove all stack VLA usage from the kernel[1], this
allocates the values buffer during the callback instead of putting it
on the stack.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
v2: allocate array as part of structure (Andrew Lunn)
v3: resend to netdev with Reviewed-by
---
 drivers/net/phy/mdio-mux-gpio.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
index 082ffef0dec4..6b55e0ddef63 100644
--- a/drivers/net/phy/mdio-mux-gpio.c
+++ b/drivers/net/phy/mdio-mux-gpio.c
@@ -20,23 +20,23 @@
 struct mdio_mux_gpio_state {
 	struct gpio_descs *gpios;
 	void *mux_handle;
+	int values[];
 };
 
 static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
 				   void *data)
 {
 	struct mdio_mux_gpio_state *s = data;
-	int values[s->gpios->ndescs];
 	unsigned int n;
 
 	if (current_child == desired_child)
 		return 0;
 
 	for (n = 0; n < s->gpios->ndescs; n++)
-		values[n] = (desired_child >> n) & 1;
+		s->values[n] = (desired_child >> n) & 1;
 
 	gpiod_set_array_value_cansleep(s->gpios->ndescs, s->gpios->desc,
-				       values);
+				       s->values);
 
 	return 0;
 }
@@ -44,15 +44,21 @@ static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
 static int mdio_mux_gpio_probe(struct platform_device *pdev)
 {
 	struct mdio_mux_gpio_state *s;
+	struct gpio_descs *gpios;
 	int r;
 
-	s = devm_kzalloc(&pdev->dev, sizeof(*s), GFP_KERNEL);
-	if (!s)
+	gpios = gpiod_get_array(&pdev->dev, NULL, GPIOD_OUT_LOW);
+	if (IS_ERR(gpios))
+		return PTR_ERR(gpios);
+
+	s = devm_kzalloc(&pdev->dev, sizeof(*s->values) * gpios->ndescs +
+				     sizeof(*s), GFP_KERNEL);
+	if (!s) {
+		gpiod_put_array(gpios);
 		return -ENOMEM;
+	}
 
-	s->gpios = gpiod_get_array(&pdev->dev, NULL, GPIOD_OUT_LOW);
-	if (IS_ERR(s->gpios))
-		return PTR_ERR(s->gpios);
+	s->gpios = gpios;
 
 	r = mdio_mux_init(&pdev->dev, pdev->dev.of_node,
 			  mdio_mux_gpio_switch_fn, &s->mux_handle, s, NULL);
-- 
2.17.1


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH v3 net-next] mdio-mux-gpio: Remove VLA usage
From: Joe Perches @ 2018-06-25 22:23 UTC (permalink / raw)
  To: Kees Cook, David S. Miller; +Cc: linux-kernel, Andrew Lunn, netdev
In-Reply-To: <20180625220917.GA25022@beast>

On Mon, 2018-06-25 at 15:09 -0700, Kees Cook wrote:
> In the quest to remove all stack VLA usage from the kernel[1], this
> allocates the values buffer during the callback instead of putting it
> on the stack.

[]

> diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
[]
> @@ -44,15 +44,21 @@ static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
>  static int mdio_mux_gpio_probe(struct platform_device *pdev)
>  {
[]
> +	s = devm_kzalloc(&pdev->dev, sizeof(*s->values) * gpios->ndescs +
> +				     sizeof(*s), GFP_KERNEL);

Isn't this supposed to use your new struct_size()

^ permalink raw reply

* Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing
From: Shannon Nelson @ 2018-06-25 22:37 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, anders.roxell, linux-kselftest
In-Reply-To: <20180622210708.43d50d15@cakuba.netronome.com>

On 6/22/2018 9:07 PM, Jakub Kicinski wrote:
> On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote:
>> Implement the IPsec/XFRM offload API for testing.
>>
>> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> 
> Thanks for the patch!  Just a number of stylistic nit picks.
> 
>> diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
>> new file mode 100644
>> index 0000000..ad64266
>> --- /dev/null
>> +++ b/drivers/net/netdevsim/ipsec.c
>> @@ -0,0 +1,345 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
>> +
>> +#include <net/xfrm.h>
>> +#include <crypto/aead.h>
>> +#include <linux/debugfs.h>
>> +#include "netdevsim.h"
> 
> Other files in the driver sort headers alphabetically and put an empty
> line between global and local headers.

Sure.

> 
>> +#define NSIM_IPSEC_AUTH_BITS	128
>> +
>> +/**
>> + * nsim_ipsec_dbg_read - read for ipsec data
>> + * @filp: the opened file
>> + * @buffer: where to write the data for the user to read
>> + * @count: the size of the user's buffer
>> + * @ppos: file position offset
>> + **/
>> +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
> 
> Doesn't match the kdoc.  Please run
> 
> ./scripts/kernel-doc -none $file
> 
> if you want kdoc.  Although IMHO you may as well drop the kdoc, your
> code is quite self explanatory and local.

By adding -v to that I got a couple of warnings that I didn't include 
the Return information - is that what you were commenting on?  The rest 
seems acceptable to the script I'm using from the net-next tree.

> 
>> +					char __user *buffer,
>> +					size_t count, loff_t *ppos)
>> +{
>> +	struct netdevsim *ns = filp->private_data;
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
>> +	size_t bufsize;
>> +	char *buf, *p;
>> +	int len;
>> +	int i;
>> +
>> +	/* don't allow partial reads */
>> +	if (*ppos != 0)
>> +		return 0;
>> +
>> +	/* the buffer needed is
>> +	 * (num SAs * 3 lines each * ~60 bytes per line) + one more line
>> +	 */
>> +	bufsize = (ipsec->count * 4 * 60) + 60;
>> +	buf = kzalloc(bufsize, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	p = buf;
>> +	p += snprintf(p, bufsize - (p - buf),
>> +		      "SA count=%u tx=%u\n",
>> +		      ipsec->count, ipsec->tx);
>> +
>> +	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
>> +		struct nsim_sa *sap = &ipsec->sa[i];
>> +
>> +		if (!sap->used)
>> +			continue;
>> +
>> +		p += snprintf(p, bufsize - (p - buf),
>> +			      "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
>> +			      i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
>> +			      sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
>> +		p += snprintf(p, bufsize - (p - buf),
>> +			      "sa[%i]    spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n",
>> +			      i, be32_to_cpu(sap->xs->id.spi),
>> +			      sap->xs->id.proto, sap->salt, sap->crypt);
>> +		p += snprintf(p, bufsize - (p - buf),
>> +			      "sa[%i]    key=0x%08x %08x %08x %08x\n",
>> +			      i, sap->key[0], sap->key[1],
>> +			      sap->key[2], sap->key[3]);
>> +	}
>> +
>> +	len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
> 
> Why not seq_file for this?

Why bother with more interface code?  This is useful enough to support 
the API testing needed.

> 
>> +	kfree(buf);
>> +	return len;
>> +}
>> +
>> +static const struct file_operations ipsec_dbg_fops = {
>> +	.owner = THIS_MODULE,
>> +	.open = simple_open,
>> +	.read = nsim_dbg_netdev_ops_read,
>> +};
>> +
>> +/**
>> + * nsim_ipsec_find_empty_idx - find the first unused security parameter index
>> + * @ipsec: pointer to ipsec struct
>> + **/
>> +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
>> +{
>> +	u32 i;
>> +
>> +	if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
>> +		return -ENOSPC;
>> +
>> +	/* search sa table */
>> +	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
>> +		if (!ipsec->sa[i].used)
>> +			return i;
>> +	}
>> +
>> +	return -ENOSPC;
> 
> FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and
> concise for a small ID allocator, but no objection to open coding.

Sure, we could add a parallel bitmap data structure to track usage of 
our array elements, and probably would for a much larger array so as to 
lessen the impact of a serial search.  But, since this is a short array 
for simple testing purposes, the search time is minimal so I think the 
simple code is fine.

> 
>> +}
>> +
>> +/**
>> + * nsim_ipsec_parse_proto_keys - find the key and salt based on the protocol
>> + * @xs: pointer to xfrm_state struct
>> + * @mykey: pointer to key array to populate
>> + * @mysalt: pointer to salt value to populate
>> + *
>> + * This copies the protocol keys and salt to our own data tables.  The
>> + * 82599 family only supports the one algorithm.
> 
> 82599 is a fine chip, it's not netdevsim tho? ;)

Yeah, guess where I hacked the code from...  Thanks, I missed this 
reference.

> 
>> + **/
>> +static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
>> +				       u32 *mykey, u32 *mysalt)
>> +{
>> +	struct net_device *dev = xs->xso.dev;
>> +	unsigned char *key_data;
>> +	char *alg_name = NULL;
>> +	const char aes_gcm_name[] = "rfc4106(gcm(aes))";
>> +	int key_len;
> 
> reverse xmas tree please

Yep, missed it here.

> 
>> +
>> +	if (!xs->aead) {
>> +		netdev_err(dev, "Unsupported IPsec algorithm\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
>> +		netdev_err(dev, "IPsec offload requires %d bit authentication\n",
>> +			   NSIM_IPSEC_AUTH_BITS);
>> +		return -EINVAL;
>> +	}
>> +
>> +	key_data = &xs->aead->alg_key[0];
>> +	key_len = xs->aead->alg_key_len;
>> +	alg_name = xs->aead->alg_name;
>> +
>> +	if (strcmp(alg_name, aes_gcm_name)) {
>> +		netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
>> +			   aes_gcm_name);
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* The key bytes come down in a bigendian array of bytes, so
>> +	 * we don't need to do any byteswapping.
> 
> Why the mention of bigendian?  82599 needs big endian? -.^

Yep, another useless reference left over from the hack-n-slash - I'll 
remove it.

> 
>> +	 * 160 accounts for 16 byte key and 4 byte salt
>> +	 */
>> +	if (key_len > 128) {
> 
> s/128/NSIM_IPSEC_AUTH_BITS/ ?

Sure.

> 
>> +		*mysalt = ((u32 *)key_data)[4];
> 
> Is alignment guaranteed?  There are the unaligned helpers if you need
> them..

Since the key_data must be at least 128 bits in this implementation, and 
the key itself is 128 bits, anything after is salt, so we can assume 
that the salt data is aligned.

> 
>> +	} else if (key_len == 128) {
>> +		*mysalt = 0;
>> +	} else {
>> +		netdev_err(dev, "IPsec hw offload only supports 128 bit keys with optional 32 bit salt\n");
>> +		return -EINVAL;
>> +	}
>> +	memcpy(mykey, key_data, 16);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * nsim_ipsec_add_sa - program device with a security association
>> + * @xs: pointer to transformer state struct
>> + **/
>> +static int nsim_ipsec_add_sa(struct xfrm_state *xs)
>> +{
>> +	struct net_device *dev = xs->xso.dev;
>> +	struct netdevsim *ns = netdev_priv(dev);
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
> 
> xmas tree again (initialize out of line if you have to)

This one is pretty much the way I've done in the past with no complaints 
and seems common enough in other net drivers, specifically when dealing 
with netdev and netdevpriv elements.  Only the first line is out of 
place, with the next lines dependent on it.

> 
>> +	struct nsim_sa sa;
>> +	u16 sa_idx;
>> +	int ret;
>> +
>> +	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
>> +		netdev_err(dev, "Unsupported protocol 0x%04x for ipsec offload\n",
>> +			   xs->id.proto);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (xs->calg) {
>> +		netdev_err(dev, "Compression offload not supported\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* find the first unused index */
>> +	ret = nsim_ipsec_find_empty_idx(ipsec);
>> +	if (ret < 0) {
>> +		netdev_err(dev, "No space for SA in Rx table!\n");
>> +		return ret;
>> +	}
>> +	sa_idx = (u16)ret;
>> +
>> +	memset(&sa, 0, sizeof(sa));
>> +	sa.used = true;
>> +	sa.xs = xs;
>> +
>> +	if (sa.xs->id.proto & IPPROTO_ESP)
>> +		sa.crypt = xs->ealg || xs->aead;
>> +
>> +	/* get the key and salt */
>> +	ret = nsim_ipsec_parse_proto_keys(xs, sa.key, &sa.salt);
>> +	if (ret) {
>> +		netdev_err(dev, "Failed to get key data for SA table\n");
>> +		return ret;
>> +	}
>> +
>> +	if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) {
>> +		sa.rx = true;
>> +
>> +		if (xs->props.family == AF_INET6)
>> +			memcpy(sa.ipaddr, &xs->id.daddr.a6, 16);
>> +		else
>> +			memcpy(&sa.ipaddr[3], &xs->id.daddr.a4, 4);
>> +	}
>> +
>> +	/* the preparations worked, so save the info */
>> +	memcpy(&ipsec->sa[sa_idx], &sa, sizeof(sa));
>> +
>> +	/* the XFRM stack doesn't like offload_handle == 0,
>> +	 * so add a bitflag in case our array index is 0
>> +	 */
>> +	xs->xso.offload_handle = sa_idx | NSIM_IPSEC_VALID;
>> +	ipsec->count++;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * nsim_ipsec_del_sa - clear out this specific SA
>> + * @xs: pointer to transformer state struct
>> + **/
>> +static void nsim_ipsec_del_sa(struct xfrm_state *xs)
>> +{
>> +	struct net_device *dev = xs->xso.dev;
>> +	struct netdevsim *ns = netdev_priv(dev);
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
>> +	u16 sa_idx;
>> +
>> +	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
>> +	if (!ipsec->sa[sa_idx].used) {
>> +		netdev_err(dev, "Invalid SA for delete sa_idx=%d\n", sa_idx);
>> +		return;
>> +	}
>> +
>> +	memset(&ipsec->sa[sa_idx], 0, sizeof(struct nsim_sa));
>> +	ipsec->count--;
>> +}
>> +
>> +/**
>> + * nsim_ipsec_offload_ok - can this packet use the xfrm hw offload
>> + * @skb: current data packet
>> + * @xs: pointer to transformer state struct
>> + **/
>> +static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
>> +{
>> +	struct net_device *dev = xs->xso.dev;
>> +	struct netdevsim *ns = netdev_priv(dev);
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
>> +
>> +	ipsec->ok++;
>> +
>> +	return true;
>> +}
>> +
>> +static const struct xfrmdev_ops nsim_xfrmdev_ops = {
>> +	.xdo_dev_state_add = nsim_ipsec_add_sa,
>> +	.xdo_dev_state_delete = nsim_ipsec_del_sa,
>> +	.xdo_dev_offload_ok = nsim_ipsec_offload_ok,
> 
> Please align the initializers by adding tabs before '='.

Sure.

> 
>> +};
>> +
>> +/**
>> + * nsim_ipsec_tx - check Tx packet for ipsec offload
>> + * @ns: pointer to ns structure
>> + * @skb: current data packet
>> + **/
>> +int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
>> +{
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
>> +	struct xfrm_state *xs;
>> +	struct nsim_sa *tsa;
>> +	u32 sa_idx;
>> +
>> +	/* do we even need to check this packet? */
>> +	if (!skb->sp)
>> +		return 1;
>> +
>> +	if (unlikely(!skb->sp->len)) {
>> +		netdev_err(ns->netdev, "%s: no xfrm state len = %d\n",
>> +			   __func__, skb->sp->len);
> 
> Hmm..  __func__ started appearing in errors?  Perhaps either always or
> never add it?
> 
> Also, I know this is not a real device, but please always use rate
> limited print functions on the data path.
> 
>> +		return 0;
>> +	}
>> +
>> +	xs = xfrm_input_state(skb);
>> +	if (unlikely(!xs)) {
>> +		netdev_err(ns->netdev, "%s: no xfrm_input_state() xs = %p\n",
>> +			   __func__, xs);
>> +		return 0;
>> +	}
>> +
>> +	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
>> +	if (unlikely(sa_idx > NSIM_IPSEC_MAX_SA_COUNT)) {
>> +		netdev_err(ns->netdev, "%s: bad sa_idx=%d max=%d\n",
>> +			   __func__, sa_idx, NSIM_IPSEC_MAX_SA_COUNT);
>> +		return 0;
>> +	}
>> +
>> +	tsa = &ipsec->sa[sa_idx];
>> +	if (unlikely(!tsa->used)) {
>> +		netdev_err(ns->netdev, "%s: unused sa_idx=%d\n",
>> +			   __func__, sa_idx);
>> +		return 0;
>> +	}
>> +
>> +	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
>> +		netdev_err(ns->netdev, "%s: unexpected proto=%d\n",
>> +			   __func__, xs->id.proto);
>> +		return 0;
>> +	}
>> +
>> +	ipsec->tx++;
>> +
>> +	return 1;
>> +}
> 
> Looks like the function should return bool?

Sure.

> 
>> +
>> +/**
>> + * nsim_ipsec_init - initialize security registers for IPSec operation
>> + * @ns: board private structure
> 
> "board"?  Yes, the kdoc may be best removed ;)
> 
>> + **/
>> +void nsim_ipsec_init(struct netdevsim *ns)
>> +{
>> +	ns->netdev->xfrmdev_ops = &nsim_xfrmdev_ops;
>> +
>> +#define NSIM_ESP_FEATURES	(NETIF_F_HW_ESP | \
>> +				 NETIF_F_HW_ESP_TX_CSUM | \
>> +				 NETIF_F_GSO_ESP)
>> +
>> +	ns->netdev->features |= NSIM_ESP_FEATURES;
>> +	ns->netdev->hw_enc_features |= NSIM_ESP_FEATURES;
>> +
>> +	ns->ipsec.pfile = debugfs_create_file("ipsec", 0400, ns->ddir, ns,
>> +					      &ipsec_dbg_fops);
>> +}
>> +
>> +void nsim_ipsec_teardown(struct netdevsim *ns)
>> +{
>> +	struct nsim_ipsec *ipsec = &ns->ipsec;
>> +
>> +	if (ipsec->count)
>> +		netdev_err(ns->netdev, "%s: tearing down IPsec offload with %d SAs left\n",
>> +			   __func__, ipsec->count);
>> +	debugfs_remove_recursive(ipsec->pfile);
>> +}
>> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
>> index ec68f38..6ce8604d 100644
>> --- a/drivers/net/netdevsim/netdev.c
>> +++ b/drivers/net/netdevsim/netdev.c
>> @@ -171,6 +171,8 @@ static int nsim_init(struct net_device *dev)
>>   	if (err)
>>   		goto err_unreg_dev;
>>   
>> +	nsim_ipsec_init(ns);
>> +
>>   	return 0;
>>   
>>   err_unreg_dev:
>> @@ -186,6 +188,7 @@ static void nsim_uninit(struct net_device *dev)
>>   {
>>   	struct netdevsim *ns = netdev_priv(dev);
>>   
>> +	nsim_ipsec_teardown(ns);
>>   	nsim_devlink_teardown(ns);
>>   	debugfs_remove_recursive(ns->ddir);
>>   	nsim_bpf_uninit(ns);
>> @@ -203,11 +206,15 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct netdevsim *ns = netdev_priv(dev);
>>   
>> +	if (!nsim_ipsec_tx(ns, skb))
>> +		goto out;
>> +
>>   	u64_stats_update_begin(&ns->syncp);
>>   	ns->tx_packets++;
>>   	ns->tx_bytes += skb->len;
>>   	u64_stats_update_end(&ns->syncp);
>>   
>> +out:
>>   	dev_kfree_skb(skb);
>>   
>>   	return NETDEV_TX_OK;
>> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
>> index 3a8581a..1708dee 100644
>> --- a/drivers/net/netdevsim/netdevsim.h
>> +++ b/drivers/net/netdevsim/netdevsim.h
>> @@ -29,6 +29,29 @@ struct bpf_prog;
>>   struct dentry;
>>   struct nsim_vf_config;
>>   
>> +#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
>> +#define NSIM_IPSEC_MAX_SA_COUNT		33
> 
> 33 caught my eye - out of curiosity is it 2^5 + 1 to catch some type of
> bug or failure mode?

For test rigs, I often use something like this to help flush out any 
interesting power-of-two or alignment assumptions in the code.  I don't 
expect anything here, but it doesn't hurt.

> 
>> +#define NSIM_IPSEC_VALID		BIT(31)
>> +
>> +struct nsim_sa {
>> +	struct xfrm_state *xs;
>> +	__be32 ipaddr[4];
>> +	u32 key[4];
>> +	u32 salt;
>> +	bool used;
>> +	bool crypt;
>> +	bool rx;
>> +};
>> +
>> +struct nsim_ipsec {
>> +	struct nsim_sa sa[NSIM_IPSEC_MAX_SA_COUNT];
>> +	struct dentry *pfile;
>> +	u32 count;
>> +	u32 tx;
>> +	u32 ok;
>> +};
>> +#endif
> 
> No need to wrap struct definitions in #if/#endif.

I suppose this is a philosophical point... Since CONFIG_XFRM_OFFLOAD is 
not yet a common config setting, I'd like to keep it here to not break 
other folks' builds or dirty them up with unused struct definitions when 
they aren't playing with IPsec offload anyway.

> 
>>   struct netdevsim {
>>   	struct net_device *netdev;
>>   
>> @@ -67,6 +90,9 @@ struct netdevsim {
>>   #if IS_ENABLED(CONFIG_NET_DEVLINK)
>>   	struct devlink *devlink;
>>   #endif
>> +#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
>> +	struct nsim_ipsec ipsec;
>> +#endif
>>   };
>>   
>>   extern struct dentry *nsim_ddir;
>> @@ -147,6 +173,17 @@ static inline void nsim_devlink_exit(void)
>>   }
>>   #endif
>>   
>> +#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
>> +void nsim_ipsec_init(struct netdevsim *ns);
>> +void nsim_ipsec_teardown(struct netdevsim *ns);
>> +int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb);
>> +#else
>> +static inline void nsim_ipsec_init(struct netdevsim *ns) {};
>> +static inline void nsim_ipsec_teardown(struct netdevsim *ns) {};
>> +static inline int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
>> +								{ return 1; };
> 
> Please use the same formatting for static inlines as the rest of the
> file.  The ';' are also unnecessary.

Sure

> 
> Other than those formatting nit picks looks good to me :)
> 

Cheers,
sln

^ permalink raw reply

* Re: [PATCH v3 net-next] mdio-mux-gpio: Remove VLA usage
From: Kees Cook @ 2018-06-25 22:39 UTC (permalink / raw)
  To: Joe Perches; +Cc: David S. Miller, LKML, Andrew Lunn, Network Development
In-Reply-To: <c892c7e7b271c9a269c78887777e691dce12f5c8.camel@perches.com>

On Mon, Jun 25, 2018 at 3:23 PM, Joe Perches <joe@perches.com> wrote:
> On Mon, 2018-06-25 at 15:09 -0700, Kees Cook wrote:
>> In the quest to remove all stack VLA usage from the kernel[1], this
>> allocates the values buffer during the callback instead of putting it
>> on the stack.
>
> []
>
>> diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
> []
>> @@ -44,15 +44,21 @@ static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
>>  static int mdio_mux_gpio_probe(struct platform_device *pdev)
>>  {
> []
>> +     s = devm_kzalloc(&pdev->dev, sizeof(*s->values) * gpios->ndescs +
>> +                                  sizeof(*s), GFP_KERNEL);
>
> Isn't this supposed to use your new struct_size()

Why yes. Yes it is. :) When treewide changes meet in the night! :)
I'll send a v4.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* [PATCH v4 net-next] mdio-mux-gpio: Remove VLA usage
From: Kees Cook @ 2018-06-25 22:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, Joe Perches, Andrew Lunn, netdev

In the quest to remove all stack VLA usage from the kernel[1], this
allocates the values buffer during the callback instead of putting it
on the stack.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
v4: use struct_size() helper for allocation (Joe Perches)
v3: resend to netdev with Reviewed-by
v2: allocate array as part of structure (Andrew Lunn)
---
 drivers/net/phy/mdio-mux-gpio.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
index 082ffef0dec4..bc90764a8b8d 100644
--- a/drivers/net/phy/mdio-mux-gpio.c
+++ b/drivers/net/phy/mdio-mux-gpio.c
@@ -20,23 +20,23 @@
 struct mdio_mux_gpio_state {
 	struct gpio_descs *gpios;
 	void *mux_handle;
+	int values[];
 };
 
 static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
 				   void *data)
 {
 	struct mdio_mux_gpio_state *s = data;
-	int values[s->gpios->ndescs];
 	unsigned int n;
 
 	if (current_child == desired_child)
 		return 0;
 
 	for (n = 0; n < s->gpios->ndescs; n++)
-		values[n] = (desired_child >> n) & 1;
+		s->values[n] = (desired_child >> n) & 1;
 
 	gpiod_set_array_value_cansleep(s->gpios->ndescs, s->gpios->desc,
-				       values);
+				       s->values);
 
 	return 0;
 }
@@ -44,15 +44,21 @@ static int mdio_mux_gpio_switch_fn(int current_child, int desired_child,
 static int mdio_mux_gpio_probe(struct platform_device *pdev)
 {
 	struct mdio_mux_gpio_state *s;
+	struct gpio_descs *gpios;
 	int r;
 
-	s = devm_kzalloc(&pdev->dev, sizeof(*s), GFP_KERNEL);
-	if (!s)
+	gpios = gpiod_get_array(&pdev->dev, NULL, GPIOD_OUT_LOW);
+	if (IS_ERR(gpios))
+		return PTR_ERR(gpios);
+
+	s = devm_kzalloc(&pdev->dev, struct_size(s, values, gpios->ndescs),
+			 GFP_KERNEL);
+	if (!s) {
+		gpiod_put_array(gpios);
 		return -ENOMEM;
+	}
 
-	s->gpios = gpiod_get_array(&pdev->dev, NULL, GPIOD_OUT_LOW);
-	if (IS_ERR(s->gpios))
-		return PTR_ERR(s->gpios);
+	s->gpios = gpios;
 
 	r = mdio_mux_init(&pdev->dev, pdev->dev.of_node,
 			  mdio_mux_gpio_switch_fn, &s->mux_handle, s, NULL);
-- 
2.17.1


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* [net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom

This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net/<dev>/queues/tx-*/xps_rxqs

A set of receive queues can be mapped to a set of transmit queues (many:many),
although the common use case is a 1:1 mapping. This will enable sending
packets on the same Tx-Rx queue association as this is useful for busy polling
multi-threaded workloads where it is not possible to pin the threads to
a CPU. This is a rework of Sridhar's patch for symmetric queueing via
socket option:
https://www.spinics.net/lists/netdev/msg453106.html

Testing Hints:
Kernel:  Linux 4.17.0-rc7+
Interface:
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00015e0b

Configuration:
ethtool -L $iface combined 16
ethtool -C $iface rx-usecs 1000
sysctl net.core.busy_poll=1000
ATR disabled:
ethtool -K $iface ntuple on

Workload:
Modified memcached that changes the thread selection policy to be based
on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket
option. The default is round-robin.

Default: No rxqs_map configured
Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue

System:
Architecture:          x86_64
CPU(s):                72
Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

16 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 4/51/2215               2/30/5163
(usec)


intr/sec                        26655                   18606

contextswitch/sec               5145                    4044

insn per cycle                  0.43                    0.72

cache-misses                    6.919                   4.310
(% of all cache refs)

L1-dcache-load-                 4.49                    3.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.26                   8.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

32 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 10/112/5562             9/46/4637
(usec)


intr/sec                        30456                   27666

contextswitch/sec               7552                    5133

insn per cycle                  0.41                    0.49

cache-misses                    9.357                   2.769
(% of all cache refs)

L1-dcache-load-                 4.09                    3.98
-misses
(% of all L1-dcache hits)

LLC-load-misses                 12.96                   3.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

16 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  5/151/4989             9/69/2611
(usec)


intr/sec                        35686                   22907

contextswitch/sec               25522                   12281

insn per cycle                  0.67                    0.74

cache-misses                    8.652                   6.38
(% of all cache refs)

L1-dcache-load-                 3.19                    2.86
-misses
(% of all L1-dcache hits)

LLC-load-misses                 16.53                   11.99
(% of all LL-cache hits)

-------------------------------------------------------------------------------
32 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  6/163/6152             8/88/4209
(usec)


intr/sec                        47079                   26548

contextswitch/sec               42190                   39168

insn per cycle                  0.45                    0.54

cache-misses                    8.798                   4.668
(% of all cache refs)

L1-dcache-load-                 6.55                    6.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.91                   10.44
(% of all LL-cache hits)

-------------------------------------------------------------------------------

v4:
- Removed enum for map types and used boolean to identify rxqs_map vs cpus_map.
- Added comments for helper functions.
- Added another static_key for rxqs_map (xps_rxqs_needed).
- New patch to change tx_queue_mapping in sock_common to unsigned short.
- Separated marking receive queue number into a standalone patch.
- Changed wording in documentation (queue-pair to queue-association)

---

Amritha Nambiar (7):
      net: Refactor XPS for CPUs and Rx queues
      net: Use static_key for XPS maps
      net: sock: Change tx_queue_mapping in sock_common to unsigned short
      net: Record receive queue number for a connection
      net: Enable Tx queue selection based on Rx queues
      net-sysfs: Add interface for Rx queue(s) map per Tx queue
      Documentation: Add explanation for XPS using Rx-queue(s) map


 Documentation/ABI/testing/sysfs-class-net-queues |   11 +
 Documentation/networking/scaling.txt             |   57 ++++
 include/linux/cpumask.h                          |   11 +
 include/linux/netdevice.h                        |  100 ++++++++
 include/net/busy_poll.h                          |    1 
 include/net/sock.h                               |   28 ++
 net/core/dev.c                                   |  283 +++++++++++++++-------
 net/core/net-sysfs.c                             |   85 ++++++-
 net/core/sock.c                                  |    4 
 net/ipv4/tcp_input.c                             |    3 
 10 files changed, 474 insertions(+), 109 deletions(-)

^ permalink raw reply

* [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |  100 +++++++++++++++++++++
 net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
 net/core/net-sysfs.c      |    4 -
 4 files changed, 246 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)		((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-	WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+	WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+	cpu_max_bits_warn(cpu, nr_cpumask_bits);
 	return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..c534f03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,15 @@ struct xps_map {
  */
 struct xps_dev_maps {
 	struct rcu_head rcu;
-	struct xps_map __rcu *cpu_map[0];
+	struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +		\
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +	\
 	(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+	(_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE	16
@@ -1909,7 +1914,8 @@ struct net_device {
 	int			watchdog_timeo;
 
 #ifdef CONFIG_XPS
-	struct xps_dev_maps __rcu *xps_maps;
+	struct xps_dev_maps __rcu *xps_cpus_map;
+	struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
@@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map);
+
+/**
+ *	attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
+ *	@j: CPU/Rx queue index
+ *	@mask: bitmask of all cpus/rx queues
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
+ */
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+				  unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+	return test_bit(j, mask);
+}
+
+/**
+ *	attr_test_online - Test for online CPU/Rx queue
+ *	@j: CPU/Rx queue index
+ *	@online_mask: bitmask for CPUs/Rx queues that are online
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns true if a CPU/Rx queue is online.
+ */
+static inline bool attr_test_online(unsigned long j,
+				    const unsigned long *online_mask,
+				    unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+
+	if (online_mask)
+		return test_bit(j, online_mask);
+
+	if (j >= 0 && j < nr_bits)
+		return true;
+
+	return false;
+}
+
+/**
+ *	attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
+ *	@n: CPU/Rx queue index
+ *	@srcp: the cpumask/Rx queue mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set.
+ */
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+					 unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (srcp)
+		return find_next_bit(srcp, nr_bits, n + 1);
+
+	return n + 1;
+}
+
+/**
+ *	attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ *	@n: CPU/Rx queue index
+ *	@src1p: the first CPUs/Rx queues mask pointer
+ *	@src2p: the second CPUs/Rx queues mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set in both.
+ */
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+				    const unsigned long *src2p,
+				    unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (src1p && src2p)
+		return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+	else if (src1p)
+		return find_next_bit(src1p, nr_bits, n + 1);
+	else if (src2p)
+		return find_next_bit(src2p, nr_bits, n + 1);
+
+	return n + 1;
+}
 #else
 static inline int netif_set_xps_queue(struct net_device *dev,
 				      const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7..2552556 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 	int pos;
 
 	if (dev_maps)
-		map = xmap_dereference(dev_maps->cpu_map[tci]);
+		map = xmap_dereference(dev_maps->attr_map[tci]);
 	if (!map)
 		return false;
 
@@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 			break;
 		}
 
-		RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+		RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
 		kfree_rcu(map, rcu);
 		return false;
 	}
@@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 	return active;
 }
 
+static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+			   struct xps_dev_maps *dev_maps, unsigned int nr_ids,
+			   u16 offset, u16 count, bool is_rxqs_map)
+{
+	bool active = false;
+	int i, j;
+
+	for (j = -1; j = attrmask_next(j, mask, nr_ids),
+	     j < nr_ids;)
+		active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+					       count);
+	if (!active) {
+		if (is_rxqs_map) {
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		} else {
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
+
+			for (i = offset + (count - 1); count--; i--)
+				netdev_queue_numa_node_write(
+					netdev_get_tx_queue(dev, i),
+							NUMA_NO_NODE);
+		}
+		kfree_rcu(dev_maps, rcu);
+	}
+}
+
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 				   u16 count)
 {
+	const unsigned long *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps;
-	int cpu, i;
-	bool active = false;
+	unsigned int nr_ids;
 
 	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
 
-	if (!dev_maps)
-		goto out_no_maps;
-
-	for_each_possible_cpu(cpu)
-		active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
-					       offset, count);
+	dev_maps = xmap_dereference(dev->xps_rxqs_map);
+	if (dev_maps) {
+		nr_ids = dev->num_rx_queues;
+		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
+			       count, true);
 
-	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-		kfree_rcu(dev_maps, rcu);
 	}
 
-	for (i = offset + (count - 1); count--; i--)
-		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
-					     NUMA_NO_NODE);
+	dev_maps = xmap_dereference(dev->xps_cpus_map);
+	if (!dev_maps)
+		goto out_no_maps;
+
+	if (num_possible_cpus() > 1)
+		possible_mask = cpumask_bits(cpu_possible_mask);
+	nr_ids = nr_cpu_ids;
+	clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
+		       false);
 
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
@@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 	netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
 }
 
-static struct xps_map *expand_xps_map(struct xps_map *map,
-				      int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+				      u16 index, bool is_rxqs_map)
 {
 	struct xps_map *new_map;
 	int alloc_len = XPS_MIN_MAP_ALLOC;
@@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		return map;
 	}
 
-	/* Need to add queue to this CPU's existing map */
+	/* Need to add tx-queue to this CPU's/rx-queue's existing map */
 	if (map) {
 		if (pos < map->alloc_len)
 			return map;
@@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		alloc_len = map->alloc_len * 2;
 	}
 
-	/* Need to allocate new map to store queue on this CPU's map */
-	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
-			       cpu_to_node(cpu));
+	/* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+	 *  map
+	 */
+	if (is_rxqs_map)
+		new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+	else
+		new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+				       cpu_to_node(attr_index));
 	if (!new_map)
 		return NULL;
 
@@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 	return new_map;
 }
 
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
-			u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map)
 {
+	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-	int i, cpu, tci, numa_node_id = -2;
+	int i, j, tci, numa_node_id = -2;
 	int maps_sz, num_tc = 1, tc = 0;
 	struct xps_map *map, *new_map;
 	bool active = false;
+	unsigned int nr_ids;
 
 	if (dev->num_tc) {
 		num_tc = dev->num_tc;
@@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -EINVAL;
 	}
 
-	maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
-	if (maps_sz < L1_CACHE_BYTES)
-		maps_sz = L1_CACHE_BYTES;
-
 	mutex_lock(&xps_map_mutex);
+	if (is_rxqs_map) {
+		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		nr_ids = dev->num_rx_queues;
+	} else {
+		maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+		if (num_possible_cpus() > 1) {
+			online_mask = cpumask_bits(cpu_online_mask);
+			possible_mask = cpumask_bits(cpu_possible_mask);
+		}
+		dev_maps = xmap_dereference(dev->xps_cpus_map);
+		nr_ids = nr_cpu_ids;
+	}
 
-	dev_maps = xmap_dereference(dev->xps_maps);
+	if (maps_sz < L1_CACHE_BYTES)
+		maps_sz = L1_CACHE_BYTES;
 
 	/* allocate memory for queue storage */
-	for_each_cpu_and(cpu, cpu_online_mask, mask) {
+	for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+	     j < nr_ids;) {
 		if (!new_dev_maps)
 			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
 		if (!new_dev_maps) {
@@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -ENOMEM;
 		}
 
-		tci = cpu * num_tc + tc;
-		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+		tci = j * num_tc + tc;
+		map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
 				 NULL;
 
-		map = expand_xps_map(map, cpu, index);
+		map = expand_xps_map(map, j, index, is_rxqs_map);
 		if (!map)
 			goto error;
 
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+		RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 	}
 
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	for_each_possible_cpu(cpu) {
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
-		for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+		for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* We need to explicitly update tci as prevous loop
 		 * could break out early if dev_maps is NULL.
 		 */
-		tci = cpu * num_tc + tc;
+		tci = j * num_tc + tc;
 
-		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
-			/* add queue to CPU maps */
+		if (attr_test_mask(j, mask, nr_ids) &&
+		    attr_test_online(j, online_mask, nr_ids)) {
+			/* add tx-queue to CPU/rx-queue maps */
 			int pos = 0;
 
-			map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+			map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			while ((pos < map->len) && (map->queues[pos] != index))
 				pos++;
 
 			if (pos == map->len)
 				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
+			if (!is_rxqs_map) {
+				if (numa_node_id == -2)
+					numa_node_id = cpu_to_node(j);
+				else if (numa_node_id != cpu_to_node(j))
+					numa_node_id = -1;
+			}
 #endif
 		} else if (dev_maps) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* copy maps belonging to foreign traffic classes */
 		for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 	}
 
-	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	if (is_rxqs_map)
+		rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
+	else
+		rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
 
 	/* Cleanup old maps */
 	if (!dev_maps)
 		goto out_no_old_maps;
 
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
 			if (map && map != new_map)
 				kfree_rcu(map, rcu);
 		}
@@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	active = true;
 
 out_no_new_maps:
-	/* update Tx queue numa node */
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     (numa_node_id >= 0) ? numa_node_id :
-				     NUMA_NO_NODE);
+	if (!is_rxqs_map) {
+		/* update Tx queue numa node */
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+					     (numa_node_id >= 0) ?
+					     numa_node_id : NUMA_NO_NODE);
+	}
 
 	if (!dev_maps)
 		goto out_no_maps;
 
-	/* removes queue from unused CPUs */
-	for_each_possible_cpu(cpu) {
-		for (i = tc, tci = cpu * num_tc; i--; tci++)
+	/* removes tx-queue from unused CPUs/rx-queues */
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = tc, tci = j * num_tc; i--; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
-		if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+		if (!attr_test_mask(j, mask, nr_ids) ||
+		    !attr_test_online(j, online_mask, nr_ids))
 			active |= remove_xps_queue(dev_maps, tci, index);
 		for (i = num_tc - tc, tci++; --i; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
@@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 
 	/* free map if not active */
 	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		if (is_rxqs_map)
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		else
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
@@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	return 0;
 error:
 	/* remove any maps that we added */
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			map = dev_maps ?
-			      xmap_dereference(dev_maps->cpu_map[tci]) :
+			      xmap_dereference(dev_maps->attr_map[tci]) :
 			      NULL;
 			if (new_map && new_map != map)
 				kfree(new_map);
@@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+			u16 index)
+{
+	return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+}
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
@@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
 		}
 
-		map = rcu_dereference(dev_maps->cpu_map[tci]);
+		map = rcu_dereference(dev_maps->attr_map[tci]);
 		if (map) {
 			if (map->len == 1)
 				queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..b39987c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 		return -ENOMEM;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		for_each_possible_cpu(cpu) {
 			int i, tci = cpu * num_tc + tc;
 			struct xps_map *map;
 
-			map = rcu_dereference(dev_maps->cpu_map[tci]);
+			map = rcu_dereference(dev_maps->attr_map[tci]);
 			if (!map)
 				continue;
 

^ permalink raw reply related

* [net-next PATCH v4 2/7] net: Use static_key for XPS maps
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/dev.c |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2552556..df2a78d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
+struct static_key xps_needed __read_mostly;
+EXPORT_SYMBOL(xps_needed);
+struct static_key xps_rxqs_needed __read_mostly;
+EXPORT_SYMBOL(xps_rxqs_needed);
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)		\
 	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
@@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 
 	mutex_lock(&xps_map_mutex);
 
-	dev_maps = xmap_dereference(dev->xps_rxqs_map);
-	if (dev_maps) {
-		nr_ids = dev->num_rx_queues;
-		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
-			       count, true);
-
+	if (static_key_false(&xps_rxqs_needed)) {
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		if (dev_maps) {
+			nr_ids = dev->num_rx_queues;
+			clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
+				       offset, count, true);
+		}
+		static_key_slow_dec(&xps_rxqs_needed);
 	}
 
 	dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 		       false);
 
 out_no_maps:
+	static_key_slow_dec(&xps_needed);
 	mutex_unlock(&xps_map_mutex);
 }
 
@@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
+	static_key_slow_inc(&xps_needed);
+	if (is_rxqs_map)
+		static_key_slow_inc(&xps_rxqs_needed);
+
 	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
 	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
@@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
+	if (!static_key_false(&xps_needed))
+		return -1;
+
 	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {

^ permalink raw reply related

* [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with 0 indicating unset and
a positive queue value being set. This way it is consistent with
the queue_mapping field in the sk_buff. This will also accommodate
adding a new 'unsigned short' field in sock_common in the next
patch for rx_queue_mapping.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b7541..009fd30 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -214,7 +214,7 @@ struct sock_common {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	int			skc_tx_queue_mapping;
+	unsigned short		skc_tx_queue_mapping;
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
-	sk->sk_tx_queue_mapping = tx_queue;
+	/* sk_tx_queue_mapping accept only upto a 16-bit value */
+	WARN_ON((unsigned short)tx_queue > USHRT_MAX);
+	sk->sk_tx_queue_mapping = tx_queue + 1;
 }
 
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = -1;
+	sk->sk_tx_queue_mapping = 0;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk ? sk->sk_tx_queue_mapping : -1;
+	return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)

^ permalink raw reply related

* [net-next PATCH v4 4/7] net: Record receive queue number for a connection
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/busy_poll.h |    1 +
 include/net/sock.h      |   14 ++++++++++++++
 net/core/sock.c         |    4 ++++
 net/ipv4/tcp_input.c    |    3 +++
 4 files changed, 22 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id = skb->napi_id;
 #endif
+	sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 009fd30..0ff4416 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	unsigned short		skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	unsigned short		skc_rx_queue_mapping;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb) + 1;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..5e4715b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = ~0;
+#endif
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f..c404c53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5584,6 +5585,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	if (skb) {
 		icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
 		security_inet_conn_established(sk, skb);
+		sk_mark_napi_id(sk, skb);
 	}
 
 	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6412,6 +6414,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_rx_queue_set(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox