Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
From: Yuchung Cheng @ 2017-09-28 23:36 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Roman Gushchin, Hideaki YOSHIFUJI, Alexey Kuznetsov, netdev,
	linux-kernel@vger.kernel.org
In-Reply-To: <2325466.Xo6SG5M5hd@natalenko.name>

On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> Hi.
>
> Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> warning in tcp_fastretrans_alert() for me.

Hi Oleksandr: no retrans_collapse should not matter for that warning
in tcp_fstretrans_alert(). the warning as I explained earlier is
likely false. Neal and I are more concerned the panic in
tcp_sacktag_walk. This is just a blind shot but thx for retrying.

We can submit a one-liner to remove the fast retrans warning but want
to nail the bigger issue first.

>
> On středa 27. září 2017 2:18:32 CEST Yuchung Cheng wrote:
>> On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> > On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin <guro@fb.com> wrote:
>> >>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin <guro@fb.com> wrote:
>> >>> > > Hello.
>> >>> > >
>> >>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting
>> >>> > > in the
>> >>> > > warning shown below. Most of the time it is harmless, but rarely it
>> >>> > > just
>> >>> > > causes either freeze or (I believe, this is related too) panic in
>> >>> > > tcp_sacktag_walk() (because sk_buff passed to this function is
>> >>> > > NULL).
>> >>> > > Unfortunately, I still do not have proper stacktrace from panic, but
>> >>> > > will try to capture it if possible.
>> >>> > >
>> >>> > > Also, I have custom settings regarding TCP stack, shown below as
>> >>> > > well. ifb is used to shape traffic with tc.
>> >>> > >
>> >>> > > Please note this regression was already reported as BZ [1] and as a
>> >>> > > letter to ML [2], but got neither attention nor resolution. It is
>> >>> > > reproducible for (not only) me on my home router since v4.11 till
>> >>> > > v4.13.1 incl.
>> >>> > >
>> >>> > > Please advise on how to deal with it. I'll provide any additional
>> >>> > > info if
>> >>> > > necessary, also ready to test patches if any.
>> >>> > >
>> >>> > > Thanks.
>> >>> > >
>> >>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
>> >>> > > [2]
>> >>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.ne
>> >>> > > t_lists_netdev_msg436158.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJ
>> >>> > > YgtDM7QT-W-Fz_d29HYQ&m=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA&s
>> >>> > > =-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc&e=>>> >
>> >>> > We're experiencing the same problems on some machines in our fleet.
>> >>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
>> >>> > sometimes panics in tcp_sacktag_walk().
>> >>
>> >>> > Here is an example of a backtrace with the panic log:
>> >> Hi Yuchung!
>> >>
>> >>> do you still see the panics if you disable RACK?
>> >>> sysctl net.ipv4.tcp_recovery=0?
>> >>
>> >> No, we haven't seen any crash since that.
>> >
>> > I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
>> > take an empty skb :-( Do you have stack trace or any hint on which call
>> > to tcp-sacktag_walk triggered the panic? internally at Google we never
>> > see that.
>>
>> hmm something just struck me: could you try
>> sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0
>> and see if kernel still panics on sack processing?
>>
>> >>> also have you experience any sack reneg? could you post the output of
>> >>> ' nstat |grep -i TCP' thanks
>> >>
>> >> hostname        TcpActiveOpens                  2289680            0.0
>> >> hostname        TcpPassiveOpens                 3592758            0.0
>> >> hostname        TcpAttemptFails                 746910             0.0
>> >> hostname        TcpEstabResets                  154988             0.0
>> >> hostname        TcpInSegs                       16258678255        0.0
>> >> hostname        TcpOutSegs                      46967011611        0.0
>> >> hostname        TcpRetransSegs                  13724310           0.0
>> >> hostname        TcpInErrs                       2                  0.0
>> >> hostname        TcpOutRsts                      9418798            0.0
>> >> hostname        TcpExtEmbryonicRsts             2303               0.0
>> >> hostname        TcpExtPruneCalled               90192              0.0
>> >> hostname        TcpExtOfoPruned                 57274              0.0
>> >> hostname        TcpExtOutOfWindowIcmps          3                  0.0
>> >> hostname        TcpExtTW                        1164705            0.0
>> >> hostname        TcpExtTWRecycled                2                  0.0
>> >> hostname        TcpExtPAWSEstab                 159                0.0
>> >> hostname        TcpExtDelayedACKs               209207209          0.0
>> >> hostname        TcpExtDelayedACKLocked          508571             0.0
>> >> hostname        TcpExtDelayedACKLost            1713248            0.0
>> >> hostname        TcpExtListenOverflows           625                0.0
>> >> hostname        TcpExtListenDrops               625                0.0
>> >> hostname        TcpExtTCPHPHits                 9341188489         0.0
>> >> hostname        TcpExtTCPPureAcks               1434646465         0.0
>> >> hostname        TcpExtTCPHPAcks                 5733614672         0.0
>> >> hostname        TcpExtTCPSackRecovery           3261698            0.0
>> >> hostname        TcpExtTCPSACKReneging           12203              0.0
>> >> hostname        TcpExtTCPSACKReorder            433189             0.0
>> >> hostname        TcpExtTCPTSReorder              22694              0.0
>> >> hostname        TcpExtTCPFullUndo               45092              0.0
>> >> hostname        TcpExtTCPPartialUndo            22016              0.0
>> >> hostname        TcpExtTCPLossUndo               2150040            0.0
>> >> hostname        TcpExtTCPLostRetransmit         60119              0.0
>> >> hostname        TcpExtTCPSackFailures           2626782            0.0
>> >> hostname        TcpExtTCPLossFailures           182999             0.0
>> >> hostname        TcpExtTCPFastRetrans            4334275            0.0
>> >> hostname        TcpExtTCPSlowStartRetrans       3453348            0.0
>> >> hostname        TcpExtTCPTimeouts               1070997            0.0
>> >> hostname        TcpExtTCPLossProbes             2633545            0.0
>> >> hostname        TcpExtTCPLossProbeRecovery      941647             0.0
>> >> hostname        TcpExtTCPSackRecoveryFail       336302             0.0
>> >> hostname        TcpExtTCPRcvCollapsed           461354             0.0
>> >> hostname        TcpExtTCPAbortOnData            349196             0.0
>> >> hostname        TcpExtTCPAbortOnClose           3395               0.0
>> >> hostname        TcpExtTCPAbortOnTimeout         51201              0.0
>> >> hostname        TcpExtTCPMemoryPressures        2                  0.0
>> >> hostname        TcpExtTCPSpuriousRTOs           2120503            0.0
>> >> hostname        TcpExtTCPSackShifted            2613736            0.0
>> >> hostname        TcpExtTCPSackMerged             21358743           0.0
>> >> hostname        TcpExtTCPSackShiftFallback      8769387            0.0
>> >> hostname        TcpExtTCPBacklogDrop            5                  0.0
>> >> hostname        TcpExtTCPRetransFail            843                0.0
>> >> hostname        TcpExtTCPRcvCoalesce            949068035          0.0
>> >> hostname        TcpExtTCPOFOQueue               470118             0.0
>> >> hostname        TcpExtTCPOFODrop                9915               0.0
>> >> hostname        TcpExtTCPOFOMerge               9                  0.0
>> >> hostname        TcpExtTCPChallengeACK           90                 0.0
>> >> hostname        TcpExtTCPSYNChallenge           3                  0.0
>> >> hostname        TcpExtTCPFastOpenActive         2089               0.0
>> >> hostname        TcpExtTCPSpuriousRtxHostQueues  896596             0.0
>> >> hostname        TcpExtTCPAutoCorking            547386735          0.0
>> >> hostname        TcpExtTCPFromZeroWindowAdv      28757              0.0
>> >> hostname        TcpExtTCPToZeroWindowAdv        28761              0.0
>> >> hostname        TcpExtTCPWantZeroWindowAdv      322431             0.0
>> >> hostname        TcpExtTCPSynRetrans             3026               0.0
>> >> hostname        TcpExtTCPOrigDataSent           40976870977        0.0
>> >> hostname        TcpExtTCPHystartTrainDetect     453920             0.0
>> >> hostname        TcpExtTCPHystartTrainCwnd       11586273           0.0
>> >> hostname        TcpExtTCPHystartDelayDetect     10943              0.0
>> >> hostname        TcpExtTCPHystartDelayCwnd       763554             0.0
>> >> hostname        TcpExtTCPACKSkippedPAWS         30                 0.0
>> >> hostname        TcpExtTCPACKSkippedSeq          218                0.0
>> >> hostname        TcpExtTCPWinProbe               2408               0.0
>> >> hostname        TcpExtTCPKeepAlive              213768             0.0
>> >> hostname        TcpExtTCPMTUPFail               69                 0.0
>> >> hostname        TcpExtTCPMTUPSuccess            8811               0.0
>> >>
>> >> Thanks!
>
>

^ permalink raw reply

* [PATCH v4 net-next 0/8] flow_dissector: Protocol specific flow dissector offload
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert

This patch set adds a new offload type to perform flow dissection for
specific protocols (either by EtherType or by IP protocol). This is
primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
the purposes of parsing the encapsulated packet.

Items in this patch set:
- Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
  on the code in the GRE dissect function and the special handling in
  GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
  goto proto_again is done)
- Add infrastructure for protocol specific flow dissection offload
- Add infrastructure to perform UDP flow dissection. Uses same model of
  GRO where a flow_dissect callback can be associated with a UDP
  socket
- Use the infrastructure to support flow dissection of VXLAN and GUE

Tested:

Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
that inner packet was being properly dissected.

v2: Add signed off

v3:
   - Make skb argument of flow dissector to be non const
   - Change UDP GRO to only do something if encap_needed static
     key is set
   - don't reference inet6_offloads or inet_offloads, get to
     them through ptype

v4:
   - skb argument to ndo_rx_flow_steer allso needs to become
     non constant

Tom Herbert (8):
  flow_dissector: Change skbuf argument to be non const
  flow_dissector: Move ETH_P_TEB processing to main switch
  udp: Check static key udp_encap_needed in udp_gro_receive
  flow_dissector: Add protocol specific flow dissection offload
  ip: Add callbacks to flow dissection by IP protocol
  udp: flow dissector offload
  fou: Support flow dissection
  vxlan: support flow dissect

 drivers/net/ethernet/broadcom/bnxt/bnxt.c         |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.c       |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.h       |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede.h           |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/sfc/efx.h                    |  2 +-
 drivers/net/ethernet/sfc/falcon/efx.h             |  2 +-
 drivers/net/ethernet/sfc/falcon/rx.c              |  2 +-
 drivers/net/ethernet/sfc/rx.c                     |  2 +-
 drivers/net/vxlan.c                               | 40 +++++++++++++
 include/linux/netdevice.h                         | 31 +++++++++-
 include/linux/skbuff.h                            | 12 ++--
 include/linux/udp.h                               |  8 +++
 include/net/flow_dissector.h                      |  1 +
 include/net/ip_fib.h                              |  4 +-
 include/net/route.h                               |  4 +-
 include/net/udp.h                                 | 10 ++++
 include/net/udp_tunnel.h                          |  8 +++
 net/core/dev.c                                    | 65 +++++++++++++++++++++
 net/core/flow_dissector.c                         | 71 ++++++++++++++---------
 net/ipv4/af_inet.c                                | 27 +++++++++
 net/ipv4/fib_semantics.c                          |  2 +-
 net/ipv4/fou.c                                    | 63 ++++++++++++++++++++
 net/ipv4/route.c                                  | 10 ++--
 net/ipv4/udp.c                                    |  4 +-
 net/ipv4/udp_offload.c                            | 55 ++++++++++++++++++
 net/ipv4/udp_tunnel.c                             |  1 +
 net/ipv6/ip6_offload.c                            | 27 +++++++++
 net/ipv6/udp_offload.c                            | 23 ++++++++
 net/sched/sch_sfq.c                               |  2 +-
 33 files changed, 433 insertions(+), 59 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH v4 net-next 1/8] flow_dissector: Change skbuf argument to be non const
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Change the skbuf argument of __skb_flow_dissect to be non constant so
that the function can call functions that take non constant skbuf
arguments. This is needed if we are to call socket lookup or BPF in the
flow dissector path.

The changes include unraveling the call chain into __skb_flow_dissect so
that those also use non constant skbufs.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.c       |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.h       |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede.h           |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c    |  2 +-
 drivers/net/ethernet/sfc/efx.h                    |  2 +-
 drivers/net/ethernet/sfc/falcon/efx.h             |  2 +-
 drivers/net/ethernet/sfc/falcon/rx.c              |  2 +-
 drivers/net/ethernet/sfc/rx.c                     |  2 +-
 include/linux/netdevice.h                         |  4 ++--
 include/linux/skbuff.h                            | 12 ++++++------
 include/net/ip_fib.h                              |  4 ++--
 include/net/route.h                               |  4 ++--
 net/core/flow_dissector.c                         | 10 +++++-----
 net/ipv4/fib_semantics.c                          |  2 +-
 net/ipv4/route.c                                  |  6 +++---
 net/sched/sch_sfq.c                               |  2 +-
 20 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5ba49938ba55..29f5cf6bea4a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7344,7 +7344,7 @@ static bool bnxt_fltr_match(struct bnxt_ntuple_filter *f1,
 	return false;
 }
 
-static int bnxt_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+static int bnxt_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 			      u16 rxq_index, u32 flow_id)
 {
 	struct bnxt *bp = netdev_priv(dev);
diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.c b/drivers/net/ethernet/cisco/enic/enic_clsf.c
index 3c677ed3c29e..7ee2aa1c3184 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.c
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.c
@@ -167,7 +167,7 @@ static struct enic_rfs_fltr_node *htbl_key_search(struct hlist_head *h,
 	return NULL;
 }
 
-int enic_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int enic_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 		       u16 rxq_index, u32 flow_id)
 {
 	struct flow_keys keys;
diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.h b/drivers/net/ethernet/cisco/enic/enic_clsf.h
index 4bfbf25f9ddc..0e7f533f81b9 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.h
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.h
@@ -13,7 +13,7 @@ void enic_rfs_flw_tbl_free(struct enic *enic);
 struct enic_rfs_fltr_node *htbl_fltr_search(struct enic *enic, u16 fltr_id);
 
 #ifdef CONFIG_RFS_ACCEL
-int enic_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int enic_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 		       u16 rxq_index, u32 flow_id);
 void enic_flow_may_expire(unsigned long data);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 9c218f1cfc6c..9f7afbfb09f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -348,7 +348,7 @@ mlx4_en_filter_find(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
 }
 
 static int
-mlx4_en_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+mlx4_en_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
 		   u16 rxq_index, u32 flow_id)
 {
 	struct mlx4_en_priv *priv = netdev_priv(net_dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cc13d3dbd366..897c9d46702c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -1017,7 +1017,7 @@ int mlx5e_arfs_create_tables(struct mlx5e_priv *priv);
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv);
 int mlx5e_arfs_enable(struct mlx5e_priv *priv);
 int mlx5e_arfs_disable(struct mlx5e_priv *priv);
-int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int mlx5e_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 			u16 rxq_index, u32 flow_id);
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index 12d3ced61114..f5e182bd613d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -699,7 +699,7 @@ static struct arfs_rule *arfs_find_rule(struct arfs_table *arfs_t,
 	return NULL;
 }
 
-int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int mlx5e_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 			u16 rxq_index, u32 flow_id)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h
index adb700512baa..56c364811929 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -445,7 +445,7 @@ struct qede_fastpath {
 #define QEDE_SP_RX_MODE			1
 
 #ifdef CONFIG_RFS_ACCEL
-int qede_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int qede_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 		       u16 rxq_index, u32 flow_id);
 #define QEDE_SP_ARFS_CONFIG	4
 #define QEDE_SP_TASK_POLL_DELAY	(5 * HZ)
diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index f79e36e4060a..2d2b473fbff8 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -411,7 +411,7 @@ qede_alloc_filter(struct qede_dev *edev, int min_hlen)
 	return n;
 }
 
-int qede_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int qede_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
 		       u16 rxq_index, u32 flow_id)
 {
 	struct qede_dev *edev = netdev_priv(dev);
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index d407adf59610..805c7880df8d 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -171,7 +171,7 @@ static inline s32 efx_filter_get_rx_ids(struct efx_nic *efx,
 	return efx->type->filter_get_rx_ids(efx, priority, buf, size);
 }
 #ifdef CONFIG_RFS_ACCEL
-int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+int efx_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
 		   u16 rxq_index, u32 flow_id);
 bool __efx_filter_rfs_expire(struct efx_nic *efx, unsigned quota);
 static inline void efx_filter_rfs_expire(struct efx_channel *channel)
diff --git a/drivers/net/ethernet/sfc/falcon/efx.h b/drivers/net/ethernet/sfc/falcon/efx.h
index 4f3bb30661ea..e3b9b7cbbb39 100644
--- a/drivers/net/ethernet/sfc/falcon/efx.h
+++ b/drivers/net/ethernet/sfc/falcon/efx.h
@@ -164,7 +164,7 @@ static inline s32 ef4_filter_get_rx_ids(struct ef4_nic *efx,
 	return efx->type->filter_get_rx_ids(efx, priority, buf, size);
 }
 #ifdef CONFIG_RFS_ACCEL
-int ef4_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+int ef4_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
 		   u16 rxq_index, u32 flow_id);
 bool __ef4_filter_rfs_expire(struct ef4_nic *efx, unsigned quota);
 static inline void ef4_filter_rfs_expire(struct ef4_channel *channel)
diff --git a/drivers/net/ethernet/sfc/falcon/rx.c b/drivers/net/ethernet/sfc/falcon/rx.c
index 6a8406dc0c2b..d5d2816b30dd 100644
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -833,7 +833,7 @@ MODULE_PARM_DESC(rx_refill_threshold,
 
 #ifdef CONFIG_RFS_ACCEL
 
-int ef4_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+int ef4_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
 		   u16 rxq_index, u32 flow_id)
 {
 	struct ef4_nic *efx = netdev_priv(net_dev);
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 42443f434569..35898054aced 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -827,7 +827,7 @@ MODULE_PARM_DESC(rx_refill_threshold,
 
 #ifdef CONFIG_RFS_ACCEL
 
-int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+int efx_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
 		   u16 rxq_index, u32 flow_id)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..06b173200e23 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1010,7 +1010,7 @@ struct xfrmdev_ops {
  *	protocol stack to use.
  *
  *	RFS acceleration.
- * int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb,
+ * int (*ndo_rx_flow_steer)(struct net_device *dev, struct sk_buff *skb,
  *			    u16 rxq_index, u32 flow_id);
  *	Set hardware filter for RFS.  rxq_index is the target queue index;
  *	flow_id is a flow ID to be passed to rps_may_expire_flow() later.
@@ -1236,7 +1236,7 @@ struct net_device_ops {
 
 #ifdef CONFIG_RFS_ACCEL
 	int			(*ndo_rx_flow_steer)(struct net_device *dev,
-						     const struct sk_buff *skb,
+						     struct sk_buff *skb,
 						     u16 rxq_index,
 						     u32 flow_id);
 #endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19e64bfb1a66..5a6e765e120f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1155,8 +1155,8 @@ __skb_set_sw_hash(struct sk_buff *skb, __u32 hash, bool is_l4)
 }
 
 void __skb_get_hash(struct sk_buff *skb);
-u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
-u32 skb_get_poff(const struct sk_buff *skb);
+u32 __skb_get_hash_symmetric(struct sk_buff *skb);
+u32 skb_get_poff(struct sk_buff *skb);
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 		   const struct flow_keys *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
@@ -1172,13 +1172,13 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count);
 
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
 			unsigned int flags);
 
-static inline bool skb_flow_dissect(const struct sk_buff *skb,
+static inline bool skb_flow_dissect(struct sk_buff *skb,
 				    struct flow_dissector *flow_dissector,
 				    void *target_container, unsigned int flags)
 {
@@ -1186,7 +1186,7 @@ static inline bool skb_flow_dissect(const struct sk_buff *skb,
 				  NULL, 0, 0, 0, flags);
 }
 
-static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
+static inline bool skb_flow_dissect_flow_keys(struct sk_buff *skb,
 					      struct flow_keys *flow,
 					      unsigned int flags)
 {
@@ -1225,7 +1225,7 @@ static inline __u32 skb_get_hash_flowi6(struct sk_buff *skb, const struct flowi6
 	return skb->hash;
 }
 
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb);
+__u32 skb_get_hash_perturb(struct sk_buff *skb, u32 perturb);
 
 static inline __u32 skb_get_hash_raw(const struct sk_buff *skb)
 {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1a7f7e424320..a376dfe1ad44 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -374,11 +374,11 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
-		       const struct sk_buff *skb);
+		       struct sk_buff *skb);
 #endif
 void fib_select_multipath(struct fib_result *res, int hash);
 void fib_select_path(struct net *net, struct fib_result *res,
-		     struct flowi4 *fl4, const struct sk_buff *skb);
+		     struct flowi4 *fl4, struct sk_buff *skb);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/include/net/route.h b/include/net/route.h
index 57dfc6850d37..cb95b79f0117 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -114,10 +114,10 @@ int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
 struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *flp,
-					const struct sk_buff *skb);
+					struct sk_buff *skb);
 struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *flp,
 					    struct fib_result *res,
-					    const struct sk_buff *skb);
+					    struct sk_buff *skb);
 
 static inline struct rtable *__ip_route_output_key(struct net *net,
 						   struct flowi4 *flp)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0a977373d003..76f5e5bc3177 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -424,7 +424,7 @@ static bool skb_flow_dissect_allowed(int *num_hdrs)
  *
  * Caller must take care of zeroing target container memory.
  */
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
@@ -1015,7 +1015,7 @@ u32 flow_hash_from_keys(struct flow_keys *keys)
 }
 EXPORT_SYMBOL(flow_hash_from_keys);
 
-static inline u32 ___skb_get_hash(const struct sk_buff *skb,
+static inline u32 ___skb_get_hash(struct sk_buff *skb,
 				  struct flow_keys *keys, u32 keyval)
 {
 	skb_flow_dissect_flow_keys(skb, keys,
@@ -1053,7 +1053,7 @@ EXPORT_SYMBOL(make_flow_keys_digest);
 
 static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
 
-u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
+u32 __skb_get_hash_symmetric(struct sk_buff *skb)
 {
 	struct flow_keys keys;
 
@@ -1090,7 +1090,7 @@ void __skb_get_hash(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(__skb_get_hash);
 
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
+__u32 skb_get_hash_perturb(struct sk_buff *skb, u32 perturb)
 {
 	struct flow_keys keys;
 
@@ -1158,7 +1158,7 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
  * truncate packets without needing to push actual payload to the user
  * space and can analyze headers only, instead.
  */
-u32 skb_get_poff(const struct sk_buff *skb)
+u32 skb_get_poff(struct sk_buff *skb)
 {
 	struct flow_keys keys;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 57a5d48acee8..dc610646bc4c 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1759,7 +1759,7 @@ void fib_select_multipath(struct fib_result *res, int hash)
 #endif
 
 void fib_select_path(struct net *net, struct fib_result *res,
-		     struct flowi4 *fl4, const struct sk_buff *skb)
+		     struct flowi4 *fl4, struct sk_buff *skb)
 {
 	bool oif_check;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 94d4cd2d5ea4..94c5b81d8f2b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1791,7 +1791,7 @@ static void ip_multipath_l3_keys(const struct sk_buff *skb,
 
 /* if skb is set it will be used and fl4 can be NULL */
 int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
-		       const struct sk_buff *skb)
+		       struct sk_buff *skb)
 {
 	struct net *net = fi->fib_net;
 	struct flow_keys hash_keys;
@@ -2270,7 +2270,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
  */
 
 struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
-					const struct sk_buff *skb)
+					struct sk_buff *skb)
 {
 	__u8 tos = RT_FL_TOS(fl4);
 	struct fib_result res;
@@ -2295,7 +2295,7 @@ EXPORT_SYMBOL_GPL(ip_route_output_key_hash);
 
 struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4,
 					    struct fib_result *res,
-					    const struct sk_buff *skb)
+					    struct sk_buff *skb)
 {
 	struct net_device *dev_out = NULL;
 	int orig_oif = fl4->flowi4_oif;
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 74ea863b8240..0d2d3a8d03f0 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -158,7 +158,7 @@ static inline struct sfq_head *sfq_dep_head(struct sfq_sched_data *q, sfq_index
 }
 
 static unsigned int sfq_hash(const struct sfq_sched_data *q,
-			     const struct sk_buff *skb)
+			     struct sk_buff *skb)
 {
 	return skb_get_hash_perturb(skb, q->perturbation) & (q->divisor - 1);
 }
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 2/8] flow_dissector: Move ETH_P_TEB processing to main switch
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Support for processing TEB is currently in GRE flow dissection as a
special case. This can be moved to be a case the main proto switch in
__skb_flow_dissect.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 net/core/flow_dissector.c | 45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 76f5e5bc3177..c15b41f96cbe 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -282,27 +282,8 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
 	if (hdr->flags & GRE_SEQ)
 		offset += sizeof(((struct pptp_gre_header *) 0)->seq);
 
-	if (gre_ver == 0) {
-		if (*p_proto == htons(ETH_P_TEB)) {
-			const struct ethhdr *eth;
-			struct ethhdr _eth;
-
-			eth = __skb_header_pointer(skb, *p_nhoff + offset,
-						   sizeof(_eth),
-						   data, *p_hlen, &_eth);
-			if (!eth)
-				return FLOW_DISSECT_RET_OUT_BAD;
-			*p_proto = eth->h_proto;
-			offset += sizeof(*eth);
-
-			/* Cap headers that we access via pointers at the
-			 * end of the Ethernet header as our maximum alignment
-			 * at that point is only 2 bytes.
-			 */
-			if (NET_IP_ALIGN)
-				*p_hlen = *p_nhoff + offset;
-		}
-	} else { /* version 1, must be PPTP */
+	/* version 1, must be PPTP */
+	if (gre_ver == 1) {
 		u8 _ppp_hdr[PPP_HDRLEN];
 		u8 *ppp_hdr;
 
@@ -595,6 +576,28 @@ bool __skb_flow_dissect(struct sk_buff *skb,
 
 		break;
 	}
+	case htons(ETH_P_TEB): {
+		const struct ethhdr *eth;
+		struct ethhdr _eth;
+
+		eth = __skb_header_pointer(skb, nhoff, sizeof(_eth),
+					   data, hlen, &_eth);
+		if (!eth)
+			goto out_bad;
+
+		proto = eth->h_proto;
+		nhoff += sizeof(*eth);
+
+		/* Cap headers that we access via pointers at the
+		 * end of the Ethernet header as our maximum alignment
+		 * at that point is only 2 bytes.
+		 */
+		if (NET_IP_ALIGN)
+			hlen = nhoff;
+
+		fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
+		break;
+	}
 	case htons(ETH_P_8021AD):
 	case htons(ETH_P_8021Q): {
 		const struct vlan_hdr *vlan;
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 3/8] udp: Check static key udp_encap_needed in udp_gro_receive
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Currently, the only support for udp gro is provided by UDP encapsulation
protocols. Since they always set udp_encap_needed we can check that in
udp_gro_receive functions before performing a socket lookup.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/net/udp.h      | 2 ++
 net/ipv4/udp.c         | 4 +++-
 net/ipv4/udp_offload.c | 7 +++++++
 net/ipv6/udp_offload.c | 7 +++++++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 12dfbfe2e2d7..c6b1c5d8d3c9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -97,6 +97,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
+extern struct static_key udp_encap_needed;
+
 extern atomic_long_t udp_memory_allocated;
 
 /* sysctl variables for udp */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..2788843e8eb2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1813,7 +1813,9 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static struct static_key udp_encap_needed __read_mostly;
+struct static_key udp_encap_needed __read_mostly;
+EXPORT_SYMBOL(udp_encap_needed);
+
 void udp_encap_enable(void)
 {
 	static_key_enable(&udp_encap_needed);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 97658bfc1b58..a744bb515455 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -261,6 +261,13 @@ static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
 {
 	struct udphdr *uh = udp_gro_udphdr(skb);
 
+	if (!static_key_false(&udp_encap_needed)) {
+		/* Currently udp_gro_receive only does something if
+		 * a UDP encapsulation has been set.
+		 */
+		goto flush;
+	}
+
 	if (unlikely(!uh))
 		goto flush;
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 455fd4e39333..111b026e4f03 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -34,6 +34,13 @@ static struct sk_buff **udp6_gro_receive(struct sk_buff **head,
 {
 	struct udphdr *uh = udp_gro_udphdr(skb);
 
+	if (!static_key_false(&udp_encap_needed)) {
+		/* Currently udp_gro_receive only does something if
+		 * a UDP encapsulation has been set.
+		 */
+		goto flush;
+	}
+
 	if (unlikely(!uh))
 		goto flush;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 4/8] flow_dissector: Add protocol specific flow dissection offload
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Add offload capability for performing protocol specific flow dissection
(either by EtherType or IP protocol).

Specifically:

- Add flow_dissect to offload callbacks
- Move flow_dissect_ret enum to flow_dissector.h, cleanup names and add a
  couple of values
- Unify handling of functions that return flow_dissect_ret enum
- In __skb_flow_dissect, add default case for switch(proto) as well as
  switch(ip_proto) that looks up and calls protocol specific flow
  dissection

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/linux/netdevice.h    | 27 ++++++++++++++++++
 include/net/flow_dissector.h |  1 +
 net/core/dev.c               | 65 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/flow_dissector.c    | 16 +++++++++--
 net/ipv4/route.c             |  4 ++-
 5 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 06b173200e23..f186b6ab480a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2207,12 +2207,25 @@ struct offload_callbacks {
 	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
 						 struct sk_buff *skb);
 	int			(*gro_complete)(struct sk_buff *skb, int nhoff);
+	enum flow_dissect_ret (*flow_dissect)(struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
 };
 
 struct packet_offload {
 	__be16			 type;	/* This is really htons(ether_type). */
 	u16			 priority;
 	struct offload_callbacks callbacks;
+	enum flow_dissect_ret (*proto_flow_dissect)(struct sk_buff *skb,
+			u8 proto,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
 	struct list_head	 list;
 };
 
@@ -3252,6 +3265,20 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+			__be16 type,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
+enum flow_dissect_ret flow_dissect_by_type_proto(struct sk_buff *skb,
+			__be16 type, u8 proto,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index fc3dce730a6b..ad75bbfd1c9c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -213,6 +213,7 @@ enum flow_dissector_key_id {
 #define FLOW_DISSECTOR_F_STOP_AT_L3		BIT(1)
 #define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	BIT(2)
 #define FLOW_DISSECTOR_F_STOP_AT_ENCAP		BIT(3)
+#define FLOW_DISSECTOR_F_STOP_AT_L4		BIT(4)
 
 struct flow_dissector_key {
 	enum flow_dissector_key_id key_id;
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..f3cd884bd04b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -104,6 +104,7 @@
 #include <linux/stat.h>
 #include <net/dst.h>
 #include <net/dst_metadata.h>
+#include <net/flow_dissector.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
 #include <net/checksum.h>
@@ -4907,6 +4908,70 @@ struct packet_offload *gro_find_complete_by_type(__be16 type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+			__be16 type,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+	struct list_head *offload_head = &offload_base;
+	struct packet_offload *ptype;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(ptype, offload_head, list) {
+		if (ptype->type != type || !ptype->callbacks.flow_dissect)
+			continue;
+		ret = ptype->callbacks.flow_dissect(skb, key_control,
+						    flow_dissector,
+						    target_container,
+						    data, p_proto,
+						    p_ip_proto, p_nhoff,
+						    p_hlen, flags);
+		break;
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(flow_dissect_by_type);
+
+enum flow_dissect_ret flow_dissect_by_type_proto(struct sk_buff *skb,
+			__be16 type, u8 proto,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+	struct list_head *offload_head = &offload_base;
+	struct packet_offload *ptype;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(ptype, offload_head, list) {
+		if (ptype->type != type || !ptype->proto_flow_dissect)
+			continue;
+		ret = ptype->proto_flow_dissect(skb, proto, key_control,
+						    flow_dissector,
+						    target_container,
+						    data, p_proto,
+						    p_ip_proto, p_nhoff,
+						    p_hlen, flags);
+		break;
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(flow_dissect_by_type_proto);
+
 static void napi_skb_free_stolen_head(struct sk_buff *skb)
 {
 	skb_dst_drop(skb);
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index c15b41f96cbe..84b8eb1f6664 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -9,6 +9,7 @@
 #include <net/ipv6.h>
 #include <net/gre.h>
 #include <net/pptp.h>
+#include <net/protocol.h>
 #include <linux/igmp.h>
 #include <linux/icmp.h>
 #include <linux/sctp.h>
@@ -721,7 +722,11 @@ bool __skb_flow_dissect(struct sk_buff *skb,
 		break;
 
 	default:
-		fdret = FLOW_DISSECT_RET_OUT_BAD;
+		fdret = flow_dissect_by_type(skb, proto, key_control,
+					     flow_dissector,
+					     target_container,
+					     data, &proto, &ip_proto, &nhoff,
+					     &hlen, flags);
 		break;
 	}
 
@@ -838,6 +843,12 @@ bool __skb_flow_dissect(struct sk_buff *skb,
 		break;
 
 	default:
+		fdret = flow_dissect_by_type_proto(skb, proto,
+						ip_proto, key_control,
+						flow_dissector,
+						target_container,
+						data, &proto, &ip_proto, &nhoff,
+						&hlen, flags);
 		break;
 	}
 
@@ -1022,7 +1033,8 @@ static inline u32 ___skb_get_hash(struct sk_buff *skb,
 				  struct flow_keys *keys, u32 keyval)
 {
 	skb_flow_dissect_flow_keys(skb, keys,
-				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL |
+				   FLOW_DISSECTOR_F_STOP_AT_L4);
 
 	return __flow_hash_from_keys(keys, keyval);
 }
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 94c5b81d8f2b..69d6ce7dfa18 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1811,7 +1811,9 @@ int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
 	case 1:
 		/* skb is currently provided only when forwarding */
 		if (skb) {
-			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
+			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP |
+					    FLOW_DISSECTOR_F_STOP_AT_L4;
+;
 			struct flow_keys keys;
 
 			/* short-circuit if we already have L4 hash present */
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 5/8] ip: Add callbacks to flow dissection by IP protocol
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Populate the proto_flow_dissect function for IPv4 and IPv6 packet
offloads. This allows the caller to flow dissect a packet starting
at the given IP protocol (as parsed to that point by flow dissector
for instance).

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 net/ipv4/af_inet.c     | 27 +++++++++++++++++++++++++++
 net/ipv6/ip6_offload.c | 27 +++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..18c1d884999a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1440,6 +1440,32 @@ static struct sk_buff **ipip_gro_receive(struct sk_buff **head,
 	return inet_gro_receive(head, skb);
 }
 
+static enum flow_dissect_ret inet_proto_flow_dissect(struct sk_buff *skb,
+			u8 proto,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+	const struct net_offload *ops;
+
+	rcu_read_lock();
+
+	ops = rcu_dereference(inet_offloads[proto]);
+	if (ops && ops->callbacks.flow_dissect)
+		ret =  ops->callbacks.flow_dissect(skb, key_control,
+						   flow_dissector,
+						   target_container,
+						   data, p_proto, p_ip_proto,
+						   p_nhoff, p_hlen, flags);
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #define SECONDS_PER_DAY	86400
 
 /* inet_current_timestamp - Return IP network timestamp
@@ -1763,6 +1789,7 @@ static int ipv4_proc_init(void);
 
 static struct packet_offload ip_packet_offload __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
+	.proto_flow_dissect = inet_proto_flow_dissect,
 	.callbacks = {
 		.gso_segment = inet_gso_segment,
 		.gro_receive = inet_gro_receive,
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index cdb3728faca7..a33a2b40b3d6 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -339,8 +339,35 @@ static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff)
 	return inet_gro_complete(skb, nhoff);
 }
 
+static enum flow_dissect_ret inet6_proto_flow_dissect(struct sk_buff *skb,
+			u8 proto,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+	const struct net_offload *ops;
+
+	rcu_read_lock();
+
+	ops = rcu_dereference(inet6_offloads[proto]);
+	if (ops && ops->callbacks.flow_dissect)
+		ret =  ops->callbacks.flow_dissect(skb, key_control,
+						   flow_dissector,
+						   target_container, data,
+						   p_proto, p_ip_proto, p_nhoff,
+						   p_hlen, flags);
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IPV6),
+	.proto_flow_dissect = inet6_proto_flow_dissect,
 	.callbacks = {
 		.gso_segment = ipv6_gso_segment,
 		.gro_receive = ipv6_gro_receive,
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 6/8] udp: flow dissector offload
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Add support to perform UDP specific flow dissection. This is
primarily intended for dissecting encapsulated packets in UDP
encapsulation.

This patch adds a flow_dissect offload for UDP4 and UDP6. The backend
function performs a socket lookup and calls the flow_dissect function
if a socket is found.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/linux/udp.h      |  8 ++++++++
 include/net/udp.h        |  8 ++++++++
 include/net/udp_tunnel.h |  8 ++++++++
 net/ipv4/udp_offload.c   | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/udp_tunnel.c    |  1 +
 net/ipv6/udp_offload.c   | 16 ++++++++++++++++
 6 files changed, 89 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..2e90b189ef6a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,14 @@ struct udp_sock {
 	int			(*gro_complete)(struct sock *sk,
 						struct sk_buff *skb,
 						int nhoff);
+	/* Flow dissector function for a UDP socket */
+	enum flow_dissect_ret (*flow_dissect)(struct sock *sk,
+			const struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
 
 	/* udp_recvmsg try to use this before splicing sk_receive_queue */
 	struct sk_buff_head	reader_queue ____cacheline_aligned_in_smp;
diff --git a/include/net/udp.h b/include/net/udp.h
index c6b1c5d8d3c9..4867f329538c 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,6 +176,14 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
 				 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+			udp_lookup_t lookup,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
 	struct udphdr *uh;
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 10cce0dd4450..b7102e0f41a9 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -69,6 +69,13 @@ typedef struct sk_buff **(*udp_tunnel_gro_receive_t)(struct sock *sk,
 						     struct sk_buff *skb);
 typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb,
 					 int nhoff);
+typedef enum flow_dissect_ret (*udp_tunnel_flow_dissect_t)(struct sock *sk,
+			const struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags);
 
 struct udp_tunnel_sock_cfg {
 	void *sk_user_data;     /* user data used by encap_rcv call back */
@@ -78,6 +85,7 @@ struct udp_tunnel_sock_cfg {
 	udp_tunnel_encap_destroy_t encap_destroy;
 	udp_tunnel_gro_receive_t gro_receive;
 	udp_tunnel_gro_complete_t gro_complete;
+	udp_tunnel_flow_dissect_t flow_dissect;
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index a744bb515455..fddf923ef433 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -335,11 +335,59 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 	return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+			udp_lookup_t lookup,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+	struct udphdr *uh, _uh;
+	struct sock *sk;
+
+	uh = __skb_header_pointer(skb, *p_nhoff, sizeof(_uh), data,
+				  *p_hlen, &_uh);
+	if (!uh)
+		return FLOW_DISSECT_RET_OUT_BAD;
+
+	rcu_read_lock();
+
+	sk = (*lookup)(skb, uh->source, uh->dest);
+
+	if (sk && udp_sk(sk)->flow_dissect)
+		ret = udp_sk(sk)->flow_dissect(sk, skb, key_control,
+					       flow_dissector, target_container,
+					       data, p_proto, p_ip_proto,
+					       p_nhoff, p_hlen, flags);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(udp_flow_dissect);
+
+static enum flow_dissect_ret udp4_flow_dissect(struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	if (!static_key_false(&udp_encap_needed))
+		return FLOW_DISSECT_RET_CONTINUE;
+
+	return udp_flow_dissect(skb, udp4_lib_lookup_skb, key_control,
+				flow_dissector, target_container, data,
+				p_proto, p_ip_proto, p_nhoff, p_hlen, flags);
+}
+
 static const struct net_offload udpv4_offload = {
 	.callbacks = {
 		.gso_segment = udp4_tunnel_segment,
 		.gro_receive  =	udp4_gro_receive,
 		.gro_complete =	udp4_gro_complete,
+		.flow_dissect = udp4_flow_dissect,
 	},
 };
 
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 6539ff15e9a3..a4eec2a044d2 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -71,6 +71,7 @@ void setup_udp_tunnel_sock(struct net *net, struct socket *sock,
 	udp_sk(sk)->encap_destroy = cfg->encap_destroy;
 	udp_sk(sk)->gro_receive = cfg->gro_receive;
 	udp_sk(sk)->gro_complete = cfg->gro_complete;
+	udp_sk(sk)->flow_dissect = cfg->flow_dissect;
 
 	udp_tunnel_encap_enable(sock);
 }
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 111b026e4f03..45b77f92d77d 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -80,11 +80,27 @@ static int udp6_gro_complete(struct sk_buff *skb, int nhoff)
 	return udp_gro_complete(skb, nhoff, udp6_lib_lookup_skb);
 }
 
+static enum flow_dissect_ret udp6_flow_dissect(struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	if (!static_key_false(&udp_encap_needed))
+		return FLOW_DISSECT_RET_CONTINUE;
+
+	return udp_flow_dissect(skb, udp6_lib_lookup_skb, key_control,
+				flow_dissector, target_container, data,
+				p_proto, p_ip_proto, p_nhoff, p_hlen, flags);
+}
+
 static const struct net_offload udpv6_offload = {
 	.callbacks = {
 		.gso_segment	=	udp6_tunnel_segment,
 		.gro_receive	=	udp6_gro_receive,
 		.gro_complete	=	udp6_gro_complete,
+		.flow_dissect	=	udp6_flow_dissect,
 	},
 };
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 7/8] fou: Support flow dissection
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Populate offload flow_dissect callabck appropriately for fou and gue.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 net/ipv4/fou.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..a831dd49fb28 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -282,6 +282,20 @@ static int fou_gro_complete(struct sock *sk, struct sk_buff *skb,
 	return err;
 }
 
+static enum flow_dissect_ret fou_flow_dissect(struct sock *sk,
+			const struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	*p_ip_proto = fou_from_sock(sk)->protocol;
+	*p_nhoff += sizeof(struct udphdr);
+
+	return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
 				      struct guehdr *guehdr, void *data,
 				      size_t hdrlen, struct gro_remcsum *grc,
@@ -500,6 +514,53 @@ static int gue_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff)
 	return err;
 }
 
+static enum flow_dissect_ret gue_flow_dissect(struct sock *sk,
+			const struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	struct guehdr *guehdr, _guehdr;
+
+	guehdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+				      sizeof(_guehdr), data, *p_hlen, &_guehdr);
+	if (!guehdr)
+		return FLOW_DISSECT_RET_OUT_BAD;
+
+	switch (guehdr->version) {
+	case 0:
+		if (unlikely(guehdr->control))
+			return FLOW_DISSECT_RET_CONTINUE;
+
+		*p_ip_proto = guehdr->proto_ctype;
+		*p_nhoff += sizeof(struct udphdr) +
+		    sizeof(*guehdr) + (guehdr->hlen << 2);
+
+		break;
+	case 1:
+		switch (((struct iphdr *)guehdr)->version) {
+		case 4:
+			*p_ip_proto = IPPROTO_IPIP;
+			break;
+		case 6:
+			*p_ip_proto = IPPROTO_IPV6;
+			break;
+		default:
+			return FLOW_DISSECT_RET_CONTINUE;
+		}
+
+		*p_nhoff += sizeof(struct udphdr);
+
+		break;
+	default:
+		return FLOW_DISSECT_RET_CONTINUE;
+	}
+
+	return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static int fou_add_to_port_list(struct net *net, struct fou *fou)
 {
 	struct fou_net *fn = net_generic(net, fou_net_id);
@@ -570,12 +631,14 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 		tunnel_cfg.encap_rcv = fou_udp_recv;
 		tunnel_cfg.gro_receive = fou_gro_receive;
 		tunnel_cfg.gro_complete = fou_gro_complete;
+		tunnel_cfg.flow_dissect = fou_flow_dissect;
 		fou->protocol = cfg->protocol;
 		break;
 	case FOU_ENCAP_GUE:
 		tunnel_cfg.encap_rcv = gue_udp_recv;
 		tunnel_cfg.gro_receive = gue_gro_receive;
 		tunnel_cfg.gro_complete = gue_gro_complete;
+		tunnel_cfg.flow_dissect = gue_flow_dissect;
 		break;
 	default:
 		err = -EINVAL;
-- 
2.11.0

^ permalink raw reply related

* [PATCH v4 net-next 8/8] vxlan: support flow dissect
From: Tom Herbert @ 2017-09-28 23:52 UTC (permalink / raw)
  To: davem; +Cc: netdev, rohit, Tom Herbert
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Populate offload flow_dissect callback appropriately for VXLAN and
VXLAN-GPE.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 drivers/net/vxlan.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index d7c49cf1d5e9..80227050b2d4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1327,6 +1327,45 @@ static bool vxlan_ecn_decapsulate(struct vxlan_sock *vs, void *oiph,
 	return err <= 1;
 }
 
+static enum flow_dissect_ret vxlan_flow_dissect(struct sock *sk,
+			const struct sk_buff *skb,
+			struct flow_dissector_key_control *key_control,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data,
+			__be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+			int *p_hlen, unsigned int flags)
+{
+	__be16 protocol = htons(ETH_P_TEB);
+	struct vxlanhdr *vhdr, _vhdr;
+	struct vxlan_sock *vs;
+
+	vhdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+				    sizeof(_vhdr), data, *p_hlen, &_vhdr);
+	if (!vhdr)
+		return FLOW_DISSECT_RET_OUT_BAD;
+
+	vs = rcu_dereference_sk_user_data(sk);
+	if (!vs)
+		return FLOW_DISSECT_RET_OUT_BAD;
+
+	if (vs->flags & VXLAN_F_GPE) {
+		struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)vhdr;
+
+		/* Need to have Next Protocol set for interfaces in GPE mode. */
+		if (gpe->version != 0 || !gpe->np_applied || gpe->oam_flag)
+			return FLOW_DISSECT_RET_CONTINUE;
+
+		protocol = tun_p_from_eth_p(gpe->next_protocol);
+		if (!protocol)
+			return FLOW_DISSECT_RET_CONTINUE;
+	}
+
+	*p_nhoff += sizeof(struct udphdr) + sizeof(_vhdr);
+	*p_proto = protocol;
+
+	return FLOW_DISSECT_RET_PROTO_AGAIN;
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 {
@@ -2846,6 +2885,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
 	tunnel_cfg.encap_destroy = NULL;
 	tunnel_cfg.gro_receive = vxlan_gro_receive;
 	tunnel_cfg.gro_complete = vxlan_gro_complete;
+	tunnel_cfg.flow_dissect = vxlan_flow_dissect;
 
 	setup_udp_tunnel_sock(net, sock, &tunnel_cfg);
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH V2] r8152: add Linksys USB3GIGV1 id
From: Doug Anderson @ 2017-09-28 23:53 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Grant Grundler, Oliver Neukum, David S . Miller,
	Greg Kroah-Hartman, Hayes Wang, LKML, linux-usb, netdev
In-Reply-To: <EE09EDA8-B6E9-4BEB-882C-C5435EDA8E41@intel.com>

Hi,

On Thu, Sep 28, 2017 at 3:28 PM, Rustad, Mark D <mark.d.rustad@intel.com> wrote:
>
>> On Sep 27, 2017, at 9:39 AM, Grant Grundler <grundler@chromium.org> wrote:
>>
>> On Wed, Sep 27, 2017 at 12:15 AM, Oliver Neukum <oneukum@suse.com> wrote:
>>> Am Dienstag, den 26.09.2017, 08:19 -0700 schrieb Doug Anderson:
>>>>
>>>> I know that for at least some of the adapters in the CDC Ethernet
>>>> blacklist it was claimed that the CDC Ethernet support in the adapter
>>>> was kinda broken anyway so the blacklist made sense.  ...but for the
>>>> Linksys Gigabit adapter the CDC Ethernet driver seems to work OK, it's
>>>> just not quite as full featured / efficient as the R8152 driver.
>>>>
>>>> Is that not a concern?  I guess you could tell people in this
>>>> situation that they simply need to enable the R8152 driver to get
>>>> continued support for their Ethernet adapter?
>>>
>>> Hi,
>>>
>>> yes, it is a valid concern. An #ifdef will be needed.
>>
>> Good idea - I will post V3 shortly.
>>
>> I'm assuming you mean to add #ifdef CONFIG_USB_RTL8152 around the
>> blacklist entry in cdc_ether driver.
>
> Shouldn't that be an #if IS_ENABLED(...) test, since that seems to be the proper way to check configured drivers.

Yes, I had the same feedback on v3.  See my comments at
<https://patchwork.kernel.org/patch/9974485/>.  Grant has fixed it in
v4.  Please see <https://patchwork.kernel.org/patch/9976657/>.  :)

-Doug

^ permalink raw reply

* Re: [PATCH V4] r8152: add Linksys USB3GIGV1 id
From: Doug Anderson @ 2017-09-28 23:57 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Hayes Wang, Oliver Neukum, linux-usb, David S . Miller, LKML,
	netdev
In-Reply-To: <20170928183500.61199-1-grundler@chromium.org>

Grant,

On Thu, Sep 28, 2017 at 11:35 AM, Grant Grundler <grundler@chromium.org> wrote:
> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>    Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Signed-off-by: Grant Grundler <grundler@chromium.org>
> ---
>  drivers/net/usb/cdc_ether.c | 10 ++++++++++
>  drivers/net/usb/r8152.c     |  2 ++
>  2 files changed, 12 insertions(+)

This seems nice to me now.  Thanks for all the fixes!  I'm no expert
in this area, but as far as I know this is ready to go now, so FWIW:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

^ permalink raw reply

* linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2017-09-29  1:36 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Vivien Didelot, Florian Fainelli

Hi all,

After merging the net-next tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

net/dsa/slave.c: In function 'dsa_slave_create':
net/dsa/slave.c:1191:18: error: 'struct dsa_slave_priv' has no member named 'phy'
  phy_disconnect(p->phy);
                  ^

Caused by commit

  0115dcd1787d ("net: dsa: use slave device phydev")

Interacting with commit

  e804441cfe0b ("net: dsa: Fix network device registration order")

from the net tree.

I applied the following merge fix patch (which I am not sure about):

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 29 Sep 2017 11:28:45 +1000
Subject: [PATCH] net: dsa: merge fix patch for removal of phy

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/dsa/slave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 8869954485db..9191c929c6c8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1188,7 +1188,7 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
 	return 0;
 
 out_phy:
-	phy_disconnect(p->phy);
+	phy_disconnect(slave_dev->phydev);
 	if (of_phy_is_fixed_link(p->dp->dn))
 		of_phy_deregister_fixed_link(p->dp->dn);
 out_free:
-- 
2.14.1

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply related

* Re: linux-next: build failure after merge of the net-next tree
From: Florian Fainelli @ 2017-09-29  2:07 UTC (permalink / raw)
  To: Stephen Rothwell, David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Vivien Didelot
In-Reply-To: <20170929113635.3337c026@canb.auug.org.au>

Le 09/28/17 à 18:36, Stephen Rothwell a écrit :
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> net/dsa/slave.c: In function 'dsa_slave_create':
> net/dsa/slave.c:1191:18: error: 'struct dsa_slave_priv' has no member named 'phy'
>   phy_disconnect(p->phy);
>                   ^
> 
> Caused by commit
> 
>   0115dcd1787d ("net: dsa: use slave device phydev")
> 
> Interacting with commit
> 
>   e804441cfe0b ("net: dsa: Fix network device registration order")
> 
> from the net tree.
> 
> I applied the following merge fix patch (which I am not sure about):

Your resolution looks fine to me, thanks Stephen!

> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 29 Sep 2017 11:28:45 +1000
> Subject: [PATCH] net: dsa: merge fix patch for removal of phy
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
> ---
>  net/dsa/slave.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 8869954485db..9191c929c6c8 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1188,7 +1188,7 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
>  	return 0;
>  
>  out_phy:
> -	phy_disconnect(p->phy);
> +	phy_disconnect(slave_dev->phydev);
>  	if (of_phy_is_fixed_link(p->dp->dn))
>  		of_phy_deregister_fixed_link(p->dp->dn);
>  out_free:
> 


-- 
Florian

^ permalink raw reply

* Re: [lkp-robot] [mac80211] 31e9170bde: hwsim.sta_dynamic_down_up.fail
From: Xiang Gao @ 2017-09-29  2:21 UTC (permalink / raw)
  To: kernel test robot
  Cc: Herbert Xu, David S. Miller, Johannes Berg, linux-crypto,
	linux-kernel, linux-wireless, netdev, lkp
In-Reply-To: <20170928080614.GZ17200@yexl-desktop>

Thanks, I will look into it.
Xiang Gao


2017-09-28 4:06 GMT-04:00 kernel test robot <xiaolong.ye@intel.com>:
>
> FYI, we noticed the following commit:
>
> commit: 31e9170bdeb6ebe66426337b4e2b9924683a412b ("mac80211: aead api to reduce redundancy")
> url: https://github.com/0day-ci/linux/commits/Xiang-Gao/mac80211-aead-api-to-reduce-redundancy/20170926-053110
> base: https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git master
>
> in testcase: hwsim
> with following parameters:
>
>         group: hwsim-10
>
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> 2017-09-27 16:04:27     ./run-tests.py sta_dynamic_down_up
> DEV: wlan0: 02:00:00:00:00:00
> DEV: wlan1: 02:00:00:00:01:00
> DEV: wlan2: 02:00:00:00:02:00
> APDEV: wlan3
> APDEV: wlan4
> START sta_dynamic_down_up 1/1
> Test: Dynamically added wpa_supplicant interface down/up
> Starting AP wlan3
> Create a dynamic wpa_supplicant interface and connect
> Connect STA wlan5 to AP
> dev1->dev2 unicast data delivery failed
> Traceback (most recent call last):
>   File "./run-tests.py", line 453, in main
>     t(dev, apdev)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/test_sta_dynamic.py", line 122, in test_sta_dynamic_down_up
>     hwsim_utils.test_connectivity(wpas, hapd)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/hwsim_utils.py", line 165, in test_connectivity
>     raise Exception(last_err)
> Exception: dev1->dev2 unicast data delivery failed
> FAIL sta_dynamic_down_up 5.397413 2017-09-27 16:04:32.540689
> passed 0 test case(s)
> skipped 0 test case(s)
> failed tests: sta_dynamic_down_up
>
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> job-script  # job-script is attached in this email
>
>
>
> Thanks,
> Xiaolong

^ permalink raw reply

* (unknown), 
From: Tina Aaron @ 2017-09-29  2:48 UTC (permalink / raw)




Do you need urgent LOAN ? If yes, Contact me now via Email: mondataclassic@gmail.com




CONFIDENTIALITY NOTICE: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information.  Any unauthorized use, disclosure or distribution is prohibited.  If you are not the intended recipient, please discard the message immediately and inform the sender that the message was sent in error.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Alexei Starovoitov @ 2017-09-29  3:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, hannes
In-Reply-To: <150660342793.2808.10838498581615265043.stgit@firesoul>

On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> The 'cpumap' is primary used as a backend map for XDP BPF helper
> call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> 
> This patch implement the main part of the map.  It is not connected to
> the XDP redirect system yet, and no SKB allocation are done yet.
> 
> The main concern in this patch is to ensure the datapath can run
> without any locking.  This adds complexity to the setup and tear-down
> procedure, which assumptions are extra carefully documented in the
> code comments.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/bpf_types.h      |    1 
>  include/uapi/linux/bpf.h       |    1 
>  kernel/bpf/Makefile            |    1 
>  kernel/bpf/cpumap.c            |  547 ++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |    8 +
>  tools/include/uapi/linux/bpf.h |    1 
>  6 files changed, 558 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/cpumap.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 6f1a567667b8..814c1081a4a9 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
>  #ifdef CONFIG_STREAM_PARSER
>  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  #endif
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e43491ac4823..f14e15702533 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -111,6 +111,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_HASH_OF_MAPS,
>  	BPF_MAP_TYPE_DEVMAP,
>  	BPF_MAP_TYPE_SOCKMAP,
> +	BPF_MAP_TYPE_CPUMAP,
>  };
>  
>  enum bpf_prog_type {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 897daa005b23..dba0bd33a43c 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> +obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  ifeq ($(CONFIG_STREAM_PARSER),y)
>  obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
>  endif
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> new file mode 100644
> index 000000000000..f0948af82e65
> --- /dev/null
> +++ b/kernel/bpf/cpumap.c
> @@ -0,0 +1,547 @@
> +/* bpf/cpumap.c
> + *
> + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> + * Released under terms in GPL version 2.  See COPYING.
> + */
> +
> +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> + *
> + * Unlike devmap which redirect XDP frames out another NIC device,
> + * this map type redirect raw XDP frames to another CPU.  The remote
> + * CPU will do SKB-allocation and call the normal network stack.
> + *
> + * This is a scalability and isolation mechanism, that allow
> + * separating the early driver network XDP layer, from the rest of the
> + * netstack, and assigning dedicated CPUs for this stage.  This
> + * basically allows for 10G wirespeed pre-filtering via bpf.
> + */
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/ptr_ring.h>
> +
> +#include <linux/sched.h>
> +#include <linux/workqueue.h>
> +#include <linux/kthread.h>
> +
> +/*
> + * General idea: XDP packets getting XDP redirected to another CPU,
> + * will maximum be stored/queued for one driver ->poll() call.  It is
> + * guaranteed that setting flush bit and flush operation happen on
> + * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> + * which queue in bpf_cpu_map_entry contains packets.
> + */
> +
> +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> +struct xdp_bulk_queue {
> +	void *q[CPU_MAP_BULK_SIZE];
> +	unsigned int count;
> +};
> +
> +/* Struct for every remote "destination" CPU in map */
> +struct bpf_cpu_map_entry {
> +	u32 cpu;    /* kthread CPU and map index */
> +	int map_id; /* Back reference to map */
> +	u32 qsize;  /* Redundant queue size for map lookup */
> +
> +	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> +	struct xdp_bulk_queue __percpu *bulkq;
> +
> +	/* Queue with potential multi-producers, and single-consumer kthread */
> +	struct ptr_ring *queue;
> +	struct task_struct *kthread;
> +	struct work_struct kthread_stop_wq;
> +
> +	atomic_t refcnt; /* Control when this struct can be free'ed */
> +	struct rcu_head rcu;
> +};
> +
> +struct bpf_cpu_map {
> +	struct bpf_map map;
> +	/* Below members specific for map type */
> +	struct bpf_cpu_map_entry **cpu_map;
> +	unsigned long __percpu *flush_needed;
> +};
> +
> +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> +			     struct xdp_bulk_queue *bq);
> +
> +static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
> +{
> +	return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
> +}
> +
> +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_cpu_map *cmap;
> +	u64 cost;
> +	int err;
> +
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> +		return ERR_PTR(-EINVAL);
> +
> +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> +	if (!cmap)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* mandatory map attributes */
> +	cmap->map.map_type = attr->map_type;
> +	cmap->map.key_size = attr->key_size;
> +	cmap->map.value_size = attr->value_size;
> +	cmap->map.max_entries = attr->max_entries;
> +	cmap->map.map_flags = attr->map_flags;
> +	cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> +
> +	/* make sure page count doesn't overflow */
> +	cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> +	cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> +	if (cost >= U32_MAX - PAGE_SIZE)
> +		goto free_cmap;
> +	cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	/* if map size is larger than memlock limit, reject it early */
> +	err = bpf_map_precharge_memlock(cmap->map.pages);
> +	if (err)
> +		goto free_cmap;
> +
> +	/* A per cpu bitfield with a bit per possible CPU in map  */
> +	cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> +					    __alignof__(unsigned long));
> +	if (!cmap->flush_needed)
> +		goto free_cmap;
> +
> +	/* Alloc array for possible remote "destination" CPUs */
> +	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> +					   sizeof(struct bpf_cpu_map_entry *),
> +					   cmap->map.numa_node);
> +	if (!cmap->cpu_map)
> +		goto free_cmap;
> +
> +	return &cmap->map;
> +free_cmap:
> +	free_percpu(cmap->flush_needed);
> +	kfree(cmap);
> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +void __cpu_map_queue_destructor(void *ptr)
> +{
> +	/* For now, just catch this as an error */
> +	if (!ptr)
> +		return;
> +	pr_err("ERROR: %s() cpu_map queue was not empty\n", __func__);
> +	page_frag_free(ptr);
> +}
> +
> +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> +		/* The queue should be empty at this point */
> +		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> +		kfree(rcpu->queue);
> +		kfree(rcpu);
> +	}
> +}
> +
> +static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +	atomic_inc(&rcpu->refcnt);
> +}
> +
> +/* called from workqueue, to workaround syscall using preempt_disable */
> +static void cpu_map_kthread_stop(struct work_struct *work)
> +{
> +	struct bpf_cpu_map_entry *rcpu;
> +
> +	rcpu = container_of(work, struct bpf_cpu_map_entry, kthread_stop_wq);
> +	synchronize_rcu(); /* wait for flush in __cpu_map_entry_free() */
> +	kthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */
> +}
> +
> +static int cpu_map_kthread_run(void *data)
> +{
> +	struct bpf_cpu_map_entry *rcpu = data;
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	while (!kthread_should_stop()) {
> +		struct xdp_pkt *xdp_pkt;
> +
> +		schedule();
> +		/* Do work */
> +		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> +			/* For now just "refcnt-free" */
> +			page_frag_free(xdp_pkt);
> +		}
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +	}
> +	put_cpu_map_entry(rcpu);
> +
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +
> +struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id)
> +{
> +	gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
> +	struct bpf_cpu_map_entry *rcpu;
> +	int numa, err;
> +
> +	/* Have map->numa_node, but choose node of redirect target CPU */
> +	numa = cpu_to_node(cpu);
> +
> +	rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
> +	if (!rcpu)
> +		return NULL;
> +
> +	/* Alloc percpu bulkq */
> +	rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
> +					 sizeof(void *), gfp);
> +	if (!rcpu->bulkq)
> +		goto fail;
> +
> +	/* Alloc queue */
> +	rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
> +	if (!rcpu->queue)
> +		goto fail;
> +
> +	err = ptr_ring_init(rcpu->queue, qsize, gfp);
> +	if (err)
> +		goto fail;
> +	rcpu->qsize = qsize;
> +
> +	/* Setup kthread */
> +	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
> +					       "cpumap/%d/map:%d", cpu, map_id);
> +	if (IS_ERR(rcpu->kthread))
> +		goto fail;
> +
> +	/* Make sure kthread runs on a single CPU */
> +	kthread_bind(rcpu->kthread, cpu);

is there a check that max_entries <= num_possible_cpu ? I couldn't find it.
otherwise it will be binding to impossible cpu?

> +	wake_up_process(rcpu->kthread);

In general the whole thing looks like 'threaded NAPI' that Hannes was
proposing some time back. I liked it back then and I like it now.
I don't remember what were the objections back then.
Something scheduler related?
Adding Hannes.

Still curious about the questions I asked in the other thread
on what's causing it to be so much better than RPS

^ permalink raw reply

* Re: [PATCH v4 2/2] ip_tunnel: add mpls over gre encapsulation
From: Tom Herbert @ 2017-09-29  4:11 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: Linux Kernel Network Developers, xeb, roopa, equinox
In-Reply-To: <2e611d0f6e0c39ff54bfe464cdf9cf6eeb7843e1.1506590878.git.amine.kherbouche@6wind.com>

On Thu, Sep 28, 2017 at 2:34 AM, Amine Kherbouche
<amine.kherbouche@6wind.com> wrote:
> This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
> API.
>
> Encap:
>   - Add a new iptunnel type mpls.
>   - Share tx path: gre type mpls loaded from skb->protocol.
>
> Decap:
>   - pull gre hdr and call mpls_forward().
>
> Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
>  include/net/gre.h              |  1 +
>  include/uapi/linux/if_tunnel.h |  1 +
>  net/ipv4/gre_demux.c           | 27 +++++++++++++++++++++++++++
>  net/ipv4/ip_gre.c              |  3 +++
>  net/ipv6/ip6_gre.c             |  3 +++
>  net/mpls/af_mpls.c             | 36 ++++++++++++++++++++++++++++++++++++
>  6 files changed, 71 insertions(+)
>
> diff --git a/include/net/gre.h b/include/net/gre.h
> index d25d836..aa3c4d3 100644
> --- a/include/net/gre.h
> +++ b/include/net/gre.h
> @@ -35,6 +35,7 @@ struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
>                                        u8 name_assign_type);
>  int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>                      bool *csum_err, __be16 proto, int nhs);
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
>
>  static inline int gre_calc_hlen(__be16 o_flags)
>  {
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 2e52088..a2f48c0 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -84,6 +84,7 @@ enum tunnel_encap_types {
>         TUNNEL_ENCAP_NONE,
>         TUNNEL_ENCAP_FOU,
>         TUNNEL_ENCAP_GUE,
> +       TUNNEL_ENCAP_MPLS,
>  };
>
>  #define TUNNEL_ENCAP_FLAG_CSUM         (1<<0)
> diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
> index b798862..40484a3 100644
> --- a/net/ipv4/gre_demux.c
> +++ b/net/ipv4/gre_demux.c
> @@ -23,6 +23,9 @@
>  #include <linux/netdevice.h>
>  #include <linux/if_tunnel.h>
>  #include <linux/spinlock.h>
> +#if IS_ENABLED(CONFIG_MPLS)
> +#include <linux/mpls.h>
> +#endif
>  #include <net/protocol.h>
>  #include <net/gre.h>
>
> @@ -122,6 +125,30 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>  }
>  EXPORT_SYMBOL(gre_parse_header);
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +       if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
> +               goto drop;
> +
> +       /* Pop GRE hdr and reset the skb */
> +       skb_pull(skb, gre_hdr_len);
> +       skb_reset_network_header(skb);
> +

I don't see why MPLS/GRE needs to be a special case in gre_rcv. Can't
we just follow the normal processing patch which calls the proto ops
handler for the protocol in the GRE header? Also, if protocol specific
code is added to rcv function that most likely means that we need to
update the related offloads also (grant it that MPLS doesn't support
GRO but it looks like it supports GSO). Additionally, we'd need to
consider if flow dissector needs a similar special case (I will point
out that my recently posted patches there eliminated TEB as the one
special case in GRE dissection).

Thanks,
Tom

> +       return mpls_forward(skb, skb->dev, NULL, NULL);
> +drop:
> +       kfree_skb(skb);
> +       return NET_RX_DROP;
> +}
> +#else
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +       kfree_skb(skb);
> +       return NET_RX_DROP;
> +}
> +#endif
> +EXPORT_SYMBOL(mpls_gre_rcv);
> +
>  static int gre_rcv(struct sk_buff *skb)
>  {
>         const struct gre_protocol *proto;
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 9cee986..7a50e4f 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -412,6 +412,9 @@ static int gre_rcv(struct sk_buff *skb)
>                         return 0;
>         }
>
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +               return mpls_gre_rcv(skb, hdr_len);
> +
>         if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
>                 return 0;
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index c82d41e..440efb1 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -476,6 +476,9 @@ static int gre_rcv(struct sk_buff *skb)
>         if (hdr_len < 0)
>                 goto drop;
>
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +               return mpls_gre_rcv(skb, hdr_len);
> +
>         if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
>                 goto drop;
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 36ea2ad..4274243 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -16,6 +16,7 @@
>  #include <net/arp.h>
>  #include <net/ip_fib.h>
>  #include <net/netevent.h>
> +#include <net/ip_tunnels.h>
>  #include <net/netns/generic.h>
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include <net/ipv6.h>
> @@ -39,6 +40,36 @@ static int one = 1;
>  static int label_limit = (1 << 20) - 1;
>  static int ttl_max = 255;
>
> +#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
> +size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
> +{
> +       return sizeof(struct mpls_shim_hdr);
> +}
> +
> +static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
> +       .encap_hlen     = ipgre_mpls_encap_hlen,
> +};
> +
> +static int ipgre_tunnel_encap_add_mpls_ops(void)
> +{
> +       return ip_tunnel_encap_add_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +}
> +
> +static void ipgre_tunnel_encap_del_mpls_ops(void)
> +{
> +       ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +}
> +#else
> +static int ipgre_tunnel_encap_add_mpls_ops(void)
> +{
> +       return 0;
> +}
> +
> +static void ipgre_tunnel_encap_del_mpls_ops(void)
> +{
> +}
> +#endif
> +
>  static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>                        struct nlmsghdr *nlh, struct net *net, u32 portid,
>                        unsigned int nlm_flags);
> @@ -2486,6 +2517,10 @@ static int __init mpls_init(void)
>                       0);
>         rtnl_register(PF_MPLS, RTM_GETNETCONF, mpls_netconf_get_devconf,
>                       mpls_netconf_dump_devconf, 0);
> +       err = ipgre_tunnel_encap_add_mpls_ops();
> +       if (err)
> +               pr_err("Can't add mpls over gre tunnel ops\n");
> +
>         err = 0;
>  out:
>         return err;
> @@ -2503,6 +2538,7 @@ static void __exit mpls_exit(void)
>         dev_remove_pack(&mpls_packet_type);
>         unregister_netdevice_notifier(&mpls_dev_notifier);
>         unregister_pernet_subsys(&mpls_net_ops);
> +       ipgre_tunnel_encap_del_mpls_ops();
>  }
>  module_exit(mpls_exit);
>
> --
> 2.1.4
>

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix under-evaluated ssthresh in TCP Vegas
From: David Miller @ 2017-09-29  5:07 UTC (permalink / raw)
  To: tranviethoang.vn; +Cc: netdev, hoang.tran, kuznet, yoshfuji, linux-kernel
In-Reply-To: <1506529940-2143-1-git-send-email-hoang.tran@uclouvain.be>

From: Hoang Tran <tranviethoang.vn@gmail.com>
Date: Wed, 27 Sep 2017 18:30:58 +0200

> With the commit 76174004a0f19785 (tcp: do not slow start when cwnd equals
> ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
> under-evaluate the ssthresh.
> 
> Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>

Applied, thank you.

^ permalink raw reply

* Re: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field
From: Jiri Pirko @ 2017-09-29  6:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, yotamg, idosch, mlxsw, nikolay, dsa, edumazet,
	willemb, johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928174903.GE14940@lunn.ch>

Thu, Sep 28, 2017 at 07:49:03PM CEST, andrew@lunn.ch wrote:
>On Thu, Sep 28, 2017 at 07:34:09PM +0200, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>> 
>> Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
>> used to allow partial offloading of MFC multicast routes.
>
>> The reason why the already existing "offload_fwd_mark" bit cannot be used
>> is that a switchdev driver would want to make the distinction between a
>> packet that has already gone through L2 forwarding but did not go through
>> multicast forwarding, and a packet that has already gone through both L2
>> and multicast forwarding.
>
>Hi Jiri
>
>So we are talking about l2 vs l3. So why not call this
>offload_l3_fwd_mark?
>
>Is there anything really specific to multicast here?

Currently it is, not sure if it is going to be used for anything else
later on. In case it will be, it could be renamed very easily.


>
>   Thanks
>      Andrew

^ permalink raw reply

* Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower
From: Jiri Pirko @ 2017-09-29  6:20 UTC (permalink / raw)
  To: Nambiar, Amritha
  Cc: intel-wired-lan, jeffrey.t.kirsher, alexander.h.duyck, netdev,
	mlxsw, alexander.duyck@gmail.com, Jamal Hadi Salim, Cong Wang
In-Reply-To: <dd18a4bd-f2fc-002b-2ef9-01de9a5a4162@intel.com>

Thu, Sep 28, 2017 at 09:22:15PM CEST, amritha.nambiar@intel.com wrote:
>On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
>> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.nambiar@intel.com wrote:
>>>> This patch enables tc-flower based hardware offloads. tc flower
>>>> filter provided by the kernel is configured as driver specific
>>>> cloud filter. The patch implements functions and admin queue
>>>> commands needed to support cloud filters in the driver and
>>>> adds cloud filters to configure these tc-flower filters.
>>>>
>>>> The only action supported is to redirect packets to a traffic class
>>>> on the same device.
>>>
>>> So basically you are not doing redirect, you are just setting tclass for
>>> matched packets, right? Why you use mirred for this? I think that
>>> you might consider extending g_act for that:
>>>
>>> # tc filter add dev eth0 protocol ip ingress \
>>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>>   action tclass 0
>>>
>> Yes, this doesn't work like a typical egress redirect, but is aimed at
>> forwarding the matched packets to a different queue-group/traffic class
>> on the same device, so some sort-of ingress redirect in the hardware. I
>> possibly may not need the mirred-redirect as you say, I'll look into the
>> g_act way of doing this with a new gact tc action.
>> 
>
>I was looking at introducing a new gact tclass action to TC. In the HW
>offload path, this sets a traffic class value for certain matched
>packets so they will be processed in a queue belonging to the traffic class.
>
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw\
>  action tclass 2
>
>But, I'm having trouble defining what this action means in the kernel
>datapath. For ingress, this action could just take the default path and
>do nothing and only have meaning in the HW offloaded path. For egress,

Sounds ok.


>certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
>for band selection, while the 'mqprio' qdisc selects the traffic class
>based on the skb priority in netdev_pick_tx(), so what would this action
>mean for the 'mqprio' qdisc?

I don't see why this action would have any special meaning for specific
qdiscs. The qdiscs have already mechanisms for band mapping. I don't see
why to mix it up with tclass action.

Also, you can use tclass action on qdisc clsact egress to do band
mapping. That would be symmetrical with ingress.


>
>It looks like the 'prio' qdisc uses band selection based on the
>'classid', so I was thinking of using the 'classid' through the cls
>flower filter and offload it to HW for the traffic class index, this way
>we would have the same behavior in HW offload and SW fallback and there
>would be no need for a separate tc action.
>
>In HW:
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw classid 1:2\
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_sw
>  in_hw
>
>This will be used to route packets to traffic class 2.
>
>In SW:
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_hw classid 1:2
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_hw
>  not_in_hw
>
>>>
>>>>
>>>> # tc qdisc add dev eth0 ingress
>>>> # ethtool -K eth0 hw-tc-offload on
>>>>
>>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 0
>>>>
>>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>>  prio 2 flower dst_ip 192.168.3.5/32\
>>>>  ip_proto udp dst_port 25 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 1
>>>>
>>>> # tc filter add dev eth0 protocol ipv6 parent ffff:\
>>>>  prio 3 flower dst_ip fe8::200:1\
>>>>  ip_proto udp dst_port 66 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 1
>>>>
>>>> Delete tc flower filter:
>>>> Example:
>>>>
>>>> # tc filter del dev eth0 parent ffff: prio 3 handle 0x1 flower
>>>> # tc filter del dev eth0 parent ffff:
>>>>
>>>> Flow Director Sideband is disabled while configuring cloud filters
>>>> via tc-flower and until any cloud filter exists.
>>>>
>>>> Unsupported matches when cloud filters are added using enhanced
>>>> big buffer cloud filter mode of underlying switch include:
>>>> 1. source port and source IP
>>>> 2. Combined MAC address and IP fields.
>>>> 3. Not specifying L4 port
>>>>
>>>> These filter matches can however be used to redirect traffic to
>>>> the main VSI (tc 0) which does not require the enhanced big buffer
>>>> cloud filter support.
>>>>
>>>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>>>> __be32 array instead of u8 array. Used macro for IP version. Minor
>>>> formatting changes.
>>>> v2:
>>>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>>>> 2. Moved dev_info for add/deleting cloud filters in else condition
>>>> 3. Fixed some format specifier in dev_err logs
>>>> 4. Refactored i40e_get_capabilities to take an additional
>>>>   list_type parameter and use it to query device and function
>>>>   level capabilities.
>>>> 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>>>>   to verify if redirect to a traffic class is supported.
>>>> 6. Added comments for Geneve fix in cloud filter big buffer AQ
>>>>   function definitions.
>>>> 7. Cleaned up setup_tc interface to rebase and work with Jiri's
>>>>   updates, separate function to process tc cls flower offloads.
>>>> 8. Changes to make Flow Director Sideband and Cloud filters mutually
>>>>   exclusive.
>>>>
>>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>>> Signed-off-by: Kiran Patil <kiran.patil@intel.com>
>>>> Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
>>>> Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>
>>>> ---
>>>> drivers/net/ethernet/intel/i40e/i40e.h             |   49 +
>>>> drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |    3 
>>>> drivers/net/ethernet/intel/i40e/i40e_common.c      |  189 ++++
>>>> drivers/net/ethernet/intel/i40e/i40e_main.c        |  971 +++++++++++++++++++-
>>>> drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16 
>>>> drivers/net/ethernet/intel/i40e/i40e_type.h        |    1 
>>>> .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h    |    3 
>>>> 7 files changed, 1202 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>>>> index 6018fb6..b110519 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>>> @@ -55,6 +55,8 @@
>>>> #include <linux/net_tstamp.h>
>>>> #include <linux/ptp_clock_kernel.h>
>>>> #include <net/pkt_cls.h>
>>>> +#include <net/tc_act/tc_gact.h>
>>>> +#include <net/tc_act/tc_mirred.h>
>>>> #include "i40e_type.h"
>>>> #include "i40e_prototype.h"
>>>> #include "i40e_client.h"
>>>> @@ -252,9 +254,52 @@ struct i40e_fdir_filter {
>>>> 	u32 fd_id;
>>>> };
>>>>
>>>> +#define IPV4_VERSION 4
>>>> +#define IPV6_VERSION 6
>>>> +
>>>> +#define I40E_CLOUD_FIELD_OMAC	0x01
>>>> +#define I40E_CLOUD_FIELD_IMAC	0x02
>>>> +#define I40E_CLOUD_FIELD_IVLAN	0x04
>>>> +#define I40E_CLOUD_FIELD_TEN_ID	0x08
>>>> +#define I40E_CLOUD_FIELD_IIP	0x10
>>>> +
>>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC	I40E_CLOUD_FIELD_OMAC
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC	I40E_CLOUD_FIELD_IMAC
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN	(I40E_CLOUD_FIELD_IMAC | \
>>>> +						 I40E_CLOUD_FIELD_IVLAN)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID	(I40E_CLOUD_FIELD_IMAC | \
>>>> +						 I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>>>> +						  I40E_CLOUD_FIELD_IMAC | \
>>>> +						  I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>>>> +						   I40E_CLOUD_FIELD_IVLAN | \
>>>> +						   I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IIP	I40E_CLOUD_FIELD_IIP
>>>> +
>>>> struct i40e_cloud_filter {
>>>> 	struct hlist_node cloud_node;
>>>> 	unsigned long cookie;
>>>> +	/* cloud filter input set follows */
>>>> +	u8 dst_mac[ETH_ALEN];
>>>> +	u8 src_mac[ETH_ALEN];
>>>> +	__be16 vlan_id;
>>>> +	__be32 dst_ip;
>>>> +	__be32 src_ip;
>>>> +	__be32 dst_ipv6[4];
>>>> +	__be32 src_ipv6[4];
>>>> +	__be16 dst_port;
>>>> +	__be16 src_port;
>>>> +	u32 ip_version;
>>>> +	u8 ip_proto;	/* IPPROTO value */
>>>> +	/* L4 port type: src or destination port */
>>>> +#define I40E_CLOUD_FILTER_PORT_SRC	0x01
>>>> +#define I40E_CLOUD_FILTER_PORT_DEST	0x02
>>>> +	u8 port_type;
>>>> +	u32 tenant_id;
>>>> +	u8 flags;
>>>> +#define I40E_CLOUD_TNL_TYPE_NONE	0xff
>>>> +	u8 tunnel_type;
>>>> 	u16 seid;	/* filter control */
>>>> };
>>>>
>>>> @@ -491,6 +536,8 @@ struct i40e_pf {
>>>> #define I40E_FLAG_LINK_DOWN_ON_CLOSE_ENABLED	BIT(27)
>>>> #define I40E_FLAG_SOURCE_PRUNING_DISABLED	BIT(28)
>>>> #define I40E_FLAG_TC_MQPRIO			BIT(29)
>>>> +#define I40E_FLAG_FD_SB_INACTIVE		BIT(30)
>>>> +#define I40E_FLAG_FD_SB_TO_CLOUD_FILTER		BIT(31)
>>>>
>>>> 	struct i40e_client_instance *cinst;
>>>> 	bool stat_offsets_loaded;
>>>> @@ -573,6 +620,8 @@ struct i40e_pf {
>>>> 	u16 phy_led_val;
>>>>
>>>> 	u16 override_q_count;
>>>> +	u16 last_sw_conf_flags;
>>>> +	u16 last_sw_conf_valid_flags;
>>>> };
>>>>
>>>> /**
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> index 2e567c2..feb3d42 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> @@ -1392,6 +1392,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>>> 		struct {
>>>> 			u8 data[16];
>>>> 		} v6;
>>>> +		struct {
>>>> +			__le16 data[8];
>>>> +		} raw_v6;
>>>> 	} ipaddr;
>>>> 	__le16	flags;
>>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> index 9567702..d9c9665 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> @@ -5434,5 +5434,194 @@ i40e_add_pinfo_to_list(struct i40e_hw *hw,
>>>>
>>>> 	status = i40e_aq_write_ppp(hw, (void *)sec, sec->data_end,
>>>> 				   track_id, &offset, &info, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_add_cloud_filters
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to add cloud filters from
>>>> + * @filters: Buffer which contains the filters to be added
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Set the cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>>> + * of the function.
>>>> + *
>>>> + **/
>>>> +enum i40e_status_code
>>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 seid,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	enum i40e_status_code status;
>>>> +	u16 buff_len;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_add_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_add_cloud_filters_bb
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to add cloud filters from
>>>> + * @filters: Buffer which contains the filters in big buffer to be added
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Set the big buffer cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>>> + * function.
>>>> + *
>>>> + **/
>>>> +i40e_status
>>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	i40e_status status;
>>>> +	u16 buff_len;
>>>> +	int i;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_add_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>>> +
>>>> +	for (i = 0; i < filter_count; i++) {
>>>> +		u16 tnl_type;
>>>> +		u32 ti;
>>>> +
>>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>>> +
>>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>>> +		 * byte than the offset for the Tenant ID for rest of the
>>>> +		 * tunnels.
>>>> +		 */
>>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_rem_cloud_filters
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to remove cloud filters from
>>>> + * @filters: Buffer which contains the filters to be removed
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Remove the cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>>> + * of the function.
>>>> + *
>>>> + **/
>>>> +enum i40e_status_code
>>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 seid,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	enum i40e_status_code status;
>>>> +	u16 buff_len;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_rem_cloud_filters_bb
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to remove cloud filters from
>>>> + * @filters: Buffer which contains the filters in big buffer to be removed
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Remove the big buffer cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>>> + * function.
>>>> + *
>>>> + **/
>>>> +i40e_status
>>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	i40e_status status;
>>>> +	u16 buff_len;
>>>> +	int i;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>>> +
>>>> +	for (i = 0; i < filter_count; i++) {
>>>> +		u16 tnl_type;
>>>> +		u32 ti;
>>>> +
>>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>>> +
>>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>>> +		 * byte than the offset for the Tenant ID for rest of the
>>>> +		 * tunnels.
>>>> +		 */
>>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> 	return status;
>>>> }
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> index afcf08a..96ee608 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> @@ -69,6 +69,15 @@ static int i40e_reset(struct i40e_pf *pf);
>>>> static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
>>>> static void i40e_fdir_sb_setup(struct i40e_pf *pf);
>>>> static int i40e_veb_get_bw_info(struct i40e_veb *veb);
>>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>>> +				     struct i40e_cloud_filter *filter,
>>>> +				     bool add);
>>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>>> +					     struct i40e_cloud_filter *filter,
>>>> +					     bool add);
>>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>>> +				 enum i40e_admin_queue_opc list_type);
>>>> +
>>>>
>>>> /* i40e_pci_tbl - PCI Device ID Table
>>>>  *
>>>> @@ -5478,7 +5487,11 @@ int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate)
>>>>  **/
>>>> static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>>> {
>>>> +	enum i40e_admin_queue_err last_aq_status;
>>>> +	struct i40e_cloud_filter *cfilter;
>>>> 	struct i40e_channel *ch, *ch_tmp;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct hlist_node *node;
>>>> 	int ret, i;
>>>>
>>>> 	/* Reset rss size that was stored when reconfiguring rss for
>>>> @@ -5519,6 +5532,29 @@ static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>>> 				 "Failed to reset tx rate for ch->seid %u\n",
>>>> 				 ch->seid);
>>>>
>>>> +		/* delete cloud filters associated with this channel */
>>>> +		hlist_for_each_entry_safe(cfilter, node,
>>>> +					  &pf->cloud_filter_list, cloud_node) {
>>>> +			if (cfilter->seid != ch->seid)
>>>> +				continue;
>>>> +
>>>> +			hash_del(&cfilter->cloud_node);
>>>> +			if (cfilter->dst_port)
>>>> +				ret = i40e_add_del_cloud_filter_big_buf(vsi,
>>>> +									cfilter,
>>>> +									false);
>>>> +			else
>>>> +				ret = i40e_add_del_cloud_filter(vsi, cfilter,
>>>> +								false);
>>>> +			last_aq_status = pf->hw.aq.asq_last_status;
>>>> +			if (ret)
>>>> +				dev_info(&pf->pdev->dev,
>>>> +					 "Failed to delete cloud filter, err %s aq_err %s\n",
>>>> +					 i40e_stat_str(&pf->hw, ret),
>>>> +					 i40e_aq_str(&pf->hw, last_aq_status));
>>>> +			kfree(cfilter);
>>>> +		}
>>>> +
>>>> 		/* delete VSI from FW */
>>>> 		ret = i40e_aq_delete_element(&vsi->back->hw, ch->seid,
>>>> 					     NULL);
>>>> @@ -5970,6 +6006,74 @@ static bool i40e_setup_channel(struct i40e_pf *pf, struct i40e_vsi *vsi,
>>>> }
>>>>
>>>> /**
>>>> + * i40e_validate_and_set_switch_mode - sets up switch mode correctly
>>>> + * @vsi: ptr to VSI which has PF backing
>>>> + * @l4type: true for TCP ond false for UDP
>>>> + * @port_type: true if port is destination and false if port is source
>>>> + *
>>>> + * Sets up switch mode correctly if it needs to be changed and perform
>>>> + * what are allowed modes.
>>>> + **/
>>>> +static int i40e_validate_and_set_switch_mode(struct i40e_vsi *vsi, bool l4type,
>>>> +					     bool port_type)
>>>> +{
>>>> +	u8 mode;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct i40e_hw *hw = &pf->hw;
>>>> +	int ret;
>>>> +
>>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_dev_capabilities);
>>>> +	if (ret)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (hw->dev_caps.switch_mode) {
>>>> +		/* if switch mode is set, support mode2 (non-tunneled for
>>>> +		 * cloud filter) for now
>>>> +		 */
>>>> +		u32 switch_mode = hw->dev_caps.switch_mode &
>>>> +							I40E_SWITCH_MODE_MASK;
>>>> +		if (switch_mode >= I40E_NVM_IMAGE_TYPE_MODE1) {
>>>> +			if (switch_mode == I40E_NVM_IMAGE_TYPE_MODE2)
>>>> +				return 0;
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Invalid switch_mode (%d), only non-tunneled mode for cloud filter is supported\n",
>>>> +				hw->dev_caps.switch_mode);
>>>> +			return -EINVAL;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* port_type: true for destination port and false for source port
>>>> +	 * For now, supports only destination port type
>>>> +	 */
>>>> +	if (!port_type) {
>>>> +		dev_err(&pf->pdev->dev, "src port type not supported\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	/* Set Bit 7 to be valid */
>>>> +	mode = I40E_AQ_SET_SWITCH_BIT7_VALID;
>>>> +
>>>> +	/* Set L4type to both TCP and UDP support */
>>>> +	mode |= I40E_AQ_SET_SWITCH_L4_TYPE_BOTH;
>>>> +
>>>> +	/* Set cloud filter mode */
>>>> +	mode |= I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL;
>>>> +
>>>> +	/* Prep mode field for set_switch_config */
>>>> +	ret = i40e_aq_set_switch_config(hw, pf->last_sw_conf_flags,
>>>> +					pf->last_sw_conf_valid_flags,
>>>> +					mode, NULL);
>>>> +	if (ret && hw->aq.asq_last_status != I40E_AQ_RC_ESRCH)
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"couldn't set switch config bits, err %s aq_err %s\n",
>>>> +			i40e_stat_str(hw, ret),
>>>> +			i40e_aq_str(hw,
>>>> +				    hw->aq.asq_last_status));
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>>  * i40e_create_queue_channel - function to create channel
>>>>  * @vsi: VSI to be configured
>>>>  * @ch: ptr to channel (it contains channel specific params)
>>>> @@ -6735,13 +6839,726 @@ static int i40e_setup_tc(struct net_device *netdev, void *type_data)
>>>> 	return ret;
>>>> }
>>>>
>>>> +/**
>>>> + * i40e_set_cld_element - sets cloud filter element data
>>>> + * @filter: cloud filter rule
>>>> + * @cld: ptr to cloud filter element data
>>>> + *
>>>> + * This is helper function to copy data into cloud filter element
>>>> + **/
>>>> +static inline void
>>>> +i40e_set_cld_element(struct i40e_cloud_filter *filter,
>>>> +		     struct i40e_aqc_cloud_filters_element_data *cld)
>>>> +{
>>>> +	int i, j;
>>>> +	u32 ipa;
>>>> +
>>>> +	memset(cld, 0, sizeof(*cld));
>>>> +	ether_addr_copy(cld->outer_mac, filter->dst_mac);
>>>> +	ether_addr_copy(cld->inner_mac, filter->src_mac);
>>>> +
>>>> +	if (filter->ip_version == IPV6_VERSION) {
>>>> +#define IPV6_MAX_INDEX	(ARRAY_SIZE(filter->dst_ipv6) - 1)
>>>> +		for (i = 0, j = 0; i < 4; i++, j += 2) {
>>>> +			ipa = be32_to_cpu(filter->dst_ipv6[IPV6_MAX_INDEX - i]);
>>>> +			ipa = cpu_to_le32(ipa);
>>>> +			memcpy(&cld->ipaddr.raw_v6.data[j], &ipa, 4);
>>>> +		}
>>>> +	} else {
>>>> +		ipa = be32_to_cpu(filter->dst_ip);
>>>> +		memcpy(&cld->ipaddr.v4.data, &ipa, 4);
>>>> +	}
>>>> +
>>>> +	cld->inner_vlan = cpu_to_le16(ntohs(filter->vlan_id));
>>>> +
>>>> +	/* tenant_id is not supported by FW now, once the support is enabled
>>>> +	 * fill the cld->tenant_id with cpu_to_le32(filter->tenant_id)
>>>> +	 */
>>>> +	if (filter->tenant_id)
>>>> +		return;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_add_del_cloud_filter - Add/del cloud filter
>>>> + * @vsi: pointer to VSI
>>>> + * @filter: cloud filter rule
>>>> + * @add: if true, add, if false, delete
>>>> + *
>>>> + * Add or delete a cloud filter for a specific flow spec.
>>>> + * Returns 0 if the filter were successfully added.
>>>> + **/
>>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>>> +				     struct i40e_cloud_filter *filter, bool add)
>>>> +{
>>>> +	struct i40e_aqc_cloud_filters_element_data cld_filter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int ret;
>>>> +	static const u16 flag_table[128] = {
>>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_TEN_ID,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN_TEN_ID,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IIP] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IIP,
>>>> +	};
>>>> +
>>>> +	if (filter->flags >= ARRAY_SIZE(flag_table))
>>>> +		return I40E_ERR_CONFIG;
>>>> +
>>>> +	/* copy element needed to add cloud filter from filter */
>>>> +	i40e_set_cld_element(filter, &cld_filter);
>>>> +
>>>> +	if (filter->tunnel_type != I40E_CLOUD_TNL_TYPE_NONE)
>>>> +		cld_filter.flags = cpu_to_le16(filter->tunnel_type <<
>>>> +					     I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT);
>>>> +
>>>> +	if (filter->ip_version == IPV6_VERSION)
>>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>>> +	else
>>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>>> +
>>>> +	if (add)
>>>> +		ret = i40e_aq_add_cloud_filters(&pf->hw, filter->seid,
>>>> +						&cld_filter, 1);
>>>> +	else
>>>> +		ret = i40e_aq_rem_cloud_filters(&pf->hw, filter->seid,
>>>> +						&cld_filter, 1);
>>>> +	if (ret)
>>>> +		dev_dbg(&pf->pdev->dev,
>>>> +			"Failed to %s cloud filter using l4 port %u, err %d aq_err %d\n",
>>>> +			add ? "add" : "delete", filter->dst_port, ret,
>>>> +			pf->hw.aq.asq_last_status);
>>>> +	else
>>>> +		dev_info(&pf->pdev->dev,
>>>> +			 "%s cloud filter for VSI: %d\n",
>>>> +			 add ? "Added" : "Deleted", filter->seid);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_add_del_cloud_filter_big_buf - Add/del cloud filter using big_buf
>>>> + * @vsi: pointer to VSI
>>>> + * @filter: cloud filter rule
>>>> + * @add: if true, add, if false, delete
>>>> + *
>>>> + * Add or delete a cloud filter for a specific flow spec using big buffer.
>>>> + * Returns 0 if the filter were successfully added.
>>>> + **/
>>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>>> +					     struct i40e_cloud_filter *filter,
>>>> +					     bool add)
>>>> +{
>>>> +	struct i40e_aqc_cloud_filters_element_bb cld_filter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int ret;
>>>> +
>>>> +	/* Both (Outer/Inner) valid mac_addr are not supported */
>>>> +	if (is_valid_ether_addr(filter->dst_mac) &&
>>>> +	    is_valid_ether_addr(filter->src_mac))
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Make sure port is specified, otherwise bail out, for channel
>>>> +	 * specific cloud filter needs 'L4 port' to be non-zero
>>>> +	 */
>>>> +	if (!filter->dst_port)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* adding filter using src_port/src_ip is not supported at this stage */
>>>> +	if (filter->src_port || filter->src_ip ||
>>>> +	    !ipv6_addr_any((struct in6_addr *)&filter->src_ipv6))
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* copy element needed to add cloud filter from filter */
>>>> +	i40e_set_cld_element(filter, &cld_filter.element);
>>>> +
>>>> +	if (is_valid_ether_addr(filter->dst_mac) ||
>>>> +	    is_valid_ether_addr(filter->src_mac) ||
>>>> +	    is_multicast_ether_addr(filter->dst_mac) ||
>>>> +	    is_multicast_ether_addr(filter->src_mac)) {
>>>> +		/* MAC + IP : unsupported mode */
>>>> +		if (filter->dst_ip)
>>>> +			return -EINVAL;
>>>> +
>>>> +		/* since we validated that L4 port must be valid before
>>>> +		 * we get here, start with respective "flags" value
>>>> +		 * and update if vlan is present or not
>>>> +		 */
>>>> +		cld_filter.element.flags =
>>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT);
>>>> +
>>>> +		if (filter->vlan_id) {
>>>> +			cld_filter.element.flags =
>>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT);
>>>> +		}
>>>> +
>>>> +	} else if (filter->dst_ip || filter->ip_version == IPV6_VERSION) {
>>>> +		cld_filter.element.flags =
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_IP_PORT);
>>>> +		if (filter->ip_version == IPV6_VERSION)
>>>> +			cld_filter.element.flags |=
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>>> +		else
>>>> +			cld_filter.element.flags |=
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>>> +	} else {
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"either mac or ip has to be valid for cloud filter\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	/* Now copy L4 port in Byte 6..7 in general fields */
>>>> +	cld_filter.general_fields[I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0] =
>>>> +						be16_to_cpu(filter->dst_port);
>>>> +
>>>> +	if (add) {
>>>> +		bool proto_type, port_type;
>>>> +
>>>> +		proto_type = (filter->ip_proto == IPPROTO_TCP) ? true : false;
>>>> +		port_type = (filter->port_type & I40E_CLOUD_FILTER_PORT_DEST) ?
>>>> +			     true : false;
>>>> +
>>>> +		/* For now, src port based cloud filter for channel is not
>>>> +		 * supported
>>>> +		 */
>>>> +		if (!port_type) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"unsupported port type (src port)\n");
>>>> +			return -EOPNOTSUPP;
>>>> +		}
>>>> +
>>>> +		/* Validate current device switch mode, change if necessary */
>>>> +		ret = i40e_validate_and_set_switch_mode(vsi, proto_type,
>>>> +							port_type);
>>>> +		if (ret) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"failed to set switch mode, ret %d\n",
>>>> +				ret);
>>>> +			return ret;
>>>> +		}
>>>> +
>>>> +		ret = i40e_aq_add_cloud_filters_bb(&pf->hw, filter->seid,
>>>> +						   &cld_filter, 1);
>>>> +	} else {
>>>> +		ret = i40e_aq_rem_cloud_filters_bb(&pf->hw, filter->seid,
>>>> +						   &cld_filter, 1);
>>>> +	}
>>>> +
>>>> +	if (ret)
>>>> +		dev_dbg(&pf->pdev->dev,
>>>> +			"Failed to %s cloud filter(big buffer) err %d aq_err %d\n",
>>>> +			add ? "add" : "delete", ret, pf->hw.aq.asq_last_status);
>>>> +	else
>>>> +		dev_info(&pf->pdev->dev,
>>>> +			 "%s cloud filter for VSI: %d, L4 port: %d\n",
>>>> +			 add ? "add" : "delete", filter->seid,
>>>> +			 ntohs(filter->dst_port));
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_parse_cls_flower - Parse tc flower filters provided by kernel
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_parse_cls_flower(struct i40e_vsi *vsi,
>>>> +				 struct tc_cls_flower_offload *f,
>>>> +				 struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	u16 addr_type = 0;
>>>> +	u8 field_flags = 0;
>>>> +
>>>> +	if (f->dissector->used_keys &
>>>> +	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_ENC_KEYID))) {
>>>> +		dev_err(&pf->pdev->dev, "Unsupported key used: 0x%x\n",
>>>> +			f->dissector->used_keys);
>>>> +		return -EOPNOTSUPP;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) {
>>>> +		struct flow_dissector_key_keyid *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>>> +						  f->key);
>>>> +
>>>> +		struct flow_dissector_key_keyid *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->keyid != 0)
>>>> +			field_flags |= I40E_CLOUD_FIELD_TEN_ID;
>>>> +
>>>> +		filter->tenant_id = be32_to_cpu(key->keyid);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
>>>> +		struct flow_dissector_key_basic *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_BASIC,
>>>> +						  f->key);
>>>> +
>>>> +		filter->ip_proto = key->ip_proto;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>>>> +		struct flow_dissector_key_eth_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>>> +						  f->key);
>>>> +
>>>> +		struct flow_dissector_key_eth_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		/* use is_broadcast and is_zero to check for all 0xf or 0 */
>>>> +		if (!is_zero_ether_addr(mask->dst)) {
>>>> +			if (is_broadcast_ether_addr(mask->dst)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_OMAC;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ether dest mask %pM\n",
>>>> +					mask->dst);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (!is_zero_ether_addr(mask->src)) {
>>>> +			if (is_broadcast_ether_addr(mask->src)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IMAC;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ether src mask %pM\n",
>>>> +					mask->src);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +		ether_addr_copy(filter->dst_mac, key->dst);
>>>> +		ether_addr_copy(filter->src_mac, key->src);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLAN)) {
>>>> +		struct flow_dissector_key_vlan *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_vlan *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->vlan_id) {
>>>> +			if (mask->vlan_id == VLAN_VID_MASK) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IVLAN;
>>>> +
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad vlan mask 0x%04x\n",
>>>> +					mask->vlan_id);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		filter->vlan_id = cpu_to_be16(key->vlan_id);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
>>>> +		struct flow_dissector_key_control *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_CONTROL,
>>>> +						  f->key);
>>>> +
>>>> +		addr_type = key->addr_type;
>>>> +	}
>>>> +
>>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
>>>> +		struct flow_dissector_key_ipv4_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ipv4_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->dst) {
>>>> +			if (mask->dst == cpu_to_be32(0xffffffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ip dst mask 0x%08x\n",
>>>> +					be32_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (mask->src) {
>>>> +			if (mask->src == cpu_to_be32(0xffffffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ip src mask 0x%08x\n",
>>>> +					be32_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (field_flags & I40E_CLOUD_FIELD_TEN_ID) {
>>>> +			dev_err(&pf->pdev->dev, "Tenant id not allowed for ip filter\n");
>>>> +			return I40E_ERR_CONFIG;
>>>> +		}
>>>> +		filter->dst_ip = key->dst;
>>>> +		filter->src_ip = key->src;
>>>> +		filter->ip_version = IPV4_VERSION;
>>>> +	}
>>>> +
>>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
>>>> +		struct flow_dissector_key_ipv6_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ipv6_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		/* src and dest IPV6 address should not be LOOPBACK
>>>> +		 * (0:0:0:0:0:0:0:1), which can be represented as ::1
>>>> +		 */
>>>> +		if (ipv6_addr_loopback(&key->dst) ||
>>>> +		    ipv6_addr_loopback(&key->src)) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Bad ipv6, addr is LOOPBACK\n");
>>>> +			return I40E_ERR_CONFIG;
>>>> +		}
>>>> +		if (!ipv6_addr_any(&mask->dst) || !ipv6_addr_any(&mask->src))
>>>> +			field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +
>>>> +		memcpy(&filter->src_ipv6, &key->src.s6_addr32,
>>>> +		       sizeof(filter->src_ipv6));
>>>> +		memcpy(&filter->dst_ipv6, &key->dst.s6_addr32,
>>>> +		       sizeof(filter->dst_ipv6));
>>>> +
>>>> +		/* mark it as IPv6 filter, to be used later */
>>>> +		filter->ip_version = IPV6_VERSION;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
>>>> +		struct flow_dissector_key_ports *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ports *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->src) {
>>>> +			if (mask->src == cpu_to_be16(0xffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad src port mask 0x%04x\n",
>>>> +					be16_to_cpu(mask->src));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (mask->dst) {
>>>> +			if (mask->dst == cpu_to_be16(0xffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad dst port mask 0x%04x\n",
>>>> +					be16_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		filter->dst_port = key->dst;
>>>> +		filter->src_port = key->src;
>>>> +
>>>> +		/* For now, only supports destination port*/
>>>> +		filter->port_type |= I40E_CLOUD_FILTER_PORT_DEST;
>>>> +
>>>> +		switch (filter->ip_proto) {
>>>> +		case IPPROTO_TCP:
>>>> +		case IPPROTO_UDP:
>>>> +			break;
>>>> +		default:
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Only UDP and TCP transport are supported\n");
>>>> +			return -EINVAL;
>>>> +		}
>>>> +	}
>>>> +	filter->flags = field_flags;
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_handle_redirect_action: Forward to a traffic class on the device
>>>> + * @vsi: Pointer to VSI
>>>> + * @ifindex: ifindex of the device to forwared to
>>>> + * @tc: traffic class index on the device
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_handle_redirect_action(struct i40e_vsi *vsi, int ifindex, u8 tc,
>>>> +				       struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	struct i40e_channel *ch, *ch_tmp;
>>>> +
>>>> +	/* redirect to a traffic class on the same device */
>>>> +	if (vsi->netdev->ifindex == ifindex) {
>>>> +		if (tc == 0) {
>>>> +			filter->seid = vsi->seid;
>>>> +			return 0;
>>>> +		} else if (vsi->tc_config.enabled_tc & BIT(tc)) {
>>>> +			if (!filter->dst_port) {
>>>> +				dev_err(&vsi->back->pdev->dev,
>>>> +					"Specify destination port to redirect to traffic class that is not default\n");
>>>> +				return -EINVAL;
>>>> +			}
>>>> +			if (list_empty(&vsi->ch_list))
>>>> +				return -EINVAL;
>>>> +			list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list,
>>>> +						 list) {
>>>> +				if (ch->seid == vsi->tc_seid_map[tc])
>>>> +					filter->seid = ch->seid;
>>>> +			}
>>>> +			return 0;
>>>> +		}
>>>> +	}
>>>> +	return -EINVAL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_parse_tc_actions - Parse tc actions
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_parse_tc_actions(struct i40e_vsi *vsi, struct tcf_exts *exts,
>>>> +				 struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	const struct tc_action *a;
>>>> +	LIST_HEAD(actions);
>>>> +	int err;
>>>> +
>>>> +	if (!tcf_exts_has_actions(exts))
>>>> +		return -EINVAL;
>>>> +
>>>> +	tcf_exts_to_list(exts, &actions);
>>>> +	list_for_each_entry(a, &actions, list) {
>>>> +		/* Drop action */
>>>> +		if (is_tcf_gact_shot(a)) {
>>>> +			dev_err(&vsi->back->pdev->dev,
>>>> +				"Cloud filters do not support the drop action.\n");
>>>> +			return -EOPNOTSUPP;
>>>> +		}
>>>> +
>>>> +		/* Redirect to a traffic class on the same device */
>>>> +		if (!is_tcf_mirred_egress_redirect(a) && is_tcf_mirred_tc(a)) {
>>>> +			int ifindex = tcf_mirred_ifindex(a);
>>>> +			u8 tc = tcf_mirred_tc(a);
>>>> +
>>>> +			err = i40e_handle_redirect_action(vsi, ifindex, tc,
>>>> +							  filter);
>>>> +			if (err == 0)
>>>> +				return err;
>>>> +		}
>>>> +	}
>>>> +	return -EINVAL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_configure_clsflower - Configure tc flower filters
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + *
>>>> + **/
>>>> +static int i40e_configure_clsflower(struct i40e_vsi *vsi,
>>>> +				    struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int err = 0;
>>>> +
>>>> +	if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) ||
>>>> +	    test_bit(__I40E_RESET_INTR_RECEIVED, pf->state))
>>>> +		return -EBUSY;
>>>> +
>>>> +	if (pf->fdir_pf_active_filters ||
>>>> +	    (!hlist_empty(&pf->fdir_filter_list))) {
>>>> +		dev_err(&vsi->back->pdev->dev,
>>>> +			"Flow Director Sideband filters exists, turn ntuple off to configure cloud filters\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (vsi->back->flags & I40E_FLAG_FD_SB_ENABLED) {
>>>> +		dev_err(&vsi->back->pdev->dev,
>>>> +			"Disable Flow Director Sideband, configuring Cloud filters via tc-flower\n");
>>>> +		vsi->back->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +		vsi->back->flags |= I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +	}
>>>> +
>>>> +	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
>>>> +	if (!filter)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	filter->cookie = cls_flower->cookie;
>>>> +
>>>> +	err = i40e_parse_cls_flower(vsi, cls_flower, filter);
>>>> +	if (err < 0)
>>>> +		goto err;
>>>> +
>>>> +	err = i40e_parse_tc_actions(vsi, cls_flower->exts, filter);
>>>> +	if (err < 0)
>>>> +		goto err;
>>>> +
>>>> +	/* Add cloud filter */
>>>> +	if (filter->dst_port)
>>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, true);
>>>> +	else
>>>> +		err = i40e_add_del_cloud_filter(vsi, filter, true);
>>>> +
>>>> +	if (err) {
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"Failed to add cloud filter, err %s\n",
>>>> +			i40e_stat_str(&pf->hw, err));
>>>> +		err = i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>>> +		goto err;
>>>> +	}
>>>> +
>>>> +	/* add filter to the ordered list */
>>>> +	INIT_HLIST_NODE(&filter->cloud_node);
>>>> +
>>>> +	hlist_add_head(&filter->cloud_node, &pf->cloud_filter_list);
>>>> +
>>>> +	pf->num_cloud_filters++;
>>>> +
>>>> +	return err;
>>>> +err:
>>>> +	kfree(filter);
>>>> +	return err;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_find_cloud_filter - Find the could filter in the list
>>>> + * @vsi: Pointer to VSI
>>>> + * @cookie: filter specific cookie
>>>> + *
>>>> + **/
>>>> +static struct i40e_cloud_filter *i40e_find_cloud_filter(struct i40e_vsi *vsi,
>>>> +							unsigned long *cookie)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct hlist_node *node2;
>>>> +
>>>> +	hlist_for_each_entry_safe(filter, node2,
>>>> +				  &vsi->back->cloud_filter_list, cloud_node)
>>>> +		if (!memcmp(cookie, &filter->cookie, sizeof(filter->cookie)))
>>>> +			return filter;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_delete_clsflower - Remove tc flower filters
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + *
>>>> + **/
>>>> +static int i40e_delete_clsflower(struct i40e_vsi *vsi,
>>>> +				 struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int err = 0;
>>>> +
>>>> +	filter = i40e_find_cloud_filter(vsi, &cls_flower->cookie);
>>>> +
>>>> +	if (!filter)
>>>> +		return -EINVAL;
>>>> +
>>>> +	hash_del(&filter->cloud_node);
>>>> +
>>>> +	if (filter->dst_port)
>>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, false);
>>>> +	else
>>>> +		err = i40e_add_del_cloud_filter(vsi, filter, false);
>>>> +	if (err) {
>>>> +		kfree(filter);
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"Failed to delete cloud filter, err %s\n",
>>>> +			i40e_stat_str(&pf->hw, err));
>>>> +		return i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>>> +	}
>>>> +
>>>> +	kfree(filter);
>>>> +	pf->num_cloud_filters--;
>>>> +
>>>> +	if (!pf->num_cloud_filters)
>>>> +		if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>>> +		    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>>> +			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +		}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_setup_tc_cls_flower - flower classifier offloads
>>>> + * @netdev: net device to configure
>>>> + * @type_data: offload data
>>>> + **/
>>>> +static int i40e_setup_tc_cls_flower(struct net_device *netdev,
>>>> +				    struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_netdev_priv *np = netdev_priv(netdev);
>>>> +	struct i40e_vsi *vsi = np->vsi;
>>>> +
>>>> +	if (!is_classid_clsact_ingress(cls_flower->common.classid) ||
>>>> +	    cls_flower->common.chain_index)
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>> +	switch (cls_flower->command) {
>>>> +	case TC_CLSFLOWER_REPLACE:
>>>> +		return i40e_configure_clsflower(vsi, cls_flower);
>>>> +	case TC_CLSFLOWER_DESTROY:
>>>> +		return i40e_delete_clsflower(vsi, cls_flower);
>>>> +	case TC_CLSFLOWER_STATS:
>>>> +		return -EOPNOTSUPP;
>>>> +	default:
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> +
>>>> static int __i40e_setup_tc(struct net_device *netdev, enum tc_setup_type type,
>>>> 			   void *type_data)
>>>> {
>>>> -	if (type != TC_SETUP_MQPRIO)
>>>> +	switch (type) {
>>>> +	case TC_SETUP_MQPRIO:
>>>> +		return i40e_setup_tc(netdev, type_data);
>>>> +	case TC_SETUP_CLSFLOWER:
>>>> +		return i40e_setup_tc_cls_flower(netdev, type_data);
>>>> +	default:
>>>> 		return -EOPNOTSUPP;
>>>> -
>>>> -	return i40e_setup_tc(netdev, type_data);
>>>> +	}
>>>> }
>>>>
>>>> /**
>>>> @@ -6939,6 +7756,13 @@ static void i40e_cloud_filter_exit(struct i40e_pf *pf)
>>>> 		kfree(cfilter);
>>>> 	}
>>>> 	pf->num_cloud_filters = 0;
>>>> +
>>>> +	if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>>> +	    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>>> +		pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +		pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +		pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +	}
>>>> }
>>>>
>>>> /**
>>>> @@ -8046,7 +8870,8 @@ static int i40e_reconstitute_veb(struct i40e_veb *veb)
>>>>  * i40e_get_capabilities - get info about the HW
>>>>  * @pf: the PF struct
>>>>  **/
>>>> -static int i40e_get_capabilities(struct i40e_pf *pf)
>>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>>> +				 enum i40e_admin_queue_opc list_type)
>>>> {
>>>> 	struct i40e_aqc_list_capabilities_element_resp *cap_buf;
>>>> 	u16 data_size;
>>>> @@ -8061,9 +8886,8 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>>>
>>>> 		/* this loads the data into the hw struct for us */
>>>> 		err = i40e_aq_discover_capabilities(&pf->hw, cap_buf, buf_len,
>>>> -					    &data_size,
>>>> -					    i40e_aqc_opc_list_func_capabilities,
>>>> -					    NULL);
>>>> +						    &data_size, list_type,
>>>> +						    NULL);
>>>> 		/* data loaded, buffer no longer needed */
>>>> 		kfree(cap_buf);
>>>>
>>>> @@ -8080,26 +8904,44 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>>> 		}
>>>> 	} while (err);
>>>>
>>>> -	if (pf->hw.debug_mask & I40E_DEBUG_USER)
>>>> -		dev_info(&pf->pdev->dev,
>>>> -			 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>>> -			 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>>> -			 pf->hw.func_caps.num_msix_vectors,
>>>> -			 pf->hw.func_caps.num_msix_vectors_vf,
>>>> -			 pf->hw.func_caps.fd_filters_guaranteed,
>>>> -			 pf->hw.func_caps.fd_filters_best_effort,
>>>> -			 pf->hw.func_caps.num_tx_qp,
>>>> -			 pf->hw.func_caps.num_vsis);
>>>> -
>>>> +	if (pf->hw.debug_mask & I40E_DEBUG_USER) {
>>>> +		if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>>> +				 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>>> +				 pf->hw.func_caps.num_msix_vectors,
>>>> +				 pf->hw.func_caps.num_msix_vectors_vf,
>>>> +				 pf->hw.func_caps.fd_filters_guaranteed,
>>>> +				 pf->hw.func_caps.fd_filters_best_effort,
>>>> +				 pf->hw.func_caps.num_tx_qp,
>>>> +				 pf->hw.func_caps.num_vsis);
>>>> +		} else if (list_type == i40e_aqc_opc_list_dev_capabilities) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "switch_mode=0x%04x, function_valid=0x%08x\n",
>>>> +				 pf->hw.dev_caps.switch_mode,
>>>> +				 pf->hw.dev_caps.valid_functions);
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "SR-IOV=%d, num_vfs for all function=%u\n",
>>>> +				 pf->hw.dev_caps.sr_iov_1_1,
>>>> +				 pf->hw.dev_caps.num_vfs);
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "num_vsis=%u, num_rx:%u, num_tx=%u\n",
>>>> +				 pf->hw.dev_caps.num_vsis,
>>>> +				 pf->hw.dev_caps.num_rx_qp,
>>>> +				 pf->hw.dev_caps.num_tx_qp);
>>>> +		}
>>>> +	}
>>>> +	if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>>> #define DEF_NUM_VSI (1 + (pf->hw.func_caps.fcoe ? 1 : 0) \
>>>> 		       + pf->hw.func_caps.num_vfs)
>>>> -	if (pf->hw.revision_id == 0 && (DEF_NUM_VSI > pf->hw.func_caps.num_vsis)) {
>>>> -		dev_info(&pf->pdev->dev,
>>>> -			 "got num_vsis %d, setting num_vsis to %d\n",
>>>> -			 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>>> -		pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>>> +		if (pf->hw.revision_id == 0 &&
>>>> +		    (pf->hw.func_caps.num_vsis < DEF_NUM_VSI)) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "got num_vsis %d, setting num_vsis to %d\n",
>>>> +				 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>>> +			pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>>> +		}
>>>> 	}
>>>> -
>>>> 	return 0;
>>>> }
>>>>
>>>> @@ -8141,6 +8983,7 @@ static void i40e_fdir_sb_setup(struct i40e_pf *pf)
>>>> 		if (!vsi) {
>>>> 			dev_info(&pf->pdev->dev, "Couldn't create FDir VSI\n");
>>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 			return;
>>>> 		}
>>>> 	}
>>>> @@ -8163,6 +9006,48 @@ static void i40e_fdir_teardown(struct i40e_pf *pf)
>>>> }
>>>>
>>>> /**
>>>> + * i40e_rebuild_cloud_filters - Rebuilds cloud filters for VSIs
>>>> + * @vsi: PF main vsi
>>>> + * @seid: seid of main or channel VSIs
>>>> + *
>>>> + * Rebuilds cloud filters associated with main VSI and channel VSIs if they
>>>> + * existed before reset
>>>> + **/
>>>> +static int i40e_rebuild_cloud_filters(struct i40e_vsi *vsi, u16 seid)
>>>> +{
>>>> +	struct i40e_cloud_filter *cfilter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct hlist_node *node;
>>>> +	i40e_status ret;
>>>> +
>>>> +	/* Add cloud filters back if they exist */
>>>> +	if (hlist_empty(&pf->cloud_filter_list))
>>>> +		return 0;
>>>> +
>>>> +	hlist_for_each_entry_safe(cfilter, node, &pf->cloud_filter_list,
>>>> +				  cloud_node) {
>>>> +		if (cfilter->seid != seid)
>>>> +			continue;
>>>> +
>>>> +		if (cfilter->dst_port)
>>>> +			ret = i40e_add_del_cloud_filter_big_buf(vsi, cfilter,
>>>> +								true);
>>>> +		else
>>>> +			ret = i40e_add_del_cloud_filter(vsi, cfilter, true);
>>>> +
>>>> +		if (ret) {
>>>> +			dev_dbg(&pf->pdev->dev,
>>>> +				"Failed to rebuild cloud filter, err %s aq_err %s\n",
>>>> +				i40e_stat_str(&pf->hw, ret),
>>>> +				i40e_aq_str(&pf->hw,
>>>> +					    pf->hw.aq.asq_last_status));
>>>> +			return ret;
>>>> +		}
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>>  * i40e_rebuild_channels - Rebuilds channel VSIs if they existed before reset
>>>>  * @vsi: PF main vsi
>>>>  *
>>>> @@ -8199,6 +9084,13 @@ static int i40e_rebuild_channels(struct i40e_vsi *vsi)
>>>> 						I40E_BW_CREDIT_DIVISOR,
>>>> 				ch->seid);
>>>> 		}
>>>> +		ret = i40e_rebuild_cloud_filters(vsi, ch->seid);
>>>> +		if (ret) {
>>>> +			dev_dbg(&vsi->back->pdev->dev,
>>>> +				"Failed to rebuild cloud filters for channel VSI %u\n",
>>>> +				ch->seid);
>>>> +			return ret;
>>>> +		}
>>>> 	}
>>>> 	return 0;
>>>> }
>>>> @@ -8365,7 +9257,7 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>>> 		i40e_verify_eeprom(pf);
>>>>
>>>> 	i40e_clear_pxe_mode(hw);
>>>> -	ret = i40e_get_capabilities(pf);
>>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>>> 	if (ret)
>>>> 		goto end_core_reset;
>>>>
>>>> @@ -8482,6 +9374,10 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>>> 			goto end_unlock;
>>>> 	}
>>>>
>>>> +	ret = i40e_rebuild_cloud_filters(vsi, vsi->seid);
>>>> +	if (ret)
>>>> +		goto end_unlock;
>>>> +
>>>> 	/* PF Main VSI is rebuild by now, go ahead and rebuild channel VSIs
>>>> 	 * for this main VSI if they exist
>>>> 	 */
>>>> @@ -9404,6 +10300,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
>>>> 	    (pf->num_fdsb_msix == 0)) {
>>>> 		dev_info(&pf->pdev->dev, "Sideband Flowdir disabled, not enough MSI-X vectors\n");
>>>> 		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	}
>>>> 	if ((pf->flags & I40E_FLAG_VMDQ_ENABLED) &&
>>>> 	    (pf->num_vmdq_msix == 0)) {
>>>> @@ -9521,6 +10418,7 @@ static int i40e_init_interrupt_scheme(struct i40e_pf *pf)
>>>> 				       I40E_FLAG_FD_SB_ENABLED	|
>>>> 				       I40E_FLAG_FD_ATR_ENABLED	|
>>>> 				       I40E_FLAG_VMDQ_ENABLED);
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>>
>>>> 			/* rework the queue expectations without MSIX */
>>>> 			i40e_determine_queue_usage(pf);
>>>> @@ -10263,9 +11161,13 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>>> 		/* Enable filters and mark for reset */
>>>> 		if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED))
>>>> 			need_reset = true;
>>>> -		/* enable FD_SB only if there is MSI-X vector */
>>>> -		if (pf->num_fdsb_msix > 0)
>>>> +		/* enable FD_SB only if there is MSI-X vector and no cloud
>>>> +		 * filters exist
>>>> +		 */
>>>> +		if (pf->num_fdsb_msix > 0 && !pf->num_cloud_filters) {
>>>> 			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +		}
>>>> 	} else {
>>>> 		/* turn off filters, mark for reset and clear SW filter list */
>>>> 		if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
>>>> @@ -10274,6 +11176,8 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>>> 		}
>>>> 		pf->flags &= ~(I40E_FLAG_FD_SB_ENABLED |
>>>> 			       I40E_FLAG_FD_SB_AUTO_DISABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> +
>>>> 		/* reset fd counters */
>>>> 		pf->fd_add_err = 0;
>>>> 		pf->fd_atr_cnt = 0;
>>>> @@ -10857,7 +11761,8 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
>>>> 		netdev->hw_features |= NETIF_F_NTUPLE;
>>>> 	hw_features = hw_enc_features		|
>>>> 		      NETIF_F_HW_VLAN_CTAG_TX	|
>>>> -		      NETIF_F_HW_VLAN_CTAG_RX;
>>>> +		      NETIF_F_HW_VLAN_CTAG_RX	|
>>>> +		      NETIF_F_HW_TC;
>>>>
>>>> 	netdev->hw_features |= hw_features;
>>>>
>>>> @@ -12159,8 +13064,10 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>>> 	*/
>>>>
>>>> 	if ((pf->hw.pf_id == 0) &&
>>>> -	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT))
>>>> +	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT)) {
>>>> 		flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
>>>> +		pf->last_sw_conf_flags = flags;
>>>> +	}
>>>>
>>>> 	if (pf->hw.pf_id == 0) {
>>>> 		u16 valid_flags;
>>>> @@ -12176,6 +13083,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>>> 					     pf->hw.aq.asq_last_status));
>>>> 			/* not a fatal problem, just keep going */
>>>> 		}
>>>> +		pf->last_sw_conf_valid_flags = valid_flags;
>>>> 	}
>>>>
>>>> 	/* first time setup */
>>>> @@ -12273,6 +13181,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			       I40E_FLAG_DCB_ENABLED	|
>>>> 			       I40E_FLAG_SRIOV_ENABLED	|
>>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	} else if (!(pf->flags & (I40E_FLAG_RSS_ENABLED |
>>>> 				  I40E_FLAG_FD_SB_ENABLED |
>>>> 				  I40E_FLAG_FD_ATR_ENABLED |
>>>> @@ -12287,6 +13196,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			       I40E_FLAG_FD_ATR_ENABLED	|
>>>> 			       I40E_FLAG_DCB_ENABLED	|
>>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	} else {
>>>> 		/* Not enough queues for all TCs */
>>>> 		if ((pf->flags & I40E_FLAG_DCB_CAPABLE) &&
>>>> @@ -12310,6 +13220,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			queues_left -= 1; /* save 1 queue for FD */
>>>> 		} else {
>>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 			dev_info(&pf->pdev->dev, "not enough queues for Flow Director. Flow Director feature is disabled\n");
>>>> 		}
>>>> 	}
>>>> @@ -12613,7 +13524,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>>> 		dev_warn(&pdev->dev, "This device is a pre-production adapter/LOM. Please be aware there may be issues with your hardware. If you are experiencing problems please contact your Intel or hardware representative who provided you with this hardware.\n");
>>>>
>>>> 	i40e_clear_pxe_mode(hw);
>>>> -	err = i40e_get_capabilities(pf);
>>>> +	err = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>>> 	if (err)
>>>> 		goto err_adminq_setup;
>>>>
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> index 92869f5..3bb6659 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> @@ -283,6 +283,22 @@ i40e_status i40e_aq_query_switch_comp_bw_config(struct i40e_hw *hw,
>>>> 		struct i40e_asq_cmd_details *cmd_details);
>>>> i40e_status i40e_aq_resume_port_tx(struct i40e_hw *hw,
>>>> 				   struct i40e_asq_cmd_details *cmd_details);
>>>> +i40e_status
>>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count);
>>>> +enum i40e_status_code
>>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count);
>>>> +enum i40e_status_code
>>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count);
>>>> +i40e_status
>>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count);
>>>> i40e_status i40e_read_lldp_cfg(struct i40e_hw *hw,
>>>> 			       struct i40e_lldp_variables *lldp_cfg);
>>>> /* i40e_common */
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> index c019f46..af38881 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> @@ -287,6 +287,7 @@ struct i40e_hw_capabilities {
>>>> #define I40E_NVM_IMAGE_TYPE_MODE1	0x6
>>>> #define I40E_NVM_IMAGE_TYPE_MODE2	0x7
>>>> #define I40E_NVM_IMAGE_TYPE_MODE3	0x8
>>>> +#define I40E_SWITCH_MODE_MASK		0xF
>>>>
>>>> 	u32  management_mode;
>>>> 	u32  mng_protocols_over_mctp;
>>>> diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> index b8c78bf..4fe27f0 100644
>>>> --- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> +++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> @@ -1360,6 +1360,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>>> 		struct {
>>>> 			u8 data[16];
>>>> 		} v6;
>>>> +		struct {
>>>> +			__le16 data[8];
>>>> +		} raw_v6;
>>>> 	} ipaddr;
>>>> 	__le16	flags;
>>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>>>

^ permalink raw reply

* Re: [pull request][net 00/11] Mellanox, mlx5 fixes 2017-09-28
From: David Miller @ 2017-09-29  5:22 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20170928044132.30940-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 28 Sep 2017 07:41:21 +0300

> This series provides misc fixes for mlx5 dirver.
> 
> Please pull and let me know if there's any problem.

Pulled.

> for -stable:
>   net/mlx5e: IPoIB, Fix access to invalid memory address (Kernels >= 4.12)

Queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-29  6:40 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: Jiri Benc, netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net, Jan Scheurich
In-Reply-To: <CAOrHB_CQyokdTWeoj02RENPP5miq5Arx6goCNQ9ZPbUeTu_MeQ@mail.gmail.com>

On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang@intel.com> wrote:
> > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> >> > will be recirculated to flow pipeline, it will be reparsed, so
> >> > key->eth.type will be set in packet parse function, we needn't handle it
> >> > in pop_nsh.
> >>
> >> This seems to be a very different approach than what we currently have.
> >> Looking at the code, the requirement after "destructive" actions such
> >> as pushing or popping headers is to recirculate.
> >
> > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > also cc jan.scheurich@ericsson.com.
> >
> > Actucally all the keys before push_nsh are still there after push_nsh,
> > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> >
> 
> 
> We should keep existing model for this patch. Later you can submit
> optimization patch with specific use cases and performance
> improvement. So that we can evaluate code complexity and benefits.

Ok, I'll remove the below line in push_nsh and send out v11, thanks.

	key->eth.type = htons(ETH_P_NSH);

> 
> >>
> >> Setting key->eth.type to satisfy conditions in the output path without
> >> updating the rest of the key looks very hacky and fragile to me. There
> >> might be other conditions and dependencies that are not obvious.
> >> I don't think the code was written with such code path in mind.
> >>
> >> I'd like to hear what Pravin thinks about this.
> >>
> >>  Jiri

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate u32 filter handles
From: Simon Horman @ 2017-09-29  6:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <CAM_iQpWbiA4QZb8j8fau7kMhzeLpRppEMcmB80byqTUryPheOw@mail.gmail.com>

On Thu, Sep 28, 2017 at 03:19:05PM -0700, Cong Wang wrote:
> On Thu, Sep 28, 2017 at 12:34 AM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > Hi Cong,
> >
> > this looks like a nice enhancement to me. Did you measure any performance
> > benefit from it.  Perhaps it could be described in the changelog_ I also
> > have a more detailed question below.
> 
> No, I am inspired by commit c15ab236d69d, don't measure it.

Perhaps it would be nice to note that in the changelog.

> >> ---
> >>  net/sched/cls_u32.c | 108 ++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 67 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> >> index 10b8d851fc6b..316b8a791b13 100644
> >> --- a/net/sched/cls_u32.c
> >> +++ b/net/sched/cls_u32.c
> >> @@ -46,6 +46,7 @@
> >
> > ...
> >
> >> @@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
> >>                       return -EINVAL;
> >>               if (TC_U32_KEY(handle))
> >>                       return -EINVAL;
> >> -             if (handle == 0) {
> >> -                     handle = gen_new_htid(tp->data);
> >> -                     if (handle == 0)
> >> -                             return -ENOMEM;
> >> -             }
> >>               ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
> >>               if (ht == NULL)
> >>                       return -ENOBUFS;
> >> +             if (handle == 0) {
> >> +                     handle = gen_new_htid(tp->data, ht);
> >> +                     if (handle == 0) {
> >> +                             kfree(ht);
> >> +                             return -ENOMEM;
> >> +                     }
> >> +             } else {
> >> +                     err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
> >> +                                         handle, handle + 1, GFP_KERNEL);
> >> +                     if (err) {
> >> +                             kfree(ht);
> >> +                             return err;
> >> +                     }
> >
> > The above seems to check that handle is not already in use and mark it as
> > in use. But I don't see that logic in the code prior to this patch.
> > Am I missing something? If not perhaps this portion should be a separate
> > patch or described in the changelog.
> 
> The logic is in upper layer, tc_ctl_tfilter(). It tries to get a
> filter by handle
> (if non-zero), and errors out if we are creating a new filter with the same
> handle.
> 
> At the point you quote above, 'n' is already NULL and 'handle' is non-zero,
> which means there is no existing filter has same handle, it is safe to just
> mark it as in-use.

Thanks for the clarification, that seems fine to me.

Reviewed-by: Simon Horman <simon.horman@netronome.com>

^ permalink raw reply

* Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT
From: Jesper Dangaard Brouer @ 2017-09-29  6:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, edumazet, brouer
In-Reply-To: <59CD7B94.8010103@iogearbox.net>

On Fri, 29 Sep 2017 00:45:40 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> > Introducing a new way to redirect XDP frames.  Notice how no driver
> > changes are necessary given the design of XDP_REDIRECT.
> >
> > This redirect map type is called 'cpumap', as it allows redirection
> > XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
> > and start the network stack invocation on that CPU.
> >
> > This is a scalability and isolation mechanism, that allow separating
> > the early driver network XDP layer, from the rest of the netstack, and
> > assigning dedicated CPUs for this stage.  The sysadm control/configure
> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
> > many queues are configured via ethtool --set-channels.  Benchmarks
> > show that a single CPU can handle approx 11Mpps.  Thus, only assigning
> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
> > wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
> > have the advantage that more packets being "bulk" available per hard
> > interrupt[1].
> >
> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> >
> > Use-cases:
> >
> > 1. End-host based pre-filtering for DDoS mitigation.  This is fast
> >     enough to allow software to see and filter all packets wirespeed.
> >     Thus, no packets getting silently dropped by hardware.
> >
> > 2. Given NIC HW unevenly distributes packets across RX queue, this
> >     mechanism can be used for redistribution load across CPUs.  This
> >     usually happens when HW is unaware of a new protocol.  This
> >     resembles RPS (Receive Packet Steering), just faster, but with more
> >     responsibility placed on the BPF program for correct steering.
> >
> > 3. Auto-scaling or power saving via only activating the appropriate
> >     number of remote CPUs for handling the current load.  The cpumap
> >     tracepoints can function as a feedback loop for this purpose.  
> 
> Interesting work, thanks! Still digesting the code a bit. I think
> it pretty much goes into the direction that Eric describes in his
> netdev paper quoted above; not on a generic level though but specific
> to XDP at least; theoretically XDP could just run transparently on
> the CPU doing the filtering, and raw buffers are handed to remote
> CPU with similar batching, but it would need some different config
> interface at minimum.

Good that you noticed this is (implicit) implementing RX bulking, which
is where much of the performance gain originates from.

It is true, I am inspired by Eric's paper (I love it). Do notice that
this is not blocking or interfering with Erics/others continued work in
this area.  This implementation just show that the section "break the
pipe!" idea works very well for XDP. 

More on config knobs below.

> Shouldn't we take the CPU(s) running XDP on the RX queues out from
> the normal process scheduler, so that we have a guarantee that user
> space or unrelated kernel tasks cannot interfere with them anymore,
> and we could then turn them into busy polling eventually (e.g. as
> long as XDP is running there and once off could put them back into
> normal scheduling domain transparently)?

We should be careful not to invent networking config knobs that belongs
to other parts of the kernel, like the scheduler.  We already have
ability to control where IRQ's land via procfs smp_affinity.  And if
you want to avoid CPU isolation, we can use the boot cmdline
"isolcpus" (hint like DPDK recommend/use for zero-loss configs).  It is
the userspace tool (or sysadm) loading the XDP program, who is
responsible for having configures the CPU smp_affinity alignment.

Making NAPI busy-poll is out of scope for this patchset. Someone
should work on this separately.  It would just help/improve this kind
of scheme.

I actually think it would be more relevant to add/put the "remote" CPUs
in the 'cpumap' into a separate scheduler group.  To implement stuff
like auto-scaling and power-saving.

> What about RPS/RFS in the sense that once you punt them to remote
> CPU, could we reuse application locality information so they'd end
> up on the right CPU in the first place (w/o backlog detour), or is
> the intent to rather disable it and have some own orchestration
> with relation to the CPU map?

An advanced bpf orchestration could basically implement what you
describe, combined with a userspace side tool that taskset/pin
applications.  To know when a task can move between CPUs, you use the
tracepoints to see when the CPU queue is empty (hint, time_limit=true
and processed=0).

For now, I'm not targeting such advanced use-cases.  My main target is
a customer that have double tagged VLANS, and ixgbe cannot RSS
distribute these, thus they all end-up on queue 0.  And as I
demonstrated (in another email) RPS is too slow to fix this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox