Netdev List

Netdev List
 help / color / mirror / Atom feed

* [pull request][net-next V2 00/18] Mellanox, mlx5 software managed steering
From: Saeed Mahameed @ 2019-09-03 20:04 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Saeed Mahameed

Hi Dave,

This series adds the support for software (driver managed) flow steering.
For more information please see tag log below.

Please pull and let me know if there is any problem.

Please note that the series starts with a merge of mlx5-next branch,
to resolve and avoid dependency with rdma tree.

v2:
 - Improve return values transformation of the first patch.

Thanks,
Saeed.

---
The following changes since commit a06ebb8d953b4100236f3057be51d67640e06323:

  Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2019-09-02 00:16:05 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-09-01-v2

for you to fetch changes up to e890acd5ff18a0144967d0289869fe5f0415d399:

  net/mlx5: Add devlink flow_steering_mode parameter (2019-09-03 12:54:24 -0700)

----------------------------------------------------------------
mlx5-updates-2019-09-01  (Software steering support)

Abstract:
--------
Mellanox ConnetX devices supports packet matching, packet modification and
redirection. These functionalities are also referred to as flow-steering.
To configure a steering rule, the rule is written to the device owned
memory, this memory is accessed and cached by the device when processing
a packet.
Steering rules are constructed from multiple steering entries (STE).

Rules are configured using the Firmware command interface. The Firmware
processes the given driver command and translates them to STEs, then
writes them to the device memory in the current steering tables.
This process is slow due to the architecture of the command interface and
the processing complexity of each rule.

The highlight of this patchset is to cut the middle man (The firmware) and
do steering rules programming into device directly from the driver, with
no firmware intervention whatsoever.

Motivation:
-----------
Software (driver managed) steering allows for high rule insertion rates
compared to the FW steering described above, this is achieved by using
internal RDMA writes to the device owned memory instead of the slow
command interface to program steering rules.

Software (driver managed) steering, doesn't depend on new FW
for new steering functionality, new implementations can be done in the
driver skipping the FW layer.

Performance:
------------
The insertion rate on a single core using the new approach allows
programming ~300K rules per sec. (Done via direct raw test to the new mlx5
sw steering layer, without any kernel layer involved).

Test: TC L2 rules
33K/s with Software steering (this patchset).
5K/s  with FW and current driver.
This will improve OVS based solution performance.

Architecture and implementation details:
----------------------------------------
Software steering will be dynamically selected via devlink device
parameter. Example:
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
          pci/0000:06:00.0:
          name flow_steering_mode type driver-specific
          values:
             cmode runtime value smfs

mlx5 software steering module a.k.a (DR - Direct Rule) is implemented
and contained in mlx5/core/steering directory and controlled by
MLX5_SW_STEERING kconfig flag.

mlx5 core steering layer (fs_core) already provides a shim layer for
implementing different steering mechanisms, software steering will
leverage that as seen at the end of this series.

When Software Steering for a specific steering domain
(NIC/RDMA/Vport/ESwitch, etc ..) is supported, it will cause rules
targeting this domain to be created using  SW steering instead of FW.

The implementation includes:
Domain - The steering domain is the object that all other object resides
    in. It holds the memory allocator, send engine, locks and other shared
    data needed by lower objects such as table, matcher, rule, action.
    Each domain can contain multiple tables. Domain is equivalent to
    namespaces e.g (NIC/RDMA/Vport/ESwitch, etc ..) as implemented
    currently in mlx5_core fs_core (flow steering core).

Table - Table objects are used for holding multiple matchers, each table
    has a level used to prevent processing loops. Packets are being
    directed to this table once it is set as the root table, this is done
    by fs_core using a FW command. A packet is being processed inside the
    table matcher by matcher until a successful hit, otherwise the packet
    will perform the default action.

Matcher - Matchers objects are used to specify the fields mask for
    matching when processing a packet. A matcher belongs to a table, each
    matcher can hold multiple rules, each rule with different matching
    values corresponding to the matcher mask. Each matcher has a priority
    used for rule processing order inside the table.

Action - Action objects are created to specify different steering actions
    such as count, reformat (encapsulate, decapsulate, ...), modify
    header, forward to table and many other actions. When creating a rule
    a sequence of actions can be provided to be executed on a successful
    match.

Rule - Rule objects are used to specify a specific match on packets as
    well as the actions that should be executed. A rule belongs to a
    matcher.

STE - This layer is used to hold the specific STE format for the device
    and to convert the requested rule to STEs. Each rule is constructed of
    an STE chain, Multiple rules construct a steering graph. Each node in
    the graph is a hash table containing multiple STEs. The index of each
    STE in the hash table is being calculated using a CRC32 hash function.

Memory pool - Used for managing and caching device owned memory for rule
    insertion. The memory is being allocated using DM (device memory) API.

Communication with device - layer for standard RDMA operation using  RC QP
    to configure the device steering.

Command utility - This module holds all of the FW commands that are
    required for SW steering to function.

Patch planning and files:
-------------------------
1) First patch, adds the support to Add flow steering actions to fs_cmd
shim layer.

2) Next 12 patch will add a file per each Software steering
functionality/module as described above. (See patches with title: DR, *)

3) Add CONFIG_MLX5_SW_STEERING for software steering support and enable
build with the new files

4) Next two patches will add the support for software steering in mlx5
steering shim layer
net/mlx5: Add API to set the namespace steering mode
net/mlx5: Add direct rule fs_cmd implementation

5) Last two patches will add the new devlink parameter to select mlx5
steering mode, will be valid only for switchdev mode for now.
Two modes are supported:
    1. DMFS - Device managed flow steering
    2. SMFS - Software/Driver managed flow steering.

    In the DMFS mode, the HW steering entities are created through the
    FW. In the SMFS mode this entities are created though the driver
    directly.

    The driver will use the devlink steering mode only if the steering
    domain supports it, for now SMFS will manages only the switchdev
    eswitch steering domain.

    User command examples:
    - Set SMFS flow steering mode::

        $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime

    - Read device flow steering mode::

        $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
          pci/0000:06:00.0:
          name flow_steering_mode type driver-specific
          values:
             cmode runtime value smfs

----------------------------------------------------------------
Alex Vesker (13):
      net/mlx5: DR, Add the internal direct rule types definitions
      net/mlx5: DR, Add direct rule command utilities
      net/mlx5: DR, ICM pool memory allocator
      net/mlx5: DR, Expose an internal API to issue RDMA operations
      net/mlx5: DR, Add Steering entry (STE) utilities
      net/mlx5: DR, Expose steering domain functionality
      net/mlx5: DR, Expose steering table functionality
      net/mlx5: DR, Expose steering matcher functionality
      net/mlx5: DR, Expose steering action functionality
      net/mlx5: DR, Expose steering rule functionality
      net/mlx5: DR, Add required FW steering functionality
      net/mlx5: DR, Expose APIs for direct rule managing
      net/mlx5: DR, Add CONFIG_MLX5_SW_STEERING for software steering support

Maor Gottlieb (5):
      net/mlx5: Add flow steering actions to fs_cmd shim layer
      net/mlx5: Add direct rule fs_cmd implementation
      net/mlx5: Add API to set the namespace steering mode
      net/mlx5: Add support to use SMFS in switchdev mode
      net/mlx5: Add devlink flow_steering_mode parameter

 .../networking/device_drivers/mellanox/mlx5.rst    |   33 +
 drivers/infiniband/hw/mlx5/flow.c                  |   21 +-
 drivers/infiniband/hw/mlx5/main.c                  |    7 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h               |    5 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |    7 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |    7 +
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  |  112 +-
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |   27 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.h   |    2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |   46 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |    7 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   87 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  116 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h   |   25 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  160 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |   39 +-
 .../ethernet/mellanox/mlx5/core/steering/Makefile  |    2 +
 .../mellanox/mlx5/core/steering/dr_action.c        | 1588 ++++++++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c  |  480 ++++
 .../mellanox/mlx5/core/steering/dr_crc32.c         |   98 +
 .../mellanox/mlx5/core/steering/dr_domain.c        |  395 ++++
 .../ethernet/mellanox/mlx5/core/steering/dr_fw.c   |   93 +
 .../mellanox/mlx5/core/steering/dr_icm_pool.c      |  570 +++++
 .../mellanox/mlx5/core/steering/dr_matcher.c       |  770 +++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_rule.c | 1243 +++++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_send.c |  976 +++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_ste.c  | 2308 ++++++++++++++++++++
 .../mellanox/mlx5/core/steering/dr_table.c         |  294 +++
 .../mellanox/mlx5/core/steering/dr_types.h         | 1060 +++++++++
 .../ethernet/mellanox/mlx5/core/steering/fs_dr.c   |  600 +++++
 .../ethernet/mellanox/mlx5/core/steering/fs_dr.h   |   60 +
 .../mellanox/mlx5/core/steering/mlx5_ifc_dr.h      |  604 +++++
 .../ethernet/mellanox/mlx5/core/steering/mlx5dr.h  |  212 ++
 include/linux/mlx5/fs.h                            |   33 +-
 34 files changed, 11967 insertions(+), 120 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/Makefile
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_crc32.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_domain.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_fw.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_matcher.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_rule.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_ste.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/fs_dr.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/fs_dr.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Florian Westphal @ 2019-09-03 19:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, Leonardo Bras, netfilter-devel, coreteam,
	bridge, netdev, linux-kernel, Jozsef Kadlecsik, Roopa Prabhu,
	Nikolay Aleksandrov, David S. Miller
In-Reply-To: <20190903193155.v74ws47zcn6zrwpr@salvia>

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > I was expecting we could find a way to handle this from br_netfilter
> > > alone itself.
> > 
> > We can't because we support ipv6 fib lookups from the netdev family
> > as well.
> > 
> > Alternative is to auto-accept ipv6 packets from the nf_tables eval loop,
> > but I think its worse.
> 
> Could we add a restriction for nf_tables + br_netfilter + !ipv6. I
> mean, if this is an IPv6 packet, nf_tables is on and IPv6 module if
> off, then drop this packet?

We could do that from nft_do_chain_netdev().

^ permalink raw reply

* Re: [PATCH] net: fixed_phy: Add forward declaration for struct gpio_desc;
From: Florian Fainelli @ 2019-09-03 19:46 UTC (permalink / raw)
  To: Moritz Fischer, netdev; +Cc: andrew, hkallweit1, linux-kernel, davem
In-Reply-To: <20190903184652.3148-1-mdf@kernel.org>

On 9/3/19 11:46 AM, Moritz Fischer wrote:
> Add forward declaration for struct gpio_desc in order to address
> the following:
> 
> ./include/linux/phy_fixed.h:48:17: error: 'struct gpio_desc' declared inside parameter list [-Werror]
> ./include/linux/phy_fixed.h:48:17: error: its scope is only this definition or declaration, which is probably not what you want [-Werror]
> 
> Fixes commit 71bd106d2567 ("net: fixed-phy: Add
> fixed_phy_register_with_gpiod() API")

There is a standardized tag for that which is:

Fixes: 71bd106d2567 ("net: fixed-phy: Add
fixed_phy_register_with_gpiod() API")

Other than that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
>  include/linux/phy_fixed.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/phy_fixed.h b/include/linux/phy_fixed.h
> index 1e5d86ebdaeb..52bc8e487ef7 100644
> --- a/include/linux/phy_fixed.h
> +++ b/include/linux/phy_fixed.h
> @@ -11,6 +11,7 @@ struct fixed_phy_status {
>  };
>  
>  struct device_node;
> +struct gpio_desc;
>  
>  #if IS_ENABLED(CONFIG_FIXED_PHY)
>  extern int fixed_phy_change_carrier(struct net_device *dev, bool new_carrier);
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH] net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)
From: David Ahern @ 2019-09-03 19:45 UTC (permalink / raw)
  To: Maciej Żenczykowski, Lorenzo Colitti; +Cc: David S . Miller, Linux NetDev
In-Reply-To: <CAHo-OoyQzJptNDcLe93o3-G10oRN+93ZZ35jKkLudSanvgn-2Q@mail.gmail.com>

On 9/3/19 6:17 AM, Maciej Żenczykowski wrote:
> Well, if you look at the commit my commit is fixing, ie.
>   commit c7a1ce397adacaf5d4bb2eab0a738b5f80dc3e43
> then you'll see this in the commit description:
>   "- dst_nocount is handled by the RTF_ADDRCONF flag"
> and the patch diff itself is from
>   "f6i->fib6_flags = RTF_UP | RTF_NONEXTHOP;
>    f6i->dst_nocount = true;"
> to
>   " .fc_flags = RTF_UP | RTF_ADDRCONF | RTF_NONEXTHOP,"
> 
> (and RTF_ANYCAST or RTF_LOCAL is later or'ed in in both versions of the code)
> 
> so I'm pretty sure that patch adds ADDRCONF unconditionally to that
> function, and my commit unconditionally removes it.
> 

exactly. It was shortsighted of me to add the ADDRCONF flag and removing
it reverts back to the previous behavior.

When I enable radvd, I do see the flag set when it should be and not for
other addresses. I believe the patch is correct.

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Pablo Neira Ayuso @ 2019-09-03 19:31 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Leonardo Bras, netfilter-devel, coreteam, bridge, netdev,
	linux-kernel, Jozsef Kadlecsik, Roopa Prabhu, Nikolay Aleksandrov,
	David S. Miller
In-Reply-To: <20190903170550.GA13660@breakpoint.cc>

On Tue, Sep 03, 2019 at 07:05:50PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Tue, Sep 03, 2019 at 01:46:50PM -0300, Leonardo Bras wrote:
> > > On Fri, 2019-08-30 at 22:58 +0200, Florian Westphal wrote:
> > > > Leonardo Bras <leonardo@linux.ibm.com> wrote:
> > > > > If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> > > > > dealing with a IPv6 packet, it causes a kernel panic in
> > > > > fib6_node_lookup_1(), crashing in bad_page_fault.
> > > > > 
> > > > > The panic is caused by trying to deference a very low address (0x38
> > > > > in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> > > > > BUG: Kernel NULL pointer dereference at 0x00000038
> > > > > 
> > > > > The kernel panic was reproduced in a host that disabled IPv6 on boot and
> > > > > have to process guest packets (coming from a bridge) using it's ip6tables.
> > > > > 
> > > > > Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
> > > > > is not loaded.
> > > > > 
> > > > > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> > > > 
> > > > Acked-by: Florian Westphal <fw@strlen.de>
> > > > 
> > > 
> > > Hello Pablo,
> > > 
> > > Any trouble with this patch? 
> > > I could see the other* one got applied, but not this one.
> > > *(The other did not get acked, so i released it alone as v5)
> > > 
> > > Is there any fix I need to do in this one?
> > 
> > Hm, I see, so this one:
> > 
> > https://patchwork.ozlabs.org/patch/1156100/
> > 
> > is not enough?
> 
> No, its not.
> 
> > I was expecting we could find a way to handle this from br_netfilter
> > alone itself.
> 
> We can't because we support ipv6 fib lookups from the netdev family
> as well.
> 
> Alternative is to auto-accept ipv6 packets from the nf_tables eval loop,
> but I think its worse.

Could we add a restriction for nf_tables + br_netfilter + !ipv6. I
mean, if this is an IPv6 packet, nf_tables is on and IPv6 module if
off, then drop this packet?

By dropping packet, the user could diagnose that its setup is
incomplete. I mean, if nf_tables fib ipv6 is used, then this setup is
really wrong and the user forgots to load the ipv6 module.

^ permalink raw reply

* RE: [PATCH net-next] net/ncsi: support unaligned payload size in NC-SI cmd handler
From: Justin.Lee1 @ 2019-09-03 19:21 UTC (permalink / raw)
  To: benwei, davem; +Cc: netdev, openbmc, sam, linux-kernel
In-Reply-To: <CH2PR15MB36869BE1AA44813CE293891BA3BE0@CH2PR15MB3686.namprd15.prod.outlook.com>

That is right. It is necessary to adjust the len for padding on both places.

Thanks,
Justin


> > > Update NC-SI command handler (both standard and OEM) to take into 
> > > account of payload paddings in allocating skb (in case of payload 
> > > size is not 32-bit aligned).
> > > 
> > > The checksum field follows payload field, without taking payload 
> > > padding into account can cause checksum being truncated, leading to 
> > > dropped packets.
> > > 
> > > Signed-off-by: Ben Wei <benwei@fb.com>
> >
> > If you have to align and add padding, I do not see where you are 
> > clearing out that padding memory to make sure it is initialized.
> >
> > You do comparisons with 'payload' but make adjustments to 'len'.
> >
> > The logic is very confusing.
> 
> Yes let me clarify a bit. 
> 
> In the code 'payload' is the exact NC-SI payload length, which goes into NC-SI packet header and needs to be actual unpadded payload length.
> 
> 'len' is used to allocate total NC-SI packet buffer (include padding). 
> 
> The original calculation of 'len' was done by summing up NCSI header + payload + checksum, without taking into account of possible padding, e.g.
> 
>         len += sizeof(struct ncsi_cmd_pkt_hdr) + 4;  /* 4 is the checksum size */
>        if (nca->payload < 26)
>                 len += 26;
>         else
>                len += nca->payload;
>         /* Allocate skb */
>         skb = alloc_skb(len, GFP_ATOMIC);
> 
> This works today for all standard NC-SI commands (in spec v1.1) because all standard commands have payload size < 26, and packet size is then set to minimum of 46 (16 hdr + 26 payload + 4 cksum) bytes.
> 
> And mem clearing is done in each of the standard cmd handlers, e.g. 
> ncsi_cmd_handler_sp, ncsi_cmd_handler_ae.
> 
> 
> 
> The problem occurs if payload >= 26 and is unaligned.  This could happen on some OEM commands, and I see this happening when we carry PLDM traffic over NC-SI packet. 
> (PLDM header being 3 bytes and payload size can be large) 
> 
> The skb allocated would be too small, and later when checksum is calculated and written:
> 
> 	pchecksum = (__be32 *)((void *)h + sizeof(struct ncsi_pkt_hdr) +
> 		    ALIGN(nca->payload, 4));
> 	*pchecksum = htonl(checksum);
> 
> Part of the checksum would fall outside of our allocated buffer.
> 
> PLDM over NC-SI and OEM NC-SI commands are currently handled in
> 
> @@ -213,17 +213,22 @@ static int ncsi_cmd_handler_oem(struct sk_buff *skb,
> 
> So here I ensure the skb allocation takes padding into account, and do the initial mem clearing to set the padding bytes
> 
> +       unsigned short payload = ALIGN(nca->payload, 4);
> 
>         len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
> -       if (nca->payload < 26)
> +       if (payload < 26)
>                 len += 26;
>         else
> -               len += nca->payload;
> +               len += payload;
> 
>         cmd = skb_put_zero(skb, len);
>         memcpy(&cmd->mfr_id, nca->data, nca->payload);
> 
> So in this patch I updated both standard command handler (in case future spec updates adds commands with payload >= 26) and OEM/generic command handler to support unaligned payload size.  
> 
> Regards,
> -Ben

^ permalink raw reply

* [PATCH v2 net-next 2/2] i40e: Implement debug macro hw_dbg using dev_dbg
From: Mauro S. M. Rodrigues @ 2019-09-03 19:20 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: intel-wired-lan, netdev, davem, Andrew Bowers, Jakub Kicinski,
	maurosr
In-Reply-To: <20190903192021.25789-1-maurosr@linux.vnet.ibm.com>

There are several uses of hw_dbg in the code, producing no output. This
patch implments it using dev_debug.

Initially the intention was to implement it using netdev_dbg, analogously
to what is done in ixgbe for instance. That approach was avoided due to
some early usages of hw_dbg, like i40e_pf_reset, before the vsi structure
initialization causing NULL pointer dereference during the driver probe if
the dbg messages were turned on as soon as the module is probed.

v2:
 - Use dev_dbg instead of pr_debug, and take advantage of dev_name
instead of crafting pretty much the same device name locally as suggested
by Jakub Kicinski.

Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_common.c | 1 +
 drivers/net/ethernet/intel/i40e/i40e_hmc.c    | 1 +
 drivers/net/ethernet/intel/i40e/i40e_osdep.h  | 5 ++++-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 906cf68d3453..a51678166ff9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include "i40e.h"
 #include "i40e_type.h"
 #include "i40e_adminq.h"
 #include "i40e_prototype.h"
diff --git a/drivers/net/ethernet/intel/i40e/i40e_hmc.c b/drivers/net/ethernet/intel/i40e/i40e_hmc.c
index 19ce93d7fd0a..163ee8c6311c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_hmc.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_hmc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include "i40e.h"
 #include "i40e_osdep.h"
 #include "i40e_register.h"
 #include "i40e_status.h"
diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index a07574bff550..c302ef2524f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -18,7 +18,10 @@
  * actual OS primitives
  */
 
-#define hw_dbg(hw, S, A...)	do {} while (0)
+#define hw_dbg(hw, S, A...)							\
+do {										\
+	dev_dbg(&((struct i40e_pf *)hw->back)->pdev->dev, S, ##A);		\
+} while (0)
 
 #define wr32(a, reg, value)	writel((value), ((a)->hw_addr + (reg)))
 #define rd32(a, reg)		readl((a)->hw_addr + (reg))
-- 
2.21.0


^ permalink raw reply related

* [PATCH v2 net-next 1/2] i40e: fix hw_dbg usage in i40e_hmc_get_object_va
From: Mauro S. M. Rodrigues @ 2019-09-03 19:20 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: intel-wired-lan, netdev, davem, Andrew Bowers, Jakub Kicinski,
	maurosr

The mentioned function references a i40e_hw attribute, as parameter for
hw_dbg, but it doesn't exist in the function scope.
Fixes it by changing  parameters from i40e_hmc_info to i40e_hw which can
retrieve the necessary i40e_hmc_info.

v2:
 - Fixed reverse xmas tree code style issue as suggested by Jakub Kicinski

Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c b/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
index 994011c38fb4..fd987f7ae6e5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include "i40e.h"
 #include "i40e_osdep.h"
 #include "i40e_register.h"
 #include "i40e_type.h"
@@ -963,7 +964,7 @@ static i40e_status i40e_set_hmc_context(u8 *context_bytes,
 
 /**
  * i40e_hmc_get_object_va - retrieves an object's virtual address
- * @hmc_info: pointer to i40e_hmc_info struct
+ * @hw: the hardware struct, from which we obtain the i40e_hmc_info pointer
  * @object_base: pointer to u64 to get the va
  * @rsrc_type: the hmc resource type
  * @obj_idx: hmc object index
@@ -972,16 +973,17 @@ static i40e_status i40e_set_hmc_context(u8 *context_bytes,
  * base pointer.  This function is used for LAN Queue contexts.
  **/
 static
-i40e_status i40e_hmc_get_object_va(struct i40e_hmc_info *hmc_info,
+i40e_status i40e_hmc_get_object_va(struct i40e_hw *hw,
 					u8 **object_base,
 					enum i40e_hmc_lan_rsrc_type rsrc_type,
 					u32 obj_idx)
 {
+	struct i40e_hmc_info *hmc_info = &hw->hmc;
 	u32 obj_offset_in_sd, obj_offset_in_pd;
-	i40e_status ret_code = 0;
 	struct i40e_hmc_sd_entry *sd_entry;
 	struct i40e_hmc_pd_entry *pd_entry;
 	u32 pd_idx, pd_lmt, rel_pd_idx;
+	i40e_status ret_code = 0;
 	u64 obj_offset_in_fpm;
 	u32 sd_idx, sd_lmt;
 
@@ -1047,7 +1049,7 @@ i40e_status i40e_clear_lan_tx_queue_context(struct i40e_hw *hw,
 	i40e_status err;
 	u8 *context_bytes;
 
-	err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+	err = i40e_hmc_get_object_va(hw, &context_bytes,
 				     I40E_HMC_LAN_TX, queue);
 	if (err < 0)
 		return err;
@@ -1068,7 +1070,7 @@ i40e_status i40e_set_lan_tx_queue_context(struct i40e_hw *hw,
 	i40e_status err;
 	u8 *context_bytes;
 
-	err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+	err = i40e_hmc_get_object_va(hw, &context_bytes,
 				     I40E_HMC_LAN_TX, queue);
 	if (err < 0)
 		return err;
@@ -1088,7 +1090,7 @@ i40e_status i40e_clear_lan_rx_queue_context(struct i40e_hw *hw,
 	i40e_status err;
 	u8 *context_bytes;
 
-	err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+	err = i40e_hmc_get_object_va(hw, &context_bytes,
 				     I40E_HMC_LAN_RX, queue);
 	if (err < 0)
 		return err;
@@ -1109,7 +1111,7 @@ i40e_status i40e_set_lan_rx_queue_context(struct i40e_hw *hw,
 	i40e_status err;
 	u8 *context_bytes;
 
-	err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+	err = i40e_hmc_get_object_va(hw, &context_bytes,
 				     I40E_HMC_LAN_RX, queue);
 	if (err < 0)
 		return err;
-- 
2.21.0


^ permalink raw reply related

* BUG: using smp_processor_id() in preemptible [ADDR] code: mime_typevmnet0/NUM
From: syzbot @ 2019-09-03 19:04 UTC (permalink / raw)
  To: allison, davem, gregkh, linux-kernel, netdev, syzkaller-bugs,
	tglx

Hello,

syzbot found the following crash on:

HEAD commit:    eea17309 Merge branch 'i2c/for-current' of git://git.kerne..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10b0622a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=5cbaa3be0b36022f
dashboard link: https://syzkaller.appspot.com/bug?extid=55acd54b57bb4b3840a4
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+55acd54b57bb4b3840a4@syzkaller.appspotmail.com

Started in network mode
Own node identity ff030000000000000000000000000001, cluster identity 4711
BUG: using smp_processor_id() in preemptible [00000000] code:  
mime_typevmnet0/24282
caller is dst_cache_get+0x3d/0xb0 net/core/dst_cache.c:68
CPU: 0 PID: 24282 Comm: mime_typevmnet0 Not tainted 5.3.0-rc6+ #148
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  check_preemption_disabled lib/smp_processor_id.c:47 [inline]
  debug_smp_processor_id.cold+0x87/0x9d lib/smp_processor_id.c:57
  dst_cache_get+0x3d/0xb0 net/core/dst_cache.c:68
  tipc_udp_xmit.isra.0+0xc4/0xb80 net/tipc/udp_media.c:164
  tipc_udp_send_msg+0x3ea/0x490 net/tipc/udp_media.c:241
  tipc_bearer_xmit_skb+0x17e/0x370 net/tipc/bearer.c:503
  tipc_enable_bearer+0xacf/0xd30 net/tipc/bearer.c:328
  __tipc_nl_bearer_enable+0x2de/0x3a0 net/tipc/bearer.c:899
  tipc_nl_bearer_enable+0x23/0x40 net/tipc/bearer.c:907
  genl_family_rcv_msg+0x74b/0xf90 net/netlink/genetlink.c:629
  genl_rcv_msg+0xca/0x170 net/netlink/genetlink.c:654
  netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
  genl_rcv+0x29/0x40 net/netlink/genetlink.c:665
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  ___sys_sendmsg+0x803/0x920 net/socket.c:2311
  __sys_sendmsg+0x105/0x1d0 net/socket.c:2356
  __do_sys_sendmsg net/socket.c:2365 [inline]
  __se_sys_sendmsg net/socket.c:2363 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459879
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe5fe7f1c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000459879
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5fe7f26d4
R13: 00000000004c753c R14: 00000000004dcce8 R15: 00000000ffffffff
Enabled bearer <udp:syz0>, priority 8


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Michal Hocko @ 2019-09-03 18:53 UTC (permalink / raw)
  To: Qian Cai; +Cc: Eric Dumazet, davem, netdev, linux-mm, linux-kernel
In-Reply-To: <1567525342.5576.60.camel@lca.pw>

On Tue 03-09-19 11:42:22, Qian Cai wrote:
> On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > If there is a risk of flooding the syslog, we should fix this generically
> > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > 
> > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > can think of a different parameters. Or maybe it is the ratelimiting
> > which doesn't work here. Hard to tell and something to explore.
> 
> The time-based ratelimit won't work for skb_build() as when a system under
> memory pressure, and the CPU is fast and IO is so slow, it could take a long
> time to swap and trigger OOM.

I really do not understand what does OOM and swapping have to do with
the ratelimiting here. The sole purpose of the ratelimit is to reduce
the amount of warnings to be printed. Slow IO might have an effect on
when the OOM killer is invoked but atomic allocations are not directly
dependent on IO.

> I suppose what happens is those skb_build() allocations are from softirq, and
> once one of them failed, it calls printk() which generates more interrupts.
> Hence, the infinite loop.

Please elaborate more.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH] net: fixed_phy: Add forward declaration for struct gpio_desc;
From: Moritz Fischer @ 2019-09-03 18:46 UTC (permalink / raw)
  To: netdev; +Cc: andrew, f.fainelli, hkallweit1, linux-kernel, davem,
	Moritz Fischer

Add forward declaration for struct gpio_desc in order to address
the following:

./include/linux/phy_fixed.h:48:17: error: 'struct gpio_desc' declared inside parameter list [-Werror]
./include/linux/phy_fixed.h:48:17: error: its scope is only this definition or declaration, which is probably not what you want [-Werror]

Fixes commit 71bd106d2567 ("net: fixed-phy: Add
fixed_phy_register_with_gpiod() API")
Signed-off-by: Moritz Fischer <mdf@kernel.org>
---
 include/linux/phy_fixed.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/phy_fixed.h b/include/linux/phy_fixed.h
index 1e5d86ebdaeb..52bc8e487ef7 100644
--- a/include/linux/phy_fixed.h
+++ b/include/linux/phy_fixed.h
@@ -11,6 +11,7 @@ struct fixed_phy_status {
 };
 
 struct device_node;
+struct gpio_desc;
 
 #if IS_ENABLED(CONFIG_FIXED_PHY)
 extern int fixed_phy_change_carrier(struct net_device *dev, bool new_carrier);
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH bpf-next V9 1/3] bpf: new helper to obtain namespace data from current task
From: Carlos Antonio Neira Bustos @ 2019-09-03 18:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: netdev@vger.kernel.org, Eric Biederman, brouer@redhat.com,
	bpf@vger.kernel.org
In-Reply-To: <20190828210333.itwtyqa5w5egnrwm@ebpf-metal>

Hi Yonghong,

> > Yes, the samples/bpf test case can be removed.
> > Could you create a selftest with tracpoint net/netif_receive_skb, which 
> > also uses the proposed helper? net/netif_receive_skb will happen in
> > interrupt context and it should catch the issue as well if 
> > filename_lookup still get called in interrupt context.
>
For this one scenario I just created another selftest with the only difference 
that the tracepoint is /net/netif_receive_skb so this fails with -EPERM.
Is that enough?.

I have made this comment on include/uapi/linux/bpf.h, maybe is too terse?

struct bpf_pidns_info {
	__u32 dev;	/* dev_t from /proc/self/ns/pid inode */
	__u32 nsid;
	__u32 tgid;
	__u32 pid;
};

I'm only missing clearing out those questions to be ready to submit v11 of this patch.

Bests

On Wed, Aug 28, 2019 at 05:03:35PM -0400, Carlos Antonio Neira Bustos wrote:
> Thanks, I'll work on the net/netif_receive_skb selftest using this helper.
> I hope I could complete this work this week.
> 
> Bests.
> 
> On Wed, Aug 28, 2019 at 08:53:25PM +0000, Yonghong Song wrote:
> > 
> > 
> > On 8/28/19 1:39 PM, Carlos Antonio Neira Bustos wrote:
> > > Yonghong,
> > > 
> > > Thanks for the pointer, I fixed this bug, but I found another one that's triggered
> > > now the test program I included in  tools/testing/selftests/bpf/test_pidns.
> > > It's seemed that fname was not correctly setup when passing it to filename_lookup.
> > > This is fixed now and I'm doing some more testing.
> > > I think I'll remove the tests on samples/bpf as they are mostly end on -EPERM as
> > > the fix intended.
> > > Is ok to remove them and just focus to finish the self tests code?.
> > 
> > Yes, the samples/bpf test case can be removed.
> > Could you create a selftest with tracpoint net/netif_receive_skb, which 
> > also uses the proposed helper? net/netif_receive_skb will happen in
> > interrupt context and it should catch the issue as well if 
> > filename_lookup still get called in interrupt context.
> > 
> > > 
> > > Bests
> > > 
> > > On Wed, Aug 14, 2019 at 01:25:06AM -0400, carlos antonio neira bustos wrote:
> > >> Thank you very much!
> > >>
> > >> Bests
> > >>
> > >> El mié., 14 de ago. de 2019 00:50, Yonghong Song <yhs@fb.com> escribió:
> > >>
> > >>>
> > >>>
> > >>> On 8/13/19 5:56 PM, Carlos Antonio Neira Bustos wrote:
> > >>>> On Tue, Aug 13, 2019 at 11:11:14PM +0000, Yonghong Song wrote:
> > >>>>>
> > >>>>>
> > >>>>> On 8/13/19 11:47 AM, Carlos Neira wrote:
> > >>>>>> From: Carlos <cneirabustos@gmail.com>
> > >>>>>>
> > >>>>>> New bpf helper bpf_get_current_pidns_info.
> > >>>>>> This helper obtains the active namespace from current and returns
> > >>>>>> pid, tgid, device and namespace id as seen from that namespace,
> > >>>>>> allowing to instrument a process inside a container.
> > >>>>>>
> > >>>>>> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > >>>>>> ---
> > >>>>>>     fs/internal.h            |  2 --
> > >>>>>>     fs/namei.c               |  1 -
> > >>>>>>     include/linux/bpf.h      |  1 +
> > >>>>>>     include/linux/namei.h    |  4 +++
> > >>>>>>     include/uapi/linux/bpf.h | 31 ++++++++++++++++++++++-
> > >>>>>>     kernel/bpf/core.c        |  1 +
> > >>>>>>     kernel/bpf/helpers.c     | 64
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>     kernel/trace/bpf_trace.c |  2 ++
> > >>>>>>     8 files changed, 102 insertions(+), 4 deletions(-)
> > >>>>>>
> > >>> [...]
> > >>>>>>
> > >>>>>> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *,
> > >>> pidns_info, u32,
> > >>>>>> +    size)
> > >>>>>> +{
> > >>>>>> +   const char *pidns_path = "/proc/self/ns/pid";
> > >>>>>> +   struct pid_namespace *pidns = NULL;
> > >>>>>> +   struct filename *tmp = NULL;
> > >>>>>> +   struct inode *inode;
> > >>>>>> +   struct path kp;
> > >>>>>> +   pid_t tgid = 0;
> > >>>>>> +   pid_t pid = 0;
> > >>>>>> +   int ret;
> > >>>>>> +   int len;
> > >>>>>
> > >>>>
> > >>>> Thank you very much for catching this!.
> > >>>> Could you share how to replicate this bug?.
> > >>>
> > >>> The config is attached. just run trace_ns_info and you
> > >>> can reproduce the issue.
> > >>>
> > >>>>
> > >>>>> I am running your sample program and get the following kernel bug:
> > >>>>>
> > >>>>> ...
> > >>>>> [   26.414825] BUG: sleeping function called from invalid context at
> > >>>>> /data/users/yhs/work/net-next/fs
> > >>>>> /dcache.c:843
> > >>>>> [   26.416314] in_atomic(): 1, irqs_disabled(): 0, pid: 1911, name: ping
> > >>>>> [   26.417189] CPU: 0 PID: 1911 Comm: ping Tainted: G        W
> > >>>>> 5.3.0-rc1+ #280
> > >>>>> [   26.418182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > >>>>> BIOS 1.9.3-1.el7.centos 04/01/2
> > >>>>> 014
> > >>>>> [   26.419393] Call Trace:
> > >>>>> [   26.419697]  <IRQ>
> > >>>>> [   26.419960]  dump_stack+0x46/0x5b
> > >>>>> [   26.420434]  ___might_sleep+0xe4/0x110
> > >>>>> [   26.420894]  dput+0x2a/0x200
> > >>>>> [   26.421265]  walk_component+0x10c/0x280
> > >>>>> [   26.421773]  link_path_walk+0x327/0x560
> > >>>>> [   26.422280]  ? proc_ns_dir_readdir+0x1a0/0x1a0
> > >>>>> [   26.422848]  ? path_init+0x232/0x330
> > >>>>> [   26.423364]  path_lookupat+0x88/0x200
> > >>>>> [   26.423808]  ? selinux_parse_skb.constprop.69+0x124/0x430
> > >>>>> [   26.424521]  filename_lookup+0xaf/0x190
> > >>>>> [   26.425031]  ? simple_attr_release+0x20/0x20
> > >>>>> [   26.425560]  bpf_get_current_pidns_info+0xfa/0x190
> > >>>>> [   26.426168]  bpf_prog_83627154cefed596+0xe66/0x1000
> > >>>>> [   26.426779]  trace_call_bpf+0xb5/0x160
> > >>>>> [   26.427317]  ? __netif_receive_skb_core+0x1/0xbb0
> > >>>>> [   26.427929]  ? __netif_receive_skb_core+0x1/0xbb0
> > >>>>> [   26.428496]  kprobe_perf_func+0x4d/0x280
> > >>>>> [   26.428986]  ? tracing_record_taskinfo_skip+0x1a/0x30
> > >>>>> [   26.429584]  ? tracing_record_taskinfo+0xe/0x80
> > >>>>> [   26.430152]  ? ttwu_do_wakeup.isra.114+0xcf/0xf0
> > >>>>> [   26.430737]  ? __netif_receive_skb_core+0x1/0xbb0
> > >>>>> [   26.431334]  ? __netif_receive_skb_core+0x5/0xbb0
> > >>>>> [   26.431930]  kprobe_ftrace_handler+0x90/0xf0
> > >>>>> [   26.432495]  ftrace_ops_assist_func+0x63/0x100
> > >>>>> [   26.433060]  0xffffffffc03180bf
> > >>>>> [   26.433471]  ? __netif_receive_skb_core+0x1/0xbb0
> > >>>>> ...
> > >>>>>
> > >>>>> To prevent we are running in arbitrary task (e.g., idle task)
> > >>>>> context which may introduce sleeping issues, the following
> > >>>>> probably appropriate:
> > >>>>>
> > >>>>>           if (in_nmi() || in_softirq())
> > >>>>>                   return -EPERM;
> > >>>>>
> > >>>>> Anyway, if in nmi or softirq, the namespace and pid/tgid
> > >>>>> we get may be just accidentally associated with the bpf running
> > >>>>> context, but it could be in a different context. So such info
> > >>>>> is not reliable any way.
> > >>>>>
> > >>>>>> +
> > >>>>>> +   if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > >>>>>> +           return -EINVAL;
> > >>>>>> +   pidns = task_active_pid_ns(current);
> > >>> [...]
> > >>>

^ permalink raw reply

* Re: [BACKPORT 4.14.y 4/8] net: sctp: fix warning "NULL check before some freeing functions is not needed"
From: Greg KH @ 2019-09-03 18:33 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Baolin Wang, stable, vyasevich, nhorman, davem, hariprasad.kelam,
	linux-sctp, netdev, arnd, orsonzhai, vincent.guittot,
	linux-kernel
In-Reply-To: <20190903145206.GB3499@localhost.localdomain>

On Tue, Sep 03, 2019 at 11:52:06AM -0300, Marcelo Ricardo Leitner wrote:
> On Tue, Sep 03, 2019 at 02:58:16PM +0800, Baolin Wang wrote:
> > From: Hariprasad Kelam <hariprasad.kelam@gmail.com>
> > 
> > This patch removes NULL checks before calling kfree.
> > 
> > fixes below issues reported by coccicheck
> > net/sctp/sm_make_chunk.c:2586:3-8: WARNING: NULL check before some
> > freeing functions is not needed.
> > net/sctp/sm_make_chunk.c:2652:3-8: WARNING: NULL check before some
> > freeing functions is not needed.
> > net/sctp/sm_make_chunk.c:2667:3-8: WARNING: NULL check before some
> > freeing functions is not needed.
> > net/sctp/sm_make_chunk.c:2684:3-8: WARNING: NULL check before some
> > freeing functions is not needed.
> 
> Hi. This doesn't seem the kind of patch that should be backported to
> such old/stable releases. After all, it's just a cleanup.

I agree, this does not seem necessary _unless_ it is needed for a later
real fix.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 0/4] ethtool: implement Energy Detect Powerdown support via phy-tunable
From: John W. Linville @ 2019-09-03 17:50 UTC (permalink / raw)
  To: Alexandru Ardelean
  Cc: netdev, linux-kernel, andrew, f.fainelli, hkallweit1, davem
In-Reply-To: <20190903160626.7518-1-alexandru.ardelean@analog.com>

On Tue, Sep 03, 2019 at 07:06:22PM +0300, Alexandru Ardelean wrote:
> This patch series is actually 2 series in 1.
> 
> First 2 patches implement the kernel support for controlling Energy Detect
> Powerdown support via phy-tunable, and the next 2 patches implement the
> ethtool user-space control.
> Hopefully, this combination of 2 series is an acceptable approach; if not,
> I am fine to re-update it based on feedback.

I understand your reasoning, but do keep in mind that userland ethtool
and the kernel are managed in different git trees. Seperate patchsets
would be preferable in general, although in some cases having an
initial userland implementation to show against proposed kernel
changes could be helpful.

It would not be unusual for someone to ask for changes on the kernel
patches. If that happens, just repost the kernel changes until you get
a final merge. Once that happens, then repost the userland patches as
a seperate patchset. But I'll keep an eye here -- if Dave merges the
existing kernel patches as-is, I can take the already posted patchs
(unless problems are found in code review).

John

> The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
> this feature is common across other PHYs (like EEE), and defining
> `ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
>     
> The way EDPD works, is that the RX block is put to a lower power mode,
> except for link-pulse detection circuits. The TX block is also put to low
> power mode, but the PHY wakes-up periodically to send link pulses, to avoid
> lock-ups in case the other side is also in EDPD mode.
>     
> Currently, there are 2 PHY drivers that look like they could use this new
> PHY tunable feature: the `adin` && `micrel` PHYs.
> 
> This series updates only the `adin` PHY driver to support this new feature,
> as this chip has been tested. A change for `micrel` can be proposed after a
> discussion of the PHY-tunable API is resolved.
> 
> -- 
> 2.20.1
> 
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: Is bug 200755 in anyone's queue??
From: Willem de Bruijn @ 2019-09-03 17:55 UTC (permalink / raw)
  Cc: Eric Dumazet, Steve Zabele, Network Development, shum,
	vladimir116, saifi.khan, Daniel Borkmann, on2k16nm,
	Stephen Hemminger
In-Reply-To: <CA+FuTSf4iLXh-+ADfBNxqcsw=u_vGm7Wsx7vchgwgwvGFYOA6w@mail.gmail.com>

On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 8/29/19 9:26 PM, Willem de Bruijn wrote:
> >
> > > SO_REUSEPORT was not intended to be used in this way. Opening
> > > multiple connected sockets with the same local port.
> > >
> > > But since the interface allowed connect after joining a group, and
> > > that is being used, I guess that point is moot. Still, I'm a bit
> > > surprised that it ever worked as described.
> > >
> > > Also note that the default distribution algorithm is not round robin
> > > assignment, but hash based. So multiple consecutive datagrams arriving
> > > at the same socket is not unexpected.
> > >
> > > I suspect that this quick hack might "work". It seemed to on the
> > > supplied .c file:
> > >
> > >                   score = compute_score(sk, net, saddr, sport,
> > >                                         daddr, hnum, dif, sdif);
> > >                   if (score > badness) {
> > >   -                       if (sk->sk_reuseport) {
> > >   +                       if (sk->sk_reuseport && !sk->sk_state !=
> > > TCP_ESTABLISHED) {
>
> This won't work for a mix of connected and connectionless sockets, of
> course (even ignoring the typo), as it only skips reuseport on the
> connected sockets.
>
> > >
> > > But a more robust approach, that also works on existing kernels, is to
> > > swap the default distribution algorithm with a custom BPF based one (
> > > SO_ATTACH_REUSEPORT_EBPF).
> > >
> >
> > Yes, I suspect that reuseport could still be used by to load-balance incoming packets
> > targetting the same 4-tuple.
> >
> > So all sockets would have the same score, and we would select the first socket in
> > the list (if not applying reuseport hashing)
>
> Can you elaborate a bit?
>
> One option I see is to record in struct sock_reuseport if any port in
> the group is connected and, if so, don't return immediately on the
> first reuseport_select_sock hit, but continue the search for a higher
> scoring connected socket.
>
> Or do return immediately, but do this refined search in
> reuseport_select_sock itself, as it has a reference to all sockets in the
> group in sock_reuseport->socks[]. Instead of the straightforward hash.

That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.

Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.

Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:

"
   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                                     struct udp_hslot *hslot2,
                                     struct sk_buff *skb)
 {
+       struct sock *reuseport_result = NULL;
        struct sock *sk, *result;
+       int reuseport_score = 0;
        int score, badness;
        u32 hash = 0;

        result = NULL;
        badness = 0;
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED &&
+                           !reuseport_result) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
-                               result = reuseport_select_sock(sk, hash, skb,
+                               reuseport_result =
reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));
-                               if (result)
-                                       return result;
+                               if (reuseport_result)
+                                       reuseport_score = score;
+                               continue;
                        }
                        badness = score;
                        result = sk;
                }
        }
+
+       if (badness < reuseport_score)
+               result = reuseport_result;
+
        return result;
"

To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.

> Steve, Re: your point on a scalable QUIC server. That is an
> interesting case certainly. Opening a connected socket per flow adds
> both memory and port table pressure. I once looked into an SO_TXONLY
> udp socket option that does not hash connected sockets into the port
> table. In effect receiving on a small set of listening sockets (e.g.,
> one per cpu) and sending over separate tx-only sockets. That still
> introduces unnecessary memory allocation. OTOH it amortizes some
> operations, such as route lookup.
>
> Anyway, that does not fix the immediate issue you reported when using
> SO_REUSEPORT as described.

As for the BPF program: good point on accessing the udp port when
skb->data is already beyond the header.

Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
Which I think will work, but have not tested.

As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
attached (with CAP_SYS_ADMIN). See
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
example that parses udp headers with bpf_skb_load_bytes.

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] dt-bindings: net: dsa: mt7530: Add support for port 5
From: Rob Herring @ 2019-09-03 17:28 UTC (permalink / raw)
  To: René van Dorst
  Cc: Sean Wang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S . Miller, Matthias Brugger, netdev, linux-arm-kernel,
	linux-mediatek, Russell King, John Crispin, linux-mips,
	Frank Wunderlich, René van Dorst, devicetree
In-Reply-To: <20190902130226.26845-3-opensource@vdorst.com>

On Mon,  2 Sep 2019 15:02:25 +0200, =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= wrote:
> MT7530 port 5 has many modes/configurations.
> Update the documentation how to use port 5.
> 
> Signed-off-by: René van Dorst <opensource@vdorst.com>
> Cc: devicetree@vger.kernel.org
> Cc: Rob Herring <robh@kernel.org>
> ---
> v2->v3:
> * Remove 'status = "okay";' lines, suggested by Rob Herring
> v1->v2:
> * Adding extra note about RGMII2 and gpio use.
> rfc->v1:
> * No change
> 
>  .../devicetree/bindings/net/dsa/mt7530.txt    | 214 ++++++++++++++++++
>  1 file changed, 214 insertions(+)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Willem de Bruijn @ 2019-09-03 17:24 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Willem de Bruijn, Daniel Borkmann, Eric Dumazet, netdev,
	Alexander Duyck, Alexei Starovoitov, Yonghong Song,
	Steffen Klassert, eyal
In-Reply-To: <20190903200312.7e0ec75e@pixies>

On Tue, Sep 3, 2019 at 1:03 PM Shmulik Ladkani <shmulik@metanetworks.com> wrote:
>
> On Tue, 3 Sep 2019 12:23:54 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
> > This is a lot more code change. Especially for stable fixes that need
> > to be backported, a smaller patch is preferable.
>
> Indeed. Thanks for the feedback.
>
> > My suggestion only tested the first frag_skb length. If a list can be
> > created where the first frag_skb is head_frag but a later one is not,
> > it will fail short. I kind of doubt that.
> >
> > By default skb_gro_receive builds GSO skbs that can be segmented
> > along the original gso_size boundaries. We have so far only observed
> > this issue when messing with gso_size.
>
> The rationale was based on inputs specified in 43170c4e0ba7, where a GRO
> skb has a fraglist with different amounts of payloads.
>
> > We can easily refine the test to fall back on to copying only if
> > skb_headlen(list_skb) != mss.
>
> I'm concerned this is too generic; innocent skbs may fall victim to our
> skb copy fallback. Probably those mentioned in 43170c4e0ba7.
>
> > Alternatively, only on SKB_GSO_DODGY is fine, too.
> >
> > I suggest we stick with the two-liner.
>
> OK.
> So lets refine your original codition, testing only the first
> frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss'
> (we know existing code DOES work OK for unchanged gso_size, even if frags
> have linear, non head_frag, data).
>
> This hits the known, reproducable case of the mentioned BUG_ON, and is
> tightly scoped to that case.
>
> If that's agreed, I'll submit a proper patch.

Yep, that sounds good to me.

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Florian Westphal @ 2019-09-03 17:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Leonardo Bras, netfilter-devel, coreteam, bridge, netdev,
	linux-kernel, FlorianWestphal, Jozsef Kadlecsik, Roopa Prabhu,
	Nikolay Aleksandrov, David S. Miller
In-Reply-To: <20190903164948.kuvtpy7viqhcmp77@salvia>

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Sep 03, 2019 at 01:46:50PM -0300, Leonardo Bras wrote:
> > On Fri, 2019-08-30 at 22:58 +0200, Florian Westphal wrote:
> > > Leonardo Bras <leonardo@linux.ibm.com> wrote:
> > > > If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> > > > dealing with a IPv6 packet, it causes a kernel panic in
> > > > fib6_node_lookup_1(), crashing in bad_page_fault.
> > > > 
> > > > The panic is caused by trying to deference a very low address (0x38
> > > > in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> > > > BUG: Kernel NULL pointer dereference at 0x00000038
> > > > 
> > > > The kernel panic was reproduced in a host that disabled IPv6 on boot and
> > > > have to process guest packets (coming from a bridge) using it's ip6tables.
> > > > 
> > > > Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
> > > > is not loaded.
> > > > 
> > > > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> > > 
> > > Acked-by: Florian Westphal <fw@strlen.de>
> > > 
> > 
> > Hello Pablo,
> > 
> > Any trouble with this patch? 
> > I could see the other* one got applied, but not this one.
> > *(The other did not get acked, so i released it alone as v5)
> > 
> > Is there any fix I need to do in this one?
> 
> Hm, I see, so this one:
> 
> https://patchwork.ozlabs.org/patch/1156100/
> 
> is not enough?

No, its not.

> I was expecting we could find a way to handle this from br_netfilter
> alone itself.

We can't because we support ipv6 fib lookups from the netdev family
as well.

Alternative is to auto-accept ipv6 packets from the nf_tables eval loop,
but I think its worse.

^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Shmulik Ladkani @ 2019-09-03 17:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck,
	Alexei Starovoitov, Yonghong Song, Steffen Klassert, eyal
In-Reply-To: <CA+FuTScE=pyopY=3f5E4JGx1zyGqT+XS+8ss13UN4if4TZ2NbA@mail.gmail.com>

On Tue, 3 Sep 2019 12:23:54 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> This is a lot more code change. Especially for stable fixes that need
> to be backported, a smaller patch is preferable.

Indeed. Thanks for the feedback.

> My suggestion only tested the first frag_skb length. If a list can be
> created where the first frag_skb is head_frag but a later one is not,
> it will fail short. I kind of doubt that.
> 
> By default skb_gro_receive builds GSO skbs that can be segmented
> along the original gso_size boundaries. We have so far only observed
> this issue when messing with gso_size.

The rationale was based on inputs specified in 43170c4e0ba7, where a GRO
skb has a fraglist with different amounts of payloads.

> We can easily refine the test to fall back on to copying only if
> skb_headlen(list_skb) != mss.

I'm concerned this is too generic; innocent skbs may fall victim to our
skb copy fallback. Probably those mentioned in 43170c4e0ba7.

> Alternatively, only on SKB_GSO_DODGY is fine, too.
> 
> I suggest we stick with the two-liner.

OK.
So lets refine your original codition, testing only the first
frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss'
(we know existing code DOES work OK for unchanged gso_size, even if frags
have linear, non head_frag, data).

This hits the known, reproducable case of the mentioned BUG_ON, and is
tightly scoped to that case.

If that's agreed, I'll submit a proper patch.

Best,
Shmulik

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Leonardo Bras @ 2019-09-03 16:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, coreteam, bridge, netdev, linux-kernel,
	FlorianWestphal, Jozsef Kadlecsik, Roopa Prabhu,
	Nikolay Aleksandrov, David S. Miller
In-Reply-To: <20190903164948.kuvtpy7viqhcmp77@salvia>

[-- Attachment #1: Type: text/plain, Size: 1095 bytes --]

On Tue, 2019-09-03 at 18:49 +0200, Pablo Neira Ayuso wrote:
> On Tue, Sep 03, 2019 at 01:46:50PM -0300, Leonardo Bras wrote:
> > On Fri, 2019-08-30 at 22:58 +0200, Florian Westphal wrote:
> > Hello Pablo,
> > 
> > Any trouble with this patch? 
> > I could see the other* one got applied, but not this one.
> > *(The other did not get acked, so i released it alone as v5)
> > 
> > Is there any fix I need to do in this one?
> 
> Hm, I see, so this one:
> 
> https://patchwork.ozlabs.org/patch/1156100/
> 
> is not enough?

By what I could understand of Florian e-mail, we would need both:

>> So, given I don't want to plaster ipv6_mod_enabled() everywhere, I
>> would suggest this course of action:
>>
>> 1. add a patch to BREAK in nft_fib_netdev.c for !ipv6_mod_enabled()
>> 2. change net/bridge/br_netfilter_hooks.c, br_nf_pre_routing() to
>>    make sure ipv6_mod_enabled() is true before doing the ipv6 stack
>>    "emulation".

Is that ok?

> 
> I was expecting we could find a way to handle this from br_netfilter
> alone itself.
> 
> Thanks.
Thank you!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Pablo Neira Ayuso @ 2019-09-03 16:49 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: netfilter-devel, coreteam, bridge, netdev, linux-kernel,
	FlorianWestphal, Jozsef Kadlecsik, Roopa Prabhu,
	Nikolay Aleksandrov, David S. Miller
In-Reply-To: <99e3ef9c5ead1c95df697d49ab9cc83a95b0ac7c.camel@linux.ibm.com>

On Tue, Sep 03, 2019 at 01:46:50PM -0300, Leonardo Bras wrote:
> On Fri, 2019-08-30 at 22:58 +0200, Florian Westphal wrote:
> > Leonardo Bras <leonardo@linux.ibm.com> wrote:
> > > If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> > > dealing with a IPv6 packet, it causes a kernel panic in
> > > fib6_node_lookup_1(), crashing in bad_page_fault.
> > > 
> > > The panic is caused by trying to deference a very low address (0x38
> > > in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> > > BUG: Kernel NULL pointer dereference at 0x00000038
> > > 
> > > The kernel panic was reproduced in a host that disabled IPv6 on boot and
> > > have to process guest packets (coming from a bridge) using it's ip6tables.
> > > 
> > > Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
> > > is not loaded.
> > > 
> > > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> > 
> > Acked-by: Florian Westphal <fw@strlen.de>
> > 
> 
> Hello Pablo,
> 
> Any trouble with this patch? 
> I could see the other* one got applied, but not this one.
> *(The other did not get acked, so i released it alone as v5)
> 
> Is there any fix I need to do in this one?

Hm, I see, so this one:

https://patchwork.ozlabs.org/patch/1156100/

is not enough?

I was expecting we could find a way to handle this from br_netfilter
alone itself.

Thanks.

^ permalink raw reply

* Re: [PATCH v4 1/2] netfilter: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Leonardo Bras @ 2019-09-03 16:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, coreteam, bridge, netdev, linux-kernel,
	FlorianWestphal, Jozsef Kadlecsik, Roopa Prabhu,
	Nikolay Aleksandrov, David S. Miller
In-Reply-To: <20190830205802.GS20113@breakpoint.cc>

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Fri, 2019-08-30 at 22:58 +0200, Florian Westphal wrote:
> Leonardo Bras <leonardo@linux.ibm.com> wrote:
> > If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> > dealing with a IPv6 packet, it causes a kernel panic in
> > fib6_node_lookup_1(), crashing in bad_page_fault.
> > 
> > The panic is caused by trying to deference a very low address (0x38
> > in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> > BUG: Kernel NULL pointer dereference at 0x00000038
> > 
> > The kernel panic was reproduced in a host that disabled IPv6 on boot and
> > have to process guest packets (coming from a bridge) using it's ip6tables.
> > 
> > Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
> > is not loaded.
> > 
> > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> 
> Acked-by: Florian Westphal <fw@strlen.de>
> 

Hello Pablo,

Any trouble with this patch? 
I could see the other* one got applied, but not this one.
*(The other did not get acked, so i released it alone as v5)

Is there any fix I need to do in this one?

Best regards,
Leonardo Bras

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net-next,v2 3/4] net: flow_offload: mangle action at byte level
From: Pablo Neira Ayuso @ 2019-09-03 16:45 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, jiri, saeedm, vishal, vladbu
In-Reply-To: <20190903164513.15462-1-pablo@netfilter.org>

The flow mangle action is originally modeled after the tc pedit action,
this has a number of shortcomings:

1) The tc pedit offset must be set on the 32-bits boundaries. Many
   protocol header field offsets are not aligned to 32-bits, eg. port
   destination, port source and ethernet destination. This patch adjusts
   the offset accordingly and trim off length in these case, so drivers get
   an exact offset and length to the header fields.

2) The maximum mangle length is one word of 32-bits, hence you need to
   up to four actions to mangle an IPv6 address. This patch coalesces
   consecutive tc pedit actions into one single action so drivers can
   configure the IPv6 mangling in one go. Ethernet address fields now
   require one single action instead of two too.

The following drivers have been updated accordingly to use this new
mangle action layout:

1) The cxgb4 driver does not need to split protocol field matching
   larger than one 32-bit words into multiple definitions. Instead one
   single definition per protocol field is enough. Checking for
   transport protocol ports is also simplified.

2) The mlx5 driver logic to disallow IPv4 ttl and IPv6 hoplimit fields
   becomes more simple too.

3) The nfp driver uses the nfp_fl_set_helper() function to configure the
   payload mangling. The memchr_inv() function is used to check for
   proper initialization of the value and mask. The driver has been
   updated to refer to the exact protocol header offsets too.

As a result, this patch reduces code complexity on the driver side at
the cost of adding ~100 LOC at the core to perform offset and length
adjustment; and to coalesce consecutive actions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 162 +++++-----------
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h   |  40 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  90 +++------
 drivers/net/ethernet/netronome/nfp/flower/action.c | 203 ++++++++++-----------
 include/net/flow_offload.h                         |   7 +-
 net/sched/cls_api.c                                | 145 ++++++++++++---
 6 files changed, 309 insertions(+), 338 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 5afc15a60199..ba1ced08e41c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -44,20 +44,12 @@
 #define STATS_CHECK_PERIOD (HZ / 2)
 
 static struct ch_tc_pedit_fields pedits[] = {
-	PEDIT_FIELDS(ETH_, DMAC_31_0, 4, dmac, 0),
-	PEDIT_FIELDS(ETH_, DMAC_47_32, 2, dmac, 4),
-	PEDIT_FIELDS(ETH_, SMAC_15_0, 2, smac, 0),
-	PEDIT_FIELDS(ETH_, SMAC_47_16, 4, smac, 2),
+	PEDIT_FIELDS(ETH_, DMAC, 6, dmac, 0),
+	PEDIT_FIELDS(ETH_, SMAC, 6, smac, 0),
 	PEDIT_FIELDS(IP4_, SRC, 4, nat_fip, 0),
 	PEDIT_FIELDS(IP4_, DST, 4, nat_lip, 0),
-	PEDIT_FIELDS(IP6_, SRC_31_0, 4, nat_fip, 0),
-	PEDIT_FIELDS(IP6_, SRC_63_32, 4, nat_fip, 4),
-	PEDIT_FIELDS(IP6_, SRC_95_64, 4, nat_fip, 8),
-	PEDIT_FIELDS(IP6_, SRC_127_96, 4, nat_fip, 12),
-	PEDIT_FIELDS(IP6_, DST_31_0, 4, nat_lip, 0),
-	PEDIT_FIELDS(IP6_, DST_63_32, 4, nat_lip, 4),
-	PEDIT_FIELDS(IP6_, DST_95_64, 4, nat_lip, 8),
-	PEDIT_FIELDS(IP6_, DST_127_96, 4, nat_lip, 12),
+	PEDIT_FIELDS(IP6_, SRC, 16, nat_fip, 0),
+	PEDIT_FIELDS(IP6_, DST, 16, nat_lip, 0),
 	PEDIT_FIELDS(TCP_, SPORT, 2, nat_fport, 0),
 	PEDIT_FIELDS(TCP_, DPORT, 2, nat_lport, 0),
 	PEDIT_FIELDS(UDP_, SPORT, 2, nat_fport, 0),
@@ -272,8 +264,8 @@ static int cxgb4_validate_flow_match(struct net_device *dev,
 	return 0;
 }
 
-static void offload_pedit(struct ch_filter_specification *fs, u32 val, u32 mask,
-			  u8 field)
+static void offload_pedit(struct ch_filter_specification *fs,
+			  struct flow_action_entry *act, u8 field)
 {
 	u32 offset = 0;
 	u8 size = 1;
@@ -286,92 +278,68 @@ static void offload_pedit(struct ch_filter_specification *fs, u32 val, u32 mask,
 			break;
 		}
 	}
-	memcpy((u8 *)fs + offset, &val, size);
+	memcpy((u8 *)fs + offset, act->mangle.val, size);
 }
 
-static void process_pedit_field(struct ch_filter_specification *fs, u32 val,
-				u32 mask, u32 offset, u8 htype)
+static void process_pedit_field(struct ch_filter_specification *fs,
+				struct flow_action_entry *act)
 {
+	u32 offset = act->mangle.offset;
+	u8 htype = act->mangle.htype;
+
 	switch (htype) {
 	case FLOW_ACT_MANGLE_HDR_TYPE_ETH:
 		switch (offset) {
-		case PEDIT_ETH_DMAC_31_0:
+		case PEDIT_ETH_DMAC:
 			fs->newdmac = 1;
-			offload_pedit(fs, val, mask, ETH_DMAC_31_0);
-			break;
-		case PEDIT_ETH_DMAC_47_32_SMAC_15_0:
-			if (mask & PEDIT_ETH_DMAC_MASK)
-				offload_pedit(fs, val, mask, ETH_DMAC_47_32);
-			else
-				offload_pedit(fs, val >> 16, mask >> 16,
-					      ETH_SMAC_15_0);
+			offload_pedit(fs, act, ETH_DMAC);
 			break;
-		case PEDIT_ETH_SMAC_47_16:
+		case PEDIT_ETH_SMAC:
 			fs->newsmac = 1;
-			offload_pedit(fs, val, mask, ETH_SMAC_47_16);
+			offload_pedit(fs, act, ETH_SMAC);
+			break;
 		}
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_IP4:
 		switch (offset) {
 		case PEDIT_IP4_SRC:
-			offload_pedit(fs, val, mask, IP4_SRC);
+			offload_pedit(fs, act, IP4_SRC);
 			break;
 		case PEDIT_IP4_DST:
-			offload_pedit(fs, val, mask, IP4_DST);
+			offload_pedit(fs, act, IP4_DST);
 		}
 		fs->nat_mode = NAT_MODE_ALL;
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_IP6:
 		switch (offset) {
-		case PEDIT_IP6_SRC_31_0:
-			offload_pedit(fs, val, mask, IP6_SRC_31_0);
-			break;
-		case PEDIT_IP6_SRC_63_32:
-			offload_pedit(fs, val, mask, IP6_SRC_63_32);
-			break;
-		case PEDIT_IP6_SRC_95_64:
-			offload_pedit(fs, val, mask, IP6_SRC_95_64);
-			break;
-		case PEDIT_IP6_SRC_127_96:
-			offload_pedit(fs, val, mask, IP6_SRC_127_96);
-			break;
-		case PEDIT_IP6_DST_31_0:
-			offload_pedit(fs, val, mask, IP6_DST_31_0);
+		case PEDIT_IP6_SRC:
+			offload_pedit(fs, act, IP6_SRC);
 			break;
-		case PEDIT_IP6_DST_63_32:
-			offload_pedit(fs, val, mask, IP6_DST_63_32);
+		case PEDIT_IP6_DST:
+			offload_pedit(fs, act, IP6_DST);
 			break;
-		case PEDIT_IP6_DST_95_64:
-			offload_pedit(fs, val, mask, IP6_DST_95_64);
-			break;
-		case PEDIT_IP6_DST_127_96:
-			offload_pedit(fs, val, mask, IP6_DST_127_96);
 		}
 		fs->nat_mode = NAT_MODE_ALL;
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_TCP:
 		switch (offset) {
-		case PEDIT_TCP_SPORT_DPORT:
-			if (mask & PEDIT_TCP_UDP_SPORT_MASK)
-				offload_pedit(fs, cpu_to_be32(val) >> 16,
-					      cpu_to_be32(mask) >> 16,
-					      TCP_SPORT);
-			else
-				offload_pedit(fs, cpu_to_be32(val),
-					      cpu_to_be32(mask), TCP_DPORT);
+		case PEDIT_TCP_SPORT:
+			offload_pedit(fs, act, TCP_SPORT);
+			break;
+		case PEDIT_TCP_DPORT:
+			offload_pedit(fs, act, TCP_DPORT);
+			break;
 		}
 		fs->nat_mode = NAT_MODE_ALL;
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_UDP:
 		switch (offset) {
-		case PEDIT_UDP_SPORT_DPORT:
-			if (mask & PEDIT_TCP_UDP_SPORT_MASK)
-				offload_pedit(fs, cpu_to_be32(val) >> 16,
-					      cpu_to_be32(mask) >> 16,
-					      UDP_SPORT);
-			else
-				offload_pedit(fs, cpu_to_be32(val),
-					      cpu_to_be32(mask), UDP_DPORT);
+		case PEDIT_UDP_SPORT:
+			offload_pedit(fs, act, UDP_SPORT);
+			break;
+		case PEDIT_UDP_DPORT:
+			offload_pedit(fs, act, UDP_DPORT);
+			break;
 		}
 		fs->nat_mode = NAT_MODE_ALL;
 	}
@@ -424,17 +392,8 @@ static void cxgb4_process_flow_actions(struct net_device *in,
 			}
 			}
 			break;
-		case FLOW_ACTION_MANGLE: {
-			u32 mask, val, offset;
-			u8 htype;
-
-			htype = act->mangle.htype;
-			mask = act->mangle.mask;
-			val = act->mangle.val;
-			offset = act->mangle.offset;
-
-			process_pedit_field(fs, val, mask, offset, htype);
-			}
+		case FLOW_ACTION_MANGLE:
+			process_pedit_field(fs, act);
 			break;
 		default:
 			break;
@@ -442,35 +401,20 @@ static void cxgb4_process_flow_actions(struct net_device *in,
 	}
 }
 
-static bool valid_l4_mask(u32 mask)
-{
-	u16 hi, lo;
-
-	/* Either the upper 16-bits (SPORT) OR the lower
-	 * 16-bits (DPORT) can be set, but NOT BOTH.
-	 */
-	hi = (mask >> 16) & 0xFFFF;
-	lo = mask & 0xFFFF;
-
-	return hi && lo ? false : true;
-}
-
 static bool valid_pedit_action(struct net_device *dev,
 			       const struct flow_action_entry *act)
 {
-	u32 mask, offset;
+	u32 offset;
 	u8 htype;
 
 	htype = act->mangle.htype;
-	mask = act->mangle.mask;
 	offset = act->mangle.offset;
 
 	switch (htype) {
 	case FLOW_ACT_MANGLE_HDR_TYPE_ETH:
 		switch (offset) {
-		case PEDIT_ETH_DMAC_31_0:
-		case PEDIT_ETH_DMAC_47_32_SMAC_15_0:
-		case PEDIT_ETH_SMAC_47_16:
+		case PEDIT_ETH_DMAC:
+		case PEDIT_ETH_SMAC:
 			break;
 		default:
 			netdev_err(dev, "%s: Unsupported pedit field\n",
@@ -491,14 +435,8 @@ static bool valid_pedit_action(struct net_device *dev,
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_IP6:
 		switch (offset) {
-		case PEDIT_IP6_SRC_31_0:
-		case PEDIT_IP6_SRC_63_32:
-		case PEDIT_IP6_SRC_95_64:
-		case PEDIT_IP6_SRC_127_96:
-		case PEDIT_IP6_DST_31_0:
-		case PEDIT_IP6_DST_63_32:
-		case PEDIT_IP6_DST_95_64:
-		case PEDIT_IP6_DST_127_96:
+		case PEDIT_IP6_SRC:
+		case PEDIT_IP6_DST:
 			break;
 		default:
 			netdev_err(dev, "%s: Unsupported pedit field\n",
@@ -508,12 +446,8 @@ static bool valid_pedit_action(struct net_device *dev,
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_TCP:
 		switch (offset) {
-		case PEDIT_TCP_SPORT_DPORT:
-			if (!valid_l4_mask(mask)) {
-				netdev_err(dev, "%s: Unsupported mask for TCP L4 ports\n",
-					   __func__);
-				return false;
-			}
+		case PEDIT_TCP_SPORT:
+		case PEDIT_TCP_DPORT:
 			break;
 		default:
 			netdev_err(dev, "%s: Unsupported pedit field\n",
@@ -523,12 +457,8 @@ static bool valid_pedit_action(struct net_device *dev,
 		break;
 	case FLOW_ACT_MANGLE_HDR_TYPE_UDP:
 		switch (offset) {
-		case PEDIT_UDP_SPORT_DPORT:
-			if (!valid_l4_mask(mask)) {
-				netdev_err(dev, "%s: Unsupported mask for UDP L4 ports\n",
-					   __func__);
-				return false;
-			}
+		case PEDIT_UDP_SPORT:
+		case PEDIT_UDP_DPORT:
 			break;
 		default:
 			netdev_err(dev, "%s: Unsupported pedit field\n",
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h
index eb4c95248baf..03892755a18f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h
@@ -55,23 +55,14 @@ struct ch_tc_flower_entry {
 };
 
 enum {
-	ETH_DMAC_31_0,	/* dmac bits 0.. 31 */
-	ETH_DMAC_47_32,	/* dmac bits 32..47 */
-	ETH_SMAC_15_0,	/* smac bits 0.. 15 */
-	ETH_SMAC_47_16,	/* smac bits 16..47 */
+	ETH_DMAC,	/* 48-bits dmac bits */
+	ETH_SMAC,	/* 48-bits smac bits */
 
 	IP4_SRC,	/* 32-bit IPv4 src  */
 	IP4_DST,	/* 32-bit IPv4 dst  */
 
-	IP6_SRC_31_0,	/* src bits 0..  31 */
-	IP6_SRC_63_32,	/* src bits 63.. 32 */
-	IP6_SRC_95_64,	/* src bits 95.. 64 */
-	IP6_SRC_127_96,	/* src bits 127..96 */
-
-	IP6_DST_31_0,	/* dst bits 0..  31 */
-	IP6_DST_63_32,	/* dst bits 63.. 32 */
-	IP6_DST_95_64,	/* dst bits 95.. 64 */
-	IP6_DST_127_96,	/* dst bits 127..96 */
+	IP6_SRC,	/* 128-bit IPv6 src */
+	IP6_DST,	/* 128-bit IPv6 dst */
 
 	TCP_SPORT,	/* 16-bit TCP sport */
 	TCP_DPORT,	/* 16-bit TCP dport */
@@ -90,23 +81,16 @@ struct ch_tc_pedit_fields {
 	{ type## field, size, \
 		offsetof(struct ch_filter_specification, fs_field) + (offset) }
 
-#define PEDIT_ETH_DMAC_MASK		0xffff
-#define PEDIT_TCP_UDP_SPORT_MASK	0xffff
-#define PEDIT_ETH_DMAC_31_0		0x0
-#define PEDIT_ETH_DMAC_47_32_SMAC_15_0	0x4
-#define PEDIT_ETH_SMAC_47_16		0x8
+#define PEDIT_ETH_DMAC			0x0
+#define PEDIT_ETH_SMAC			0x6
+#define PEDIT_IP6_SRC			0x8
+#define PEDIT_IP6_DST			0x18
 #define PEDIT_IP4_SRC			0xC
 #define PEDIT_IP4_DST			0x10
-#define PEDIT_IP6_SRC_31_0		0x8
-#define PEDIT_IP6_SRC_63_32		0xC
-#define PEDIT_IP6_SRC_95_64		0x10
-#define PEDIT_IP6_SRC_127_96		0x14
-#define PEDIT_IP6_DST_31_0		0x18
-#define PEDIT_IP6_DST_63_32		0x1C
-#define PEDIT_IP6_DST_95_64		0x20
-#define PEDIT_IP6_DST_127_96		0x24
-#define PEDIT_TCP_SPORT_DPORT		0x0
-#define PEDIT_UDP_SPORT_DPORT		0x0
+#define PEDIT_TCP_SPORT			0x0
+#define PEDIT_TCP_DPORT			0x2
+#define PEDIT_UDP_SPORT			0x0
+#define PEDIT_UDP_DPORT			0x2
 
 int cxgb4_tc_flower_replace(struct net_device *dev,
 			    struct flow_cls_offload *cls);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index f29895b3a947..b7b88bc22cf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2201,19 +2201,24 @@ static int pedit_header_offsets[] = {
 
 #define pedit_header(_ph, _htype) ((void *)(_ph) + pedit_header_offsets[_htype])
 
-static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset,
+static int set_pedit_val(u8 hdr_type, const struct flow_action_entry *act,
 			 struct pedit_headers_action *hdrs)
 {
-	u32 *curr_pmask, *curr_pval;
+	u32 offset = act->mangle.offset;
+	u8 *curr_pmask, *curr_pval;
+	int i;
 
-	curr_pmask = (u32 *)(pedit_header(&hdrs->masks, hdr_type) + offset);
-	curr_pval  = (u32 *)(pedit_header(&hdrs->vals, hdr_type) + offset);
+	curr_pmask = (u8 *)(pedit_header(&hdrs->masks, hdr_type) + offset);
+	curr_pval  = (u8 *)(pedit_header(&hdrs->vals, hdr_type) + offset);
 
-	if (*curr_pmask & mask)  /* disallow acting twice on the same location */
-		goto out_err;
+	for (i = 0; i < act->mangle.len; i++) {
+		/* disallow acting twice on the same location */
+		if (curr_pmask[i] & act->mangle.mask[i])
+			goto out_err;
 
-	*curr_pmask |= mask;
-	*curr_pval  |= val;
+		curr_pmask[i] |= act->mangle.mask[i];
+		curr_pval[i] |= act->mangle.val[i];
+	}
 
 	return 0;
 
@@ -2487,7 +2492,6 @@ static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 {
 	u8 cmd = (act->id == FLOW_ACTION_MANGLE) ? 0 : 1;
 	int err = -EOPNOTSUPP;
-	u32 mask, val, offset;
 	u8 htype;
 
 	htype = act->mangle.htype;
@@ -2504,11 +2508,7 @@ static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 		goto out_err;
 	}
 
-	mask = act->mangle.mask;
-	val = act->mangle.val;
-	offset = act->mangle.offset;
-
-	err = set_pedit_val(htype, mask, val, offset, &hdrs[cmd]);
+	err = set_pedit_val(htype, act, &hdrs[cmd]);
 	if (err)
 		goto out_err;
 
@@ -2589,50 +2589,18 @@ static bool csum_offload_supported(struct mlx5e_priv *priv,
 	return true;
 }
 
-struct ip_ttl_word {
-	__u8	ttl;
-	__u8	protocol;
-	__sum16	check;
-};
-
-struct ipv6_hoplimit_word {
-	__be16	payload_len;
-	__u8	nexthdr;
-	__u8	hop_limit;
-};
-
 static bool is_action_keys_supported(const struct flow_action_entry *act)
 {
-	u32 mask, offset;
-	u8 htype;
+	u32 offset = act->mangle.offset;
+	u8 htype = act->mangle.htype;
 
-	htype = act->mangle.htype;
-	offset = act->mangle.offset;
-	mask = act->mangle.mask;
-	/* For IPv4 & IPv6 header check 4 byte word,
-	 * to determine that modified fields
-	 * are NOT ttl & hop_limit only.
-	 */
-	if (htype == FLOW_ACT_MANGLE_HDR_TYPE_IP4) {
-		struct ip_ttl_word *ttl_word =
-			(struct ip_ttl_word *)&mask;
-
-		if (offset != offsetof(struct iphdr, ttl) ||
-		    ttl_word->protocol ||
-		    ttl_word->check) {
-			return true;
-		}
-	} else if (htype == FLOW_ACT_MANGLE_HDR_TYPE_IP6) {
-		struct ipv6_hoplimit_word *hoplimit_word =
-			(struct ipv6_hoplimit_word *)&mask;
-
-		if (offset != offsetof(struct ipv6hdr, payload_len) ||
-		    hoplimit_word->payload_len ||
-		    hoplimit_word->nexthdr) {
-			return true;
-		}
-	}
-	return false;
+	if ((htype == FLOW_ACT_MANGLE_HDR_TYPE_IP4 &&
+	     offset == offsetof(struct iphdr, ttl)) ||
+	    (htype == FLOW_ACT_MANGLE_HDR_TYPE_IP6 &&
+	     offset == offsetof(struct ipv6hdr, hop_limit)))
+		return false;
+
+	return true;
 }
 
 static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
@@ -2726,19 +2694,21 @@ static int add_vlan_rewrite_action(struct mlx5e_priv *priv, int namespace,
 				   struct pedit_headers_action *hdrs,
 				   u32 *action, struct netlink_ext_ack *extack)
 {
-	u16 mask16 = VLAN_VID_MASK;
-	u16 val16 = act->vlan.vid & VLAN_VID_MASK;
-	const struct flow_action_entry pedit_act = {
+	__be16 mask16 = htons(VLAN_VID_MASK);
+	__be16 val16 = htons(act->vlan.vid & VLAN_VID_MASK);
+	struct flow_action_entry pedit_act = {
 		.id = FLOW_ACTION_MANGLE,
 		.mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_ETH,
 		.mangle.offset = offsetof(struct vlan_ethhdr, h_vlan_TCI),
-		.mangle.mask = (u32)be16_to_cpu(*(__be16 *)&mask16),
-		.mangle.val = (u32)be16_to_cpu(*(__be16 *)&val16),
+		.mangle.len = 2,
 	};
 	u8 match_prio_mask, match_prio_val;
 	void *headers_c, *headers_v;
 	int err;
 
+	memcpy(pedit_act.mangle.mask, &mask16, sizeof(__be16));
+	memcpy(pedit_act.mangle.val, &val16, sizeof(__be16));
+
 	headers_c = get_match_headers_criteria(*action, &parse_attr->spec);
 	headers_v = get_match_headers_value(*action, &parse_attr->spec);
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 592c36ba9e3f..a525ec244b79 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -472,38 +472,44 @@ nfp_fl_set_ipv4_tun(struct nfp_app *app, struct nfp_fl_set_ipv4_tun *set_tun,
 	return 0;
 }
 
-static void nfp_fl_set_helper32(u32 value, u32 mask, u8 *p_exact, u8 *p_mask)
+static void nfp_fl_set_helper(const u8 *value, const u8 *mask,
+			      u8 *p_exact, u8 *p_mask, int len)
 {
-	u32 oldvalue = get_unaligned((u32 *)p_exact);
-	u32 oldmask = get_unaligned((u32 *)p_mask);
+	int i;
 
-	value |= oldvalue & ~mask;
-
-	put_unaligned(oldmask | mask, (u32 *)p_mask);
-	put_unaligned(value, (u32 *)p_exact);
+	for (i = 0; i < len; i++) {
+		p_exact[i] = (p_exact[i] & ~mask[i]) | value[i];
+		p_mask[i] |= mask[i];
+	}
 }
 
 static int
 nfp_fl_set_eth(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_eth *set_eth, struct netlink_ext_ack *extack)
 {
-	u32 exact, mask;
+	int i;
 
-	if (off + 4 > ETH_ALEN * 2) {
+	switch (off) {
+	case offsetof(struct ethhdr, h_dest):
+		i = 0;
+		break;
+	case offsetof(struct ethhdr, h_source):
+		i = ETH_ALEN;
+		break;
+	default:
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit ethernet action");
 		return -EOPNOTSUPP;
 	}
 
-	mask = act->mangle.mask;
-	exact = act->mangle.val;
-
-	if (exact & ~mask) {
+	if (memchr_inv(&act->mangle.val, 0, act->mangle.len) &&
+	    !memchr_inv(&act->mangle.mask, 0, act->mangle.len)) {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit ethernet action");
 		return -EOPNOTSUPP;
 	}
 
-	nfp_fl_set_helper32(exact, mask, &set_eth->eth_addr_val[off],
-			    &set_eth->eth_addr_mask[off]);
+	nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+			  &set_eth->eth_addr_val[i],
+			  &set_eth->eth_addr_mask[i], act->mangle.len);
 
 	set_eth->reserved = cpu_to_be16(0);
 	set_eth->head.jump_id = NFP_FL_ACTION_OPCODE_SET_ETHERNET;
@@ -524,68 +530,46 @@ nfp_fl_set_ip4(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_ip4_ttl_tos *set_ip_ttl_tos,
 	       struct netlink_ext_ack *extack)
 {
-	struct ipv4_ttl_word *ttl_word_mask;
-	struct ipv4_ttl_word *ttl_word;
-	struct iphdr *tos_word_mask;
-	struct iphdr *tos_word;
-	__be32 exact, mask;
-
-	/* We are expecting tcf_pedit to return a big endian value */
-	mask = (__force __be32)act->mangle.mask;
-	exact = (__force __be32)act->mangle.val;
-
-	if (exact & ~mask) {
+	if (memchr_inv(act->mangle.val, 0, act->mangle.len) &&
+	    !memchr_inv(act->mangle.mask, 0, act->mangle.len)) {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv4 action");
 		return -EOPNOTSUPP;
 	}
 
 	switch (off) {
 	case offsetof(struct iphdr, daddr):
-		set_ip_addr->ipv4_dst_mask |= mask;
-		set_ip_addr->ipv4_dst &= ~mask;
-		set_ip_addr->ipv4_dst |= exact;
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  (u8 *)&set_ip_addr->ipv4_dst,
+				  (u8 *)&set_ip_addr->ipv4_dst_mask,
+				  act->mangle.len);
 		set_ip_addr->head.jump_id = NFP_FL_ACTION_OPCODE_SET_IPV4_ADDRS;
 		set_ip_addr->head.len_lw = sizeof(*set_ip_addr) >>
 					   NFP_FL_LW_SIZ;
 		break;
 	case offsetof(struct iphdr, saddr):
-		set_ip_addr->ipv4_src_mask |= mask;
-		set_ip_addr->ipv4_src &= ~mask;
-		set_ip_addr->ipv4_src |= exact;
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  (u8 *)&set_ip_addr->ipv4_src,
+				  (u8 *)&set_ip_addr->ipv4_src_mask,
+				  act->mangle.len);
 		set_ip_addr->head.jump_id = NFP_FL_ACTION_OPCODE_SET_IPV4_ADDRS;
 		set_ip_addr->head.len_lw = sizeof(*set_ip_addr) >>
 					   NFP_FL_LW_SIZ;
 		break;
 	case offsetof(struct iphdr, ttl):
-		ttl_word_mask = (struct ipv4_ttl_word *)&mask;
-		ttl_word = (struct ipv4_ttl_word *)&exact;
-
-		if (ttl_word_mask->protocol || ttl_word_mask->check) {
-			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv4 ttl action");
-			return -EOPNOTSUPP;
-		}
-
-		set_ip_ttl_tos->ipv4_ttl_mask |= ttl_word_mask->ttl;
-		set_ip_ttl_tos->ipv4_ttl &= ~ttl_word_mask->ttl;
-		set_ip_ttl_tos->ipv4_ttl |= ttl_word->ttl & ttl_word_mask->ttl;
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  (u8 *)&set_ip_ttl_tos->ipv4_ttl,
+				  (u8 *)&set_ip_ttl_tos->ipv4_ttl_mask,
+				  act->mangle.len);
 		set_ip_ttl_tos->head.jump_id =
 			NFP_FL_ACTION_OPCODE_SET_IPV4_TTL_TOS;
 		set_ip_ttl_tos->head.len_lw = sizeof(*set_ip_ttl_tos) >>
 					      NFP_FL_LW_SIZ;
 		break;
-	case round_down(offsetof(struct iphdr, tos), 4):
-		tos_word_mask = (struct iphdr *)&mask;
-		tos_word = (struct iphdr *)&exact;
-
-		if (tos_word_mask->version || tos_word_mask->ihl ||
-		    tos_word_mask->tot_len) {
-			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv4 tos action");
-			return -EOPNOTSUPP;
-		}
-
-		set_ip_ttl_tos->ipv4_tos_mask |= tos_word_mask->tos;
-		set_ip_ttl_tos->ipv4_tos &= ~tos_word_mask->tos;
-		set_ip_ttl_tos->ipv4_tos |= tos_word->tos & tos_word_mask->tos;
+	case offsetof(struct iphdr, tos):
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  (u8 *)&set_ip_ttl_tos->ipv4_tos,
+				  (u8 *)&set_ip_ttl_tos->ipv4_tos_mask,
+				  act->mangle.len);
 		set_ip_ttl_tos->head.jump_id =
 			NFP_FL_ACTION_OPCODE_SET_IPV4_TTL_TOS;
 		set_ip_ttl_tos->head.len_lw = sizeof(*set_ip_ttl_tos) >>
@@ -600,12 +584,17 @@ nfp_fl_set_ip4(const struct flow_action_entry *act, u32 off,
 }
 
 static void
-nfp_fl_set_ip6_helper(int opcode_tag, u8 word, __be32 exact, __be32 mask,
-		      struct nfp_fl_set_ipv6_addr *ip6)
+nfp_fl_set_ip6_helper(int opcode_tag, const struct flow_action_entry *act,
+		      struct nfp_fl_set_ipv6_addr *ip6,
+		      struct netlink_ext_ack *extack)
 {
-	ip6->ipv6[word].mask |= mask;
-	ip6->ipv6[word].exact &= ~mask;
-	ip6->ipv6[word].exact |= exact;
+	int i, j;
+
+	for (i = 0, j = 0; i < sizeof(struct in6_addr); i++, j += sizeof(u32)) {
+		nfp_fl_set_helper(&act->mangle.val[j], &act->mangle.mask[j],
+				  (u8 *)&ip6->ipv6[i].exact,
+				  (u8 *)&ip6->ipv6[i].mask, sizeof(u32));
+	}
 
 	ip6->reserved = cpu_to_be16(0);
 	ip6->head.jump_id = opcode_tag;
@@ -619,39 +608,34 @@ struct ipv6_hop_limit_word {
 };
 
 static int
-nfp_fl_set_ip6_hop_limit_flow_label(u32 off, __be32 exact, __be32 mask,
+nfp_fl_set_ip6_hop_limit_flow_label(u32 off,
+				    const struct flow_action_entry *act,
 				    struct nfp_fl_set_ipv6_tc_hl_fl *ip_hl_fl,
 				    struct netlink_ext_ack *extack)
 {
-	struct ipv6_hop_limit_word *fl_hl_mask;
-	struct ipv6_hop_limit_word *fl_hl;
-
 	switch (off) {
-	case offsetof(struct ipv6hdr, payload_len):
-		fl_hl_mask = (struct ipv6_hop_limit_word *)&mask;
-		fl_hl = (struct ipv6_hop_limit_word *)&exact;
-
-		if (fl_hl_mask->nexthdr || fl_hl_mask->payload_len) {
-			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv6 hop limit action");
-			return -EOPNOTSUPP;
-		}
-
-		ip_hl_fl->ipv6_hop_limit_mask |= fl_hl_mask->hop_limit;
-		ip_hl_fl->ipv6_hop_limit &= ~fl_hl_mask->hop_limit;
-		ip_hl_fl->ipv6_hop_limit |= fl_hl->hop_limit &
-					    fl_hl_mask->hop_limit;
+	case offsetof(struct ipv6hdr, hop_limit):
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  &ip_hl_fl->ipv6_hop_limit,
+				  &ip_hl_fl->ipv6_hop_limit_mask,
+				  act->mangle.len);
 		break;
-	case round_down(offsetof(struct ipv6hdr, flow_lbl), 4):
-		if (mask & ~IPV6_FLOW_LABEL_MASK ||
-		    exact & ~IPV6_FLOW_LABEL_MASK) {
+	case 1: /* offsetof(struct ipv6hdr, flow_lbl) */
+		/* The initial 4-bits are part of the traffic class field. */
+		if (act->mangle.val[0] & 0xf0 ||
+		    act->mangle.mask[0] & 0xf0) {
 			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv6 flow label action");
 			return -EOPNOTSUPP;
 		}
 
-		ip_hl_fl->ipv6_label_mask |= mask;
-		ip_hl_fl->ipv6_label &= ~mask;
-		ip_hl_fl->ipv6_label |= exact;
+		nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+				  (u8 *)&ip_hl_fl->ipv6_label,
+				  (u8 *)&ip_hl_fl->ipv6_label_mask,
+				  act->mangle.len);
 		break;
+	default:
+		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv6 action");
+		return -EOPNOTSUPP;
 	}
 
 	ip_hl_fl->head.jump_id = NFP_FL_ACTION_OPCODE_SET_IPV6_TC_HL_FL;
@@ -667,31 +651,23 @@ nfp_fl_set_ip6(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_ipv6_tc_hl_fl *ip_hl_fl,
 	       struct netlink_ext_ack *extack)
 {
-	__be32 exact, mask;
 	int err = 0;
-	u8 word;
 
-	/* We are expecting tcf_pedit to return a big endian value */
-	mask = (__force __be32)act->mangle.mask;
-	exact = (__force __be32)act->mangle.val;
-
-	if (exact & ~mask) {
+	if (memchr_inv(act->mangle.val, 0, act->mangle.len) &&
+	    !memchr_inv(act->mangle.mask, 0, act->mangle.len)) {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit IPv6 action");
 		return -EOPNOTSUPP;
 	}
 
 	if (off < offsetof(struct ipv6hdr, saddr)) {
-		err = nfp_fl_set_ip6_hop_limit_flow_label(off, exact, mask,
-							  ip_hl_fl, extack);
-	} else if (off < offsetof(struct ipv6hdr, daddr)) {
-		word = (off - offsetof(struct ipv6hdr, saddr)) / sizeof(exact);
-		nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC, word,
-				      exact, mask, ip_src);
-	} else if (off < offsetof(struct ipv6hdr, daddr) +
-		       sizeof(struct in6_addr)) {
-		word = (off - offsetof(struct ipv6hdr, daddr)) / sizeof(exact);
-		nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST, word,
-				      exact, mask, ip_dst);
+		err = nfp_fl_set_ip6_hop_limit_flow_label(off, act, ip_hl_fl,
+							  extack);
+	} else if (off == offsetof(struct ipv6hdr, saddr)) {
+		nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC,
+				      act, ip_src, extack);
+	} else if (off == offsetof(struct ipv6hdr, daddr)) {
+		nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST,
+				      act, ip_dst, extack);
 	} else {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: pedit on unsupported section of IPv6 header");
 		return -EOPNOTSUPP;
@@ -705,23 +681,30 @@ nfp_fl_set_tport(const struct flow_action_entry *act, u32 off,
 		 struct nfp_fl_set_tport *set_tport, int opcode,
 		 struct netlink_ext_ack *extack)
 {
-	u32 exact, mask;
+	int i;
 
-	if (off) {
+	switch (off) {
+	case offsetof(struct tcphdr, source):
+		i = 0;
+		break;
+	case offsetof(struct tcphdr, dest):
+		i = sizeof(u16);
+		break;
+	default:
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: pedit on unsupported section of L4 header");
 		return -EOPNOTSUPP;
 	}
 
-	mask = act->mangle.mask;
-	exact = act->mangle.val;
-
-	if (exact & ~mask) {
+	if (memchr_inv(act->mangle.val, 0, act->mangle.len) &&
+	    !memchr_inv(act->mangle.mask, 0, act->mangle.len)) {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: invalid pedit L4 action");
 		return -EOPNOTSUPP;
 	}
 
-	nfp_fl_set_helper32(exact, mask, set_tport->tp_port_val,
-			    set_tport->tp_port_mask);
+	nfp_fl_set_helper(act->mangle.val, act->mangle.mask,
+			  &set_tport->tp_port_val[i],
+			  &set_tport->tp_port_mask[i],
+			  act->mangle.len);
 
 	set_tport->reserved = cpu_to_be16(0);
 	set_tport->head.jump_id = opcode;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index fc881875f856..a2853b37dded 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -154,6 +154,8 @@ enum flow_action_mangle_base {
 	FLOW_ACT_MANGLE_HDR_TYPE_UDP,
 };
 
+#define FLOW_ACTION_MANGLE_MAXLEN	16
+
 struct flow_action_entry {
 	enum flow_action_id		id;
 	union {
@@ -167,8 +169,9 @@ struct flow_action_entry {
 		struct {				/* FLOW_ACTION_PACKET_EDIT */
 			enum flow_action_mangle_base htype;
 			u32		offset;
-			u32		mask;
-			u32		val;
+			u8		mask[FLOW_ACTION_MANGLE_MAXLEN];
+			u8		val[FLOW_ACTION_MANGLE_MAXLEN];
+			u8		len;
 		} mangle;
 		const struct ip_tunnel_info *tunnel;	/* FLOW_ACTION_TUNNEL_ENCAP */
 		u32			csum_flags;	/* FLOW_ACTION_CSUM */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index e30a151d8527..468db33292a9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3289,11 +3289,128 @@ void tc_cleanup_flow_action(struct flow_action *flow_action)
 }
 EXPORT_SYMBOL(tc_cleanup_flow_action);
 
+static unsigned int flow_action_mangle_type(enum pedit_cmd cmd)
+{
+	switch (cmd) {
+	case TCA_PEDIT_KEY_EX_CMD_SET:
+		return FLOW_ACTION_MANGLE;
+	case TCA_PEDIT_KEY_EX_CMD_ADD:
+		return FLOW_ACTION_ADD;
+	default:
+		WARN_ON_ONCE(1);
+	}
+	return 0;
+}
+
+struct flow_action_mangle_ctx {
+	u8	cmd;
+	u8	offset;
+	u8	htype;
+	u8	idx;
+	u8	val[FLOW_ACTION_MANGLE_MAXLEN];
+	u8	mask[FLOW_ACTION_MANGLE_MAXLEN];
+	u32	first_word;
+	u32	last_word;
+};
+
+static void flow_action_mangle_entry(struct flow_action_entry *entry,
+				     const struct flow_action_mangle_ctx *ctx)
+{
+	u32 delta;
+
+	entry->id = ctx->cmd;
+	entry->mangle.htype = ctx->htype;
+	entry->mangle.offset = ctx->offset;
+	entry->mangle.len = ctx->idx;
+
+	/* Adjust offset. */
+	delta = sizeof(u32) - (fls(ctx->first_word) / BITS_PER_BYTE);
+	entry->mangle.offset += delta;
+
+	/* Adjust length. */
+	entry->mangle.len -= ((ffs(ctx->last_word) / BITS_PER_BYTE) + delta);
+
+	/* Copy value and mask from offset using the adjusted length. */
+	memcpy(entry->mangle.val, &ctx->val[delta], entry->mangle.len);
+	memcpy(entry->mangle.mask, &ctx->mask[delta], entry->mangle.len);
+}
+
+static void flow_action_mangle_ctx_update(struct flow_action_mangle_ctx *ctx,
+					  const struct tc_action *act, int k)
+{
+	u32 val, mask;
+
+	val = tcf_pedit_val(act, k);
+	mask = ~tcf_pedit_mask(act, k);
+
+	memcpy(&ctx->val[ctx->idx], &val, sizeof(u32));
+	memcpy(&ctx->mask[ctx->idx], &mask, sizeof(u32));
+	ctx->idx += sizeof(u32);
+}
+
+static void flow_action_mangle_ctx_init(struct flow_action_mangle_ctx *ctx,
+					const struct tc_action *act, int k)
+{
+	ctx->cmd = flow_action_mangle_type(tcf_pedit_cmd(act, k));
+	ctx->offset = tcf_pedit_offset(act, k);
+	ctx->htype = tcf_pedit_htype(act, k);
+	ctx->idx = 0;
+
+	ctx->first_word = ntohl(~tcf_pedit_mask(act, k));
+	ctx->last_word = ctx->first_word;
+
+	flow_action_mangle_ctx_update(ctx, act, k);
+}
+
+static int flow_action_mangle(struct flow_action *flow_action,
+			      struct flow_action_entry *entry,
+			      const struct tc_action *act, int j)
+{
+	struct flow_action_mangle_ctx ctx;
+	int k;
+
+	if (tcf_pedit_cmd(act, 0) > TCA_PEDIT_KEY_EX_CMD_ADD)
+		return -1;
+
+	flow_action_mangle_ctx_init(&ctx, act, 0);
+
+	/* Special case: one single 32-bits word. */
+	if (tcf_pedit_nkeys(act) == 1) {
+		flow_action_mangle_entry(entry, &ctx);
+		return j;
+	}
+
+	for (k = 1; k < tcf_pedit_nkeys(act); k++) {
+		if (tcf_pedit_cmd(act, k) > TCA_PEDIT_KEY_EX_CMD_ADD)
+			return -1;
+
+		/* Offset is contiguous and type is the same, coalesce. */
+		if (ctx.idx < FLOW_ACTION_MANGLE_MAXLEN &&
+		    ctx.offset + ctx.idx == tcf_pedit_offset(act, k) &&
+		    ctx.cmd == flow_action_mangle_type(tcf_pedit_cmd(act, k))) {
+			flow_action_mangle_ctx_update(&ctx, act, k);
+			continue;
+		}
+		ctx.last_word = ntohl(~tcf_pedit_mask(act, k - 1));
+
+		/* Cannot coalesce, set up this entry. */
+		flow_action_mangle_entry(entry, &ctx);
+
+		flow_action_mangle_ctx_init(&ctx, act, k);
+		entry = &flow_action->entries[++j];
+	}
+
+	ctx.last_word = ntohl(~tcf_pedit_mask(act, k - 1));
+	flow_action_mangle_entry(entry, &ctx);
+
+	return j;
+}
+
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts, bool rtnl_held)
 {
 	const struct tc_action *act;
-	int i, j, k, err = 0;
+	int i, j, err = 0;
 
 	if (!exts)
 		return 0;
@@ -3366,25 +3483,9 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 		} else if (is_tcf_tunnel_release(act)) {
 			entry->id = FLOW_ACTION_TUNNEL_DECAP;
 		} else if (is_tcf_pedit(act)) {
-			for (k = 0; k < tcf_pedit_nkeys(act); k++) {
-				switch (tcf_pedit_cmd(act, k)) {
-				case TCA_PEDIT_KEY_EX_CMD_SET:
-					entry->id = FLOW_ACTION_MANGLE;
-					break;
-				case TCA_PEDIT_KEY_EX_CMD_ADD:
-					entry->id = FLOW_ACTION_ADD;
-					break;
-				default:
-					err = -EOPNOTSUPP;
-					goto err_out;
-				}
-				entry->mangle.htype = tcf_pedit_htype(act, k);
-				entry->mangle.mask = ~tcf_pedit_mask(act, k);
-				entry->mangle.val = tcf_pedit_val(act, k) &
-							entry->mangle.mask;
-				entry->mangle.offset = tcf_pedit_offset(act, k);
-				entry = &flow_action->entries[++j];
-			}
+			j = flow_action_mangle(flow_action, entry, act, j);
+			if (j < 0)
+				goto err_out;
 		} else if (is_tcf_csum(act)) {
 			entry->id = FLOW_ACTION_CSUM;
 			entry->csum_flags = tcf_csum_update_flags(act);
@@ -3439,9 +3540,9 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			goto err_out;
 		}
 
-		if (!is_tcf_pedit(act))
-			j++;
+		j++;
 	}
+	flow_action->num_entries = j;
 
 err_out:
 	if (!rtnl_held)
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next,v2 4/4] netfilter: nft_payload: packet mangling offload support
From: Pablo Neira Ayuso @ 2019-09-03 16:45 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, jiri, saeedm, vishal, vladbu
In-Reply-To: <20190903164513.15462-1-pablo@netfilter.org>

This patch allows for mangling packet fields using hardware offload
infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_payload.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 22a80eb60222..0efa8bfd2b51 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -562,12 +562,85 @@ static int nft_payload_set_dump(struct sk_buff *skb, const struct nft_expr *expr
 	return -1;
 }
 
+static int nft_payload_offload_set_nh(struct nft_offload_ctx *ctx,
+				      struct nft_flow_rule *flow,
+				      const struct nft_payload_set *priv)
+{
+	int type = FLOW_ACT_MANGLE_UNSPEC;
+
+	switch (ctx->dep.l3num) {
+	case htons(ETH_P_IP):
+		type = FLOW_ACT_MANGLE_HDR_TYPE_IP4;
+		break;
+	case htons(ETH_P_IPV6):
+		type = FLOW_ACT_MANGLE_HDR_TYPE_IP6;
+		break;
+	}
+
+	return type;
+}
+
+static int nft_payload_offload_set_th(struct nft_offload_ctx *ctx,
+				      struct nft_flow_rule *flow,
+				      const struct nft_payload_set *priv)
+{
+	int type = FLOW_ACT_MANGLE_UNSPEC;
+
+	switch (ctx->dep.protonum) {
+	case IPPROTO_TCP:
+		type = FLOW_ACT_MANGLE_HDR_TYPE_TCP;
+		break;
+	case IPPROTO_UDP:
+		type = FLOW_ACT_MANGLE_HDR_TYPE_UDP;
+		break;
+	}
+
+	return type;
+}
+
+static int nft_payload_set_offload(struct nft_offload_ctx *ctx,
+				   struct nft_flow_rule *flow,
+				   const struct nft_expr *expr)
+{
+	const struct nft_payload_set *priv = nft_expr_priv(expr);
+	struct nft_offload_reg *sreg = &ctx->regs[priv->sreg];
+	int type = FLOW_ACT_MANGLE_UNSPEC;
+	struct flow_action_entry *entry;
+
+	switch (priv->base) {
+	case NFT_PAYLOAD_LL_HEADER:
+		type = FLOW_ACT_MANGLE_HDR_TYPE_ETH;
+		break;
+	case NFT_PAYLOAD_NETWORK_HEADER:
+		type = nft_payload_offload_set_nh(ctx, flow, priv);
+		break;
+	case NFT_PAYLOAD_TRANSPORT_HEADER:
+		type = nft_payload_offload_set_th(ctx, flow, priv);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+	entry = &flow->rule->action.entries[ctx->num_actions++];
+	entry->id		= FLOW_ACTION_MANGLE;
+	entry->mangle.htype	= type;
+	entry->mangle.offset	= priv->offset;
+	entry->mangle.len	= priv->len;
+
+	memcpy(entry->mangle.val, sreg->data.data, priv->len);
+	memset(entry->mangle.mask, 0xff, priv->len);
+
+	return type != FLOW_ACT_MANGLE_UNSPEC ? 0 : -EOPNOTSUPP;
+}
+
 static const struct nft_expr_ops nft_payload_set_ops = {
 	.type		= &nft_payload_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_payload_set)),
 	.eval		= nft_payload_set_eval,
 	.init		= nft_payload_set_init,
 	.dump		= nft_payload_set_dump,
+	.offload	= nft_payload_set_offload,
 };
 
 static const struct nft_expr_ops *
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next,v2 1/4] net: flow_offload: flip mangle action mask
From: Pablo Neira Ayuso @ 2019-09-03 16:45 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, jiri, saeedm, vishal, vladbu
In-Reply-To: <20190903164513.15462-1-pablo@netfilter.org>

Userspace tc pedit action performs a bitwise NOT operation on the mask.
All of the existing drivers in the tree undo this operation. Prepare the
mangle mask in the way the drivers expect from the
tc_setup_flow_action() function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 12 ++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c      |  6 +++---
 drivers/net/ethernet/netronome/nfp/flower/action.c   |  8 ++++----
 net/sched/cls_api.c                                  |  2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index e447976bdd3e..2d26dbca701d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -275,7 +275,7 @@ static int cxgb4_validate_flow_match(struct net_device *dev,
 static void offload_pedit(struct ch_filter_specification *fs, u32 val, u32 mask,
 			  u8 field)
 {
-	u32 set_val = val & ~mask;
+	u32 set_val = val & mask;
 	u32 offset = 0;
 	u8 size = 1;
 	int i;
@@ -301,7 +301,7 @@ static void process_pedit_field(struct ch_filter_specification *fs, u32 val,
 			offload_pedit(fs, val, mask, ETH_DMAC_31_0);
 			break;
 		case PEDIT_ETH_DMAC_47_32_SMAC_15_0:
-			if (~mask & PEDIT_ETH_DMAC_MASK)
+			if (mask & PEDIT_ETH_DMAC_MASK)
 				offload_pedit(fs, val, mask, ETH_DMAC_47_32);
 			else
 				offload_pedit(fs, val >> 16, mask >> 16,
@@ -353,7 +353,7 @@ static void process_pedit_field(struct ch_filter_specification *fs, u32 val,
 	case FLOW_ACT_MANGLE_HDR_TYPE_TCP:
 		switch (offset) {
 		case PEDIT_TCP_SPORT_DPORT:
-			if (~mask & PEDIT_TCP_UDP_SPORT_MASK)
+			if (mask & PEDIT_TCP_UDP_SPORT_MASK)
 				offload_pedit(fs, cpu_to_be32(val) >> 16,
 					      cpu_to_be32(mask) >> 16,
 					      TCP_SPORT);
@@ -366,7 +366,7 @@ static void process_pedit_field(struct ch_filter_specification *fs, u32 val,
 	case FLOW_ACT_MANGLE_HDR_TYPE_UDP:
 		switch (offset) {
 		case PEDIT_UDP_SPORT_DPORT:
-			if (~mask & PEDIT_TCP_UDP_SPORT_MASK)
+			if (mask & PEDIT_TCP_UDP_SPORT_MASK)
 				offload_pedit(fs, cpu_to_be32(val) >> 16,
 					      cpu_to_be32(mask) >> 16,
 					      UDP_SPORT);
@@ -510,7 +510,7 @@ static bool valid_pedit_action(struct net_device *dev,
 	case FLOW_ACT_MANGLE_HDR_TYPE_TCP:
 		switch (offset) {
 		case PEDIT_TCP_SPORT_DPORT:
-			if (!valid_l4_mask(~mask)) {
+			if (!valid_l4_mask(mask)) {
 				netdev_err(dev, "%s: Unsupported mask for TCP L4 ports\n",
 					   __func__);
 				return false;
@@ -525,7 +525,7 @@ static bool valid_pedit_action(struct net_device *dev,
 	case FLOW_ACT_MANGLE_HDR_TYPE_UDP:
 		switch (offset) {
 		case PEDIT_UDP_SPORT_DPORT:
-			if (!valid_l4_mask(~mask)) {
+			if (!valid_l4_mask(mask)) {
 				netdev_err(dev, "%s: Unsupported mask for UDP L4 ports\n",
 					   __func__);
 				return false;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 5581a8045ede..67e82480b516 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2508,7 +2508,7 @@ static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 	val = act->mangle.val;
 	offset = act->mangle.offset;
 
-	err = set_pedit_val(htype, ~mask, val, offset, &hdrs[cmd]);
+	err = set_pedit_val(htype, mask, val, offset, &hdrs[cmd]);
 	if (err)
 		goto out_err;
 
@@ -2608,7 +2608,7 @@ static bool is_action_keys_supported(const struct flow_action_entry *act)
 
 	htype = act->mangle.htype;
 	offset = act->mangle.offset;
-	mask = ~act->mangle.mask;
+	mask = act->mangle.mask;
 	/* For IPv4 & IPv6 header check 4 byte word,
 	 * to determine that modified fields
 	 * are NOT ttl & hop_limit only.
@@ -2732,7 +2732,7 @@ static int add_vlan_rewrite_action(struct mlx5e_priv *priv, int namespace,
 		.id = FLOW_ACTION_MANGLE,
 		.mangle.htype = FLOW_ACT_MANGLE_HDR_TYPE_ETH,
 		.mangle.offset = offsetof(struct vlan_ethhdr, h_vlan_TCI),
-		.mangle.mask = ~(u32)be16_to_cpu(*(__be16 *)&mask16),
+		.mangle.mask = (u32)be16_to_cpu(*(__be16 *)&mask16),
 		.mangle.val = (u32)be16_to_cpu(*(__be16 *)&val16),
 	};
 	u8 match_prio_mask, match_prio_val;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 1b019fdfcd97..ee0066a7ba87 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -495,7 +495,7 @@ nfp_fl_set_eth(const struct flow_action_entry *act, u32 off,
 		return -EOPNOTSUPP;
 	}
 
-	mask = ~act->mangle.mask;
+	mask = act->mangle.mask;
 	exact = act->mangle.val;
 
 	if (exact & ~mask) {
@@ -532,7 +532,7 @@ nfp_fl_set_ip4(const struct flow_action_entry *act, u32 off,
 	__be32 exact, mask;
 
 	/* We are expecting tcf_pedit to return a big endian value */
-	mask = (__force __be32)~act->mangle.mask;
+	mask = (__force __be32)act->mangle.mask;
 	exact = (__force __be32)act->mangle.val;
 
 	if (exact & ~mask) {
@@ -673,7 +673,7 @@ nfp_fl_set_ip6(const struct flow_action_entry *act, u32 off,
 	u8 word;
 
 	/* We are expecting tcf_pedit to return a big endian value */
-	mask = (__force __be32)~act->mangle.mask;
+	mask = (__force __be32)act->mangle.mask;
 	exact = (__force __be32)act->mangle.val;
 
 	if (exact & ~mask) {
@@ -713,7 +713,7 @@ nfp_fl_set_tport(const struct flow_action_entry *act, u32 off,
 		return -EOPNOTSUPP;
 	}
 
-	mask = ~act->mangle.mask;
+	mask = act->mangle.mask;
 	exact = act->mangle.val;
 
 	if (exact & ~mask) {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 671ca905dbb5..fbab004d0075 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3379,7 +3379,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 					goto err_out;
 				}
 				entry->mangle.htype = tcf_pedit_htype(act, k);
-				entry->mangle.mask = tcf_pedit_mask(act, k);
+				entry->mangle.mask = ~tcf_pedit_mask(act, k);
 				entry->mangle.val = tcf_pedit_val(act, k);
 				entry->mangle.offset = tcf_pedit_offset(act, k);
 				entry = &flow_action->entries[++j];
-- 
2.11.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox