* Re: [net,v2] net/packet: fix packet receive on L3 devices without visible hard header
From: Willem de Bruijn @ 2020-11-21 13:23 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: Eyal Birger, Willem de Bruijn, David Miller, Jakub Kicinski,
Network Development, Xie He
In-Reply-To: <CAHmME9rYRrWOs247vFJX-MAY+Zn3yUudOxVhqL13mWp8E+i0-A@mail.gmail.com>
On Sat, Nov 21, 2020 at 2:56 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On 11/21/20, Eyal Birger <eyal.birger@gmail.com> wrote:
> > In the patchset merged by commit b9fcf0a0d826
> > ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
> > did not have header_ops were given one for the purpose of protocol parsing
> > on af_packet transmit path.
> >
> > That change made af_packet receive path regard these devices as having a
> > visible L3 header and therefore aligned incoming skb->data to point to the
> > skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
> > reset their mac_header prior to ingress and therefore their incoming
> > packets became malformed.
> >
> > Ideally these devices would reset their mac headers, or af_packet would be
> > able to rely on dev->hard_header_len being 0 for such cases, but it seems
> > this is not the case.
> >
> > Fix by changing af_packet RX ll visibility criteria to include the
> > existence of a '.create()' header operation, which is used when creating
> > a device hard header - via dev_hard_header() - by upper layers, and does
> > not exist in these L3 devices.
> >
> > As this predicate may be useful in other situations, add it as a common
> > dev_has_header() helper in netdevice.h.
> >
> > Fixes: b9fcf0a0d826 ("Merge branch
> > 'support-AF_PACKET-for-layer-3-devices'")
> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply
* Re: [PATCH 072/141] can: peak_usb: Fix fall-through warnings for Clang
From: Marc Kleine-Budde @ 2020-11-21 13:17 UTC (permalink / raw)
To: Gustavo A. R. Silva, Wolfgang Grandegger, David S. Miller,
Jakub Kicinski
Cc: linux-can, netdev, linux-kernel, linux-hardening
In-Reply-To: <aab7cf16bf43cc7c3e9c9930d2dae850c1d07a3c.1605896059.git.gustavoars@kernel.org>
[-- Attachment #1.1: Type: text/plain, Size: 2372 bytes --]
On 11/20/20 7:34 PM, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of letting the code fall
> through to the next case.
>
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
> ---
> drivers/net/can/usb/peak_usb/pcan_usb_core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.c b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
> index c2764799f9ef..fd65a155be3b 100644
> --- a/drivers/net/can/usb/peak_usb/pcan_usb_core.c
> +++ b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
> @@ -299,6 +299,8 @@ static void peak_usb_write_bulk_callback(struct urb *urb)
> if (net_ratelimit())
> netdev_err(netdev, "Tx urb aborted (%d)\n",
> urb->status);
> + break;
> +
> case -EPROTO:
> case -ENOENT:
> case -ECONNRESET:
>
What about moving the default to the end if the case, which is more common anyways:
diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.c b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
index 204ccb27d6d9..e8977dd10902 100644
--- a/drivers/net/can/usb/peak_usb/pcan_usb_core.c
+++ b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
@@ -295,16 +295,16 @@ static void peak_usb_write_bulk_callback(struct urb *urb)
netif_trans_update(netdev);
break;
- default:
- if (net_ratelimit())
- netdev_err(netdev, "Tx urb aborted (%d)\n",
- urb->status);
case -EPROTO:
case -ENOENT:
case -ECONNRESET:
case -ESHUTDOWN:
-
break;
+
+ default:
+ if (net_ratelimit())
+ netdev_err(netdev, "Tx urb aborted (%d)\n",
+ urb->status);
}
/* should always release echo skb and corresponding context */
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply related
* Re: [PATCH v3] dt-bindings: misc: convert fsl,qoriq-mc from txt to YAML
From: Rob Herring @ 2020-11-21 13:03 UTC (permalink / raw)
To: Laurentiu Tudor
Cc: leoyang.li, corbet, linux-arm-kernel, devicetree, linux-kernel,
netdev, linux-doc, davem, kuba, linuxppc-dev, ioana.ciornei,
Ionut-robert Aron
In-Reply-To: <20201112133254.7291-1-laurentiu.tudor@nxp.com>
On Thu, Nov 12, 2020 at 03:32:54PM +0200, Laurentiu Tudor wrote:
> From: Ionut-robert Aron <ionut-robert.aron@nxp.com>
>
> Convert fsl,qoriq-mc to YAML in order to automate the verification
> process of dts files. In addition, update MAINTAINERS accordingly
> and, while at it, add some missing files.
>
> Signed-off-by: Ionut-robert Aron <ionut-robert.aron@nxp.com>
> [laurentiu.tudor@nxp.com: update MINTAINERS, updates & fixes in schema]
> Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
> ---
> Changes in v3:
> - dropped duplicated "fsl,qoriq-mc-dpmac" schema and replaced with
> reference to it
> - fixed a dt_binding_check warning
> Changes in v2:
> - fixed errors reported by yamllint
> - dropped multiple unnecessary quotes
> - used schema instead of text in description
> - added constraints on dpmac reg property
>
> .../devicetree/bindings/misc/fsl,qoriq-mc.txt | 196 ------------------
> .../bindings/misc/fsl,qoriq-mc.yaml | 187 +++++++++++++++++
> .../ethernet/freescale/dpaa2/overview.rst | 5 +-
> MAINTAINERS | 4 +-
> 4 files changed, 194 insertions(+), 198 deletions(-)
> delete mode 100644 Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
> create mode 100644 Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
> diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
> new file mode 100644
> index 000000000000..1dda2ad29717
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
> @@ -0,0 +1,187 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +# Copyright 2020 NXP
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/misc/fsl,qoriq-mc.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +maintainers:
> + - Laurentiu Tudor <laurentiu.tudor@nxp.com>
> +
> +title: Freescale Management Complex
> +
> +description: |
> + The Freescale Management Complex (fsl-mc) is a hardware resource
> + manager that manages specialized hardware objects used in
> + network-oriented packet processing applications. After the fsl-mc
> + block is enabled, pools of hardware resources are available, such as
> + queues, buffer pools, I/O interfaces. These resources are building
> + blocks that can be used to create functional hardware objects/devices
> + such as network interfaces, crypto accelerator instances, L2 switches,
> + etc.
> +
> + For an overview of the DPAA2 architecture and fsl-mc bus see:
> + Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
> +
> + As described in the above overview, all DPAA2 objects in a DPRC share the
> + same hardware "isolation context" and a 10-bit value called an ICID
> + (isolation context id) is expressed by the hardware to identify
> + the requester.
> +
> + The generic 'iommus' property is insufficient to describe the relationship
> + between ICIDs and IOMMUs, so an iommu-map property is used to define
> + the set of possible ICIDs under a root DPRC and how they map to
> + an IOMMU.
> +
> + For generic IOMMU bindings, see:
> + Documentation/devicetree/bindings/iommu/iommu.txt.
> +
> + For arm-smmu binding, see:
> + Documentation/devicetree/bindings/iommu/arm,smmu.yaml.
> +
> + MC firmware binary images can be found here:
> + https://github.com/NXP/qoriq-mc-binary
> +
> +properties:
> + compatible:
> + const: fsl,qoriq-mc
> + description:
> + A Freescale Management Complex compatible with this binding must have
> + Block Revision Registers BRR1 and BRR2 at offset 0x0BF8 and 0x0BFC in
> + the MC control register region.
> +
> + reg:
> + minItems: 1
> + items:
> + - description: the command portal for this machine
> + - description:
> + MC control registers. This region may not be present in some
> + scenarios, such as in the device tree presented to a virtual
> + machine.
> +
> + ranges:
> + description: |
> + A standard property. Defines the mapping between the child MC address
> + space and the parent system address space.
> +
> + The MC address space is defined by 3 components:
> + <region type> <offset hi> <offset lo>
> +
> + Valid values for region type are:
> + 0x0 - MC portals
> + 0x1 - QBMAN portals
> +
> + '#address-cells':
> + const: 3
> +
> + '#size-cells':
> + const: 1
> +
> + dpmacs:
> + type: object
> + description:
> + The fsl-mc node may optionally have dpmac sub-nodes that describe the
> + relationship between the Ethernet MACs which belong to the MC and the
> + Ethernet PHYs on the system board.
> +
> + properties:
> + '#address-cells':
> + const: 1
> +
> + '#size-cells':
> + const: 0
> +
> + patternProperties:
> + "^(dpmac@[0-9a-f]+)|(ethernet@[0-9a-f]+)$":
> + type: object
> +
> + description:
> + see Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
Use $ref here.
> +
> + iommu-map:
> + description: |
> + Maps an ICID to an IOMMU and associated iommu-specifier data.
> +
> + The property is an arbitrary number of tuples of
> + (icid-base, iommu, iommu-base, length).
> +
> + Any ICID i in the interval [icid-base, icid-base + length) is
> + associated with the listed IOMMU, with the iommu-specifier
> + (i - icid-base + iommu-base).
> +
> + msi-map:
> + description: |
> + Maps an ICID to a GIC ITS and associated msi-specifier data.
> +
> + The property is an arbitrary number of tuples of
> + (icid-base, gic-its, msi-base, length).
> +
> + Any ICID in the interval [icid-base, icid-base + length) is
> + associated with the listed GIC ITS, with the msi-specifier
> + (i - icid-base + msi-base).
> +
> + msi-parent:
> + deprecated: true
> + description:
> + Points to the MSI controller node handling message interrupts for the MC.
> +
> +required:
> + - compatible
> + - reg
> + - iommu-map
> + - msi-map
> + - ranges
> + - '#address-cells'
> + - '#size-cells'
> +
> +additionalProperties: false
> +
> +examples:
> + - |
> + soc {
> + #address-cells = <2>;
> + #size-cells = <2>;
> +
> + smmu: iommu@5000000 {
> + compatible = "arm,mmu-500";
> + #global-interrupts = <1>;
> + #iommu-cells = <1>;
> + reg = <0 0x5000000 0 0x800000>;
> + stream-match-mask = <0x7c00>;
> + interrupts = <0 13 4>,
> + <0 146 4>, <0 147 4>,
> + <0 148 4>, <0 149 4>,
> + <0 150 4>, <0 151 4>,
> + <0 152 4>, <0 153 4>;
> + };
> +
> + fsl_mc: fsl-mc@80c000000 {
> + compatible = "fsl,qoriq-mc";
> + reg = <0x00000008 0x0c000000 0 0x40>, /* MC portal base */
> + <0x00000000 0x08340000 0 0x40000>; /* MC control reg */
> + /* define map for ICIDs 23-64 */
> + iommu-map = <23 &smmu 23 41>;
> + /* define msi map for ICIDs 23-64 */
> + msi-map = <23 &its 23 41>;
> + #address-cells = <3>;
> + #size-cells = <1>;
> +
> + /*
> + * Region type 0x0 - MC portals
> + * Region type 0x1 - QBMAN portals
> + */
> + ranges = <0x0 0x0 0x0 0x8 0x0c000000 0x4000000
> + 0x1 0x0 0x0 0x8 0x18000000 0x8000000>;
> +
> + dpmacs {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + ethernet@1 {
> + compatible = "fsl,qoriq-mc-dpmac";
> + reg = <1>;
> + phy-handle = <&mdio0_phy0>;
> + };
> + };
> + };
> + };
> diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
> index d638b5a8aadd..b3261c5871cc 100644
> --- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
> +++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
> @@ -28,6 +28,9 @@ interfaces, an L2 switch, or accelerator instances.
> The MC provides memory-mapped I/O command interfaces (MC portals)
> which DPAA2 software drivers use to operate on DPAA2 objects.
>
> +MC firmware binary images can be found here:
> +https://github.com/NXP/qoriq-mc-binary
> +
> The diagram below shows an overview of the DPAA2 resource management
> architecture::
>
> @@ -338,7 +341,7 @@ Key functions include:
> a bind of the root DPRC to the DPRC driver
>
> The binding for the MC-bus device-tree node can be consulted at
> -*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
> +*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml*.
> The sysfs bind/unbind interfaces for the MC-bus can be consulted at
> *Documentation/ABI/testing/sysfs-bus-fsl-mc*.
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b516bb34a8d5..e0ce6e2b663c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14409,9 +14409,11 @@ M: Stuart Yoder <stuyoder@gmail.com>
> M: Laurentiu Tudor <laurentiu.tudor@nxp.com>
> L: linux-kernel@vger.kernel.org
> S: Maintained
> -F: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
> +F: Documentation/devicetree/bindings/misc/fsl,dpaa2-console.yaml
> +F: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
> F: Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
> F: drivers/bus/fsl-mc/
> +F: include/linux/fsl/mc.h
>
> QT1010 MEDIA DRIVER
> M: Antti Palosaari <crope@iki.fi>
> --
> 2.17.1
>
^ permalink raw reply
* Re: [EXT] Re: [RFC, net-next] net: qos: introduce a redundancy flow action
From: Jamal Hadi Salim @ 2020-11-21 12:43 UTC (permalink / raw)
To: Xiaoliang Yang, Joergen Andreasen
Cc: davem@davemloft.net, netdev@vger.kernel.org,
Arvid.Brodin@xdin.com, m-karicheri2@ti.com,
vinicius.gomes@intel.com, michael.chan@broadcom.com,
vishal@chelsio.com, saeedm@mellanox.com, jiri@mellanox.com,
idosch@mellanox.com, alexandre.belloni@bootlin.com,
UNGLinuxDriver@microchip.com, ivan.khoronzhuk@linaro.org,
andre.guedes@linux.intel.com, allan.nielsen@microchip.com, Po Liu,
Mingkai Hu, Claudiu Manoil, Vladimir Oltean, Leo Li, Cong Wang
In-Reply-To: <DB8PR04MB5785668533329D4B012545D1F0FF0@DB8PR04MB5785.eurprd04.prod.outlook.com>
Hi,
On 2020-11-20 2:32 a.m., Xiaoliang Yang wrote:
> Hi Jamal,
>
> On 2020-11-19 0:11, Jamal Hadi Salim wrote:
>> The 11/17/2020 14:30, Xiaoliang Yang wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you
[..]
>> We already have mirroring + ability to add/pop tags.
>> Would the following not work?
>>
>> Example, generator mode:
>> tc filter add dev swp0 ingress protocol 802.1Q flower \
action vlan push id 789 protocol 802.1q \ action mirred egress mirror
dev swp1 \ action mirred egress mirror dev swp2
>>
>> recovery mode:
>> tc filter add dev swp0 ingress protocol 802.1Q flower \ skip_hw dst_mac 00:01:02:03:04:06 vlan_id 1 \ actopm vlan pop
>>
>
> Action mirred only copy the packets and forward to different egress ports.
Also to ingress of ports (and can redirect as well, etc)
>802.1CB need to add a redundancy tag before forward. Recovery mode need pop the
redundancy tag
and check sequence in R-TAG, drop the repeat frames. So I added a new
action to do this.
>
Ok, when you said "tags" I thought you meant 802.1q tags. But seems like
different ethernet tags?
The preferred approach is to write an action to pop/push these R-TAGs.
generator mode:
tc filter add dev swp0 ingress protocol 802.1Q flower \
action rtag push \
action mirred egress mirror dev swp1 \
action mirred egress mirror dev swp2
recovery mode:
tc filter add dev swp0 ingress protocol 802.1Q flower \
action rtag pop
...
.....
We have a few examplery actions which provide tags at L2
(IFE, MPLS, VLAN, etc).
cheers,
jamal
^ permalink raw reply
* [PATCH net 3/4] netfilter: ipset: prevent uninit-value in hash_ip6_add
From: Pablo Neira Ayuso @ 2020-11-21 12:36 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba
In-Reply-To: <20201121123601.21733-1-pablo@netfilter.org>
From: Eric Dumazet <edumazet@google.com>
syzbot found that we are not validating user input properly
before copying 16 bytes [1].
Using NLA_BINARY in ipaddr_policy[] for IPv6 address is not correct,
since it ensures at most 16 bytes were provided.
We should instead make sure user provided exactly 16 bytes.
In old kernels (before v4.20), fix would be to remove the NLA_BINARY,
since NLA_POLICY_EXACT_LEN() was not yet available.
[1]
BUG: KMSAN: uninit-value in hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
CPU: 1 PID: 11611 Comm: syz-executor.0 Not tainted 5.10.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x21c/0x280 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
__msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
hash_ip6_uadt+0x976/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:267
call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg net/socket.c:671 [inline]
____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
___sys_sendmsg net/socket.c:2407 [inline]
__sys_sendmsg+0x6d5/0x830 net/socket.c:2440
__do_sys_sendmsg net/socket.c:2449 [inline]
__se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe2e503fc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000029ec0 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 000000000118bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 000000000169fb7f R14: 00007fe2e50409c0 R15: 000000000118bf2c
Uninit was stored to memory at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
__msan_chain_origin+0x57/0xa0 mm/kmsan/kmsan_instr.c:147
ip6_netmask include/linux/netfilter/ipset/pfxlen.h:49 [inline]
hash_ip6_netmask net/netfilter/ipset/ip_set_hash_ip.c:185 [inline]
hash_ip6_uadt+0xb1c/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:263
call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg net/socket.c:671 [inline]
____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
___sys_sendmsg net/socket.c:2407 [inline]
__sys_sendmsg+0x6d5/0x830 net/socket.c:2440
__do_sys_sendmsg net/socket.c:2449 [inline]
__se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Uninit was stored to memory at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
kmsan_memcpy_memmove_metadata+0x25e/0x2d0 mm/kmsan/kmsan.c:226
kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:246
__msan_memcpy+0x46/0x60 mm/kmsan/kmsan_instr.c:110
ip_set_get_ipaddr6+0x2cb/0x370 net/netfilter/ipset/ip_set_core.c:310
hash_ip6_uadt+0x439/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:255
call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg net/socket.c:671 [inline]
____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
___sys_sendmsg net/socket.c:2407 [inline]
__sys_sendmsg+0x6d5/0x830 net/socket.c:2440
__do_sys_sendmsg net/socket.c:2449 [inline]
__se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
kmsan_internal_poison_shadow+0x5c/0xf0 mm/kmsan/kmsan.c:104
kmsan_slab_alloc+0x8d/0xe0 mm/kmsan/kmsan_hooks.c:76
slab_alloc_node mm/slub.c:2906 [inline]
__kmalloc_node_track_caller+0xc61/0x15f0 mm/slub.c:4512
__kmalloc_reserve net/core/skbuff.c:142 [inline]
__alloc_skb+0x309/0xae0 net/core/skbuff.c:210
alloc_skb include/linux/skbuff.h:1094 [inline]
netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
netlink_sendmsg+0xdb8/0x1840 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg net/socket.c:671 [inline]
____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
___sys_sendmsg net/socket.c:2407 [inline]
__sys_sendmsg+0x6d5/0x830 net/socket.c:2440
__do_sys_sendmsg net/socket.c:2449 [inline]
__se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipset/ip_set_core.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 7cff6e5e7445..2b19189a930f 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -271,8 +271,7 @@ flag_nested(const struct nlattr *nla)
static const struct nla_policy ipaddr_policy[IPSET_ATTR_IPADDR_MAX + 1] = {
[IPSET_ATTR_IPADDR_IPV4] = { .type = NLA_U32 },
- [IPSET_ATTR_IPADDR_IPV6] = { .type = NLA_BINARY,
- .len = sizeof(struct in6_addr) },
+ [IPSET_ATTR_IPADDR_IPV6] = NLA_POLICY_EXACT_LEN(sizeof(struct in6_addr)),
};
int
--
2.20.1
^ permalink raw reply related
* [PATCH net 4/4] netfilter: nf_tables: avoid false-postive lockdep splat
From: Pablo Neira Ayuso @ 2020-11-21 12:36 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba
In-Reply-To: <20201121123601.21733-1-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
There are reports wrt lockdep splat in nftables, e.g.:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 31416 at net/netfilter/nf_tables_api.c:622
lockdep_nfnl_nft_mutex_not_held+0x28/0x38 [nf_tables]
...
These are caused by an earlier, unrelated bug such as a n ABBA deadlock
in a different subsystem.
In such an event, lockdep is disabled and lockdep_is_held returns true
unconditionally. This then causes the WARN() in nf_tables.
Make the WARN conditional on lockdep still active to avoid this.
Fixes: f102d66b335a417 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/linux-kselftest/CA+G9fYvFUpODs+NkSYcnwKnXm62tmP=ksLeBPmB+KFrB2rvCtQ@mail.gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_tables_api.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0f58e98542be..23abf1578594 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -619,7 +619,8 @@ static int nft_request_module(struct net *net, const char *fmt, ...)
static void lockdep_nfnl_nft_mutex_not_held(void)
{
#ifdef CONFIG_PROVE_LOCKING
- WARN_ON_ONCE(lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES));
+ if (debug_locks)
+ WARN_ON_ONCE(lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES));
#endif
}
--
2.20.1
^ permalink raw reply related
* [PATCH net 2/4] netfilter: nftables_offload: build mask based from the matching bytes
From: Pablo Neira Ayuso @ 2020-11-21 12:35 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba
In-Reply-To: <20201121123601.21733-1-pablo@netfilter.org>
Userspace might match on prefix bytes of header fields if they are on
the byte boundary, this requires that the mask is adjusted accordingly.
Use NFT_OFFLOAD_MATCH_EXACT() for meta since prefix byte matching is not
allowed for this type of selector.
Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_tables_offload.h | 3 ++
net/netfilter/nft_cmp.c | 8 +--
net/netfilter/nft_meta.c | 16 +++---
net/netfilter/nft_payload.c | 66 +++++++++++++++++------
4 files changed, 64 insertions(+), 29 deletions(-)
diff --git a/include/net/netfilter/nf_tables_offload.h b/include/net/netfilter/nf_tables_offload.h
index bddd34c5bd79..1d34fe154fe0 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -78,6 +78,9 @@ int nft_flow_rule_offload_commit(struct net *net);
offsetof(struct nft_flow_key, __base.__field); \
(__reg)->len = __len; \
(__reg)->key = __key; \
+
+#define NFT_OFFLOAD_MATCH_EXACT(__key, __base, __field, __len, __reg) \
+ NFT_OFFLOAD_MATCH(__key, __base, __field, __len, __reg) \
memset(&(__reg)->mask, 0xff, (__reg)->len);
int nft_chain_offload_priority(struct nft_base_chain *basechain);
diff --git a/net/netfilter/nft_cmp.c b/net/netfilter/nft_cmp.c
index bc079d68a536..00e563a72d3d 100644
--- a/net/netfilter/nft_cmp.c
+++ b/net/netfilter/nft_cmp.c
@@ -123,11 +123,11 @@ static int __nft_cmp_offload(struct nft_offload_ctx *ctx,
u8 *mask = (u8 *)&flow->match.mask;
u8 *key = (u8 *)&flow->match.key;
- if (priv->op != NFT_CMP_EQ || reg->len != priv->len)
+ if (priv->op != NFT_CMP_EQ || priv->len > reg->len)
return -EOPNOTSUPP;
- memcpy(key + reg->offset, &priv->data, priv->len);
- memcpy(mask + reg->offset, ®->mask, priv->len);
+ memcpy(key + reg->offset, &priv->data, reg->len);
+ memcpy(mask + reg->offset, ®->mask, reg->len);
flow->match.dissector.used_keys |= BIT(reg->key);
flow->match.dissector.offset[reg->key] = reg->base_offset;
@@ -137,7 +137,7 @@ static int __nft_cmp_offload(struct nft_offload_ctx *ctx,
nft_reg_load16(priv->data.data) != ARPHRD_ETHER)
return -EOPNOTSUPP;
- nft_offload_update_dependency(ctx, &priv->data, priv->len);
+ nft_offload_update_dependency(ctx, &priv->data, reg->len);
return 0;
}
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index b37bd02448d8..bf4b3ad5314c 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -724,22 +724,22 @@ static int nft_meta_get_offload(struct nft_offload_ctx *ctx,
switch (priv->key) {
case NFT_META_PROTOCOL:
- NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_BASIC, basic, n_proto,
- sizeof(__u16), reg);
+ NFT_OFFLOAD_MATCH_EXACT(FLOW_DISSECTOR_KEY_BASIC, basic, n_proto,
+ sizeof(__u16), reg);
nft_offload_set_dependency(ctx, NFT_OFFLOAD_DEP_NETWORK);
break;
case NFT_META_L4PROTO:
- NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_BASIC, basic, ip_proto,
- sizeof(__u8), reg);
+ NFT_OFFLOAD_MATCH_EXACT(FLOW_DISSECTOR_KEY_BASIC, basic, ip_proto,
+ sizeof(__u8), reg);
nft_offload_set_dependency(ctx, NFT_OFFLOAD_DEP_TRANSPORT);
break;
case NFT_META_IIF:
- NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_META, meta,
- ingress_ifindex, sizeof(__u32), reg);
+ NFT_OFFLOAD_MATCH_EXACT(FLOW_DISSECTOR_KEY_META, meta,
+ ingress_ifindex, sizeof(__u32), reg);
break;
case NFT_META_IIFTYPE:
- NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_META, meta,
- ingress_iftype, sizeof(__u16), reg);
+ NFT_OFFLOAD_MATCH_EXACT(FLOW_DISSECTOR_KEY_META, meta,
+ ingress_iftype, sizeof(__u16), reg);
break;
default:
return -EOPNOTSUPP;
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index bbf811d030d5..47d4e0e21651 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -165,6 +165,34 @@ static int nft_payload_dump(struct sk_buff *skb, const struct nft_expr *expr)
return -1;
}
+static bool nft_payload_offload_mask(struct nft_offload_reg *reg,
+ u32 priv_len, u32 field_len)
+{
+ unsigned int remainder, delta, k;
+ struct nft_data mask = {};
+ __be32 remainder_mask;
+
+ if (priv_len == field_len) {
+ memset(®->mask, 0xff, priv_len);
+ return true;
+ } else if (priv_len > field_len) {
+ return false;
+ }
+
+ memset(&mask, 0xff, field_len);
+ remainder = priv_len % sizeof(u32);
+ if (remainder) {
+ k = priv_len / sizeof(u32);
+ delta = field_len - priv_len;
+ remainder_mask = htonl(~((1 << (delta * BITS_PER_BYTE)) - 1));
+ mask.data[k] = (__force u32)remainder_mask;
+ }
+
+ memcpy(®->mask, &mask, field_len);
+
+ return true;
+}
+
static int nft_payload_offload_ll(struct nft_offload_ctx *ctx,
struct nft_flow_rule *flow,
const struct nft_payload *priv)
@@ -173,21 +201,21 @@ static int nft_payload_offload_ll(struct nft_offload_ctx *ctx,
switch (priv->offset) {
case offsetof(struct ethhdr, h_source):
- if (priv->len != ETH_ALEN)
+ if (!nft_payload_offload_mask(reg, priv->len, ETH_ALEN))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_ETH_ADDRS, eth_addrs,
src, ETH_ALEN, reg);
break;
case offsetof(struct ethhdr, h_dest):
- if (priv->len != ETH_ALEN)
+ if (!nft_payload_offload_mask(reg, priv->len, ETH_ALEN))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_ETH_ADDRS, eth_addrs,
dst, ETH_ALEN, reg);
break;
case offsetof(struct ethhdr, h_proto):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_BASIC, basic,
@@ -195,14 +223,14 @@ static int nft_payload_offload_ll(struct nft_offload_ctx *ctx,
nft_offload_set_dependency(ctx, NFT_OFFLOAD_DEP_NETWORK);
break;
case offsetof(struct vlan_ethhdr, h_vlan_TCI):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_VLAN, vlan,
vlan_tci, sizeof(__be16), reg);
break;
case offsetof(struct vlan_ethhdr, h_vlan_encapsulated_proto):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_VLAN, vlan,
@@ -210,7 +238,7 @@ static int nft_payload_offload_ll(struct nft_offload_ctx *ctx,
nft_offload_set_dependency(ctx, NFT_OFFLOAD_DEP_NETWORK);
break;
case offsetof(struct vlan_ethhdr, h_vlan_TCI) + sizeof(struct vlan_hdr):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_CVLAN, vlan,
@@ -218,7 +246,7 @@ static int nft_payload_offload_ll(struct nft_offload_ctx *ctx,
break;
case offsetof(struct vlan_ethhdr, h_vlan_encapsulated_proto) +
sizeof(struct vlan_hdr):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_CVLAN, vlan,
@@ -239,7 +267,8 @@ static int nft_payload_offload_ip(struct nft_offload_ctx *ctx,
switch (priv->offset) {
case offsetof(struct iphdr, saddr):
- if (priv->len != sizeof(struct in_addr))
+ if (!nft_payload_offload_mask(reg, priv->len,
+ sizeof(struct in_addr)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4, src,
@@ -247,7 +276,8 @@ static int nft_payload_offload_ip(struct nft_offload_ctx *ctx,
nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
break;
case offsetof(struct iphdr, daddr):
- if (priv->len != sizeof(struct in_addr))
+ if (!nft_payload_offload_mask(reg, priv->len,
+ sizeof(struct in_addr)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4, dst,
@@ -255,7 +285,7 @@ static int nft_payload_offload_ip(struct nft_offload_ctx *ctx,
nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
break;
case offsetof(struct iphdr, protocol):
- if (priv->len != sizeof(__u8))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__u8)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_BASIC, basic, ip_proto,
@@ -277,7 +307,8 @@ static int nft_payload_offload_ip6(struct nft_offload_ctx *ctx,
switch (priv->offset) {
case offsetof(struct ipv6hdr, saddr):
- if (priv->len != sizeof(struct in6_addr))
+ if (!nft_payload_offload_mask(reg, priv->len,
+ sizeof(struct in6_addr)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6, src,
@@ -285,7 +316,8 @@ static int nft_payload_offload_ip6(struct nft_offload_ctx *ctx,
nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
break;
case offsetof(struct ipv6hdr, daddr):
- if (priv->len != sizeof(struct in6_addr))
+ if (!nft_payload_offload_mask(reg, priv->len,
+ sizeof(struct in6_addr)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6, dst,
@@ -293,7 +325,7 @@ static int nft_payload_offload_ip6(struct nft_offload_ctx *ctx,
nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
break;
case offsetof(struct ipv6hdr, nexthdr):
- if (priv->len != sizeof(__u8))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__u8)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_BASIC, basic, ip_proto,
@@ -335,14 +367,14 @@ static int nft_payload_offload_tcp(struct nft_offload_ctx *ctx,
switch (priv->offset) {
case offsetof(struct tcphdr, source):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_PORTS, tp, src,
sizeof(__be16), reg);
break;
case offsetof(struct tcphdr, dest):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_PORTS, tp, dst,
@@ -363,14 +395,14 @@ static int nft_payload_offload_udp(struct nft_offload_ctx *ctx,
switch (priv->offset) {
case offsetof(struct udphdr, source):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_PORTS, tp, src,
sizeof(__be16), reg);
break;
case offsetof(struct udphdr, dest):
- if (priv->len != sizeof(__be16))
+ if (!nft_payload_offload_mask(reg, priv->len, sizeof(__be16)))
return -EOPNOTSUPP;
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_PORTS, tp, dst,
--
2.20.1
^ permalink raw reply related
* [PATCH net 1/4] netfilter: nftables_offload: set address type in control dissector
From: Pablo Neira Ayuso @ 2020-11-21 12:35 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba
In-Reply-To: <20201121123601.21733-1-pablo@netfilter.org>
If the address type is missing through the control dissector, then
matching on IPv4 and IPv6 addresses does not work. Set it accordingly so
rules that specify an IP address succesfully match on packets.
Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_tables_offload.h | 4 ++++
net/netfilter/nf_tables_offload.c | 18 ++++++++++++++++++
net/netfilter/nft_payload.c | 4 ++++
3 files changed, 26 insertions(+)
diff --git a/include/net/netfilter/nf_tables_offload.h b/include/net/netfilter/nf_tables_offload.h
index ea7d1d78b92d..bddd34c5bd79 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -37,6 +37,7 @@ void nft_offload_update_dependency(struct nft_offload_ctx *ctx,
struct nft_flow_key {
struct flow_dissector_key_basic basic;
+ struct flow_dissector_key_control control;
union {
struct flow_dissector_key_ipv4_addrs ipv4;
struct flow_dissector_key_ipv6_addrs ipv6;
@@ -62,6 +63,9 @@ struct nft_flow_rule {
#define NFT_OFFLOAD_F_ACTION (1 << 0)
+void nft_flow_rule_set_addr_type(struct nft_flow_rule *flow,
+ enum flow_dissector_key_id addr_type);
+
struct nft_rule;
struct nft_flow_rule *nft_flow_rule_create(struct net *net, const struct nft_rule *rule);
void nft_flow_rule_destroy(struct nft_flow_rule *flow);
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 9f625724a20f..9a3c5ac057b6 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -28,6 +28,24 @@ static struct nft_flow_rule *nft_flow_rule_alloc(int num_actions)
return flow;
}
+void nft_flow_rule_set_addr_type(struct nft_flow_rule *flow,
+ enum flow_dissector_key_id addr_type)
+{
+ struct nft_flow_match *match = &flow->match;
+ struct nft_flow_key *mask = &match->mask;
+ struct nft_flow_key *key = &match->key;
+
+ if (match->dissector.used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL))
+ return;
+
+ key->control.addr_type = addr_type;
+ mask->control.addr_type = 0xffff;
+ match->dissector.used_keys |= BIT(FLOW_DISSECTOR_KEY_CONTROL);
+ match->dissector.offset[FLOW_DISSECTOR_KEY_CONTROL] =
+ offsetof(struct nft_flow_key, control);
+}
+EXPORT_SYMBOL_GPL(nft_flow_rule_set_addr_type);
+
struct nft_flow_rule *nft_flow_rule_create(struct net *net,
const struct nft_rule *rule)
{
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index dcd3c7b8a367..bbf811d030d5 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -244,6 +244,7 @@ static int nft_payload_offload_ip(struct nft_offload_ctx *ctx,
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4, src,
sizeof(struct in_addr), reg);
+ nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
break;
case offsetof(struct iphdr, daddr):
if (priv->len != sizeof(struct in_addr))
@@ -251,6 +252,7 @@ static int nft_payload_offload_ip(struct nft_offload_ctx *ctx,
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4, dst,
sizeof(struct in_addr), reg);
+ nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
break;
case offsetof(struct iphdr, protocol):
if (priv->len != sizeof(__u8))
@@ -280,6 +282,7 @@ static int nft_payload_offload_ip6(struct nft_offload_ctx *ctx,
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6, src,
sizeof(struct in6_addr), reg);
+ nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
break;
case offsetof(struct ipv6hdr, daddr):
if (priv->len != sizeof(struct in6_addr))
@@ -287,6 +290,7 @@ static int nft_payload_offload_ip6(struct nft_offload_ctx *ctx,
NFT_OFFLOAD_MATCH(FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6, dst,
sizeof(struct in6_addr), reg);
+ nft_flow_rule_set_addr_type(flow, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
break;
case offsetof(struct ipv6hdr, nexthdr):
if (priv->len != sizeof(__u8))
--
2.20.1
^ permalink raw reply related
* [PATCH net 0/4] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2020-11-21 12:35 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba
Hi,
The following patchset contains Netfilter fixes for net:
1) Fix missing control data in flow dissector, otherwise IP address
matching in hardware offload infra does not work.
2) Fix hardware offload match on prefix IP address when userspace
does not send a bitwise expression to represent the prefix.
3) Insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported
by syzbot.
4) Remove spurious reports on nf_tables when lockdep gets disabled,
from Florian Westphal.
Please, pull these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Thank you.
----------------------------------------------------------------
The following changes since commit 849920c703392957f94023f77ec89ca6cf119d43:
devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill() (2020-11-14 16:23:19 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD
for you to fetch changes up to 986fbd9842ba114c74b4fb61c4dc146d87a55316:
netfilter: nf_tables: avoid false-postive lockdep splat (2020-11-20 10:18:39 +0100)
----------------------------------------------------------------
Eric Dumazet (1):
netfilter: ipset: prevent uninit-value in hash_ip6_add
Florian Westphal (1):
netfilter: nf_tables: avoid false-postive lockdep splat
Pablo Neira Ayuso (2):
netfilter: nftables_offload: set address type in control dissector
netfilter: nftables_offload: build mask based from the matching bytes
include/net/netfilter/nf_tables_offload.h | 7 ++++
net/netfilter/ipset/ip_set_core.c | 3 +-
net/netfilter/nf_tables_api.c | 3 +-
net/netfilter/nf_tables_offload.c | 18 ++++++++
net/netfilter/nft_cmp.c | 8 ++--
net/netfilter/nft_meta.c | 16 +++----
net/netfilter/nft_payload.c | 70 +++++++++++++++++++++++--------
7 files changed, 93 insertions(+), 32 deletions(-)
^ permalink raw reply
* Re: [PATCH net-next,v3 0/9] netfilter: flowtable bridge and vlan enhancements
From: Pablo Neira Ayuso @ 2020-11-21 12:31 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tobias Waldekranz, netfilter-devel, davem, netdev, razor, jeremy
In-Reply-To: <20201116225658.GA7247@salvia>
On Mon, Nov 16, 2020 at 11:56:58PM +0100, Pablo Neira Ayuso wrote:
> On Mon, Nov 16, 2020 at 02:45:21PM -0800, Jakub Kicinski wrote:
> > On Mon, 16 Nov 2020 23:36:15 +0100 Pablo Neira Ayuso wrote:
> > > > Are you saying A -> B traffic won't match so it will update the cache,
> > > > since conntrack flows are bi-directional?
> > >
> > > Yes, Traffic for A -> B won't match the flowtable entry, this will
> > > update the cache.
> >
> > That's assuming there will be A -> B traffic without B sending a
> > request which reaches A, first.
>
> B might send packets to A but this will not get anywhere. Assuming
> TCP, this will trigger retransmissions so B -> A will kick in to
> refresh the entry.
>
> Is this scenario that you describe a showstopper?
I have been discussing the topology update by tracking fdb updates
with the bridge maintainer, I'll be exploring extensions to the
existing fdb_notify() infrastructure to deal with this scenario you
describe. On my side this topology update scenario is not a priority
to be supported in this patchset, but it's feasible to support it
later on.
^ permalink raw reply
* Re: [PATCH net-next,v5 0/9] netfilter: flowtable bridge and vlan enhancements
From: Pablo Neira Ayuso @ 2020-11-21 11:58 UTC (permalink / raw)
To: Alexander Lobakin
Cc: netfilter-devel, davem, netdev, kuba, fw, razor, jeremy, tobias
In-Reply-To: <JbOm90Raei3ADlleQvsaCY9krt0lOkG1YFpbZEgylgU@cp4-web-014.plabs.ch>
Hi,
On Fri, Nov 20, 2020 at 03:09:37PM +0000, Alexander Lobakin wrote:
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> Date: Fri, 20 Nov 2020 13:49:12 +0100
[...]
> > The following patchset augments the Netfilter flowtable fastpath to
> > support for network topologies that combine IP forwarding, bridge and
> > VLAN devices.
>
> I'm curious if this new infra can be expanded later to shortcut other
> VLAN-like virtual netdevs e.g. DSA-like switch slaves.
>
> I mean, usually we have port0...portX physical port representors
> and backing CPU port with ethX representor. When in comes to NAT,
> portX is set as destination. Flow offload calls dev_queue_xmit()
> on it, switch stack pushes CPU tag into the skb, change skb->dev
> to ethX and calls another dev_queue_xmit().
>
> If we could (using the new .ndo_fill_forward_path()) tell Netfilter
> that our real dest is ethX and push the CPU tag via dev_hard_header(),
> this will omit one more dev_queue_xmit() and a bunch of indirect calls
> and checks.
If the XMIT_DIRECT path can be used for this with minimal changes,
that would be good.
> This might require some sort of "custom" or "private" cookies for
> N-Tuple though to separate flows from/to different switch ports (as
> it's done for VLAN: proto + VID).
Probably VLAN proto + VID in the tuple can be reused for this too.
Maybe add some extra information to tell if this is a VLAN or DSA
frame. It should be just one extra check for skb->protocol equals DSA.
Looks like very minimal changes to support for this.
> If so, I'd like to try to implement and publish that idea for reviews
> after this one lands nf-next.
Exploring new extensions is fine.
I received another email from someone else that would like to extend
this to support for PPPoE devices with PcEngines APU routers. In
general, adding more .ndo_fill_forward_path for more device types is
possible.
^ permalink raw reply
* [PATCH net-next 1/1] net: dsa: hellcreek: Add TAPRIO offloading support
From: Kurt Kanzenbach @ 2020-11-21 11:57 UTC (permalink / raw)
To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
Cc: David S. Miller, Jakub Kicinski, Vinicius Costa Gomes, netdev,
Kurt Kanzenbach
In-Reply-To: <20201121115703.23221-1-kurt@linutronix.de>
The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port has eight
egress queues. The traffic is mapped to a traffic class respectively via the PCP
field of a VLAN tagged frame.
The TAPRIO Qdisc already implements that. Therefore, this interface can simply
be reused. Add .port_setup_tc() accordingly.
The activation of a schedule on a port is split into two parts:
* Programming the necessary gate control list (GCL)
* Setup delayed work for starting the schedule
The hardware supports starting a schedule up to eight seconds in the future. The
TAPRIO interface provides an absolute base time. Therefore, periodic delayed
work is leveraged to check whether a schedule may be started or not.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
---
drivers/net/dsa/hirschmann/hellcreek.c | 314 +++++++++++++++++++++++++
drivers/net/dsa/hirschmann/hellcreek.h | 22 ++
2 files changed, 336 insertions(+)
diff --git a/drivers/net/dsa/hirschmann/hellcreek.c b/drivers/net/dsa/hirschmann/hellcreek.c
index 6420b76ea37c..35514af1922a 100644
--- a/drivers/net/dsa/hirschmann/hellcreek.c
+++ b/drivers/net/dsa/hirschmann/hellcreek.c
@@ -23,6 +23,7 @@
#include <linux/mutex.h>
#include <linux/delay.h>
#include <net/dsa.h>
+#include <net/pkt_sched.h>
#include "hellcreek.h"
#include "hellcreek_ptp.h"
@@ -153,6 +154,13 @@ static void hellcreek_select_vlan(struct hellcreek *hellcreek, int vid,
hellcreek_write(hellcreek, val, HR_VIDCFG);
}
+static void hellcreek_select_tgd(struct hellcreek *hellcreek, int port)
+{
+ u16 val = port << TR_TGDSEL_TDGSEL_SHIFT;
+
+ hellcreek_write(hellcreek, val, TR_TGDSEL);
+}
+
static int hellcreek_wait_until_ready(struct hellcreek *hellcreek)
{
u16 val;
@@ -1135,6 +1143,308 @@ hellcreek_port_prechangeupper(struct dsa_switch *ds, int port,
return ret;
}
+static void hellcreek_setup_gcl(struct hellcreek *hellcreek, int port,
+ const struct hellcreek_schedule *schedule)
+{
+ size_t i;
+
+ for (i = 1; i <= schedule->num_entries; ++i) {
+ const struct hellcreek_gcl_entry *cur, *initial, *next;
+ u16 data;
+ u8 gates;
+
+ cur = &schedule->entries[i - 1];
+ initial = &schedule->entries[0];
+ next = &schedule->entries[i];
+
+ if (i == schedule->num_entries)
+ gates = initial->gate_states ^
+ cur->gate_states;
+ else
+ gates = next->gate_states ^
+ cur->gate_states;
+
+ data = gates;
+ if (cur->overrun_ignore)
+ data |= TR_GCLDAT_GCLOVRI;
+
+ if (i == schedule->num_entries)
+ data |= TR_GCLDAT_GCLWRLAST;
+
+ /* Gates states */
+ hellcreek_write(hellcreek, data, TR_GCLDAT);
+
+ /* Time intervall */
+ hellcreek_write(hellcreek,
+ cur->interval & 0x0000ffff,
+ TR_GCLTIL);
+ hellcreek_write(hellcreek,
+ (cur->interval & 0xffff0000) >> 16,
+ TR_GCLTIH);
+
+ /* Commit entry */
+ data = ((i - 1) << TR_GCLCMD_GCLWRADR_SHIFT) |
+ (initial->gate_states <<
+ TR_GCLCMD_INIT_GATE_STATES_SHIFT);
+ hellcreek_write(hellcreek, data, TR_GCLCMD);
+ }
+}
+
+static void hellcreek_set_cycle_time(struct hellcreek *hellcreek,
+ const struct hellcreek_schedule *schedule)
+{
+ u32 cycle_time = schedule->cycle_time;
+
+ hellcreek_write(hellcreek, cycle_time & 0x0000ffff, TR_CTWRL);
+ hellcreek_write(hellcreek, (cycle_time & 0xffff0000) >> 16, TR_CTWRH);
+}
+
+static void hellcreek_switch_schedule(struct hellcreek *hellcreek,
+ ktime_t start_time)
+{
+ struct timespec64 ts = ktime_to_timespec64(start_time);
+
+ /* Start can be up to eight seconds in the future */
+ ts.tv_sec %= 8;
+
+ /* Start schedule at this point of time */
+ hellcreek_write(hellcreek, ts.tv_nsec & 0x0000ffff, TR_ESTWRL);
+ hellcreek_write(hellcreek, (ts.tv_nsec & 0xffff0000) >> 16, TR_ESTWRH);
+
+ /* Arm timer, set seconds and switch schedule */
+ hellcreek_write(hellcreek, TR_ESTCMD_ESTARM | TR_ESTCMD_ESTSWCFG |
+ ((ts.tv_sec & TR_ESTCMD_ESTSEC_MASK) <<
+ TR_ESTCMD_ESTSEC_SHIFT), TR_ESTCMD);
+}
+
+static struct hellcreek_schedule *
+hellcreek_taprio_to_schedule(struct tc_taprio_qopt_offload *taprio)
+{
+ struct hellcreek_schedule *schedule;
+ size_t i;
+
+ /* Allocate some memory first */
+ schedule = kzalloc(sizeof(*schedule), GFP_KERNEL);
+ if (!schedule)
+ return ERR_PTR(-ENOMEM);
+ schedule->entries = kcalloc(taprio->num_entries,
+ sizeof(*schedule->entries),
+ GFP_KERNEL);
+ if (!schedule->entries) {
+ kfree(schedule);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Construct hellcreek schedule */
+ schedule->num_entries = taprio->num_entries;
+ schedule->base_time = taprio->base_time;
+
+ for (i = 0; i < taprio->num_entries; ++i) {
+ const struct tc_taprio_sched_entry *t = &taprio->entries[i];
+ struct hellcreek_gcl_entry *k = &schedule->entries[i];
+
+ k->interval = t->interval;
+ k->gate_states = t->gate_mask;
+ k->overrun_ignore = 0;
+
+ /* Update complete cycle time */
+ schedule->cycle_time += t->interval;
+ }
+
+ return schedule;
+}
+
+static void hellcreek_free_schedule(struct hellcreek *hellcreek, int port)
+{
+ struct hellcreek_port *hellcreek_port = &hellcreek->ports[port];
+
+ kfree(hellcreek_port->current_schedule->entries);
+ kfree(hellcreek_port->current_schedule);
+ hellcreek_port->current_schedule = NULL;
+}
+
+static bool hellcreek_schedule_startable(struct hellcreek *hellcreek, int port)
+{
+ struct hellcreek_port *hellcreek_port = &hellcreek->ports[port];
+ s64 base_time_ns, current_ns;
+
+ /* The switch allows a schedule to be started only eight seconds within
+ * the future. Therefore, check the current PTP time if the schedule is
+ * startable or not.
+ */
+
+ /* Use the "cached" time. That should be alright, as it's updated quite
+ * frequently in the PTP code.
+ */
+ mutex_lock(&hellcreek->ptp_lock);
+ current_ns = hellcreek->seconds * NSEC_PER_SEC + hellcreek->last_ts;
+ mutex_unlock(&hellcreek->ptp_lock);
+
+ /* Calculate difference to admin base time */
+ base_time_ns = ktime_to_ns(hellcreek_port->current_schedule->base_time);
+
+ if (base_time_ns - current_ns < (s64)8 * NSEC_PER_SEC)
+ return true;
+
+ return false;
+}
+
+static void hellcreek_start_schedule(struct hellcreek *hellcreek, int port)
+{
+ struct hellcreek_port *hellcreek_port = &hellcreek->ports[port];
+
+ /* First select port */
+ hellcreek_select_tgd(hellcreek, port);
+
+ /* Set admin base time and switch schedule */
+ hellcreek_switch_schedule(hellcreek,
+ hellcreek_port->current_schedule->base_time);
+
+ hellcreek_free_schedule(hellcreek, port);
+
+ dev_dbg(hellcreek->dev, "ARMed EST timer for port %d\n",
+ hellcreek_port->port);
+}
+
+static void hellcreek_check_schedule(struct work_struct *work)
+{
+ struct delayed_work *dw = to_delayed_work(work);
+ struct hellcreek_port *hellcreek_port;
+ struct hellcreek *hellcreek;
+ bool startable;
+
+ hellcreek_port = dw_to_hellcreek_port(dw);
+ hellcreek = hellcreek_port->hellcreek;
+
+ mutex_lock(&hellcreek->reg_lock);
+
+ /* Check starting time */
+ startable = hellcreek_schedule_startable(hellcreek,
+ hellcreek_port->port);
+ if (startable) {
+ hellcreek_start_schedule(hellcreek, hellcreek_port->port);
+ mutex_unlock(&hellcreek->reg_lock);
+ return;
+ }
+
+ mutex_unlock(&hellcreek->reg_lock);
+
+ /* Reschedule */
+ schedule_delayed_work(&hellcreek_port->schedule_work,
+ HELLCREEK_SCHEDULE_PERIOD);
+}
+
+static int hellcreek_port_set_schedule(struct dsa_switch *ds, int port,
+ struct tc_taprio_qopt_offload *taprio)
+{
+ struct hellcreek *hellcreek = ds->priv;
+ struct hellcreek_port *hellcreek_port;
+ struct hellcreek_schedule *schedule;
+ bool startable;
+ u16 ctrl;
+
+ hellcreek_port = &hellcreek->ports[port];
+
+ /* Convert taprio data to hellcreek schedule */
+ schedule = hellcreek_taprio_to_schedule(taprio);
+ if (IS_ERR(schedule))
+ return PTR_ERR(schedule);
+
+ dev_dbg(hellcreek->dev, "Configure traffic schedule on port %d\n",
+ port);
+
+ /* First cancel delayed work */
+ cancel_delayed_work_sync(&hellcreek_port->schedule_work);
+
+ mutex_lock(&hellcreek->reg_lock);
+
+ if (hellcreek_port->current_schedule)
+ hellcreek_free_schedule(hellcreek, port);
+ hellcreek_port->current_schedule = schedule;
+
+ /* Then select port */
+ hellcreek_select_tgd(hellcreek, port);
+
+ /* Enable gating and set the admin state to forward everything in the
+ * mean time
+ */
+ ctrl = (0xff << TR_TGDCTRL_ADMINGATESTATES_SHIFT) | TR_TGDCTRL_GATE_EN;
+ hellcreek_write(hellcreek, ctrl, TR_TGDCTRL);
+
+ /* Cancel pending schedule */
+ hellcreek_write(hellcreek, 0x00, TR_ESTCMD);
+
+ /* Setup a new schedule */
+ hellcreek_setup_gcl(hellcreek, port, schedule);
+
+ /* Configure cycle time */
+ hellcreek_set_cycle_time(hellcreek, schedule);
+
+ /* Check starting time */
+ startable = hellcreek_schedule_startable(hellcreek, port);
+ if (startable) {
+ hellcreek_start_schedule(hellcreek, port);
+ mutex_unlock(&hellcreek->reg_lock);
+ return 0;
+ }
+
+ mutex_unlock(&hellcreek->reg_lock);
+
+ /* Schedule periodic schedule check */
+ schedule_delayed_work(&hellcreek_port->schedule_work,
+ HELLCREEK_SCHEDULE_PERIOD);
+
+ return 0;
+}
+
+static int hellcreek_port_del_schedule(struct dsa_switch *ds, int port)
+{
+ struct hellcreek *hellcreek = ds->priv;
+ struct hellcreek_port *hellcreek_port;
+
+ hellcreek_port = &hellcreek->ports[port];
+
+ dev_dbg(hellcreek->dev, "Remove traffic schedule on port %d\n", port);
+
+ /* First cancel delayed work */
+ cancel_delayed_work_sync(&hellcreek_port->schedule_work);
+
+ mutex_lock(&hellcreek->reg_lock);
+
+ if (hellcreek_port->current_schedule)
+ hellcreek_free_schedule(hellcreek, port);
+
+ /* Then select port */
+ hellcreek_select_tgd(hellcreek, port);
+
+ /* Disable gating and return to regular switching flow */
+ hellcreek_write(hellcreek, 0xff << TR_TGDCTRL_ADMINGATESTATES_SHIFT,
+ TR_TGDCTRL);
+
+ mutex_unlock(&hellcreek->reg_lock);
+
+ return 0;
+}
+
+static int hellcreek_port_setup_tc(struct dsa_switch *ds, int port,
+ enum tc_setup_type type, void *type_data)
+{
+ struct tc_taprio_qopt_offload *taprio = type_data;
+ struct hellcreek *hellcreek = ds->priv;
+
+ if (type != TC_SETUP_QDISC_TAPRIO)
+ return -EOPNOTSUPP;
+
+ /* Does this hellcreek version support Qbv in hardware? */
+ if (!hellcreek->pdata->qbv_support)
+ return -EOPNOTSUPP;
+
+ if (taprio->enable)
+ return hellcreek_port_set_schedule(ds, port, taprio);
+
+ return hellcreek_port_del_schedule(ds, port);
+}
+
static const struct dsa_switch_ops hellcreek_ds_ops = {
.get_ethtool_stats = hellcreek_get_ethtool_stats,
.get_sset_count = hellcreek_get_sset_count,
@@ -1153,6 +1463,7 @@ static const struct dsa_switch_ops hellcreek_ds_ops = {
.port_hwtstamp_get = hellcreek_port_hwtstamp_get,
.port_prechangeupper = hellcreek_port_prechangeupper,
.port_rxtstamp = hellcreek_port_rxtstamp,
+ .port_setup_tc = hellcreek_port_setup_tc,
.port_stp_state_set = hellcreek_port_stp_state_set,
.port_txtstamp = hellcreek_port_txtstamp,
.port_vlan_add = hellcreek_vlan_add,
@@ -1208,6 +1519,9 @@ static int hellcreek_probe(struct platform_device *pdev)
port->hellcreek = hellcreek;
port->port = i;
+
+ INIT_DELAYED_WORK(&port->schedule_work,
+ hellcreek_check_schedule);
}
mutex_init(&hellcreek->reg_lock);
diff --git a/drivers/net/dsa/hirschmann/hellcreek.h b/drivers/net/dsa/hirschmann/hellcreek.h
index e81781ebc31c..7ffb1b33ff72 100644
--- a/drivers/net/dsa/hirschmann/hellcreek.h
+++ b/drivers/net/dsa/hirschmann/hellcreek.h
@@ -213,6 +213,20 @@ struct hellcreek_counter {
const char *name;
};
+struct hellcreek_gcl_entry {
+ u32 interval;
+ u8 gate_states;
+ bool overrun_ignore;
+};
+
+struct hellcreek_schedule {
+ struct hellcreek_gcl_entry *entries;
+ size_t num_entries;
+ ktime_t base_time;
+ u32 cycle_time;
+ int port;
+};
+
struct hellcreek;
/* State flags for hellcreek_port_hwtstamp::state */
@@ -246,6 +260,10 @@ struct hellcreek_port {
/* Per-port timestamping resources */
struct hellcreek_port_hwtstamp port_hwtstamp;
+
+ /* Per-port Qbv schedule information */
+ struct hellcreek_schedule *current_schedule;
+ struct delayed_work schedule_work;
};
struct hellcreek_fdb_entry {
@@ -283,4 +301,8 @@ struct hellcreek {
size_t fdb_entries;
};
+#define HELLCREEK_SCHEDULE_PERIOD (2 * HZ)
+#define dw_to_hellcreek_port(dw) \
+ container_of(dw, struct hellcreek_port, schedule_work)
+
#endif /* _HELLCREEK_H_ */
--
2.20.1
^ permalink raw reply related
* [PATCH net-next 0/1] net: dsa: hellcreek: Add TAPRIO offloading
From: Kurt Kanzenbach @ 2020-11-21 11:57 UTC (permalink / raw)
To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
Cc: David S. Miller, Jakub Kicinski, Vinicius Costa Gomes, netdev,
Kurt Kanzenbach
Hi,
The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port has eight
egress queues. The traffic is mapped to a traffic class respectively via the PCP
field of a VLAN tagged frame.
This is a respin of the original patch with the discovered issues fixed:
* Drop TC <-> PCP mapping
=> This is handled in the TAPRIO core now.
* Don't depend on the system's time synchronized to the PTP clock
=> Drop hrtimers and use periodic delayed work instead. Also drop the
spinlocks as delayed work is executed in user context and mutexes can
be used which makes everything much simpler.
Thanks,
Kurt
Kurt Kanzenbach (1):
net: dsa: hellcreek: Add TAPRIO offloading support
drivers/net/dsa/hirschmann/hellcreek.c | 314 +++++++++++++++++++++++++
drivers/net/dsa/hirschmann/hellcreek.h | 22 ++
2 files changed, 336 insertions(+)
--
2.20.1
^ permalink raw reply
* [PATCH net-next 2/2] net: dsa: hellcreek: Don't print error message on defer
From: Kurt Kanzenbach @ 2020-11-21 11:44 UTC (permalink / raw)
To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
Cc: David S. Miller, Jakub Kicinski, netdev, Kurt Kanzenbach
In-Reply-To: <20201121114455.22422-1-kurt@linutronix.de>
When DSA is not loaded when the driver is probed an error message is
printed. But, that's not really an error, just a defer. Use dev_err_probe()
instead.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
---
drivers/net/dsa/hirschmann/hellcreek.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/dsa/hirschmann/hellcreek.c b/drivers/net/dsa/hirschmann/hellcreek.c
index d42f40c76ba5..6420b76ea37c 100644
--- a/drivers/net/dsa/hirschmann/hellcreek.c
+++ b/drivers/net/dsa/hirschmann/hellcreek.c
@@ -1266,7 +1266,7 @@ static int hellcreek_probe(struct platform_device *pdev)
ret = dsa_register_switch(hellcreek->ds);
if (ret) {
- dev_err(dev, "Unable to register switch\n");
+ dev_err_probe(dev, ret, "Unable to register switch\n");
return ret;
}
--
2.20.1
^ permalink raw reply related
* [PATCH net-next 0/2] net: dsa: hellcreek: Minor cleanups
From: Kurt Kanzenbach @ 2020-11-21 11:44 UTC (permalink / raw)
To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
Cc: David S. Miller, Jakub Kicinski, netdev, Kurt Kanzenbach
Hi,
fix two minor issues in the hellcreek driver.
Thanks,
Kurt
Kurt Kanzenbach (2):
net: dsa: tag_hellcreek: Cleanup includes
net: dsa: hellcreek: Don't print error message on defer
drivers/net/dsa/hirschmann/hellcreek.c | 2 +-
net/dsa/tag_hellcreek.c | 4 +---
2 files changed, 2 insertions(+), 4 deletions(-)
--
2.20.1
^ permalink raw reply
* [PATCH net-next 1/2] net: dsa: tag_hellcreek: Cleanup includes
From: Kurt Kanzenbach @ 2020-11-21 11:44 UTC (permalink / raw)
To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
Cc: David S. Miller, Jakub Kicinski, netdev, Kurt Kanzenbach
In-Reply-To: <20201121114455.22422-1-kurt@linutronix.de>
Remove unused and add needed includes. No functional change.
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
---
net/dsa/tag_hellcreek.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/net/dsa/tag_hellcreek.c b/net/dsa/tag_hellcreek.c
index 2061de06eafb..a09805c8e1ab 100644
--- a/net/dsa/tag_hellcreek.c
+++ b/net/dsa/tag_hellcreek.c
@@ -8,9 +8,7 @@
* Based on tag_ksz.c.
*/
-#include <linux/etherdevice.h>
-#include <linux/list.h>
-#include <linux/slab.h>
+#include <linux/skbuff.h>
#include <net/dsa.h>
#include "dsa_priv.h"
--
2.20.1
^ permalink raw reply related
* [PATCH] e1000e: remove the redundant value assignment in e1000_update_nvm_checksum_spt
From: xiakaixu1987 @ 2020-11-21 10:17 UTC (permalink / raw)
To: jesse.brandeburg, anthony.l.nguyen, kuba, davem
Cc: intel-wired-lan, netdev, linux-kernel, Kaixu Xia
From: Kaixu Xia <kaixuxia@tencent.com>
Both of the statements are value assignment of the variable act_offset.
The first value assignment is overwritten by the second and is useless.
Remove it.
Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
---
drivers/net/ethernet/intel/e1000e/ich8lan.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 9aa6fad8ed47..f05070ed18c9 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -3875,13 +3875,6 @@ static s32 e1000_update_nvm_checksum_spt(struct e1000_hw *hw)
if (ret_val)
goto release;
- /* And invalidate the previously valid segment by setting
- * its signature word (0x13) high_byte to 0b. This can be
- * done without an erase because flash erase sets all bits
- * to 1's. We can write 1's to 0's without an erase
- */
- act_offset = (old_bank_offset + E1000_ICH_NVM_SIG_WORD) * 2 + 1;
-
/* offset in words but we read dword */
act_offset = old_bank_offset + E1000_ICH_NVM_SIG_WORD - 1;
ret_val = e1000_read_flash_dword_ich8lan(hw, act_offset, &dword);
--
2.20.0
^ permalink raw reply related
* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
From: Kuniyuki Iwashima @ 2020-11-21 10:16 UTC (permalink / raw)
To: kafai
Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
linux-kernel, netdev
In-Reply-To: <20201120023157.immjndtw4hgcyz75@kafai-mbp.dhcp.thefacebook.com>
From: Martin KaFai Lau <kafai@fb.com>
Date: Thu, 19 Nov 2020 18:31:57 -0800
> On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > Date: Wed, 18 Nov 2020 17:49:13 -0800
> > > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > > accept connections evenly. However, there is a defect in the current
> > > > implementation. When a SYN packet is received, the connection is tied to a
> > > > listening socket. Accordingly, when the listener is closed, in-flight
> > > > requests during the three-way handshake and child sockets in the accept
> > > > queue are dropped even if other listeners could accept such connections.
> > > >
> > > > This situation can happen when various server management tools restart
> > > > server (such as nginx) processes. For instance, when we change nginx
> > > > configurations and restart it, it spins up new workers that respect the new
> > > > configuration and closes all listeners on the old workers, resulting in
> > > > in-flight ACK of 3WHS is responded by RST.
> > > >
> > > > As a workaround for this issue, we can do connection draining by eBPF:
> > > >
> > > > 1. Before closing a listener, stop routing SYN packets to it.
> > > > 2. Wait enough time for requests to complete 3WHS.
> > > > 3. Accept connections until EAGAIN, then close the listener.
> > > >
> > > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > > how many requests are still during 3WHS. Thus, we have to know the number
> > > It sounds like the application can already drain the established socket
> > > by accept()? To solve the problem that you have,
> > > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> >
> > Ideally, the application needs to drain only the accepted sockets because
> > 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> > there are some cases where we want to apply new configurations as soon as
> > possible such as replacing TLS certificates.
> >
> > It is possible to drain the established sockets by accept(), but the
> > sockets in the accept queue have not started application sessions yet. So,
> > if we do not drain such sockets (or if the kernel happened to select
> > another listener), we can (could) apply the new settings much earlier.
> >
> > Moreover, the established sockets may start long-standing connections so
> > that we cannot complete draining for a long time and may have to
> > force-close them (and they would have longer lifetime if they are migrated
> > to a new listener).
> >
> >
> > > Applications can already use the bpf prog to do (1) and divert
> > > the SYN to the newly started process.
> > >
> > > If the application cares about service disruption,
> > > it usually needs to drain the fd(s) that it already has and
> > > finishes serving the pending request (e.g. https) on them anyway.
> > > The time taking to finish those could already be longer than it takes
> > > to drain the accept queue or finish off the 3WHS in reasonable time.
> > > or the application that you have does not need to drain the fd(s)
> > > it already has and it can close them immediately?
> >
> > In the point of view of service disruption, I agree with you.
> >
> > However, I think that there are some situations where we want to apply new
> > configurations rather than to drain sockets with old configurations and
> > that if the kernel migrates sockets automatically, we can simplify user
> > programs.
> This configuration-update(/new-TLS-cert...etc) consideration will be useful
> if it is also included in the cover letter.
I will add this to the next cover letter.
> It sounds like the service that you have is draining the existing
> already-accepted fd(s) which are using the old configuration.
> Those existing fd(s) could also be long life. Potentially those
> existing fd(s) will be in a much higher number than the
> to-be-accepted fd(s)?
In many cases, yes.
> or you meant in some cases it wants to migrate to the new configuration
> ASAP (e.g. for security reason) even it has to close all the
> already-accepted fds() which are using the old configuration??
And sometimes, yes.
As you expected, for some reasons including security, there are cases we
have to prioritize to close connections than to complete them.
For example, HTTP/1.1 is often short-lived, and we can complete draining
immediately. However, sometimes it can be long-lived by upgrading to
WebSocket. Then we may be not able to wait to finish draining.
> In either cases, considering the already-accepted fd(s)
> is usually in a much more number, does the to-be-accepted
> connection make any difference percentage-wise?
It is difficult to drain all connections in every case, but we can decrease
such aborted connections by migration. In that sense, I think migration is
always better than draining.
^ permalink raw reply
* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
From: Kuniyuki Iwashima @ 2020-11-21 10:13 UTC (permalink / raw)
To: kafai
Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
linux-kernel, netdev
In-Reply-To: <20201120015346.sokqxwx4uavmoctz@kafai-mbp.dhcp.thefacebook.com>
From: Martin KaFai Lau <kafai@fb.com>
Date: Thu, 19 Nov 2020 17:53:46 -0800
> On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > which is used only by inet_unhash(). If it is not NULL,
> > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > sockets from the closing listener to the selected one.
> > > >
> > > > Listening sockets hold incoming connections as a linked list of struct
> > > > request_sock in the accept queue, and each request has reference to a full
> > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > requests from the closing listener's queue and relink them to the head of
> > > > the new listener's queue. We do not process each request, so the migration
> > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > sockets, we will take special care in the next commit.
> > > >
> > > > By default, we select the last element of socks[] as the new listener.
> > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > >
> > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > >
> > > > socks[0] : A <-. socks[0] : D socks[0] : D
> > > > socks[1] : B | => socks[1] : B <-. => socks[1] : C
> > > > socks[2] : C | socks[2] : C --'
> > > > socks[3] : D --'
> > > >
> > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > requests evenly to new listeners.
> > > I don't think it should emphasize/claim there is a specific way that
> > > the kernel-pick here can redistribute the requests evenly. It depends on
> > > how the application close/listen. The userspace can not expect the
> > > ordering of socks[] will behave in a certain way.
> >
> > I've expected replacing listeners by generations as a general use case.
> > But exactly. Users should not expect the undocumented kernel internal.
> >
> >
> > > The primary redistribution policy has to depend on BPF which is the
> > > policy defined by the user based on its application logic (e.g. how
> > > its binary restart work). The application (and bpf) knows which one
> > > is a dying process and can avoid distributing to it.
> > >
> > > The kernel-pick could be an optional fallback but not a must. If the bpf
> > > prog is attached, I would even go further to call bpf to redistribute
> > > regardless of the sysctl, so I think the sysctl is not necessary.
> >
> > I also think it is just an optional fallback, but to pick out a different
> > listener everytime, choosing the moved socket was reasonable. So the even
> > redistribution for a specific use case is a side effect of such socket
> > selection.
> >
> > But, users should decide to use either way:
> > (1) let the kernel select a new listener randomly
> > (2) select a particular listener by eBPF
> >
> > I will update the commit message like:
> > The kernel selects a new listener randomly, but as the side effect, it can
> > redistribute packets evenly for a specific case where an application
> > replaces listeners by generations.
> Since there is no feedback on sysctl, so may be something missed
> in the lines.
I'm sorry, I have missed this point while thinking about each reply...
> I don't think this migration logic should depend on a sysctl.
> At least not when a bpf prog is attached that is capable of doing
> migration, it is too fragile to ask user to remember to turn on
> the sysctl before attaching the bpf prog.
>
> Your use case is to primarily based on bpf prog to pick or only based
> on kernel to do a random pick?
I think we have to care about both cases.
I think we can always enable the migration feature if eBPF prog is not
attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
to select a listener by some rules, along updating the kernel,
redistributing requests without user intention can break the application.
So, there is something needed to confirm user intension at least if eBPF
prog is attached.
But honestly, I believe such eBPF users can follow this change and
implement migration eBPF prog if we introduce such a breaking change.
> Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> change it until all the listening sockets are closed which is exactly
> the service disruption problem this series is trying to solve here.
Oh, exactly...
If we apply this series by live patching, we cannot enable the feature
without service disruption.
To enable the migration feature dynamically, how about this logic?
In this logic, we do not save the sysctl value and check it at each time.
1. no eBPF prog attached -> ON
2. eBPF prog attached and sysctl is 0 -> OFF
3. eBPF prog attached and sysctl is 1 -> ON
So, replacing
if (reuse->migrate_req)
to
if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)
^ permalink raw reply
* Re: [RFC PATCH] bpf: preload: Fix build error when O= is set
From: David Gow @ 2020-11-21 9:48 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Alexei Starovoitov, Daniel Borkmann,
Toke Høiland-Jørgensen, Networking, bpf, open list,
Brendan Higgins, Masami Hiramatsu, linux-um
In-Reply-To: <CAEf4BzY4i0fH34eO=-4WOzVpifgPmJ0ER5ipBJWB0_4Zdv0AQg@mail.gmail.com>
On Sat, Nov 21, 2020 at 3:38 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Nov 19, 2020 at 12:51 AM David Gow <davidgow@google.com> wrote:
> >
> > If BPF_PRELOAD is enabled, and an out-of-tree build is requested with
> > make O=<path>, compilation seems to fail with:
> >
> > tools/scripts/Makefile.include:4: *** O=.kunit does not exist. Stop.
> > make[4]: *** [../kernel/bpf/preload/Makefile:8: kernel/bpf/preload/libbpf.a] Error 2
> > make[3]: *** [../scripts/Makefile.build:500: kernel/bpf/preload] Error 2
> > make[2]: *** [../scripts/Makefile.build:500: kernel/bpf] Error 2
> > make[2]: *** Waiting for unfinished jobs....
> > make[1]: *** [.../Makefile:1799: kernel] Error 2
> > make[1]: *** Waiting for unfinished jobs....
> > make: *** [Makefile:185: __sub-make] Error 2
> >
> > By the looks of things, this is because the (relative path) O= passed on
> > the command line is being passed to the libbpf Makefile, which then
> > can't find the directory. Given OUTPUT= is being passed anyway, we can
> > work around this by explicitly setting an empty O=, which will be
> > ignored in favour of OUTPUT= in tools/scripts/Makefile.include.
>
> Strange, but I can't repro it. I use make O=<absolute path> all the
> time with no issues. I just tried specifically with a make O=.build,
> where .build is inside Linux repo, and it still worked fine. See also
> be40920fbf10 ("tools: Let O= makes handle a relative path with -C
> option") which was supposed to address such an issue. So I'm wondering
> what exactly is causing this problem.
>
[+ linux-um list]
Hmm... From a quick check, I can't reproduce this on x86, so it's
possibly a UML-specific issue.
The problem here seems to be that $PWD is, for whatever reason, equal
to the srcdir on x86, but not on UML. In general, $PWD behaves pretty
weirdly -- I don't fully understand it -- but if I add a tactical "PWD
:= $(shell pwd)" or use $(CURDIR) instead, the issue shows up on x86
as well. I guess this is because PWD only gets updated when set by a
shell or something, and UML does this somewhere?
Thoughts?
Cheers,
-- David
> >
> > Signed-off-by: David Gow <davidgow@google.com>
> > ---
> >
> > Hi all,
> >
> > I'm not 100% sure this is the correct fix here -- it seems to work for
> > me, and makes some sense, but let me know if there's a better way.
> >
> > One other thing worth noting is that I've been hitting this with
> > make allyesconfig on ARCH=um, but there's a comment in the Kconfig
> > suggesting that, because BPF_PRELOAD depends on !COMPILE_TEST, that
> > maybe it shouldn't be being built at all. I figured that it was worth
> > trying to fix this anyway.
> >
> > Cheers,
> > -- David
> >
> >
> > kernel/bpf/preload/Makefile | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/preload/Makefile b/kernel/bpf/preload/Makefile
> > index 23ee310b6eb4..39848d296097 100644
> > --- a/kernel/bpf/preload/Makefile
> > +++ b/kernel/bpf/preload/Makefile
> > @@ -5,7 +5,7 @@ LIBBPF_A = $(obj)/libbpf.a
> > LIBBPF_OUT = $(abspath $(obj))
> >
> > $(LIBBPF_A):
> > - $(Q)$(MAKE) -C $(LIBBPF_SRCS) OUTPUT=$(LIBBPF_OUT)/ $(LIBBPF_OUT)/libbpf.a
> > + $(Q)$(MAKE) -C $(LIBBPF_SRCS) O= OUTPUT=$(LIBBPF_OUT)/ $(LIBBPF_OUT)/libbpf.a
> >
> > userccflags += -I $(srctree)/tools/include/ -I $(srctree)/tools/include/uapi \
> > -I $(srctree)/tools/lib/ -Wno-unused-result
> > --
> > 2.29.2.454.gaff20da3a2-goog
> >
^ permalink raw reply
* [PATCH 2/2] net: pch_gbe: Use 'dma_free_coherent()' to undo 'dma_alloc_coherent()'
From: Christophe JAILLET @ 2020-11-21 9:03 UTC (permalink / raw)
To: davem, kuba, kaixuxia, mhabets, mst, luc.vanoostenryck,
jesse.brandeburg
Cc: netdev, linux-kernel, kernel-janitors, Christophe JAILLET
Memory allocation are done with 'dma_alloc_coherent()'. Be consistent
and use 'dma_free_coherent()' to free the corresponding memory.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 8e4255efe829..140cee7c459d 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1816,7 +1816,8 @@ void pch_gbe_free_tx_resources(struct pch_gbe_adapter *adapter,
pch_gbe_clean_tx_ring(adapter, tx_ring);
vfree(tx_ring->buffer_info);
tx_ring->buffer_info = NULL;
- pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
+ dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
+ tx_ring->dma);
tx_ring->desc = NULL;
}
@@ -1833,7 +1834,8 @@ void pch_gbe_free_rx_resources(struct pch_gbe_adapter *adapter,
pch_gbe_clean_rx_ring(adapter, rx_ring);
vfree(rx_ring->buffer_info);
rx_ring->buffer_info = NULL;
- pci_free_consistent(pdev, rx_ring->size, rx_ring->desc, rx_ring->dma);
+ dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
+ rx_ring->dma);
rx_ring->desc = NULL;
}
@@ -1954,8 +1956,8 @@ void pch_gbe_down(struct pch_gbe_adapter *adapter)
pch_gbe_clean_tx_ring(adapter, adapter->tx_ring);
pch_gbe_clean_rx_ring(adapter, adapter->rx_ring);
- pci_free_consistent(adapter->pdev, rx_ring->rx_buff_pool_size,
- rx_ring->rx_buff_pool, rx_ring->rx_buff_pool_logic);
+ dma_free_coherent(&adapter->pdev->dev, rx_ring->rx_buff_pool_size,
+ rx_ring->rx_buff_pool, rx_ring->rx_buff_pool_logic);
rx_ring->rx_buff_pool_logic = 0;
rx_ring->rx_buff_pool_size = 0;
rx_ring->rx_buff_pool = NULL;
--
2.27.0
^ permalink raw reply related
* [PATCH 1/2] net: pch_gbe: Use dma_set_mask_and_coherent to simplify code
From: Christophe JAILLET @ 2020-11-21 9:03 UTC (permalink / raw)
To: davem, kuba, kaixuxia, mhabets, mst, luc.vanoostenryck,
jesse.brandeburg
Cc: netdev, linux-kernel, kernel-janitors, Christophe JAILLET
'pci_set_dma_mask()' + 'pci_set_consistent_dma_mask()' can be replaced by
an equivalent 'dma_set_mask_and_coherent()' which is much less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
.../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index b9e32e4478a8..8e4255efe829 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2502,17 +2502,11 @@ static int pch_gbe_probe(struct pci_dev *pdev,
if (ret)
return ret;
- if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64))
- || pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
- ret = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+ if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64))) {
+ ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
if (ret) {
- ret = pci_set_consistent_dma_mask(pdev,
- DMA_BIT_MASK(32));
- if (ret) {
- dev_err(&pdev->dev, "ERR: No usable DMA "
- "configuration, aborting\n");
- return ret;
- }
+ dev_err(&pdev->dev, "ERR: No usable DMA configuration, aborting\n");
+ return ret;
}
}
--
2.27.0
^ permalink raw reply related
* [PATCH net-next] net: DSCP in IPv4 routing v2
From: Russell Strong @ 2020-11-21 8:24 UTC (permalink / raw)
To: netdev
From 2f27f92d5a6f4dd69ac4af32cdb51ba8d2083606 Mon Sep 17 00:00:00 2001
From: Russell Strong <russell@strong.id.au>
Date: Sat, 21 Nov 2020 18:12:43 +1000
Subject: [PATCH] DSCP in IPv4 routing v2
This patch allows the use of DSCP values in routing
The following macros define the scope of IPv4 TOS
bits used in routing decisions:
IPTOS_TOS (include/uapi/linux/ip.h)
IPTOS_TOS_MASK (include/uapi/linux/ip.h)
IPTOS_RT_MASK (include/net/route.h)
RT_TOS (include/net/in_route.h)
To expand the number of bits out to the set of DSCP
bits, two macros are added:
(hash)define IP_DSCP_MASK 0xfc
(hash)define IP_DSCP ((dscp)&IP_DSCP_MASK)
Use of TOS macros are replaced with DSCP macros
where the change does not change the user space API
with one exception:
net/ipv4/fib_rules.c has been changed to accept a
wider range of values ( dscp values ). Previously
this would have returned an error.
iproute2 already supports setting dscp values through
ip route add dsfield <dscp value> lookup ......
Signed-off-by: Russell Strong <russell@strong.id.au>
---
.../ethernet/mellanox/mlx5/core/en/tc_tun.c | 2 +-
drivers/net/geneve.c | 4 ++--
drivers/net/ipvlan/ipvlan_core.c | 2 +-
drivers/net/ppp/pptp.c | 2 +-
drivers/net/vrf.c | 2 +-
drivers/net/vxlan.c | 4 ++--
include/net/ip.h | 2 +-
include/net/route.h | 6 ++----
include/uapi/linux/ip.h | 2 ++
net/bridge/br_netfilter_hooks.c | 2 +-
net/core/filter.c | 4 ++--
net/core/lwt_bpf.c | 2 +-
net/ipv4/fib_frontend.c | 2 +-
net/ipv4/fib_rules.c | 2 +-
net/ipv4/icmp.c | 6 +++---
net/ipv4/ip_gre.c | 2 +-
net/ipv4/ip_output.c | 2 +-
net/ipv4/ip_tunnel.c | 6 +++---
net/ipv4/ipmr.c | 6 +++---
net/ipv4/netfilter.c | 2 +-
net/ipv4/netfilter/ipt_rpfilter.c | 2 +-
net/ipv4/netfilter/nf_dup_ipv4.c | 2 +-
net/ipv4/route.c | 20 +++++++++----------
net/ipv6/ip6_output.c | 2 +-
net/ipv6/ip6_tunnel.c | 4 ++--
net/ipv6/sit.c | 4 ++--
net/xfrm/xfrm_policy.c | 2 +-
27 files changed, 49 insertions(+), 49 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c index
90930e54b6f2..b0c766216a2c 100644 ---
a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c +++
b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c @@ -364,7 +364,7
@@ int mlx5e_tc_tun_create_header_ipv6(struct mlx5e_priv *priv,
ttl = tun_key->ttl;
- fl6.flowlabel = ip6_make_flowinfo(RT_TOS(tun_key->tos),
tun_key->label);
+ fl6.flowlabel = ip6_make_flowinfo(IP_DSCP(tun_key->tos),
tun_key->label); fl6.daddr = tun_key->u.ipv6.dst;
fl6.saddr = tun_key->u.ipv6.src;
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index a3c8ce6deb93..1c20acc649ef 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -797,7 +797,7 @@ static struct rtable *geneve_get_v4_rt(struct
sk_buff *skb, tos = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
use_cache = false;
}
- fl4->flowi4_tos = RT_TOS(tos);
+ fl4->flowi4_tos = IP_DSCP(tos);
dst_cache = (struct dst_cache *)&info->dst_cache;
if (use_cache) {
@@ -851,7 +851,7 @@ static struct dst_entry *geneve_get_v6_dst(struct
sk_buff *skb, use_cache = false;
}
- fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
+ fl6->flowlabel = ip6_make_flowinfo(IP_DSCP(prio),
info->key.label);
dst_cache = (struct dst_cache *)&info->dst_cache;
if (use_cache) {
diff --git a/drivers/net/ipvlan/ipvlan_core.c
b/drivers/net/ipvlan/ipvlan_core.c index 8801d093135c..d50e4163d0e0
100644 --- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -421,7 +421,7 @@ static int ipvlan_process_v4_outbound(struct
sk_buff *skb) int err, ret = NET_XMIT_DROP;
struct flowi4 fl4 = {
.flowi4_oif = dev->ifindex,
- .flowi4_tos = RT_TOS(ip4h->tos),
+ .flowi4_tos = IP_DSCP(ip4h->tos),
.flowi4_flags = FLOWI_FLAG_ANYSRC,
.flowi4_mark = skb->mark,
.daddr = ip4h->daddr,
diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index ee5058445d06..3f29a1690955 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -155,7 +155,7 @@ static int pptp_xmit(struct ppp_channel *chan,
struct sk_buff *skb) opt->dst_addr.sin_addr.s_addr,
opt->src_addr.sin_addr.s_addr,
0, 0, IPPROTO_GRE,
- RT_TOS(0), sk->sk_bound_dev_if);
+ IP_DSCP(0), sk->sk_bound_dev_if);
if (IS_ERR(rt))
goto tx_error;
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index f2793ffde191..09f4058a2c52 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -534,7 +534,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct
sk_buff *skb, /* needed to match OIF rule */
fl4.flowi4_oif = vrf_dev->ifindex;
fl4.flowi4_iif = LOOPBACK_IFINDEX;
- fl4.flowi4_tos = RT_TOS(ip4h->tos);
+ fl4.flowi4_tos = IP_DSCP(ip4h->tos);
fl4.flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_SKIP_NH_OIF;
fl4.flowi4_proto = ip4h->protocol;
fl4.daddr = ip4h->daddr;
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 236fcc55a5fd..59c4e7f466ab 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2412,7 +2412,7 @@ static struct rtable *vxlan_get_route(struct
vxlan_dev *vxlan, struct net_device
memset(&fl4, 0, sizeof(fl4));
fl4.flowi4_oif = oif;
- fl4.flowi4_tos = RT_TOS(tos);
+ fl4.flowi4_tos = IP_DSCP(tos);
fl4.flowi4_mark = skb->mark;
fl4.flowi4_proto = IPPROTO_UDP;
fl4.daddr = daddr;
@@ -2469,7 +2469,7 @@ static struct dst_entry *vxlan6_get_route(struct
vxlan_dev *vxlan, fl6.flowi6_oif = oif;
fl6.daddr = *daddr;
fl6.saddr = *saddr;
- fl6.flowlabel = ip6_make_flowinfo(RT_TOS(tos), label);
+ fl6.flowlabel = ip6_make_flowinfo(IP_DSCP(tos), label);
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = IPPROTO_UDP;
fl6.fl6_dport = dport;
diff --git a/include/net/ip.h b/include/net/ip.h
index e20874059f82..9df0734c7e29 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -241,7 +241,7 @@ static inline struct sk_buff *ip_finish_skb(struct
sock *sk, struct flowi4 *fl4)
static inline __u8 get_rttos(struct ipcm_cookie* ipc, struct inet_sock
*inet) {
- return (ipc->tos != -1) ? RT_TOS(ipc->tos) : RT_TOS(inet->tos);
+ return (ipc->tos != -1) ? IP_DSCP(ipc->tos) :
IP_DSCP(inet->tos); }
static inline __u8 get_rtconn_flags(struct ipcm_cookie* ipc, struct
sock* sk) diff --git a/include/net/route.h b/include/net/route.h
index ff021cab657e..123d151ef47c 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -40,8 +40,8 @@
#define RTO_ONLINK 0x01
-#define RT_CONN_FLAGS(sk) (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk,
SOCK_LOCALROUTE)) -#define RT_CONN_FLAGS_TOS(sk,tos) (RT_TOS(tos) |
sock_flag(sk, SOCK_LOCALROUTE)) +#define RT_CONN_FLAGS(sk)
(IP_DSCP(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE)) +#define
RT_CONN_FLAGS_TOS(sk,tos) (IP_DSCP(tos) | sock_flag(sk,
SOCK_LOCALROUTE)) struct fib_nh;
struct fib_info;
@@ -255,8 +255,6 @@ static inline void ip_rt_put(struct rtable *rt)
dst_release(&rt->dst);
}
-#define IPTOS_RT_MASK (IPTOS_TOS_MASK & ~3)
-
extern const __u8 ip_tos2prio[16];
static inline char rt_tos2priority(u8 tos)
diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
index e42d13b55cf3..2519e779e9ad 100644
--- a/include/uapi/linux/ip.h
+++ b/include/uapi/linux/ip.h
@@ -38,6 +38,8 @@
#define IPTOS_PREC_PRIORITY 0x20
#define IPTOS_PREC_ROUTINE 0x00
+#define IP_DSCP_MASK 0xfc
+#define IP_DSCP(dscp) ((dscp)&IP_DSCP_MASK)
/* IP options */
#define IPOPT_COPY 0x80
diff --git a/net/bridge/br_netfilter_hooks.c
b/net/bridge/br_netfilter_hooks.c index 04c3f9a82650..fea45a94125e
100644 --- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -379,7 +379,7 @@ static int br_nf_pre_routing_finish(struct net
*net, struct sock *sk, struct sk_ goto free_skb;
rt = ip_route_output(net, iph->daddr, 0,
- RT_TOS(iph->tos), 0);
+ IP_DSCP(iph->tos), 0);
if (!IS_ERR(rt)) {
/* - Bridged-and-DNAT'ed traffic
doesn't
* require ip_forwarding. */
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..83c3011326dd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2345,7 +2345,7 @@ static int __bpf_redirect_neigh_v4(struct sk_buff
*skb, struct net_device *dev, struct flowi4 fl4 = {
.flowi4_flags = FLOWI_FLAG_ANYSRC,
.flowi4_mark = skb->mark,
- .flowi4_tos = RT_TOS(ip4h->tos),
+ .flowi4_tos = IP_DSCP(ip4h->tos),
.flowi4_oif = dev->ifindex,
.flowi4_proto = ip4h->protocol,
.daddr = ip4h->daddr,
@@ -5309,7 +5309,7 @@ static int bpf_ipv4_fib_lookup(struct net *net,
struct bpf_fib_lookup *params, fl4.flowi4_iif = params->ifindex;
fl4.flowi4_oif = 0;
}
- fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
+ fl4.flowi4_tos = params->tos & IP_DSCP_MASK;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
fl4.flowi4_flags = 0;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 7d3438215f32..0757a36030b3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -206,7 +206,7 @@ static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
fl4.flowi4_oif = oif;
fl4.flowi4_mark = skb->mark;
fl4.flowi4_uid = sock_net_uid(net, sk);
- fl4.flowi4_tos = RT_TOS(iph->tos);
+ fl4.flowi4_tos = IP_DSCP(iph->tos);
fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
fl4.flowi4_proto = iph->protocol;
fl4.daddr = iph->daddr;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 86a23e4a6a50..0f07f0f0bc17 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -292,7 +292,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
.flowi4_iif = LOOPBACK_IFINDEX,
.flowi4_oif = l3mdev_master_ifindex_rcu(dev),
.daddr = ip_hdr(skb)->saddr,
- .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
+ .flowi4_tos = IP_DSCP(ip_hdr(skb)->tos),
.flowi4_scope = scope,
.flowi4_mark = vmark ? skb->mark : 0,
};
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index ce54a30c2ef1..1e75bb3b2f25 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -229,7 +229,7 @@ static int fib4_rule_configure(struct fib_rule
*rule, struct sk_buff *skb, int err = -EINVAL;
struct fib4_rule *rule4 = (struct fib4_rule *) rule;
- if (frh->tos & ~IPTOS_TOS_MASK) {
+ if (frh->tos & ~IP_DSCP_MASK) {
NL_SET_ERR_MSG(extack, "Invalid tos");
goto errout;
}
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 005faea415a4..3f6f7c64902f 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -444,7 +444,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param,
struct sk_buff *skb) fl4.saddr = saddr;
fl4.flowi4_mark = mark;
fl4.flowi4_uid = sock_net_uid(net, NULL);
- fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
+ fl4.flowi4_tos = IP_DSCP(ip_hdr(skb)->tos);
fl4.flowi4_proto = IPPROTO_ICMP;
fl4.flowi4_oif = l3mdev_master_ifindex(skb->dev);
security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
@@ -496,7 +496,7 @@ static struct rtable *icmp_route_lookup(struct net
*net, fl4->saddr = saddr;
fl4->flowi4_mark = mark;
fl4->flowi4_uid = sock_net_uid(net, NULL);
- fl4->flowi4_tos = RT_TOS(tos);
+ fl4->flowi4_tos = IP_DSCP(tos);
fl4->flowi4_proto = IPPROTO_ICMP;
fl4->fl4_icmp_type = type;
fl4->fl4_icmp_code = code;
@@ -544,7 +544,7 @@ static struct rtable *icmp_route_lookup(struct net
*net, orefdst = skb_in->_skb_refdst; /* save old refdst */
skb_dst_set(skb_in, NULL);
err = ip_route_input(skb_in, fl4_dec.daddr,
fl4_dec.saddr,
- RT_TOS(tos), rt2->dst.dev);
+ IP_DSCP(tos), rt2->dst.dev);
dst_release(&rt2->dst);
rt2 = skb_rtable(skb_in);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a68bf4c6fe9b..6bf61a994c19 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -882,7 +882,7 @@ static int ipgre_open(struct net_device *dev)
t->parms.iph.daddr,
t->parms.iph.saddr,
t->parms.o_key,
- RT_TOS(t->parms.iph.tos),
+ IP_DSCP(t->parms.iph.tos),
t->parms.link);
if (IS_ERR(rt))
return -EADDRNOTAVAIL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 879b76ae4435..6a459283ef82 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1694,7 +1694,7 @@ void ip_send_unicast_reply(struct sock *sk,
struct sk_buff *skb,
flowi4_init_output(&fl4, oif,
IP4_REPLY_MARK(net, skb->mark) ?:
sk->sk_mark,
- RT_TOS(arg->tos),
+ IP_DSCP(arg->tos),
RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
ip_reply_arg_flowi_flags(arg),
daddr, saddr,
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index ee65c9225178..2ca0a3f6c29c 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -294,7 +294,7 @@ static int ip_tunnel_bind_dev(struct net_device
*dev)
ip_tunnel_init_flow(&fl4, iph->protocol, iph->daddr,
iph->saddr, tunnel->parms.o_key,
- RT_TOS(iph->tos),
tunnel->parms.link,
+ IP_DSCP(iph->tos),
tunnel->parms.link, tunnel->fwmark, 0);
rt = ip_route_output_key(tunnel->net, &fl4);
@@ -565,7 +565,7 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct
net_device *dev, tos = ipv6_get_dsfield((const struct ipv6hdr
*)inner_iph); }
ip_tunnel_init_flow(&fl4, proto, key->u.ipv4.dst,
key->u.ipv4.src,
- tunnel_id_to_key32(key->tun_id),
RT_TOS(tos),
+ tunnel_id_to_key32(key->tun_id),
IP_DSCP(tos), 0, skb->mark, skb_get_hash(skb));
if (tunnel->encap.type != TUNNEL_ENCAP_NONE)
goto tx_error;
@@ -722,7 +722,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct
net_device *dev, }
ip_tunnel_init_flow(&fl4, protocol, dst, tnl_params->saddr,
- tunnel->parms.o_key, RT_TOS(tos),
tunnel->parms.link,
+ tunnel->parms.o_key, IP_DSCP(tos),
tunnel->parms.link, tunnel->fwmark, skb_get_hash(skb));
if (ip_tunnel_encap(skb, tunnel, &protocol, &fl4) < 0)
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 939792a38814..7806b5c04970 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1840,7 +1840,7 @@ static void ipmr_queue_xmit(struct net *net,
struct mr_table *mrt, vif->remote, vif->local,
0, 0,
IPPROTO_IPIP,
- RT_TOS(iph->tos),
vif->link);
+ IP_DSCP(iph->tos),
vif->link); if (IS_ERR(rt))
goto out_free;
encap = sizeof(struct iphdr);
@@ -1848,7 +1848,7 @@ static void ipmr_queue_xmit(struct net *net,
struct mr_table *mrt, rt = ip_route_output_ports(net, &fl4, NULL,
iph->daddr, 0, 0, 0,
IPPROTO_IPIP,
- RT_TOS(iph->tos),
vif->link);
+ IP_DSCP(iph->tos),
vif->link); if (IS_ERR(rt))
goto out_free;
}
@@ -2048,7 +2048,7 @@ static struct mr_table *ipmr_rt_fib_lookup(struct
net *net, struct sk_buff *skb) struct flowi4 fl4 = {
.daddr = iph->daddr,
.saddr = iph->saddr,
- .flowi4_tos = RT_TOS(iph->tos),
+ .flowi4_tos = IP_DSCP(iph->tos),
.flowi4_oif = (rt_is_output_route(rt) ?
skb->dev->ifindex : 0),
.flowi4_iif = (rt_is_output_route(rt) ?
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index 7c841037c533..aa9f5322a489 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -42,7 +42,7 @@ int ip_route_me_harder(struct net *net, struct sock
*sk, struct sk_buff *skb, un */
fl4.daddr = iph->daddr;
fl4.saddr = saddr;
- fl4.flowi4_tos = RT_TOS(iph->tos);
+ fl4.flowi4_tos = IP_DSCP(iph->tos);
fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
if (!fl4.flowi4_oif)
fl4.flowi4_oif = l3mdev_master_ifindex(dev);
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c
b/net/ipv4/netfilter/ipt_rpfilter.c index cc23f1ce239c..5e952661a5ea
100644 --- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -76,7 +76,7 @@ static bool rpfilter_mt(const struct sk_buff *skb,
struct xt_action_param *par) flow.daddr = iph->saddr;
flow.saddr = rpfilter_get_saddr(iph->daddr);
flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ?
skb->mark : 0;
- flow.flowi4_tos = RT_TOS(iph->tos);
+ flow.flowi4_tos = IP_DSCP(iph->tos);
flow.flowi4_scope = RT_SCOPE_UNIVERSE;
flow.flowi4_oif = l3mdev_master_ifindex_rcu(xt_in(par));
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c
b/net/ipv4/netfilter/nf_dup_ipv4.c index 6cc5743c553a..d2613828a0ec
100644 --- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -32,7 +32,7 @@ static bool nf_dup_ipv4_route(struct net *net, struct
sk_buff *skb, fl4.flowi4_oif = oif;
fl4.daddr = gw->s_addr;
- fl4.flowi4_tos = RT_TOS(iph->tos);
+ fl4.flowi4_tos = IP_DSCP(iph->tos);
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
fl4.flowi4_flags = FLOWI_FLAG_KNOWN_NH;
rt = ip_route_output_key(net, &fl4);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c962f0d96d8d..1ae7f5d668c9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -113,7 +113,7 @@
#include "fib_lookup.h"
#define RT_FL_TOS(oldflp4) \
- ((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
+ ((oldflp4)->flowi4_tos & (IP_DSCP_MASK | RTO_ONLINK))
#define RT_GC_TIMEOUT (300*HZ)
@@ -549,7 +549,7 @@ static void build_skb_flow_key(struct flowi4 *fl4,
const struct sk_buff *skb, const struct net *net = dev_net(skb->dev);
const struct iphdr *iph = ip_hdr(skb);
int oif = skb->dev->ifindex;
- u8 tos = RT_TOS(iph->tos);
+ u8 tos = IP_DSCP(iph->tos);
u8 prot = iph->protocol;
u32 mark = skb->mark;
@@ -825,7 +825,7 @@ static void ip_do_redirect(struct dst_entry *dst,
struct sock *sk, struct sk_buf const struct iphdr *iph = (const struct
iphdr *) skb->data; struct net *net = dev_net(skb->dev);
int oif = skb->dev->ifindex;
- u8 tos = RT_TOS(iph->tos);
+ u8 tos = IP_DSCP(iph->tos);
u8 prot = iph->protocol;
u32 mark = skb->mark;
@@ -1073,7 +1073,7 @@ void ipv4_update_pmtu(struct sk_buff *skb, struct
net *net, u32 mtu, u32 mark = IP4_REPLY_MARK(net, skb->mark);
__build_flow_key(net, &fl4, NULL, iph, oif,
- RT_TOS(iph->tos), protocol, mark, 0);
+ IP_DSCP(iph->tos), protocol, mark, 0);
rt = __ip_route_output_key(net, &fl4);
if (!IS_ERR(rt)) {
__ip_rt_update_pmtu(rt, &fl4, mtu);
@@ -1162,7 +1162,7 @@ void ipv4_redirect(struct sk_buff *skb, struct
net *net, struct rtable *rt;
__build_flow_key(net, &fl4, NULL, iph, oif,
- RT_TOS(iph->tos), protocol, 0, 0);
+ IP_DSCP(iph->tos), protocol, 0, 0);
rt = __ip_route_output_key(net, &fl4);
if (!IS_ERR(rt)) {
__ip_do_redirect(rt, skb, &fl4, false);
@@ -1274,7 +1274,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff
*skb, struct rtable *rt) struct flowi4 fl4 = {
.daddr = iph->daddr,
.saddr = iph->saddr,
- .flowi4_tos = RT_TOS(iph->tos),
+ .flowi4_tos = IP_DSCP(iph->tos),
.flowi4_oif = rt->dst.dev->ifindex,
.flowi4_iif = skb->dev->ifindex,
.flowi4_mark = skb->mark,
@@ -2055,7 +2055,7 @@ int ip_route_use_hint(struct sk_buff *skb, __be32
daddr, __be32 saddr, if (rt->rt_type != RTN_LOCAL)
goto skip_validate_source;
- tos &= IPTOS_RT_MASK;
+ tos &= IP_DSCP_MASK;
err = fib_validate_source(skb, saddr, daddr, tos, 0, dev,
in_dev, &tag); if (err < 0)
goto martian_source;
@@ -2298,7 +2298,7 @@ int ip_route_input_noref(struct sk_buff *skb,
__be32 daddr, __be32 saddr, struct fib_result res;
int err;
- tos &= IPTOS_RT_MASK;
+ tos &= IP_DSCP_MASK;
rcu_read_lock();
err = ip_route_input_rcu(skb, daddr, saddr, tos, dev, &res);
rcu_read_unlock();
@@ -2499,7 +2499,7 @@ struct rtable *ip_route_output_key_hash(struct
net *net, struct flowi4 *fl4, struct rtable *rth;
fl4->flowi4_iif = LOOPBACK_IFINDEX;
- fl4->flowi4_tos = tos & IPTOS_RT_MASK;
+ fl4->flowi4_tos = tos & IP_DSCP_MASK;
fl4->flowi4_scope = ((tos & RTO_ONLINK) ?
RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
@@ -2808,7 +2808,7 @@ struct rtable *ip_route_output_tunnel(struct
sk_buff *skb, fl4.daddr = info->key.u.ipv4.dst;
fl4.saddr = info->key.u.ipv4.src;
tos = info->key.tos;
- fl4.flowi4_tos = RT_TOS(tos);
+ fl4.flowi4_tos = IP_DSCP(tos);
rt = ip_route_output_key(net, &fl4);
if (IS_ERR(rt)) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 749ad72386b2..1cd6f7e7bc13 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1243,7 +1243,7 @@ struct dst_entry *ip6_dst_lookup_tunnel(struct
sk_buff *skb, fl6.daddr = info->key.u.ipv6.dst;
fl6.saddr = info->key.u.ipv6.src;
prio = info->key.tos;
- fl6.flowlabel = ip6_make_flowinfo(RT_TOS(prio),
+ fl6.flowlabel = ip6_make_flowinfo(IP_DSCP(prio),
info->key.label);
dst = ipv6_stub->ipv6_dst_lookup_flow(net, sock->sk, &fl6,
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a7950baa05e5..ef1c880da186 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -612,7 +612,7 @@ ip4ip6_err(struct sk_buff *skb, struct
inet6_skb_parm *opt,
/* Try to guess incoming interface */
rt = ip_route_output_ports(dev_net(skb->dev), &fl4, NULL,
eiph->saddr,
- 0, 0, 0, IPPROTO_IPIP,
RT_TOS(eiph->tos), 0);
+ 0, 0, 0, IPPROTO_IPIP,
IP_DSCP(eiph->tos), 0); if (IS_ERR(rt))
goto out;
@@ -623,7 +623,7 @@ ip4ip6_err(struct sk_buff *skb, struct
inet6_skb_parm *opt, if (rt->rt_flags & RTCF_LOCAL) {
rt = ip_route_output_ports(dev_net(skb->dev), &fl4,
NULL, eiph->daddr, eiph->saddr, 0, 0,
- IPPROTO_IPIP,
RT_TOS(eiph->tos), 0);
+ IPPROTO_IPIP,
IP_DSCP(eiph->tos), 0); if (IS_ERR(rt) || rt->dst.dev->type !=
ARPHRD_TUNNEL6) { if (!IS_ERR(rt))
ip_rt_put(rt);
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 2da0ee703779..5149ed121e6b 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -937,7 +937,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff
*skb, }
flowi4_init_output(&fl4, tunnel->parms.link, tunnel->fwmark,
- RT_TOS(tos), RT_SCOPE_UNIVERSE,
IPPROTO_IPV6,
+ IP_DSCP(tos), RT_SCOPE_UNIVERSE,
IPPROTO_IPV6, 0, dst, tiph->saddr, 0, 0,
sock_net_uid(tunnel->net, NULL));
@@ -1112,7 +1112,7 @@ static void ipip6_tunnel_bind_dev(struct
net_device *dev) iph->daddr, iph->saddr,
0, 0,
IPPROTO_IPV6,
-
RT_TOS(iph->tos),
+
IP_DSCP(iph->tos), tunnel->parms.link);
if (!IS_ERR(rt)) {
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d622c2548d22..0425cc597a98 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2450,7 +2450,7 @@ xfrm_tmpl_resolve(struct xfrm_policy **pols, int
npols, const struct flowi *fl, static int xfrm_get_tos(const struct
flowi *fl, int family) {
if (family == AF_INET)
- return IPTOS_RT_MASK & fl->u.ip4.flowi4_tos;
+ return IP_DSCP_MASK & fl->u.ip4.flowi4_tos;
return 0;
}
--
2.26.2
^ permalink raw reply related
* Re: [net,v2] net/packet: fix packet receive on L3 devices without visible hard header
From: Jason A. Donenfeld @ 2020-11-21 7:51 UTC (permalink / raw)
To: Eyal Birger; +Cc: willemdebruijn.kernel, davem, kuba, netdev, xie.he.0141
In-Reply-To: <20201121062817.3178900-1-eyal.birger@gmail.com>
On 11/21/20, Eyal Birger <eyal.birger@gmail.com> wrote:
> In the patchset merged by commit b9fcf0a0d826
> ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
> did not have header_ops were given one for the purpose of protocol parsing
> on af_packet transmit path.
>
> That change made af_packet receive path regard these devices as having a
> visible L3 header and therefore aligned incoming skb->data to point to the
> skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
> reset their mac_header prior to ingress and therefore their incoming
> packets became malformed.
>
> Ideally these devices would reset their mac headers, or af_packet would be
> able to rely on dev->hard_header_len being 0 for such cases, but it seems
> this is not the case.
>
> Fix by changing af_packet RX ll visibility criteria to include the
> existence of a '.create()' header operation, which is used when creating
> a device hard header - via dev_hard_header() - by upper layers, and does
> not exist in these L3 devices.
>
> As this predicate may be useful in other situations, add it as a common
> dev_has_header() helper in netdevice.h.
>
> Fixes: b9fcf0a0d826 ("Merge branch
> 'support-AF_PACKET-for-layer-3-devices'")
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
>
> ---
>
> v2:
> - add common dev_has_header() helper as suggested by Willem de Bruijn
> ---
> include/linux/netdevice.h | 5 +++++
> net/packet/af_packet.c | 18 +++++++++---------
> 2 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 964b494b0e8d..fa275a054f46 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3137,6 +3137,11 @@ static inline bool dev_validate_header(const struct
> net_device *dev,
> return false;
> }
>
> +static inline bool dev_has_header(const struct net_device *dev)
> +{
> + return dev->header_ops && dev->header_ops->create;
> +}
> +
> typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr,
> int len, int size);
> int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index cefbd50c1090..7a18ffff8551 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -93,8 +93,8 @@
>
> /*
> Assumptions:
> - - If the device has no dev->header_ops, there is no LL header visible
> - above the device. In this case, its hard_header_len should be 0.
> + - If the device has no dev->header_ops->create, there is no LL header
> + visible above the device. In this case, its hard_header_len should be
> 0.
> The device may prepend its own header internally. In this case, its
> needed_headroom should be set to the space needed for it to add its
> internal header.
> @@ -108,26 +108,26 @@
> On receive:
> -----------
>
> -Incoming, dev->header_ops != NULL
> +Incoming, dev_has_header(dev) == true
> mac_header -> ll header
> data -> data
>
> -Outgoing, dev->header_ops != NULL
> +Outgoing, dev_has_header(dev) == true
> mac_header -> ll header
> data -> ll header
>
> -Incoming, dev->header_ops == NULL
> +Incoming, dev_has_header(dev) == false
> mac_header -> data
> However drivers often make it point to the ll header.
> This is incorrect because the ll header should be invisible to us.
> data -> data
>
> -Outgoing, dev->header_ops == NULL
> +Outgoing, dev_has_header(dev) == false
> mac_header -> data. ll header is invisible to us.
> data -> data
>
> Resume
> - If dev->header_ops == NULL we are unable to restore the ll header,
> + If dev_has_header(dev) == false we are unable to restore the ll header,
> because it is invisible to us.
>
>
> @@ -2069,7 +2069,7 @@ static int packet_rcv(struct sk_buff *skb, struct
> net_device *dev,
>
> skb->dev = dev;
>
> - if (dev->header_ops) {
> + if (dev_has_header(dev)) {
> /* The device has an explicit notion of ll header,
> * exported to higher levels.
> *
> @@ -2198,7 +2198,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct
> net_device *dev,
> if (!net_eq(dev_net(dev), sock_net(sk)))
> goto drop;
>
> - if (dev->header_ops) {
> + if (dev_has_header(dev)) {
> if (sk->sk_type != SOCK_DGRAM)
> skb_push(skb, skb->data - skb_mac_header(skb));
> else if (skb->pkt_type == PACKET_OUTGOING) {
Thanks for fixing this. Patch seems correct to me.
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
^ permalink raw reply
* Re: [PATCH bpf-next V7 8/8] bpf/selftests: activating bpf_check_mtu BPF-helper
From: Andrii Nakryiko @ 2020-11-21 7:41 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
Maciej Żenczykowski, Lorenz Bauer, shaun, Lorenzo Bianconi,
Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger,
colrack
In-Reply-To: <160588912738.2817268.9380466634324530673.stgit@firesoul>
On Fri, Nov 20, 2020 at 8:21 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> it can be used from both XDP and TC.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> tools/testing/selftests/bpf/prog_tests/check_mtu.c | 37 ++++++++++++++++++++
> tools/testing/selftests/bpf/progs/test_check_mtu.c | 33 ++++++++++++++++++
> 2 files changed, 70 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> new file mode 100644
> index 000000000000..09b8f986a17b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Red Hat */
> +#include <uapi/linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <test_progs.h>
> +
> +#include "test_check_mtu.skel.h"
> +#define IFINDEX_LO 1
> +
> +void test_check_mtu_xdp(struct test_check_mtu *skel)
this should be static func, otherwise it's treated as an independent selftest.
> +{
> + int err = 0;
> + int fd;
> +
> + fd = bpf_program__fd(skel->progs.xdp_use_helper);
> + err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);
> + if (CHECK_FAIL(err))
please use CHECK() or one of ASSERT_xxx() helpers. CHECK_FAIL() should
be used for high-volume unlikely to fail test (i.e., very rarely).
> + return;
> +
> + bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);
> +}
> +
> +void test_check_mtu(void)
> +{
> + struct test_check_mtu *skel;
> +
> + skel = test_check_mtu__open_and_load();
> + if (CHECK_FAIL(!skel)) {
> + perror("test_check_mtu__open_and_load");
> + return;
> + }
> +
> + if (test__start_subtest("bpf_check_mtu XDP-attach"))
> + test_check_mtu_xdp(skel);
> +
> + test_check_mtu__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_check_mtu.c b/tools/testing/selftests/bpf/progs/test_check_mtu.c
> new file mode 100644
> index 000000000000..ab97ec925a32
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_check_mtu.c
> @@ -0,0 +1,33 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Red Hat */
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("xdp")
> +int xdp_use_helper(struct xdp_md *ctx)
> +{
> + uint32_t mtu_len = 0;
> + int delta = 20;
> +
> + if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {
> + return XDP_ABORTED;
> + }
nit: unnecessary {} for single-line if; same below
> + return XDP_PASS;
> +}
> +
> +SEC("classifier")
> +int tc_use_helper(struct __sk_buff *ctx)
> +{
> + uint32_t mtu_len = 0;
> + int delta = -20;
> +
> + if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {
> + return BPF_DROP;
> + }
> + return BPF_OK;
> +}
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox