Netdev List
 help / color / mirror / Atom feed
* [PATCH V4 1/3] dt-bindings: net: dsa: ksz: document Microchip KSZ87xx family switches
From: Marek Vasut @ 2019-07-29 17:49 UTC (permalink / raw)
  To: netdev
  Cc: Marek Vasut, Andrew Lunn, David S . Miller, Florian Fainelli,
	Rob Herring, Tristram Ha, Vivien Didelot, Woojung Huh, devicetree
In-Reply-To: <20190729174947.10103-1-marex@denx.de>

Document Microchip KSZ87xx family switches. These include
KSZ8765 - 5 port switch
KSZ8794 - 4 port switch
KSZ8795 - 5 port switch

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Tristram Ha <Tristram.Ha@microchip.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Woojung Huh <woojung.huh@microchip.com>
Cc: devicetree@vger.kernel.org
---
V2: No change
V3: No change
V4: No change
---
 Documentation/devicetree/bindings/net/dsa/ksz.txt | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt
index 4ac21cef370e..5e8429b6f9ca 100644
--- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -5,6 +5,9 @@ Required properties:
 
 - compatible: For external switch chips, compatible string must be exactly one
   of the following:
+  - "microchip,ksz8765"
+  - "microchip,ksz8794"
+  - "microchip,ksz8795"
   - "microchip,ksz9477"
   - "microchip,ksz9897"
   - "microchip,ksz9896"
-- 
2.20.1


^ permalink raw reply related

* Re: [patch net-next 1/3] net: devlink: allow to change namespaces
From: David Miller @ 2019-07-29 17:52 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190727094459.26345-2-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Sat, 27 Jul 2019 11:44:57 +0200

> +	if ((netns_pid_attr && (netns_fd_attr || netns_id_attr)) ||
> +	    (netns_fd_attr && (netns_pid_attr || netns_id_attr)) ||
> +	    (netns_id_attr && (netns_pid_attr || netns_fd_attr))) {
> +		NL_SET_ERR_MSG(info->extack, "multiple netns identifying attributes specified");
> +		return ERR_PTR(-EINVAL);
> +	}

How about:

	if (!!a + !!b + !!c > 1) {
	...

> +
> +	if (netns_pid_attr) {
> +		net = get_net_ns_by_pid(nla_get_u32(netns_pid_attr));
> +	} else if (netns_fd_attr) {
> +		net = get_net_ns_by_fd(nla_get_u32(netns_fd_attr));
> +	} else if (netns_id_attr) {
> +		net = get_net_ns_by_id(sock_net(skb->sk),
> +				       nla_get_u32(netns_id_attr));
> +		if (!net)
> +			net = ERR_PTR(-EINVAL);
> +	}
> +	if (IS_ERR(net)) {

I think this is going to be one of those cases where a compiler won't be able
to prove that 'net' is guaranteed to be initialized at this spot.  Please
rearrange this code somehow so that is unlikely to happen.

Thanks.

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Ido Schimmel @ 2019-07-29 17:51 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190729143508.tcyebbvleppa242d@lx-anielsen.microsemi.net>

On Mon, Jul 29, 2019 at 04:35:09PM +0200, Allan W. Nielsen wrote:
> The 07/29/2019 17:21, Nikolay Aleksandrov wrote:
> > On 29/07/2019 16:52, Allan W. Nielsen wrote:
> > > The 07/29/2019 15:50, Nikolay Aleksandrov wrote:
> > >> On 29/07/2019 15:22, Nikolay Aleksandrov wrote:
> > >>> Hi Allan,
> > >>> On 29/07/2019 15:14, Allan W. Nielsen wrote:
> > >>>> First of all, as mentioned further down in this thread, I realized that our
> > >>>> implementation of the multicast floodmasks does not align with the existing SW
> > >>>> implementation. We will change this, such that all multicast packets goes to the
> > >>>> SW bridge.
> > >>>>
> > >>>> This changes things a bit, not that much.
> > >>>>
> > >>>> I actually think you summarized the issue we have (after changing to multicast
> > >>>> flood-masks) right here:
> > >>>>
> > >>>> The 07/26/2019 12:26, Nikolay Aleksandrov wrote:
> > >>>>>>> Actually you mentioned non-IP traffic, so the querier stuff is not a problem. This
> > >>>>>>> traffic will always be flooded by the bridge (and also a copy will be locally sent up).
> > >>>>>>> Thus only the flooding may need to be controlled.
> > >>>>
> > >>>> This seems to be exactly what we need.
> > >>>>
> > >>>> Assuming we have a SW bridge (br0) with 4 slave interfaces (eth0-3). We use this
> > >>>> on a network where we want to limit the flooding of frames with dmac
> > >>>> 01:21:6C:00:00:01 (which is non IP traffic) to eth0 and eth1.
> > >>>>
> > >>>> One way of doing this could potentially be to support the following command:
> > >>>>
> > >>>> bridge fdb add    01:21:6C:00:00:01 port eth0
> > >>>> bridge fdb append 01:21:6C:00:00:01 port eth1
> > >> And the fdbs become linked lists?
> > > Yes, it will most likely become a linked list
> > > 
> > >> So we'll increase the complexity for something that is already supported by
> > >> ACLs (e.g. tc) and also bridge per-port multicast flood flag ?
> > > I do not think it can be supported with the facilities we have today in tc.
> > > 
> > > We can do half of it (copy more fraems to the CPU) with tc, but we can not limit
> > > the floodmask of a frame with tc (say we want it to flood to 2 out of 4 slave
> > > ports).
> > Why not ? You attach an egress filter for the ports and allow that dmac on only
> > 2 of the ports.
> Because we want a solution which we eventually can offload in HW. And the HW
> facilities we have is doing ingress processing (we have no egress ACLs in this
> design), and if we try to offload an egress rule, with an ingress HW facility,
> then we will run into other issues.

Can you please clarify what you're trying to achieve? I just read the
thread again and my impression is that you're trying to locally receive
packets with a certain link layer multicast address. Nik suggested
SIOCADDMULTI and I suggested a tc filter to get the packet to the CPU.

If you now want to limit the ports to which this packet is flooded, then
you can use tc filters in *software*:

# tc qdisc add dev eth2 clsact
# tc filter add dev eth2 egress pref 1 flower skip_hw \
	dst_mac 01:21:6C:00:00:01 action drop

If you want to forward the packet in hardware and locally receive it,
you can chain several mirred action and then a trap action.

Both options avoid HW egress ACLs which your design does not support.

^ permalink raw reply

* Re: [PATCH 2/2] staging/octeon: Allow test build on !MIPS
From: David Miller @ 2019-07-29 17:45 UTC (permalink / raw)
  To: willy; +Cc: netdev, aaro.koskinen, arnd, gregkh
In-Reply-To: <20190726174425.6845-3-willy@infradead.org>

From: Matthew Wilcox <willy@infradead.org>
Date: Fri, 26 Jul 2019 10:44:25 -0700

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Add compile test support by moving all includes of files under
> asm/octeon into octeon-ethernet.h, and if we're not on MIPS,
> stub out all the calls into the octeon support code in octeon-stubs.h
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH 1/2] octeon: Fix typo
From: David Miller @ 2019-07-29 17:44 UTC (permalink / raw)
  To: willy; +Cc: netdev, aaro.koskinen, arnd, gregkh
In-Reply-To: <20190726174425.6845-2-willy@infradead.org>

From: Matthew Wilcox <willy@infradead.org>
Date: Fri, 26 Jul 2019 10:44:24 -0700

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Compile fix from skb_frag_t conversion.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

As I mentioned I already fixed this.

^ permalink raw reply

* Re: [PATCH V3 2/3] net: dsa: ksz: Add KSZ8795 tag code
From: David Miller @ 2019-07-29 17:33 UTC (permalink / raw)
  To: marex; +Cc: netdev, Tristram.Ha, andrew, f.fainelli, vivien.didelot,
	woojung.huh
In-Reply-To: <20190726162308.16764-3-marex@denx.de>

From: Marek Vasut <marex@denx.de>
Date: Fri, 26 Jul 2019 18:23:07 +0200

> @@ -77,6 +77,13 @@ config NET_DSA_TAG_KSZ_COMMON
>  	tristate
>  	default n
>  
> +config NET_DSA_TAG_KSZ8795
> +	tristate "Tag driver for Microchip 8795 family of switches"
> +	select NET_DSA_TAG_KSZ_COMMON
> +	help
> +	  Say Y if you want to enable support for tagging frames for the
> +	  Microchip 8795 family of switches.

This Kconfig knob doesn't control the inclusion of code at all.

It's only use is in patch #3 where it is selected for the new
NET_DSA_MICROCHIP_KSZ8795.

Therefore it isn't doing anything.

So please either make it control code inclusion or remove it.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations
From: David Miller @ 2019-07-29 17:29 UTC (permalink / raw)
  To: rasmus.villemoes
  Cc: andrew, vivien.didelot, f.fainelli, Rasmus.Villemoes, netdev,
	linux-kernel
In-Reply-To: <20190722233713.31396-1-rasmus.villemoes@prevas.dk>


Andrew, I really can't wait forever to get this patch reviewed.

It's been an entire week already :-(

^ permalink raw reply

* Re: [PATCH] arcnet: com20020-isa: Mark expected switch fall-throughs
From: David Miller @ 2019-07-29 17:24 UTC (permalink / raw)
  To: gustavo; +Cc: m.grzeschik, netdev, linux-kernel, geert, keescook
In-Reply-To: <20190729142503.GA7917@embeddedor>

From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Date: Mon, 29 Jul 2019 09:25:03 -0500

> Mark switch cases where we are expecting to fall through.
> 
> This patch fixes the following warnings:
> 
> drivers/net/arcnet/com20020-isa.c: warning: this statement may fall
> through [-Wimplicit-fallthrough=]:  => 205:13, 203:10, 209:7, 201:11,
> 207:8
> 
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Applied.

^ permalink raw reply

* Re: [PATCH v3] net: sched: Fix a possible null-pointer dereference in dequeue_func()
From: Cong Wang @ 2019-07-29 17:23 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: Jamal Hadi Salim, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, LKML
In-Reply-To: <20190729082433.28981-1-baijiaju1990@gmail.com>

On Mon, Jul 29, 2019 at 1:24 AM Jia-Ju Bai <baijiaju1990@gmail.com> wrote:
>
> In dequeue_func(), there is an if statement on line 74 to check whether
> skb is NULL:
>     if (skb)
>
> When skb is NULL, it is used on line 77:
>     prefetch(&skb->end);
>
> Thus, a possible null-pointer dereference may occur.
>
> To fix this bug, skb->end is used when skb is not NULL.
>
> This bug is found by a static analysis tool STCheck written by us.

It doesn't dereference the pointer, it simply calculates the address
and passes it to gcc builtin prefetch. Both are fine when it is NULL,
as prefetching a NULL address should be okay for kernel.

So your changelog is misleading and it doesn't fix any bug,
although it does very slightly make the code better.

^ permalink raw reply

* RE: [PATCH net] hv_sock: Fix hang when a connection is closed
From: Sunil Muthuswamy @ 2019-07-29 17:21 UTC (permalink / raw)
  To: Dexuan Cui, David Miller, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com
In-Reply-To: <PU1P153MB01690A7767ECDF420FF78D66BFC20@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Sunday, July 28, 2019 11:32 AM
> To: Sunil Muthuswamy <sunilmut@microsoft.com>; David Miller <davem@davemloft.net>; netdev@vger.kernel.org
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; sashal@kernel.org; Michael Kelley <mikelley@microsoft.com>; linux-hyperv@vger.kernel.org; linux-
> kernel@vger.kernel.org; olaf@aepfle.de; apw@canonical.com; jasowang@redhat.com; vkuznets <vkuznets@redhat.com>;
> marcelo.cerri@canonical.com
> Subject: [PATCH net] hv_sock: Fix hang when a connection is closed
> 
> 
> hvs_do_close_lock_held() may decrease the reference count to 0 and free the
> sk struct completely, and then the following release_sock(sk) may hang.
> 
> Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> 
> ---
> With the proper kernel debugging options enabled, first a warning can
> appear:
> 
> kworker/1:0/4467 is freeing memory ..., with a lock still held there!
> stack backtrace:
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> Call Trace:
>  dump_stack+0x67/0x90
>  debug_check_no_locks_freed.cold.52+0x78/0x7d
>  slab_free_freelist_hook+0x85/0x140
>  kmem_cache_free+0xa5/0x380
>  __sk_destruct+0x150/0x260
>  hvs_close_connection+0x24/0x30 [hv_sock]
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
> and then the following release_sock(sk) can hang:
> 
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
> ...
> irq event stamp: 62890
> CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: G        W         5.2.0+ #39
> Workqueue: events vmbus_onmessage_work [hv_vmbus]
> RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
> ...
> Call Trace:
>  do_raw_spin_lock+0xab/0xb0
>  release_sock+0x19/0xb0
>  vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
>  process_one_work+0x241/0x600
>  worker_thread+0x3c/0x390
>  kthread+0x11b/0x140
>  ret_from_fork+0x24/0x30
> 
>  net/vmw_vsock/hyperv_transport.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index f2084e3f7aa4..efbda8ef1eff 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -309,9 +309,16 @@ static void hvs_close_connection(struct vmbus_channel *chan)
>  {
>  	struct sock *sk = get_per_channel_state(chan);
> 
> +	/* Grab an extra reference since hvs_do_close_lock_held() may decrease
> +	 * the reference count to 0 by calling sock_put(sk).
> +	 */
> +	sock_hold(sk);
> +

To me, it seems like when 'hvs_close_connection' is called, there should always be
an outstanding reference to the socket. The reference that is dropped by
' hvs_do_close_lock_held' is a legitimate reference that was taken by 'hvs_close_lock_held'.
Or, in other words, I think the right solution is to always maintain a reference to socket
until this routine is called and drop that here. That can be done by taking the reference to
the socket prior to ' vmbus_set_chn_rescind_callback(chan, hvs_close_connection)' and
dropping that reference at the end of 'hvs_close_connection'.

>  	lock_sock(sk);
>  	hvs_do_close_lock_held(vsock_sk(sk), true);
>  	release_sock(sk);
> +
> +	sock_put(sk);
>  }
> 
>  static void hvs_open_connection(struct vmbus_channel *chan)
> --
> 2.19.1

Thanks for taking a look at this. We should queue this fix and the other hvsocket fixes
for the stable branch.

^ permalink raw reply

* [PATCH 2/3 net-next] net: Use skb_frag_off accessors
From: Jonathan Lemon @ 2019-07-29 17:19 UTC (permalink / raw)
  To: willy, davem; +Cc: kernel-team, netdev
In-Reply-To: <20190729171941.250569-1-jonathan.lemon@gmail.com>

Use accessor functions for skb fragment's page_offset instead
of direct references, in preparation for bvec conversion.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/atm/eni.c                             |  2 +-
 drivers/hsi/clients/ssi_protocol.c            |  2 +-
 drivers/infiniband/hw/hfi1/vnic_sdma.c        |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c       |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
 .../ethernet/cavium/thunder/nicvf_queues.c    |  2 +-
 drivers/net/ethernet/chelsio/cxgb3/sge.c      |  2 +-
 drivers/net/ethernet/emulex/benet/be_main.c   | 12 ++---
 .../ethernet/freescale/fs_enet/fs_enet-main.c |  2 +-
 drivers/net/ethernet/ibm/ibmvnic.c            |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  4 +-
 drivers/net/ethernet/jme.c                    |  4 +-
 drivers/net/ethernet/marvell/mv643xx_eth.c    |  2 +-
 .../net/ethernet/myricom/myri10ge/myri10ge.c  |  6 +--
 drivers/net/ethernet/sfc/tx.c                 |  2 +-
 drivers/net/ethernet/sun/cassini.c            |  8 +--
 drivers/net/ethernet/sun/niu.c                |  2 +-
 drivers/net/ethernet/sun/sunvnet_common.c     |  4 +-
 drivers/net/ethernet/ti/netcp_core.c          |  2 +-
 drivers/net/hyperv/netvsc_drv.c               |  4 +-
 drivers/net/thunderbolt.c                     |  2 +-
 drivers/net/usb/usbnet.c                      |  2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c             |  2 +-
 drivers/net/xen-netback/netback.c             |  6 +--
 drivers/net/xen-netfront.c                    |  8 +--
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c             |  2 +-
 drivers/scsi/fcoe/fcoe.c                      |  3 +-
 drivers/scsi/fcoe/fcoe_transport.c            |  2 +-
 drivers/scsi/qedf/qedf_main.c                 |  2 +-
 .../staging/unisys/visornic/visornic_main.c   |  2 +-
 drivers/target/iscsi/cxgbit/cxgbit_target.c   |  4 +-
 net/appletalk/ddp.c                           |  4 +-
 net/core/datagram.c                           |  6 +--
 net/core/dev.c                                |  2 +-
 net/core/pktgen.c                             |  2 +-
 net/core/skbuff.c                             | 54 ++++++++++---------
 net/ipv4/tcp.c                                |  6 +--
 net/ipv4/tcp_output.c                         |  2 +-
 net/kcm/kcmsock.c                             |  2 +-
 net/tls/tls_device.c                          |  8 +--
 net/tls/tls_device_fallback.c                 |  2 +-
 net/xfrm/xfrm_ipcomp.c                        |  2 +-
 44 files changed, 100 insertions(+), 98 deletions(-)

diff --git a/drivers/atm/eni.c b/drivers/atm/eni.c
index 79b718430cd1..b23d1e4bad33 100644
--- a/drivers/atm/eni.c
+++ b/drivers/atm/eni.c
@@ -1136,7 +1136,7 @@ DPRINTK("doing direct send\n"); /* @@@ well, this doesn't work anyway */
 			else
 				put_dma(tx->index,eni_dev->dma,&j,(unsigned long)
 				    skb_frag_page(&skb_shinfo(skb)->frags[i]) +
-					skb_shinfo(skb)->frags[i].page_offset,
+					skb_frag_off(&skb_shinfo(skb)->frags[i]),
 				    skb_frag_size(&skb_shinfo(skb)->frags[i]));
 	}
 	if (skb->len & 3) {
diff --git a/drivers/hsi/clients/ssi_protocol.c b/drivers/hsi/clients/ssi_protocol.c
index c9e3f928b93d..0253e76f1df2 100644
--- a/drivers/hsi/clients/ssi_protocol.c
+++ b/drivers/hsi/clients/ssi_protocol.c
@@ -182,7 +182,7 @@ static void ssip_skb_to_msg(struct sk_buff *skb, struct hsi_msg *msg)
 		BUG_ON(!sg);
 		frag = &skb_shinfo(skb)->frags[i];
 		sg_set_page(sg, skb_frag_page(frag), skb_frag_size(frag),
-				frag->page_offset);
+				skb_frag_off(frag));
 	}
 }
 
diff --git a/drivers/infiniband/hw/hfi1/vnic_sdma.c b/drivers/infiniband/hw/hfi1/vnic_sdma.c
index 05a140504a99..7d90b900131b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_sdma.c
+++ b/drivers/infiniband/hw/hfi1/vnic_sdma.c
@@ -108,7 +108,7 @@ static noinline int build_vnic_ulp_payload(struct sdma_engine *sde,
 		ret = sdma_txadd_page(sde->dd,
 				      &tx->txreq,
 				      skb_frag_page(frag),
-				      frag->page_offset,
+				      skb_frag_off(frag),
 				      skb_frag_size(frag));
 		if (unlikely(ret))
 			goto bail_txadd;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 78fa777c87b1..c332b4761816 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -293,7 +293,8 @@ int ipoib_dma_map_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_req)
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		mapping[i + off] = ib_dma_map_page(ca,
 						 skb_frag_page(frag),
-						 frag->page_offset, skb_frag_size(frag),
+						 skb_frag_off(frag),
+						 skb_frag_size(frag),
 						 DMA_TO_DEVICE);
 		if (unlikely(ib_dma_mapping_error(ca, mapping[i + off])))
 			goto partial_error;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1fedefd499ac..c7273cb796fa 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -919,7 +919,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	frag = &skb_shinfo(skb)->frags[0];
 	skb_frag_size_sub(frag, payload);
-	frag->page_offset += payload;
+	skb_frag_off_add(frag, payload);
 	skb->data_len -= payload;
 	skb->tail += payload;
 
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index c0266a87794c..4ab57d33a87e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -1594,7 +1594,7 @@ int nicvf_sq_append_skb(struct nicvf *nic, struct snd_queue *sq,
 		size = skb_frag_size(frag);
 		dma_addr = dma_map_page_attrs(&nic->pdev->dev,
 					      skb_frag_page(frag),
-					      frag->page_offset, size,
+					      skb_frag_off(frag), size,
 					      DMA_TO_DEVICE,
 					      DMA_ATTR_SKIP_CPU_SYNC);
 		if (dma_mapping_error(&nic->pdev->dev, dma_addr)) {
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index 310a232e00f0..6dabbf1502c7 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -2182,7 +2182,7 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
 
 	rx_frag += nr_frags;
 	__skb_frag_set_page(rx_frag, sd->pg_chunk.page);
-	rx_frag->page_offset = sd->pg_chunk.offset + offset;
+	skb_frag_off_set(rx_frag, sd->pg_chunk.offset + offset);
 	skb_frag_size_set(rx_frag, len);
 
 	skb->len += len;
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index e00a94a03879..1c9883019767 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2346,8 +2346,8 @@ static void skb_fill_rx_data(struct be_rx_obj *rxo, struct sk_buff *skb,
 		memcpy(skb->data, start, hdr_len);
 		skb_shinfo(skb)->nr_frags = 1;
 		skb_frag_set_page(skb, 0, page_info->page);
-		skb_shinfo(skb)->frags[0].page_offset =
-					page_info->page_offset + hdr_len;
+		skb_frag_off_set(&skb_shinfo(skb)->frags[0],
+				 page_info->page_offset + hdr_len);
 		skb_frag_size_set(&skb_shinfo(skb)->frags[0],
 				  curr_frag_len - hdr_len);
 		skb->data_len = curr_frag_len - hdr_len;
@@ -2372,8 +2372,8 @@ static void skb_fill_rx_data(struct be_rx_obj *rxo, struct sk_buff *skb,
 			/* Fresh page */
 			j++;
 			skb_frag_set_page(skb, j, page_info->page);
-			skb_shinfo(skb)->frags[j].page_offset =
-							page_info->page_offset;
+			skb_frag_off_set(&skb_shinfo(skb)->frags[j],
+					 page_info->page_offset);
 			skb_frag_size_set(&skb_shinfo(skb)->frags[j], 0);
 			skb_shinfo(skb)->nr_frags++;
 		} else {
@@ -2454,8 +2454,8 @@ static void be_rx_compl_process_gro(struct be_rx_obj *rxo,
 			/* First frag or Fresh page */
 			j++;
 			skb_frag_set_page(skb, j, page_info->page);
-			skb_shinfo(skb)->frags[j].page_offset =
-							page_info->page_offset;
+			skb_frag_off_set(&skb_shinfo(skb)->frags[j],
+					 page_info->page_offset);
 			skb_frag_size_set(&skb_shinfo(skb)->frags[j], 0);
 		} else {
 			put_page(page_info->page);
diff --git a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
index 5fad73b2e123..3981c06f082f 100644
--- a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
+++ b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
@@ -501,7 +501,7 @@ fs_enet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		nr_frags = skb_shinfo(skb)->nr_frags;
 		frag = skb_shinfo(skb)->frags;
 		for (i = 0; i < nr_frags; i++, frag++) {
-			if (!IS_ALIGNED(frag->page_offset, 4)) {
+			if (!IS_ALIGNED(skb_frag_off(frag), 4)) {
 				is_aligned = 0;
 				break;
 			}
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 3da680073265..81a05ea38237 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1485,7 +1485,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 
 			memcpy(dst + cur,
 			       page_address(skb_frag_page(frag)) +
-			       frag->page_offset, skb_frag_size(frag));
+			       skb_frag_off(frag), skb_frag_size(frag));
 			cur += skb_frag_size(frag);
 		}
 	} else {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f162252f01b5..e3f29dc8b290 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3306,7 +3306,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb)
 		 * descriptor associated with the fragment.
 		 */
 		if (stale_size > I40E_MAX_DATA_PER_TXD) {
-			int align_pad = -(stale->page_offset) &
+			int align_pad = -(skb_frag_off(stale)) &
 					(I40E_MAX_READ_REQ_SIZE - 1);
 
 			sum -= align_pad;
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index fae7cd1c618a..7a30d5d5ef53 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -2205,7 +2205,7 @@ bool __iavf_chk_linearize(struct sk_buff *skb)
 		 * descriptor associated with the fragment.
 		 */
 		if (stale_size > IAVF_MAX_DATA_PER_TXD) {
-			int align_pad = -(stale->page_offset) &
+			int align_pad = -(skb_frag_off(stale)) &
 					(IAVF_MAX_READ_REQ_SIZE - 1);
 
 			sum -= align_pad;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e12d23d1fa64..dc7b128c780e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1807,7 +1807,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 
 	/* update all of the pointers */
 	skb_frag_size_sub(frag, pull_len);
-	frag->page_offset += pull_len;
+	skb_frag_off_add(frag, pull_len);
 	skb->data_len -= pull_len;
 	skb->tail += pull_len;
 }
@@ -1844,7 +1844,7 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
 
 		dma_sync_single_range_for_cpu(rx_ring->dev,
 					      IXGBE_CB(skb)->dma,
-					      frag->page_offset,
+					      skb_frag_off(frag),
 					      skb_frag_size(frag),
 					      DMA_FROM_DEVICE);
 	}
diff --git a/drivers/net/ethernet/jme.c b/drivers/net/ethernet/jme.c
index 9c3ab00643bd..6d52cf5ce20e 100644
--- a/drivers/net/ethernet/jme.c
+++ b/drivers/net/ethernet/jme.c
@@ -2040,8 +2040,8 @@ jme_map_tx_skb(struct jme_adapter *jme, struct sk_buff *skb, int idx)
 		ctxbi = txbi + ((idx + i + 2) & (mask));
 
 		ret = jme_fill_tx_map(jme->pdev, ctxdesc, ctxbi,
-				skb_frag_page(frag),
-				frag->page_offset, skb_frag_size(frag), hidma);
+				      skb_frag_page(frag), skb_frag_off(frag),
+				      skb_frag_size(frag), hidma);
 		if (ret) {
 			jme_drop_tx_map(jme, idx, i);
 			goto out;
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index 88ea5ac83c93..82ea55ae5053 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -659,7 +659,7 @@ static inline unsigned int has_tiny_unaligned_frags(struct sk_buff *skb)
 	for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) {
 		const skb_frag_t *fragp = &skb_shinfo(skb)->frags[frag];
 
-		if (skb_frag_size(fragp) <= 8 && fragp->page_offset & 7)
+		if (skb_frag_size(fragp) <= 8 && skb_frag_off(fragp) & 7)
 			return 1;
 	}
 
diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index 9ead6ecb7586..99eaadba555f 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -1306,8 +1306,8 @@ myri10ge_vlan_rx(struct net_device *dev, void *addr, struct sk_buff *skb)
 		skb->len -= VLAN_HLEN;
 		skb->data_len -= VLAN_HLEN;
 		frag = skb_shinfo(skb)->frags;
-		frag->page_offset += VLAN_HLEN;
-		skb_frag_size_set(frag, skb_frag_size(frag) - VLAN_HLEN);
+		skb_frag_off_add(frag, VLAN_HLEN);
+		skb_frag_size_sub(frag, VLAN_HLEN);
 	}
 }
 
@@ -1364,7 +1364,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum)
 	}
 
 	/* remove padding */
-	rx_frags[0].page_offset += MXGEFW_PAD;
+	skb_frag_off_add(&rx_frags[0], MXGEFW_PAD);
 	skb_frag_size_sub(&rx_frags[0], MXGEFW_PAD);
 	len -= MXGEFW_PAD;
 
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 31ec56091a5d..65e81ec1b314 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -274,7 +274,7 @@ static void efx_skb_copy_bits_to_pio(struct efx_nic *efx, struct sk_buff *skb,
 
 		vaddr = kmap_atomic(skb_frag_page(f));
 
-		efx_memcpy_toio_aligned_cb(efx, piobuf, vaddr + f->page_offset,
+		efx_memcpy_toio_aligned_cb(efx, piobuf, vaddr + skb_frag_off(f),
 					   skb_frag_size(f), copy_buf);
 		kunmap_atomic(vaddr);
 	}
diff --git a/drivers/net/ethernet/sun/cassini.c b/drivers/net/ethernet/sun/cassini.c
index 6fc05c106afc..c91876f8c536 100644
--- a/drivers/net/ethernet/sun/cassini.c
+++ b/drivers/net/ethernet/sun/cassini.c
@@ -2034,7 +2034,7 @@ static int cas_rx_process_pkt(struct cas *cp, struct cas_rx_comp *rxc,
 
 		__skb_frag_set_page(frag, page->buffer);
 		__skb_frag_ref(frag);
-		frag->page_offset = off;
+		skb_frag_off_set(frag, off);
 		skb_frag_size_set(frag, hlen - swivel);
 
 		/* any more data? */
@@ -2058,7 +2058,7 @@ static int cas_rx_process_pkt(struct cas *cp, struct cas_rx_comp *rxc,
 
 			__skb_frag_set_page(frag, page->buffer);
 			__skb_frag_ref(frag);
-			frag->page_offset = 0;
+			skb_frag_off_set(frag, 0);
 			skb_frag_size_set(frag, hlen);
 			RX_USED_ADD(page, hlen + cp->crc_size);
 		}
@@ -2816,7 +2816,7 @@ static inline int cas_xmit_tx_ringN(struct cas *cp, int ring,
 		mapping = skb_frag_dma_map(&cp->pdev->dev, fragp, 0, len,
 					   DMA_TO_DEVICE);
 
-		tabort = cas_calc_tabort(cp, fragp->page_offset, len);
+		tabort = cas_calc_tabort(cp, skb_frag_off(fragp), len);
 		if (unlikely(tabort)) {
 			void *addr;
 
@@ -2827,7 +2827,7 @@ static inline int cas_xmit_tx_ringN(struct cas *cp, int ring,
 
 			addr = cas_page_map(skb_frag_page(fragp));
 			memcpy(tx_tiny_buf(cp, ring, entry),
-			       addr + fragp->page_offset + len - tabort,
+			       addr + skb_frag_off(fragp) + len - tabort,
 			       tabort);
 			cas_page_unmap(addr);
 			mapping = tx_tiny_map(cp, ring, entry, tentry);
diff --git a/drivers/net/ethernet/sun/niu.c b/drivers/net/ethernet/sun/niu.c
index 0bc5863bffeb..f5fd1f3c07cc 100644
--- a/drivers/net/ethernet/sun/niu.c
+++ b/drivers/net/ethernet/sun/niu.c
@@ -6695,7 +6695,7 @@ static netdev_tx_t niu_start_xmit(struct sk_buff *skb,
 
 		len = skb_frag_size(frag);
 		mapping = np->ops->map_page(np->device, skb_frag_page(frag),
-					    frag->page_offset, len,
+					    skb_frag_off(frag), len,
 					    DMA_TO_DEVICE);
 
 		rp->tx_buffs[prod].skb = NULL;
diff --git a/drivers/net/ethernet/sun/sunvnet_common.c b/drivers/net/ethernet/sun/sunvnet_common.c
index baa3088b475c..646e67236b65 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1088,7 +1088,7 @@ static inline int vnet_skb_map(struct ldc_channel *lp, struct sk_buff *skb,
 			vaddr = kmap_atomic(skb_frag_page(f));
 			blen = skb_frag_size(f);
 			blen += 8 - (blen & 7);
-			err = ldc_map_single(lp, vaddr + f->page_offset,
+			err = ldc_map_single(lp, vaddr + skb_frag_off(f),
 					     blen, cookies + nc, ncookies - nc,
 					     map_perm);
 			kunmap_atomic(vaddr);
@@ -1124,7 +1124,7 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, int ncookies)
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
-		docopy |= f->page_offset & 7;
+		docopy |= skb_frag_off(f) & 7;
 	}
 	if (((unsigned long)skb->data & 7) != VNET_PACKET_SKIP ||
 	    skb_tailroom(skb) < pad ||
diff --git a/drivers/net/ethernet/ti/netcp_core.c b/drivers/net/ethernet/ti/netcp_core.c
index 642843945031..1b2702f74455 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1116,7 +1116,7 @@ netcp_tx_map_skb(struct sk_buff *skb, struct netcp_intf *netcp)
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		struct page *page = skb_frag_page(frag);
-		u32 page_offset = frag->page_offset;
+		u32 page_offset = skb_frag_off(frag);
 		u32 buf_len = skb_frag_size(frag);
 		dma_addr_t desc_dma;
 		u32 desc_dma_32;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 3544e1991579..86884c863013 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -435,7 +435,7 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
 		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
 
 		slots_used += fill_pg_buf(skb_frag_page(frag),
-					frag->page_offset,
+					skb_frag_off(frag),
 					skb_frag_size(frag), &pb[slots_used]);
 	}
 	return slots_used;
@@ -449,7 +449,7 @@ static int count_skb_frag_slots(struct sk_buff *skb)
 	for (i = 0; i < frags; i++) {
 		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
 		unsigned long size = skb_frag_size(frag);
-		unsigned long offset = frag->page_offset;
+		unsigned long offset = skb_frag_off(frag);
 
 		/* Skip unused frames from start of page */
 		offset &= ~PAGE_MASK;
diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index fcf31335a8b6..dacb4f680fd4 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -1005,7 +1005,7 @@ static void *tbnet_kmap_frag(struct sk_buff *skb, unsigned int frag_num,
 	const skb_frag_t *frag = &skb_shinfo(skb)->frags[frag_num];
 
 	*len = skb_frag_size(frag);
-	return kmap_atomic(skb_frag_page(frag)) + frag->page_offset;
+	return kmap_atomic(skb_frag_page(frag)) + skb_frag_off(frag);
 }
 
 static netdev_tx_t tbnet_start_xmit(struct sk_buff *skb,
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ace7ffaf3913..58952a79b05f 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1328,7 +1328,7 @@ static int build_dma_sg(const struct sk_buff *skb, struct urb *urb)
 
 		total_len += skb_frag_size(f);
 		sg_set_page(&urb->sg[i + s], skb_frag_page(f), skb_frag_size(f),
-				f->page_offset);
+			    skb_frag_off(f));
 	}
 	urb->transfer_buffer_length = total_len;
 
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 03feaeae89cd..216acf37ca7c 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -662,7 +662,7 @@ vmxnet3_append_frag(struct sk_buff *skb, struct Vmxnet3_RxCompDesc *rcd,
 	BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS);
 
 	__skb_frag_set_page(frag, rbi->page);
-	frag->page_offset = 0;
+	skb_frag_off_set(frag, 0);
 	skb_frag_size_set(frag, rcd->len);
 	skb->data_len += rcd->len;
 	skb->truesize += PAGE_SIZE;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index a96c5c2a2c5a..3ef07b63613e 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -136,12 +136,12 @@ static inline struct xenvif_queue *ubuf_to_queue(const struct ubuf_info *ubuf)
 
 static u16 frag_get_pending_idx(skb_frag_t *frag)
 {
-	return (u16)frag->page_offset;
+	return (u16)skb_frag_off(frag);
 }
 
 static void frag_set_pending_idx(skb_frag_t *frag, u16 pending_idx)
 {
-	frag->page_offset = pending_idx;
+	skb_frag_off_set(frag, pending_idx);
 }
 
 static inline pending_ring_idx_t pending_index(unsigned i)
@@ -1068,7 +1068,7 @@ static int xenvif_handle_frag_list(struct xenvif_queue *queue, struct sk_buff *s
 
 		offset += len;
 		__skb_frag_set_page(&frags[i], page);
-		frags[i].page_offset = 0;
+		skb_frag_off_set(&frags[i], 0);
 		skb_frag_size_set(&frags[i], len);
 	}
 
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 8d33970a2950..b930d5f95222 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -531,7 +531,7 @@ static int xennet_count_skb_slots(struct sk_buff *skb)
 	for (i = 0; i < frags; i++) {
 		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
 		unsigned long size = skb_frag_size(frag);
-		unsigned long offset = frag->page_offset;
+		unsigned long offset = skb_frag_off(frag);
 
 		/* Skip unused frames from start of page */
 		offset &= ~PAGE_MASK;
@@ -674,8 +674,8 @@ static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev
 	/* Requests for all the frags. */
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-		tx = xennet_make_txreqs(queue, tx, skb,
-					skb_frag_page(frag), frag->page_offset,
+		tx = xennet_make_txreqs(queue, tx, skb, skb_frag_page(frag),
+					skb_frag_off(frag),
 					skb_frag_size(frag));
 	}
 
@@ -1040,7 +1040,7 @@ static int xennet_poll(struct napi_struct *napi, int budget)
 		if (NETFRONT_SKB_CB(skb)->pull_to > RX_COPY_THRESHOLD)
 			NETFRONT_SKB_CB(skb)->pull_to = RX_COPY_THRESHOLD;
 
-		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
+		skb_frag_off_set(&skb_shinfo(skb)->frags[0], rx->offset);
 		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
 		skb->data_len = rx->status;
 		skb->len += rx->status;
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 7796799bf04a..9ff9429395eb 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -346,7 +346,7 @@ static int bnx2fc_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(skb_frag_page(frag)) + frag->page_offset;
+		cp = kmap_atomic(skb_frag_page(frag)) + skb_frag_off(frag);
 	} else {
 		cp = skb_put(skb, tlen);
 	}
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 00dd47bcbb1e..587d4bbb7d22 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1522,8 +1522,7 @@ static int fcoe_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(skb_frag_page(frag))
-			+ frag->page_offset;
+		cp = kmap_atomic(skb_frag_page(frag)) + skb_frag_off(frag);
 	} else {
 		cp = skb_put(skb, tlen);
 	}
diff --git a/drivers/scsi/fcoe/fcoe_transport.c b/drivers/scsi/fcoe/fcoe_transport.c
index d0550384cc38..a20ddc301c89 100644
--- a/drivers/scsi/fcoe/fcoe_transport.c
+++ b/drivers/scsi/fcoe/fcoe_transport.c
@@ -318,7 +318,7 @@ u32 fcoe_fc_crc(struct fc_frame *fp)
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		frag = &skb_shinfo(skb)->frags[i];
-		off = frag->page_offset;
+		off = skb_frag_off(frag);
 		len = skb_frag_size(frag);
 		while (len > 0) {
 			clen = min(len, PAGE_SIZE - (off & ~PAGE_MASK));
diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
index a42babde036d..42542720962f 100644
--- a/drivers/scsi/qedf/qedf_main.c
+++ b/drivers/scsi/qedf/qedf_main.c
@@ -1077,7 +1077,7 @@ static int qedf_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(skb_frag_page(frag)) + frag->page_offset;
+		cp = kmap_atomic(skb_frag_page(frag)) + skb_frag_off(frag);
 	} else {
 		cp = skb_put(skb, tlen);
 	}
diff --git a/drivers/staging/unisys/visornic/visornic_main.c b/drivers/staging/unisys/visornic/visornic_main.c
index b889b04a6e25..6fa7726185de 100644
--- a/drivers/staging/unisys/visornic/visornic_main.c
+++ b/drivers/staging/unisys/visornic/visornic_main.c
@@ -284,7 +284,7 @@ static int visor_copy_fragsinfo_from_skb(struct sk_buff *skb,
 		for (frag = 0; frag < numfrags; frag++) {
 			count = add_physinfo_entries(page_to_pfn(
 				  skb_frag_page(&skb_shinfo(skb)->frags[frag])),
-				  skb_shinfo(skb)->frags[frag].page_offset,
+				  skb_frag_off(&skb_shinfo(skb)->frags[frag]),
 				  skb_frag_size(&skb_shinfo(skb)->frags[frag]),
 				  count, frags_max, frags);
 			/* add_physinfo_entries only returns
diff --git a/drivers/target/iscsi/cxgbit/cxgbit_target.c b/drivers/target/iscsi/cxgbit/cxgbit_target.c
index c25315431ad0..fcdc4211e3c2 100644
--- a/drivers/target/iscsi/cxgbit/cxgbit_target.c
+++ b/drivers/target/iscsi/cxgbit/cxgbit_target.c
@@ -900,7 +900,7 @@ cxgbit_handle_immediate_data(struct iscsi_cmd *cmd, struct iscsi_scsi_req *hdr,
 
 		sg_init_table(&ccmd->sg, 1);
 		sg_set_page(&ccmd->sg, skb_frag_page(dfrag),
-				skb_frag_size(dfrag), dfrag->page_offset);
+				skb_frag_size(dfrag), skb_frag_off(dfrag));
 		get_page(skb_frag_page(dfrag));
 
 		cmd->se_cmd.t_data_sg = &ccmd->sg;
@@ -1403,7 +1403,7 @@ static void cxgbit_lro_skb_dump(struct sk_buff *skb)
 			pdu_cb->ddigest, pdu_cb->frags);
 	for (i = 0; i < ssi->nr_frags; i++)
 		pr_info("skb 0x%p, frag %d, off %u, sz %u.\n",
-			skb, i, ssi->frags[i].page_offset,
+			skb, i, skb_frag_off(&ssi->frags[i]),
 			skb_frag_size(&ssi->frags[i]));
 }
 
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index a8cb6b2e20c1..4072e9d394d6 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -953,8 +953,8 @@ static unsigned long atalk_sum_skb(const struct sk_buff *skb, int offset,
 			if (copy > len)
 				copy = len;
 			vaddr = kmap_atomic(skb_frag_page(frag));
-			sum = atalk_sum_partial(vaddr + frag->page_offset +
-						  offset - start, copy, sum);
+			sum = atalk_sum_partial(vaddr + skb_frag_off(frag) +
+						offset - start, copy, sum);
 			kunmap_atomic(vaddr);
 
 			if (!(len -= copy))
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 45a162ef5e02..4cc8dc5db2b7 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -442,8 +442,8 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			n = cb(vaddr + frag->page_offset +
-				offset - start, copy, data, to);
+			n = cb(vaddr + skb_frag_off(frag) + offset - start,
+			       copy, data, to);
 			kunmap(page);
 			offset += n;
 			if (n != copy)
@@ -573,7 +573,7 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 			if (copy > len)
 				copy = len;
 			copied = copy_page_from_iter(skb_frag_page(frag),
-					  frag->page_offset + offset - start,
+					  skb_frag_off(frag) + offset - start,
 					  copy, from);
 			if (copied != copy)
 				goto fault;
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..e2a11c62197b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5481,7 +5481,7 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
 	skb->data_len -= grow;
 	skb->tail += grow;
 
-	pinfo->frags[0].page_offset += grow;
+	skb_frag_off_add(&pinfo->frags[0], grow);
 	skb_frag_size_sub(&pinfo->frags[0], grow);
 
 	if (unlikely(!skb_frag_size(&pinfo->frags[0]))) {
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index bb9915291644..c5dbdc87342a 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2652,7 +2652,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
 			}
 			get_page(pkt_dev->page);
 			skb_frag_set_page(skb, i, pkt_dev->page);
-			skb_shinfo(skb)->frags[i].page_offset = 0;
+			skb_frag_off_set(&skb_shinfo(skb)->frags[i], 0);
 			/*last fragment, fill rest of data*/
 			if (i == (frags - 1))
 				skb_frag_size_set(&skb_shinfo(skb)->frags[i],
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0b788df5a75b..2c8a9e10c9d9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -785,7 +785,7 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
 		struct page *p;
 		u8 *vaddr;
 
-		skb_frag_foreach_page(frag, frag->page_offset,
+		skb_frag_foreach_page(frag, skb_frag_off(frag),
 				      skb_frag_size(frag), p, p_off, p_len,
 				      copied) {
 			seg_len = min_t(int, p_len, len);
@@ -1375,7 +1375,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		struct page *p;
 		u8 *vaddr;
 
-		skb_frag_foreach_page(f, f->page_offset, skb_frag_size(f),
+		skb_frag_foreach_page(f, skb_frag_off(f), skb_frag_size(f),
 				      p, p_off, p_len, copied) {
 			u32 copy, done = 0;
 			vaddr = kmap_atomic(p);
@@ -2144,10 +2144,12 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 			skb_frag_unref(skb, i);
 			eat -= size;
 		} else {
-			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
+
+			*frag = skb_shinfo(skb)->frags[i];
 			if (eat) {
-				skb_shinfo(skb)->frags[k].page_offset += eat;
-				skb_frag_size_sub(&skb_shinfo(skb)->frags[k], eat);
+				skb_frag_off_add(frag, eat);
+				skb_frag_size_sub(frag, eat);
 				if (!i)
 					goto end;
 				eat = 0;
@@ -2219,7 +2221,7 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 				copy = len;
 
 			skb_frag_foreach_page(f,
-					      f->page_offset + offset - start,
+					      skb_frag_off(f) + offset - start,
 					      copy, p, p_off, p_len, copied) {
 				vaddr = kmap_atomic(p);
 				memcpy(to + copied, vaddr + p_off, p_len);
@@ -2395,7 +2397,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(skb_frag_page(f),
-				     f->page_offset, skb_frag_size(f),
+				     skb_frag_off(f), skb_frag_size(f),
 				     offset, len, spd, false, sk, pipe))
 			return true;
 	}
@@ -2498,7 +2500,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 
 		while (slen) {
 			ret = kernel_sendpage_locked(sk, skb_frag_page(frag),
-						     frag->page_offset + offset,
+						     skb_frag_off(frag) + offset,
 						     slen, MSG_DONTWAIT);
 			if (ret <= 0)
 				goto error;
@@ -2580,7 +2582,7 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 				copy = len;
 
 			skb_frag_foreach_page(frag,
-					      frag->page_offset + offset - start,
+					      skb_frag_off(frag) + offset - start,
 					      copy, p, p_off, p_len, copied) {
 				vaddr = kmap_atomic(p);
 				memcpy(vaddr + p_off, from + copied, p_len);
@@ -2660,7 +2662,7 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 				copy = len;
 
 			skb_frag_foreach_page(frag,
-					      frag->page_offset + offset - start,
+					      skb_frag_off(frag) + offset - start,
 					      copy, p, p_off, p_len, copied) {
 				vaddr = kmap_atomic(p);
 				csum2 = INDIRECT_CALL_1(ops->update,
@@ -2759,7 +2761,7 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 				copy = len;
 
 			skb_frag_foreach_page(frag,
-					      frag->page_offset + offset - start,
+					      skb_frag_off(frag) + offset - start,
 					      copy, p, p_off, p_len, copied) {
 				vaddr = kmap_atomic(p);
 				csum2 = csum_partial_copy_nocheck(vaddr + p_off,
@@ -3234,7 +3236,7 @@ static inline void skb_split_no_header(struct sk_buff *skb,
 				 * 2. Split is accurately. We make this.
 				 */
 				skb_frag_ref(skb, i);
-				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
+				skb_frag_off_add(&skb_shinfo(skb1)->frags[0], len - pos);
 				skb_frag_size_sub(&skb_shinfo(skb1)->frags[0], len - pos);
 				skb_frag_size_set(&skb_shinfo(skb)->frags[i], len - pos);
 				skb_shinfo(skb)->nr_frags++;
@@ -3316,7 +3318,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
-			      fragfrom->page_offset)) {
+			      skb_frag_off(fragfrom))) {
 		merge = -1;
 	} else {
 		merge = to - 1;
@@ -3333,7 +3335,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 			skb_frag_size_add(fragto, shiftlen);
 			skb_frag_size_sub(fragfrom, shiftlen);
-			fragfrom->page_offset += shiftlen;
+			skb_frag_off_add(fragfrom, shiftlen);
 
 			goto onlymerged;
 		}
@@ -3364,11 +3366,11 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 		} else {
 			__skb_frag_ref(fragfrom);
-			fragto->bv_page = fragfrom->bv_page;
-			fragto->page_offset = fragfrom->page_offset;
+			skb_frag_page_set_from(fragto, fragfrom);
+			skb_frag_off_set_from(fragto, fragfrom);
 			skb_frag_size_set(fragto, todo);
 
-			fragfrom->page_offset += todo;
+			skb_frag_off_add(fragfrom, todo);
 			skb_frag_size_sub(fragfrom, todo);
 			todo = 0;
 
@@ -3493,7 +3495,7 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data,
 			if (!st->frag_data)
 				st->frag_data = kmap_atomic(skb_frag_page(frag));
 
-			*data = (u8 *) st->frag_data + frag->page_offset +
+			*data = (u8 *) st->frag_data + skb_frag_off(frag) +
 				(abs_offset - st->stepped_offset);
 
 			return block_limit - abs_offset;
@@ -3630,8 +3632,8 @@ static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
 
 	page = virt_to_head_page(frag_skb->head);
 	__skb_frag_set_page(&head_frag, page);
-	head_frag.page_offset = frag_skb->data -
-		(unsigned char *)page_address(page);
+	skb_frag_off_set(&head_frag, frag_skb->data -
+			 (unsigned char *)page_address(page));
 	skb_frag_size_set(&head_frag, skb_headlen(frag_skb));
 	return head_frag;
 }
@@ -3875,7 +3877,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			size = skb_frag_size(nskb_frag);
 
 			if (pos < offset) {
-				nskb_frag->page_offset += offset - pos;
+				skb_frag_off_add(nskb_frag, offset - pos);
 				skb_frag_size_sub(nskb_frag, offset - pos);
 			}
 
@@ -3996,7 +3998,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 			*--frag = *--frag2;
 		} while (--i);
 
-		frag->page_offset += offset;
+		skb_frag_off_add(frag, offset);
 		skb_frag_size_sub(frag, offset);
 
 		/* all fragments truesize : remove (head size + sk_buff) */
@@ -4026,7 +4028,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
 
 		__skb_frag_set_page(frag, page);
-		frag->page_offset = first_offset;
+		skb_frag_off_set(frag, first_offset);
 		skb_frag_size_set(frag, first_size);
 
 		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
@@ -4042,7 +4044,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	if (offset > headlen) {
 		unsigned int eat = offset - headlen;
 
-		skbinfo->frags[0].page_offset += eat;
+		skb_frag_off_add(&skbinfo->frags[0], eat);
 		skb_frag_size_sub(&skbinfo->frags[0], eat);
 		skb->data_len -= eat;
 		skb->len -= eat;
@@ -4167,7 +4169,7 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len,
 			if (copy > len)
 				copy = len;
 			sg_set_page(&sg[elt], skb_frag_page(frag), copy,
-					frag->page_offset+offset-start);
+				    skb_frag_off(frag) + offset - start);
 			elt++;
 			if (!(len -= copy))
 				return elt;
@@ -5838,7 +5840,7 @@ static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off,
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				shinfo->frags[0].page_offset += off - pos;
+				skb_frag_off_add(&shinfo->frags[0], off - pos);
 				skb_frag_size_sub(&shinfo->frags[0], off - pos);
 			}
 			skb_frag_ref(skb, i);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f62f0e7e3cdd..a0a66321c0ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1782,12 +1782,12 @@ static int tcp_zerocopy_receive(struct sock *sk,
 				frags++;
 			}
 		}
-		if (skb_frag_size(frags) != PAGE_SIZE || frags->page_offset) {
+		if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) {
 			int remaining = zc->recv_skip_hint;
 			int size = skb_frag_size(frags);
 
 			while (remaining && (size != PAGE_SIZE ||
-					     frags->page_offset)) {
+					     skb_frag_off(frags))) {
 				remaining -= size;
 				frags++;
 				size = skb_frag_size(frags);
@@ -3784,7 +3784,7 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 
 	for (i = 0; i < shi->nr_frags; ++i) {
 		const skb_frag_t *f = &shi->frags[i];
-		unsigned int offset = f->page_offset;
+		unsigned int offset = skb_frag_off(f);
 		struct page *page = skb_frag_page(f) + (offset >> PAGE_SHIFT);
 
 		sg_set_page(&sg, page, skb_frag_size(f),
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4afc48d7bb..e6d02e05bb1c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1402,7 +1402,7 @@ static int __pskb_trim_head(struct sk_buff *skb, int len)
 		} else {
 			shinfo->frags[k] = shinfo->frags[i];
 			if (eat) {
-				shinfo->frags[k].page_offset += eat;
+				skb_frag_off_add(&shinfo->frags[k], eat);
 				skb_frag_size_sub(&shinfo->frags[k], eat);
 				eat = 0;
 			}
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 05f63c4300e9..4ff75c3a8d6e 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -642,7 +642,7 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 
 			ret = kernel_sendpage(psock->sk->sk_socket,
 					      skb_frag_page(frag),
-					      frag->page_offset + frag_offset,
+					      skb_frag_off(frag) + frag_offset,
 					      skb_frag_size(frag) - frag_offset,
 					      MSG_DONTWAIT);
 			if (ret <= 0) {
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4ec8a06fa5d1..d184230665eb 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -244,12 +244,12 @@ static void tls_append_frag(struct tls_record_info *record,
 
 	frag = &record->frags[record->num_frags - 1];
 	if (skb_frag_page(frag) == pfrag->page &&
-	    frag->page_offset + skb_frag_size(frag) == pfrag->offset) {
+	    skb_frag_off(frag) + skb_frag_size(frag) == pfrag->offset) {
 		skb_frag_size_add(frag, size);
 	} else {
 		++frag;
 		__skb_frag_set_page(frag, pfrag->page);
-		frag->page_offset = pfrag->offset;
+		skb_frag_off_set(frag, pfrag->offset);
 		skb_frag_size_set(frag, size);
 		++record->num_frags;
 		get_page(pfrag->page);
@@ -301,7 +301,7 @@ static int tls_push_record(struct sock *sk,
 		frag = &record->frags[i];
 		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
 		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
-			    skb_frag_size(frag), frag->page_offset);
+			    skb_frag_size(frag), skb_frag_off(frag));
 		sk_mem_charge(sk, skb_frag_size(frag));
 		get_page(skb_frag_page(frag));
 	}
@@ -324,7 +324,7 @@ static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
 
 	frag = &record->frags[0];
 	__skb_frag_set_page(frag, pfrag->page);
-	frag->page_offset = pfrag->offset;
+	skb_frag_off_set(frag, pfrag->offset);
 	skb_frag_size_set(frag, prepend_size);
 
 	get_page(pfrag->page);
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
index 9070d68a92a4..28895333701e 100644
--- a/net/tls/tls_device_fallback.c
+++ b/net/tls/tls_device_fallback.c
@@ -273,7 +273,7 @@ static int fill_sg_in(struct scatterlist *sg_in,
 
 		__skb_frag_ref(frag);
 		sg_set_page(sg_in + i, skb_frag_page(frag),
-			    skb_frag_size(frag), frag->page_offset);
+			    skb_frag_size(frag), skb_frag_off(frag));
 
 		remaining -= skb_frag_size(frag);
 
diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index 32c364d3bfb3..4d422447aadc 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -85,7 +85,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
 		if (dlen < len)
 			len = dlen;
 
-		frag->page_offset = 0;
+		skb_frag_off_set(frag, 0);
 		skb_frag_size_set(frag, len);
 		memcpy(skb_frag_address(frag), scratch, len);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 1/3 net-next] linux: Add skb_frag_t page_offset accessors
From: Jonathan Lemon @ 2019-07-29 17:19 UTC (permalink / raw)
  To: willy, davem; +Cc: kernel-team, netdev
In-Reply-To: <20190729171941.250569-1-jonathan.lemon@gmail.com>

Add skb_frag_off(), skb_frag_off_add(), skb_frag_off_set(),
and skb_frag_off_set_from() accessors for page_offset.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 61 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 718742b1c505..7d94a78067ee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -331,7 +331,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
 }
 
 /**
- * skb_frag_size_add - Incrementes the size of a skb fragment by %delta
+ * skb_frag_size_add - Increments the size of a skb fragment by %delta
  * @frag: skb fragment
  * @delta: value to add
  */
@@ -2857,6 +2857,46 @@ static inline void skb_propagate_pfmemalloc(struct page *page,
 		skb->pfmemalloc = true;
 }
 
+/**
+ * skb_frag_off - Returns the offset of a skb fragment
+ * @frag: the paged fragment
+ */
+static inline unsigned int skb_frag_off(const skb_frag_t *frag)
+{
+	return frag->page_offset;
+}
+
+/**
+ * skb_frag_off_add - Increments the offset of a skb fragment by %delta
+ * @frag: skb fragment
+ * @delta: value to add
+ */
+static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
+{
+	frag->page_offset += delta;
+}
+
+/**
+ * skb_frag_off_set - Sets the offset of a skb fragment
+ * @frag: skb fragment
+ * @offset: offset of fragment
+ */
+static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
+{
+	frag->page_offset = offset;
+}
+
+/**
+ * skb_frag_off_set_from - Sets the offset of a skb fragment from another fragment
+ * @fragto: skb fragment where offset is set
+ * @fragfrom: skb fragment offset is copied from
+ */
+static inline void skb_frag_off_set_from(skb_frag_t *fragto,
+					 const skb_frag_t *fragfrom)
+{
+	fragto->page_offset = fragfrom->page_offset;
+}
+
 /**
  * skb_frag_page - retrieve the page referred to by a paged fragment
  * @frag: the paged fragment
@@ -2923,7 +2963,7 @@ static inline void skb_frag_unref(struct sk_buff *skb, int f)
  */
 static inline void *skb_frag_address(const skb_frag_t *frag)
 {
-	return page_address(skb_frag_page(frag)) + frag->page_offset;
+	return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
 }
 
 /**
@@ -2939,7 +2979,18 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
 	if (unlikely(!ptr))
 		return NULL;
 
-	return ptr + frag->page_offset;
+	return ptr + skb_frag_off(frag);
+}
+
+/**
+ * skb_frag_page_set_from - sets the page in a fragment from another fragment
+ * @fragto: skb fragment where page is set
+ * @fragfrom: skb fragment page is copied from
+ */
+static inline void skb_frag_page_set_from(skb_frag_t *fragto,
+					  const skb_frag_t *fragfrom)
+{
+	fragto->bv_page = fragfrom->bv_page;
 }
 
 /**
@@ -2987,7 +3038,7 @@ static inline dma_addr_t skb_frag_dma_map(struct device *dev,
 					  enum dma_data_direction dir)
 {
 	return dma_map_page(dev, skb_frag_page(frag),
-			    frag->page_offset + offset, size, dir);
+			    skb_frag_off(frag) + offset, size, dir);
 }
 
 static inline struct sk_buff *pskb_copy(struct sk_buff *skb,
@@ -3157,7 +3208,7 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == skb_frag_off(frag) + skb_frag_size(frag);
 	}
 	return false;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jonathan Lemon @ 2019-07-29 17:19 UTC (permalink / raw)
  To: willy, davem; +Cc: kernel-team, netdev

The recent conversion of skb_frag_t to bio_vec did not include
skb_frag's page_offset.  Add accessor functions for this field,
utilize them, and remove the union, restoring the original structure.

Jonathan Lemon (3):
  linux: Add page_offset accessors
  net: Use skb_frag_off accessors
  linux: Remove bvec page_offset, use bv_offset

 drivers/atm/eni.c                             |  2 +-
 drivers/hsi/clients/ssi_protocol.c            |  2 +-
 drivers/infiniband/hw/hfi1/vnic_sdma.c        |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c       |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
 .../ethernet/cavium/thunder/nicvf_queues.c    |  2 +-
 drivers/net/ethernet/chelsio/cxgb3/sge.c      |  2 +-
 drivers/net/ethernet/emulex/benet/be_main.c   | 12 ++--
 .../ethernet/freescale/fs_enet/fs_enet-main.c |  2 +-
 drivers/net/ethernet/ibm/ibmvnic.c            |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  4 +-
 drivers/net/ethernet/jme.c                    |  4 +-
 drivers/net/ethernet/marvell/mv643xx_eth.c    |  2 +-
 .../net/ethernet/myricom/myri10ge/myri10ge.c  |  6 +-
 drivers/net/ethernet/sfc/tx.c                 |  2 +-
 drivers/net/ethernet/sun/cassini.c            |  8 +--
 drivers/net/ethernet/sun/niu.c                |  2 +-
 drivers/net/ethernet/sun/sunvnet_common.c     |  4 +-
 drivers/net/ethernet/ti/netcp_core.c          |  2 +-
 drivers/net/hyperv/netvsc_drv.c               |  4 +-
 drivers/net/thunderbolt.c                     |  2 +-
 drivers/net/usb/usbnet.c                      |  2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c             |  2 +-
 drivers/net/xen-netback/netback.c             |  6 +-
 drivers/net/xen-netfront.c                    |  8 +--
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c             |  2 +-
 drivers/scsi/fcoe/fcoe.c                      |  3 +-
 drivers/scsi/fcoe/fcoe_transport.c            |  2 +-
 drivers/scsi/qedf/qedf_main.c                 |  2 +-
 .../staging/unisys/visornic/visornic_main.c   |  2 +-
 drivers/target/iscsi/cxgbit/cxgbit_target.c   |  4 +-
 include/linux/bvec.h                          |  5 +-
 include/linux/skbuff.h                        | 63 +++++++++++++++++--
 net/appletalk/ddp.c                           |  4 +-
 net/core/datagram.c                           |  6 +-
 net/core/dev.c                                |  2 +-
 net/core/pktgen.c                             |  2 +-
 net/core/skbuff.c                             | 54 ++++++++--------
 net/ipv4/tcp.c                                |  6 +-
 net/ipv4/tcp_output.c                         |  2 +-
 net/kcm/kcmsock.c                             |  2 +-
 net/tls/tls_device.c                          |  8 +--
 net/tls/tls_device_fallback.c                 |  2 +-
 net/xfrm/xfrm_ipcomp.c                        |  2 +-
 46 files changed, 158 insertions(+), 108 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH 3/3 net-next] linux: Remove bvec page_offset, use bv_offset
From: Jonathan Lemon @ 2019-07-29 17:19 UTC (permalink / raw)
  To: willy, davem; +Cc: kernel-team, netdev
In-Reply-To: <20190729171941.250569-1-jonathan.lemon@gmail.com>

Now that page_offset is referenced through accessors, remove
the union, and use bv_offset.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/bvec.h   |  5 +----
 include/linux/skbuff.h | 10 +++++-----
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 7f2b2ea9399c..a032f01e928c 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -18,10 +18,7 @@
 struct bio_vec {
 	struct page	*bv_page;
 	unsigned int	bv_len;
-	union {
-		__u32		page_offset;
-		unsigned int	bv_offset;
-	};
+	unsigned int	bv_offset;
 };
 
 struct bvec_iter {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7d94a78067ee..68334e56a697 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2078,7 +2078,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	 * on page_is_pfmemalloc doing the right thing(tm).
 	 */
 	frag->bv_page		  = page;
-	frag->page_offset	  = off;
+	frag->bv_offset		  = off;
 	skb_frag_size_set(frag, size);
 
 	page = compound_head(page);
@@ -2863,7 +2863,7 @@ static inline void skb_propagate_pfmemalloc(struct page *page,
  */
 static inline unsigned int skb_frag_off(const skb_frag_t *frag)
 {
-	return frag->page_offset;
+	return frag->bv_offset;
 }
 
 /**
@@ -2873,7 +2873,7 @@ static inline unsigned int skb_frag_off(const skb_frag_t *frag)
  */
 static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
 {
-	frag->page_offset += delta;
+	frag->bv_offset += delta;
 }
 
 /**
@@ -2883,7 +2883,7 @@ static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 {
-	frag->page_offset = offset;
+	frag->bv_offset = offset;
 }
 
 /**
@@ -2894,7 +2894,7 @@ static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 static inline void skb_frag_off_set_from(skb_frag_t *fragto,
 					 const skb_frag_t *fragfrom)
 {
-	fragto->page_offset = fragfrom->page_offset;
+	fragto->bv_offset = fragfrom->bv_offset;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply related

* [bpf-next,v2 3/6] bpf: add bpf_tcp_gen_syncookie helper
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This helper function allows BPF programs to try to generate SYN
cookies, given a reference to a listener socket. The function works
from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a
socket in both cases.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 include/uapi/linux/bpf.h | 30 ++++++++++++++++-
 net/core/filter.c        | 73 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e985f07a98ed..5a54f1011db8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2713,6 +2713,33 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, on the listening socket in *sk*.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header.
+ *
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		followed by 16 bits which hold the MSS value for that cookie,
+ *		and the top 16 bits are unused.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** SYN cookie cannot be issued due to error
+ *
+ *		**-ENOENT** SYN cookie should not be issued (no SYN flood)
+ *
+ *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
+ *
+ *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2824,7 +2851,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(tcp_gen_syncookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 3961437ccc50..d7848d6944ff 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5850,6 +5850,75 @@ static const struct bpf_func_proto bpf_tcp_check_syncookie_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
+	   struct tcphdr *, th, u32, th_len)
+{
+#ifdef CONFIG_SYN_COOKIES
+	u32 cookie;
+	u16 mss;
+
+	if (unlikely(th_len < sizeof(*th) || th_len != th->doff * 4))
+		return -EINVAL;
+
+	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
+		return -ENOENT;
+
+	if (!th->syn || th->ack || th->fin || th->rst)
+		return -EINVAL;
+
+	if (unlikely(iph_len < sizeof(struct iphdr)))
+		return -EINVAL;
+
+	/* Both struct iphdr and struct ipv6hdr have the version field at the
+	 * same offset so we can cast to the shorter header (struct iphdr).
+	 */
+	switch (((struct iphdr *)iph)->version) {
+	case 4:
+		if (sk->sk_family == AF_INET6 && sk->sk_ipv6only)
+			return -EINVAL;
+
+		mss = tcp_v4_get_syncookie(sk, iph, th, &cookie);
+		break;
+
+#if IS_BUILTIN(CONFIG_IPV6)
+	case 6:
+		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+
+		if (sk->sk_family != AF_INET6)
+			return -EINVAL;
+
+		mss = tcp_v6_get_syncookie(sk, iph, th, &cookie);
+		break;
+#endif /* CONFIG_IPV6 */
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
+	if (mss <= 0)
+		return -ENOENT;
+
+	return cookie | ((u64)mss << 32);
+#else
+	return -EOPNOTSUPP;
+#endif /* CONFIG_SYN_COOKIES */
+}
+
+static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
+	.func		= bpf_tcp_gen_syncookie,
+	.gpl_only	= true, /* __cookie_v*_init_sequence() is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -6139,6 +6208,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_check_syncookie_proto;
 	case BPF_FUNC_skb_ecn_set_ce:
 		return &bpf_skb_ecn_set_ce_proto;
+	case BPF_FUNC_tcp_gen_syncookie:
+		return &bpf_tcp_gen_syncookie_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
@@ -6178,6 +6249,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_skc_lookup_tcp_proto;
 	case BPF_FUNC_tcp_check_syncookie:
 		return &bpf_tcp_check_syncookie_proto;
+	case BPF_FUNC_tcp_gen_syncookie:
+		return &bpf_tcp_gen_syncookie_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 6/6] selftests/bpf: add test for bpf_tcp_gen_syncookie
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Modify the existing bpf_tcp_check_syncookie test to also generate a
SYN cookie, pass the packet to the kernel, and verify that the two
cookies are the same (and both valid). Since cloned SKBs are skipped
during generic XDP, this test does not issue a SYN cookie when run in
XDP mode. We therefore only check that a valid SYN cookie was issued at
the TC hook.

Additionally, verify that the MSS for that SYN cookie is within
expected range.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 .../bpf/progs/test_tcp_check_syncookie_kern.c | 48 +++++++++++++--
 .../selftests/bpf/test_tcp_check_syncookie.sh |  3 +
 .../bpf/test_tcp_check_syncookie_user.c       | 61 ++++++++++++++++---
 3 files changed, 99 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c b/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
index 1ab095bcacd8..d8803dfa8d32 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
@@ -19,10 +19,29 @@
 struct bpf_map_def SEC("maps") results = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(__u32),
-	.value_size = sizeof(__u64),
-	.max_entries = 1,
+	.value_size = sizeof(__u32),
+	.max_entries = 3,
 };
 
+static __always_inline __s64 gen_syncookie(void *data_end, struct bpf_sock *sk,
+					   void *iph, __u32 ip_size,
+					   struct tcphdr *tcph)
+{
+	__u32 thlen = tcph->doff * 4;
+
+	if (tcph->syn && !tcph->ack) {
+		// packet should only have an MSS option
+		if (thlen != 24)
+			return 0;
+
+		if ((void *)tcph + thlen > data_end)
+			return 0;
+
+		return bpf_tcp_gen_syncookie(sk, iph, ip_size, tcph, thlen);
+	}
+	return 0;
+}
+
 static __always_inline void check_syncookie(void *ctx, void *data,
 					    void *data_end)
 {
@@ -33,8 +52,10 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 	struct ipv6hdr *ipv6h;
 	struct tcphdr *tcph;
 	int ret;
+	__u32 key_mss = 2;
+	__u32 key_gen = 1;
 	__u32 key = 0;
-	__u64 value = 1;
+	__s64 seq_mss;
 
 	ethh = data;
 	if (ethh + 1 > data_end)
@@ -66,6 +87,9 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		if (sk->state != BPF_TCP_LISTEN)
 			goto release;
 
+		seq_mss = gen_syncookie(data_end, sk, ipv4h, sizeof(*ipv4h),
+					tcph);
+
 		ret = bpf_tcp_check_syncookie(sk, ipv4h, sizeof(*ipv4h),
 					      tcph, sizeof(*tcph));
 		break;
@@ -95,6 +119,9 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		if (sk->state != BPF_TCP_LISTEN)
 			goto release;
 
+		seq_mss = gen_syncookie(data_end, sk, ipv6h, sizeof(*ipv6h),
+					tcph);
+
 		ret = bpf_tcp_check_syncookie(sk, ipv6h, sizeof(*ipv6h),
 					      tcph, sizeof(*tcph));
 		break;
@@ -103,8 +130,19 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		return;
 	}
 
-	if (ret == 0)
-		bpf_map_update_elem(&results, &key, &value, 0);
+	if (seq_mss > 0) {
+		__u32 cookie = (__u32)seq_mss;
+		__u32 mss = seq_mss >> 32;
+
+		bpf_map_update_elem(&results, &key_gen, &cookie, 0);
+		bpf_map_update_elem(&results, &key_mss, &mss, 0);
+	}
+
+	if (ret == 0) {
+		__u32 cookie = bpf_ntohl(tcph->ack_seq) - 1;
+
+		bpf_map_update_elem(&results, &key, &cookie, 0);
+	}
 
 release:
 	bpf_sk_release(sk);
diff --git a/tools/testing/selftests/bpf/test_tcp_check_syncookie.sh b/tools/testing/selftests/bpf/test_tcp_check_syncookie.sh
index d48e51716d19..9b3617d770a5 100755
--- a/tools/testing/selftests/bpf/test_tcp_check_syncookie.sh
+++ b/tools/testing/selftests/bpf/test_tcp_check_syncookie.sh
@@ -37,6 +37,9 @@ setup()
 	ns1_exec ip link set lo up
 
 	ns1_exec sysctl -w net.ipv4.tcp_syncookies=2
+	ns1_exec sysctl -w net.ipv4.tcp_window_scaling=0
+	ns1_exec sysctl -w net.ipv4.tcp_timestamps=0
+	ns1_exec sysctl -w net.ipv4.tcp_sack=0
 
 	wait_for_ip 127.0.0.1
 	wait_for_ip ::1
diff --git a/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c b/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
index 87829c86c746..b9e991d43155 100644
--- a/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
+++ b/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
@@ -2,6 +2,7 @@
 // Copyright (c) 2018 Facebook
 // Copyright (c) 2019 Cloudflare
 
+#include <limits.h>
 #include <string.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -77,7 +78,7 @@ static int connect_to_server(int server_fd)
 	return fd;
 }
 
-static int get_map_fd_by_prog_id(int prog_id)
+static int get_map_fd_by_prog_id(int prog_id, bool *xdp)
 {
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
@@ -104,6 +105,8 @@ static int get_map_fd_by_prog_id(int prog_id)
 		goto err;
 	}
 
+	*xdp = info.type == BPF_PROG_TYPE_XDP;
+
 	map_fd = bpf_map_get_fd_by_id(map_ids[0]);
 	if (map_fd < 0)
 		log_err("Failed to get fd by map id %d", map_ids[0]);
@@ -113,18 +116,32 @@ static int get_map_fd_by_prog_id(int prog_id)
 	return map_fd;
 }
 
-static int run_test(int server_fd, int results_fd)
+static int run_test(int server_fd, int results_fd, bool xdp)
 {
 	int client = -1, srv_client = -1;
 	int ret = 0;
 	__u32 key = 0;
-	__u64 value = 0;
+	__u32 key_gen = 1;
+	__u32 key_mss = 2;
+	__u32 value = 0;
+	__u32 value_gen = 0;
+	__u32 value_mss = 0;
 
 	if (bpf_map_update_elem(results_fd, &key, &value, 0) < 0) {
 		log_err("Can't clear results");
 		goto err;
 	}
 
+	if (bpf_map_update_elem(results_fd, &key_gen, &value_gen, 0) < 0) {
+		log_err("Can't clear results");
+		goto err;
+	}
+
+	if (bpf_map_update_elem(results_fd, &key_mss, &value_mss, 0) < 0) {
+		log_err("Can't clear results");
+		goto err;
+	}
+
 	client = connect_to_server(server_fd);
 	if (client == -1)
 		goto err;
@@ -140,8 +157,35 @@ static int run_test(int server_fd, int results_fd)
 		goto err;
 	}
 
-	if (value != 1) {
-		log_err("Didn't match syncookie: %llu", value);
+	if (value == 0) {
+		log_err("Didn't match syncookie: %u", value);
+		goto err;
+	}
+
+	if (bpf_map_lookup_elem(results_fd, &key_gen, &value_gen) < 0) {
+		log_err("Can't lookup result");
+		goto err;
+	}
+
+	if (xdp && value_gen == 0) {
+		// SYN packets do not get passed through generic XDP, skip the
+		// rest of the test.
+		printf("Skipping XDP cookie check\n");
+		goto out;
+	}
+
+	if (bpf_map_lookup_elem(results_fd, &key_mss, &value_mss) < 0) {
+		log_err("Can't lookup result");
+		goto err;
+	}
+
+	if (value != value_gen) {
+		log_err("BPF generated cookie does not match kernel one");
+		goto err;
+	}
+
+	if (value_mss < 536 || value_mss > USHRT_MAX) {
+		log_err("Unexpected MSS retrieved");
 		goto err;
 	}
 
@@ -163,13 +207,14 @@ int main(int argc, char **argv)
 	int server_v6 = -1;
 	int results = -1;
 	int err = 0;
+	bool xdp;
 
 	if (argc < 2) {
 		fprintf(stderr, "Usage: %s prog_id\n", argv[0]);
 		exit(1);
 	}
 
-	results = get_map_fd_by_prog_id(atoi(argv[1]));
+	results = get_map_fd_by_prog_id(atoi(argv[1]), &xdp);
 	if (results < 0) {
 		log_err("Can't get map");
 		goto err;
@@ -194,10 +239,10 @@ int main(int argc, char **argv)
 	if (server_v6 == -1)
 		goto err;
 
-	if (run_test(server, results))
+	if (run_test(server, results, xdp))
 		goto err;
 
-	if (run_test(server_v6, results))
+	if (run_test(server_v6, results, xdp))
 		goto err;
 
 	printf("ok\n");
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 4/6] bpf: sync bpf.h to tools/
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Sync updated documentation for bpf_redirect_map.

Sync the bpf_tcp_gen_syncookie helper function definition with the one
in tools/uapi.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 tools/include/uapi/linux/bpf.h | 37 +++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3d7fc67ec1b8..5a54f1011db8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1571,8 +1571,11 @@ union bpf_attr {
  * 		but this is only implemented for native XDP (with driver
  * 		support) as of this writing).
  *
- * 		All values for *flags* are reserved for future usage, and must
- * 		be left at zero.
+ * 		The lower two bits of *flags* are used as the return code if
+ * 		the map lookup fails. This is so that the return value can be
+ * 		one of the XDP program return codes up to XDP_TX, as chosen by
+ * 		the caller. Any higher bits in the *flags* argument must be
+ * 		unset.
  *
  * 		When used to redirect packets to net devices, this helper
  * 		provides a high performance increase over **bpf_redirect**\ ().
@@ -2710,6 +2713,33 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, on the listening socket in *sk*.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header.
+ *
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		followed by 16 bits which hold the MSS value for that cookie,
+ *		and the top 16 bits are unused.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** SYN cookie cannot be issued due to error
+ *
+ *		**-ENOENT** SYN cookie should not be issued (no SYN flood)
+ *
+ *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
+ *
+ *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2821,7 +2851,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(tcp_gen_syncookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 5/6] selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Expose bpf_tcp_gen_syncookie to selftests.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index f804f210244e..120aa86c58d3 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -228,6 +228,9 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
+static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
+					  int ip_len, void *tcp, int tcp_len) =
+	(void *) BPF_FUNC_tcp_gen_syncookie;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 2/6] tcp: add skb-less helpers to retrieve SYN cookie
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This patch allows generation of a SYN cookie before an SKB has been
allocated, as is the case at XDP.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 include/net/tcp.h    | 10 ++++++
 net/ipv4/tcp_input.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c  | 15 +++++++++
 net/ipv6/tcp_ipv6.c  | 15 +++++++++
 4 files changed, 113 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e5cf514ba118..fb7e153aecc5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -414,6 +414,16 @@ void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       int estab, struct tcp_fastopen_cookie *foc);
 const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
 
+/*
+ *	BPF SKB-less helpers
+ */
+u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph,
+			 struct tcphdr *th, u32 *cookie);
+u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph,
+			 struct tcphdr *th, u32 *cookie);
+u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
+			  const struct tcp_request_sock_ops *af_ops,
+			  struct sock *sk, struct tcphdr *th);
 /*
  *	TCP v4 functions exported for the inet6 API
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8892df6de1d4..706cbb3b2986 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3782,6 +3782,49 @@ static void smc_parse_options(const struct tcphdr *th,
 #endif
 }
 
+/* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
+ * value on success.
+ */
+static u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss)
+{
+	const unsigned char *ptr = (const unsigned char *)(th + 1);
+	int length = (th->doff * 4) - sizeof(struct tcphdr);
+	u16 mss = 0;
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		switch (opcode) {
+		case TCPOPT_EOL:
+			return mss;
+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
+			length--;
+			continue;
+		default:
+			if (length < 2)
+				return mss;
+			opsize = *ptr++;
+			if (opsize < 2) /* "silly options" */
+				return mss;
+			if (opsize > length)
+				return mss;	/* fail on partial options */
+			if (opcode == TCPOPT_MSS && opsize == TCPOLEN_MSS) {
+				u16 in_mss = get_unaligned_be16(ptr);
+
+				if (in_mss) {
+					if (user_mss && user_mss < in_mss)
+						in_mss = user_mss;
+					mss = in_mss;
+				}
+			}
+			ptr += opsize - 2;
+			length -= opsize;
+		}
+	}
+	return mss;
+}
+
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
@@ -6464,6 +6507,36 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 	}
 }
 
+/* If a SYN cookie is required and supported, returns a clamped MSS value to be
+ * used for SYN cookie generation.
+ */
+u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
+			  const struct tcp_request_sock_ops *af_ops,
+			  struct sock *sk, struct tcphdr *th)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u16 mss;
+
+	if (sock_net(sk)->ipv4.sysctl_tcp_syncookies != 2 &&
+	    !inet_csk_reqsk_queue_is_full(sk))
+		return 0;
+
+	if (!tcp_syn_flood_action(sk, rsk_ops->slab_name))
+		return 0;
+
+	if (sk_acceptq_is_full(sk)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+		return 0;
+	}
+
+	mss = tcp_parse_mss_option(th, tp->rx_opt.user_mss);
+	if (!mss)
+		mss = af_ops->mss_clamp;
+
+	return mss;
+}
+EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
+
 int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		     const struct tcp_request_sock_ops *af_ops,
 		     struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..10217393cda6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1515,6 +1515,21 @@ static struct sock *tcp_v4_cookie_check(struct sock *sk, struct sk_buff *skb)
 	return sk;
 }
 
+u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph,
+			 struct tcphdr *th, u32 *cookie)
+{
+	u16 mss = 0;
+#ifdef CONFIG_SYN_COOKIES
+	mss = tcp_get_syncookie_mss(&tcp_request_sock_ops,
+				    &tcp_request_sock_ipv4_ops, sk, th);
+	if (mss) {
+		*cookie = __cookie_v4_init_sequence(iph, th, &mss);
+		tcp_synq_overflow(sk);
+	}
+#endif
+	return mss;
+}
+
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5da069e91cac..87f44d3250ee 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1063,6 +1063,21 @@ static struct sock *tcp_v6_cookie_check(struct sock *sk, struct sk_buff *skb)
 	return sk;
 }
 
+u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph,
+			 struct tcphdr *th, u32 *cookie)
+{
+	u16 mss = 0;
+#ifdef CONFIG_SYN_COOKIES
+	mss = tcp_get_syncookie_mss(&tcp6_request_sock_ops,
+				    &tcp_request_sock_ipv6_ops, sk, th);
+	if (mss) {
+		*cookie = __cookie_v6_init_sequence(iph, th, &mss);
+		tcp_synq_overflow(sk);
+	}
+#endif
+	return mss;
+}
+
 static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	if (skb->protocol == htons(ETH_P_IP))
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 1/6] tcp: tcp_syn_flood_action read port from socket
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov
In-Reply-To: <20190729165918.92933-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This allows us to call this function before an SKB has been
allocated.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
---
 net/ipv4/tcp_input.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c21e8a22fb3b..8892df6de1d4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6422,9 +6422,7 @@ EXPORT_SYMBOL(inet_reqsk_alloc);
 /*
  * Return true if a syncookie should be sent
  */
-static bool tcp_syn_flood_action(const struct sock *sk,
-				 const struct sk_buff *skb,
-				 const char *proto)
+static bool tcp_syn_flood_action(const struct sock *sk, const char *proto)
 {
 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
 	const char *msg = "Dropping request";
@@ -6444,7 +6442,7 @@ static bool tcp_syn_flood_action(const struct sock *sk,
 	    net->ipv4.sysctl_tcp_syncookies != 2 &&
 	    xchg(&queue->synflood_warned, 1) == 0)
 		net_info_ratelimited("%s: Possible SYN flooding on port %d. %s.  Check SNMP counters.\n",
-				     proto, ntohs(tcp_hdr(skb)->dest), msg);
+				     proto, sk->sk_num, msg);
 
 	return want_cookie;
 }
@@ -6487,7 +6485,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	 */
 	if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
 	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
+		want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
 		if (!want_cookie)
 			goto drop;
 	}
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [bpf-next,v2 0/6] Introduce a BPF helper to generate SYN cookies
From: Petar Penkov @ 2019-07-29 16:59 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, toke, Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series introduces a BPF helper function that allows generating SYN
cookies from BPF. Currently, this helper is enabled at both the TC hook and the
XDP hook.

The first two patches in the series add/modify several TCP helper functions to
allow for SKB-less operation, as is the case at the XDP hook.

The third patch introduces the bpf_tcp_gen_syncookie helper function which
generates a SYN cookie for either XDP or TC programs. The return value of
this function contains both the MSS value, encoded in the cookie, and the
cookie itself.

The last three patches sync tools/ and add a test. 

Performance evaluation:
I sent 10Mpps to a fixed port on a host with 2 10G bonded Mellanox 4 NICs from
random IPv6 source addresses. Without XDP I observed 7.2Mpps (syn-acks) being
sent out if the IPv6 packets carry 20 bytes of TCP options or 7.6Mpps if they
carry no options. If I attached a simple program that checks if a packet is
IPv6/TCP/SYN, looks up the socket, issues a cookie, and sends it back out after
swapping src/dest, recomputing the checksum, and setting the ACK flag, I
observed 10Mpps being sent back out.

Changes since v1:
1/ Added performance numbers to the cover letter
2/ Patch 2: Refactored a bit to fix compilation issues
3/ Patch 3: Changed ENOTSUPP to EOPNOTSUPP at Toke's suggestion

Changes since RFC:
1/ Cookie is returned in host order at Alexei's suggestion
2/ If cookies are not enabled via a sysctl, the helper function returns
   -ENOENT instead of -EINVAL at Lorenz's suggestion
3/ Fixed documentation to properly reflect that MSS is 16 bits at
   Lorenz's suggestion
4/ BPF helper requires TCP length to match ->doff field, rather than to simply
   be no more than 20 bytes at Eric and Alexei's suggestion
5/ Packet type is looked up from the packet version field, rather than from the
   socket. v4 packets are rejected on v6-only sockets but should work with
   dual stack listeners at Eric's suggestion
6/ Removed unnecessary `net` argument from helper function in patch 2 at
   Lorenz's suggestion 
7/ Changed test to only pass MSS option so we can convince the verifier that the
   memory access is not out of bounds

Note that 7/ below illustrates the verifier might need to be extended to allow
passing a variable tcph->doff to the helper function like below:

__u32 thlen = tcph->doff * 4;
if (thlen < sizeof(*tcph))
	return;
__s64 cookie = bpf_tcp_gen_syncookie(sk, ipv4h, 20, tcph, thlen);

Petar Penkov (6):
  tcp: tcp_syn_flood_action read port from socket
  tcp: add skb-less helpers to retrieve SYN cookie
  bpf: add bpf_tcp_gen_syncookie helper
  bpf: sync bpf.h to tools/
  selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers
  selftests/bpf: add test for bpf_tcp_gen_syncookie

 include/net/tcp.h                             | 10 +++
 include/uapi/linux/bpf.h                      | 30 ++++++-
 net/core/filter.c                             | 73 +++++++++++++++++
 net/ipv4/tcp_input.c                          | 81 +++++++++++++++++--
 net/ipv4/tcp_ipv4.c                           | 15 ++++
 net/ipv6/tcp_ipv6.c                           | 15 ++++
 tools/include/uapi/linux/bpf.h                | 37 ++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h     |  3 +
 .../bpf/progs/test_tcp_check_syncookie_kern.c | 48 +++++++++--
 .../selftests/bpf/test_tcp_check_syncookie.sh |  3 +
 .../bpf/test_tcp_check_syncookie_user.c       | 61 ++++++++++++--
 11 files changed, 354 insertions(+), 22 deletions(-)

-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply

* Re: [PATCH net-next v4 2/3] flow_offload: Support get default block from tc immediately
From: Jakub Kicinski @ 2019-07-29 16:55 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <449a5603-80e9-ad7d-5c02-bf57558f9603@ucloud.cn>

On Mon, 29 Jul 2019 15:18:03 +0800, wenxu wrote:
> On 7/29/2019 12:42 PM, Jakub Kicinski wrote:
> > On Mon, 29 Jul 2019 10:43:56 +0800, wenxu wrote:  
> >> On 7/29/2019 4:16 AM, Jakub Kicinski wrote:  
> >>> I don't know the nft code, but it seems unlikely it wouldn't have the
> >>> same problem/need..    
> >> nft don't have the same problem.  The offload rule can only attached
> >> to offload base chain.
> >>
> >> Th  offload base chain is created after the device driver loaded (the
> >> device exist).  
> > For indirect blocks the block is on the tunnel device and the offload
> > target is another device. E.g. you offload rules from a VXLAN device
> > onto the ASIC. The ASICs driver does not have to be loaded when VXLAN
> > device is created.
> >
> > So I feel like either the chain somehow directly references the offload
> > target (in which case the indirect infrastructure with hash lookup etc
> > is not needed for nft), or indirect infra is needed, and we need to take
> > care of replays.  
> 
> I think the nft is different with tc. 
> 
> In tc case we can create vxlan device add a ingress qdisc with a block success
> 
> Then the ASIC driver loaded,  then register the vxlan indr-dev and get the block
> adn replay it to hardware
> 
> But in the nft case,  The base chain flags with offload. Create an offload netdev
> base chain on vxlan device will fail if there is no indr-device to offload.

Can you show us the offload chain spec? Does it specify offload to the
vxlan device or the ASIC device?

Indir-devs can come and go, how do you handle a situation where offload
chain was installed with indir listener present, but then the ASIC
driver got removed?

^ permalink raw reply

* Re: [PATCH net-next v4 2/3] flow_offload: Support get default block from tc immediately
From: Jakub Kicinski @ 2019-07-29 16:51 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <c405ff42-dfc5-6f3f-061c-7788e1204afa@ucloud.cn>

On Mon, 29 Jul 2019 15:05:34 +0800, wenxu wrote:
> On 7/29/2019 12:42 PM, Jakub Kicinski wrote:
> > On Mon, 29 Jul 2019 10:43:56 +0800, wenxu wrote:  
> >> On 7/29/2019 4:16 AM, Jakub Kicinski wrote:  
> >>> I don't know the nft code, but it seems unlikely it wouldn't have the
> >>> same problem/need..    
> >> nft don't have the same problem.  The offload rule can only attached
> >> to offload base chain.
> >>
> >> Th  offload base chain is created after the device driver loaded (the
> >> device exist).  
> > For indirect blocks the block is on the tunnel device and the offload
> > target is another device. E.g. you offload rules from a VXLAN device
> > onto the ASIC. The ASICs driver does not have to be loaded when VXLAN
> > device is created.
> >
> > So I feel like either the chain somehow directly references the offload
> > target (in which case the indirect infrastructure with hash lookup etc
> > is not needed for nft), or indirect infra is needed, and we need to take
> > care of replays.  
> 
> So you mean the case is there are two card A and B both can offload vxlan.
> 
> First vxlan device offload with A.  And then the B driver loaded, So the rules
> should replay to B device?

That'd be one example, yes.

^ permalink raw reply

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Stefano Garzarella @ 2019-07-29 16:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190729161903.yhaj5rfcvleexkhc@steredhat>

On Mon, Jul 29, 2019 at 06:19:03PM +0200, Stefano Garzarella wrote:
> On Mon, Jul 29, 2019 at 11:49:02AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 05:36:56PM +0200, Stefano Garzarella wrote:
> > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > and pushed to the guest using the vring, are directly queued in
> > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > with a fixed size (4 KB).
> > > > > 
> > > > > The maximum amount of memory used by each socket should be
> > > > > controlled by the credit mechanism.
> > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > to avoid starvation of other sockets.
> > > > > 
> > > > > This patch mitigates this issue copying the payload of small
> > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > order to avoid wasting memory.
> > > > > 
> > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > 
> > > > This is good enough for net-next, but for net I think we
> > > > should figure out how to address the issue completely.
> > > > Can we make the accounting precise? What happens to
> > > > performance if we do?
> > > > 
> > > 
> > > In order to do more precise accounting maybe we can use the buffer size,
> > > instead of payload size when we update the credit available.
> > > In this way, the credit available for each socket will reflect the memory
> > > actually used.
> > > 
> > > I should check better, because I'm not sure what happen if the peer sees
> > > 1KB of space available, then it sends 1KB of payload (using a 4KB
> > > buffer).
> > > 
> > > The other option is to copy each packet in a new buffer like I did in
> > > the v2 [2], but this forces us to make a copy for each packet that does
> > > not fill the entire buffer, perhaps too expensive.
> > > 
> > > [2] https://patchwork.kernel.org/patch/10938741/
> > > 
> > > 
> > > Thanks,
> > > Stefano
> > 
> > Interesting. You are right, and at some level the protocol forces copies.
> > 
> > We could try to detect that the actual memory is getting close to
> > admin limits and force copies on queued packets after the fact.
> > Is that practical?
> 
> Yes, I think it is doable!
> We can decrease the credit available with the buffer size queued, and
> when the buffer size of packet to queue is bigger than the credit
> available, we can copy it.
> 
> > 
> > And yes we can extend the credit accounting to include buffer size.
> > That's a protocol change but maybe it makes sense.
> 
> Since we send to the other peer the credit available, maybe this
> change can be backwards compatible (I'll check better this).

What I said was wrong.

We send a counter (increased when the user consumes the packets) and the
"buf_alloc" (the max memory allowed) to the other peer.
It makes a difference between a local counter (increased when the
packets are sent) and the remote counter to calculate the credit available:

    u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
    {
    	u32 ret;

    	spin_lock_bh(&vvs->tx_lock);
    	ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
    	if (ret > credit)
    		ret = credit;
    	vvs->tx_cnt += ret;
    	spin_unlock_bh(&vvs->tx_lock);

    	return ret;
    }

Maybe I can play with "buf_alloc" to take care of bytes queued but not
used.

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH net v2] net: bridge: delete local fdb on device init failure
From: David Miller @ 2019-07-29 16:50 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, bridge, syzbot+88533dc8b582309bf3ee
In-Reply-To: <20190729092841.4260-1-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Mon, 29 Jul 2019 12:28:41 +0300

> On initialization failure we have to delete the local fdb which was
> inserted due to the default pvid creation. This problem has been present
> since the inception of default_pvid. Note that currently there are 2 cases:
> 1) in br_dev_init() when br_multicast_init() fails
> 2) if register_netdevice() fails after calling ndo_init()
> 
> This patch takes care of both since br_vlan_flush() is called on both
> occasions. Also the new fdb delete would be a no-op on normal bridge
> device destruction since the local fdb would've been already flushed by
> br_dev_delete(). This is not an issue for ports since nbp_vlan_init() is
> called last when adding a port thus nothing can fail after it.
> 
> Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
> Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v2: reworded the commit message and comment so they're not plural, we're
>     talking about a single bridge local fdb added on the init vlan creation
>     of the default pvid
> 
> Tested with the provided reproducer and can no longer trigger the leak.
> Also tested the br_multicast_init() failure manually by making it always
> return an error.

Applied and queued up for -stable, thank you.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox