Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v4 0/3] IB/ipoib: Use dev_port to disambiguate port numbers
From: Doug Ledford @ 2018-09-13 16:50 UTC (permalink / raw)
  To: Arseny Maslennikov, linux-rdma; +Cc: Jason Gunthorpe, netdev
In-Reply-To: <20180906145112.29245-1-ar@cs.msu.ru>

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Thu, 2018-09-06 at 17:51 +0300, Arseny Maslennikov wrote:
> Pre-3.15 userspace had trouble distinguishing different ports
> of a NIC on a single PCI bus/device/function. To solve this,
> a sysfs field `dev_port' was introduced quite a while ago
> (commit v3.14-rc3-739-g3f85944fe207), and some relevant device
> drivers were fixed to use it, but not in case of IPoIB.
> 
> The convention for some reason never got documented in the kernel, but
> was immediately adopted by userspace (notably udev[1][2], biosdevname[3])
> 
> 1/3 documents the sysfs field — that's why I'm CC-ing netdev.
> 
> This series was tested on and applies to 4.19-rc2.
> 
> [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
> [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
> [3] https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38
> 
> v1->v2: replace a line instead of inserting and then removing.
> v2->v3: restore both attributes, output a notice of deprecation to kmsg.
> v3->v4: style adjustments, join the deprecation notice to single line.
> 
> Arseny Maslennikov (3):
>   Documentation/ABI: document /sys/class/net/*/dev_port
>   IB/ipoib: Use dev_port to expose network interface port numbers
>   IB/ipoib: Log sysfs 'dev_id' accesses from userspace
> 
>  Documentation/ABI/testing/sysfs-class-net | 18 +++++++++++++
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 33 +++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 

Series applied to for-next.  But I think we should watch feedback from
people, and if people think the notification about using the wrong
variable is too noisy, then we might want to revert it or modify it to
only print out once per specific executable instead of once per run of
each executable.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* What is the best forum (mailing list, irc etc) to ask questions about the usage of AF_XDP sockets.
From: Konrad Djimeli @ 2018-09-13 16:31 UTC (permalink / raw)
  To: netdev; +Cc: bjorn.topel, jakub.kicinski

Hello,

I have been working on trying to make use of AF_XDP sockets as part of a
project I working on, and I have been facing some issues but I am not
sure where to ask questions related to the usage of AF_XDP, since this
is a development mailing list.

Thanks
Konrad
www.djimeli.me

^ permalink raw reply

* Re: [Patch net] net_sched: notify filter deletion when deleting a chain
From: David Miller @ 2018-09-13 16:08 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jiri
In-Reply-To: <20180911212223.11613-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 11 Sep 2018 14:22:23 -0700

> When we delete a chain of filters, we need to notify
> user-space we are deleting each filters in this chain
> too.
> 
> Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
> Cc: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [Patch net-next] llc: avoid blocking in llc_sap_close()
From: David Miller @ 2018-09-13 16:05 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev
In-Reply-To: <20180911184206.8461-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 11 Sep 2018 11:42:06 -0700

> llc_sap_close() is called by llc_sap_put() which
> could be called in BH context in llc_rcv(). We can't
> block in BH.
> 
> There is no reason to block it here, kfree_rcu() should
> be sufficient.
> 
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Michal Kubecek @ 2018-09-13 16:03 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-wireless, netdev
In-Reply-To: <1536842777.4160.9.camel@sipsolutions.net>

On Thu, Sep 13, 2018 at 02:46:17PM +0200, Johannes Berg wrote:
> On Thu, 2018-09-13 at 14:24 +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > > > Maybe rather
> > > > 
> > > >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> > > >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > > > 
> > > > so that one could write
> > > > 
> > > >   { .type = NLA_ETH_ADDR }
> > > 
> > > Yeah, that's possible. I considered it for a second, but it was slightly
> > > too magical for my taste :-)
> > > 
> > > Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?
> > 
> > Right, that sounds better. I'm afraid anything too tricky would
> > inevitably trick people into using it in an unexpected way and ending up
> > with very confusing error messages.
> 
> Right.
> 
> Then again though, we already have NLA_MSECS which is basically
> equivalent to NLA_U64 afaict, so why not have more types?
> 
> It doesn't really cost us that much, just a few bytes in the validation?

And one more branch in the switch in validate_nla().

I'm not really opposed to having special policy types for MAC/IPv4/IPv6
address, these types are quite common, at least in networking code which
is where netlink is used most oftena. I rather felt that technically the
only difference between MAC and IPv6 address is the size and as we have
.len already, it would be more useful to have generic policy allowing to
also handle other fixes size data types.

On the other hand, if there was a trend of adding special policies for
more "endemic" data types, it might be more appropriate to introduce
a generic policy where validation_data would be a "validator function"
doing specific checks (probably using a union to allow type check of the
callback). But that's not happening (yet).

> Also, with .type = NLA_ETH_ADDR_COMPAT we could get a warning, which is
> not true for just checking .len.

We could have three flavors of NLA_BINARY.

Michal Kubecek

^ permalink raw reply

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
From: David Miller @ 2018-09-13 21:12 UTC (permalink / raw)
  To: fw
  Cc: linux, netdev, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault
In-Reply-To: <20180913210325.5usfj2rorvuvtyc7@breakpoint.cc>

From: Florian Westphal <fw@strlen.de>
Date: Thu, 13 Sep 2018 23:03:25 +0200

> I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> but can't figure out how to configure that away from the
> 'no hashing for prefixed policies' default or why we even have
> policy_inexact in first place :/
> 
> I'll look at this again tomorrow.

The inexact list exists to handle prefixed input keys.

At the time that I wrote all of the control plane hash table
stuff, configurations I could find consisted of:

1) Entires with non-prefixed keys, which are easy to hash.
   The number of entries could be large (e.g. cell phone
   network)

2) A very small number of prefixed policies.

So hashing, when possible, falling back to the linked list
for prefixed stuff.

Beforehand we only had the linked list :-)

^ permalink raw reply

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
From: Florian Westphal @ 2018-09-13 21:03 UTC (permalink / raw)
  To: David Miller
  Cc: fw, linux, netdev, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault
In-Reply-To: <20180913.102305.939671149040995911.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Thu, 13 Sep 2018 18:38:48 +0200
> 
> > Wolfgang Walter <linux@stwm.de> wrote:
> >> What I can say is that it depends mainly on number of policy rules and SA.
> > 
> > Thats already a good hint, I guess we're hitting long hash chains in
> > xfrm_policy_lookup_bytype().
> 
> I don't really see how recent changes can influence that.

I don't think there is a recent change that did this.

Walter says < 4.14 is ok, so this is likely related to flow cache removal.

F.e. it looks like all prefixed policies end up in a linked list
(net->xfrm.policy_inexact) and are not even in a hash table.

I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
but can't figure out how to configure that away from the
'no hashing for prefixed policies' default or why we even have
policy_inexact in first place :/

I'll look at this again tomorrow.

^ permalink raw reply

* Re: [PATCH] Subject: [PATCH] rtl_bt: Add firmware and configuration files for the Bluetooth part of RTL8822CU
From: Josh Boyer @ 2018-09-13 15:48 UTC (permalink / raw)
  To: Larry Finger
  Cc: Linux Firmware, Linux Wireless, netdev, 陆朱伟
In-Reply-To: <20180905155146.28556-1-Larry.Finger@lwfinger.net>

On Wed, Sep 5, 2018 at 11:52 AM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>
> These devices are new models from Realtek. Updates to driver btrtl will
> soon be submitted to the kernel.
>
> These files were provided by the Realtek developer.
>
> Signed-off-by: 陆朱伟 <alex_lu@realsil.com.cn>
> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
> ---
>  WHENCE                  |   2 ++
>  rtl_bt/rtl8822cu_fw.bin | Bin 0 -> 22412 bytes
>  2 files changed, 2 insertions(+)
>  create mode 100644 rtl_bt/rtl8822cu_fw.bin
>

Applied (with fixed Subject), and pushed out.

josh

^ permalink raw reply

* Re: [PATCH v4 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace
From: Doug Ledford @ 2018-09-13 15:45 UTC (permalink / raw)
  To: Arseny Maslennikov, Jason Gunthorpe; +Cc: linux-rdma, netdev
In-Reply-To: <20180909205512.GA5918@cello.Dlink>

[-- Attachment #1: Type: text/plain, Size: 3657 bytes --]

On Sun, 2018-09-09 at 23:55 +0300, Arseny Maslennikov wrote:
> On Sun, Sep 09, 2018 at 09:11:46PM +0300, Arseny Maslennikov wrote:
> > On Fri, Sep 07, 2018 at 09:43:59AM -0600, Jason Gunthorpe wrote:
> > > On Thu, Sep 06, 2018 at 05:51:12PM +0300, Arseny Maslennikov wrote:
> > > > Some tools may currently be using only the deprecated attribute;
> > > > let's print an elaborate and clear deprecation notice to kmsg.
> > > > 
> > > > To do that, we have to replace the whole sysfs file, since we inherit
> > > > the original one from netdev.
> > > > 
> > > > Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> > > >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 31 +++++++++++++++++++++++
> > > >  1 file changed, 31 insertions(+)
> > > > 
> > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > index 30f840f874b3..74732726ec6f 100644
> > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > @@ -2386,6 +2386,35 @@ int ipoib_add_pkey_attr(struct net_device *dev)
> > > >  	return device_create_file(&dev->dev, &dev_attr_pkey);
> > > >  }
> > > >  
> > > > +/*
> > > > + * We erroneously exposed the iface's port number in the dev_id
> > > > + * sysfs field long after dev_port was introduced for that purpose[1],
> > > > + * and we need to stop everyone from relying on that.
> > > > + * Let's overload the shower routine for the dev_id file here
> > > > + * to gently bring the issue up.
> > > > + *
> > > > + * [1] https://www.spinics.net/lists/netdev/msg272123.html
> > > > + */
> > > > +static ssize_t dev_id_show(struct device *dev,
> > > > +			   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +	struct net_device *ndev = to_net_dev(dev);
> > > > +
> > > > +	if (ndev->dev_id == ndev->dev_port)
> > > > +		netdev_info_once(ndev,
> > > > +			"\"%s\" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.\n",
> > > > +			current->comm);
> > > > +
> > > > +	return sprintf(buf, "%#x\n", ndev->dev_id);
> > > > +}
> > > > +static DEVICE_ATTR_RO(dev_id);
> > > > +
> > > > +int ipoib_intercept_dev_id_attr(struct net_device *dev)
> > > > +{
> > > > +	device_remove_file(&dev->dev, &dev_attr_dev_id);
> > > > +	return device_create_file(&dev->dev, &dev_attr_dev_id);
> > > > +}
> > > 
> > > Isn't this racey with userspace? Ie what happens if udev is querying
> > > the dev_id right here?
> > 
> > udev in particular does not use dev_id at all since 2014, because "why
> > would we keep using dev_id if it is not the right thing to use?".
> > 
> > > 
> > > Do we know there is no userspace doing this?
> > > 
> > 
> > Not for sure.
> > 
> > If we move all the sysfs handling stuff we introduce in _add_port():
> >  - pkey
> >  - umcast
> >  - {create,delete}_child
> >  - connected/datagram mode
> > to _ndo_init(), which is called by register_netdev before it sends
> > the netlink message, would that suffice to eliminate the race?
> > (Sysfs files for {create,delete}_child go to _parent_init() then).
> > 
> 
> No, we can't, sorry for the noise. ndo_init() runs before the kobject
> becomes available.
> 
> Anyway, our sysfs attributes being racy is unrelated to the patch series
> subject, and I can't come up with any other ideas what to do with them
> that do not involve adjustments to register_netdev.

Agreed (that fixing the race issues is a different patch series).

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [net-next, PATCH 2/2, v2] net: socionext: add XDP support
From: Ilias Apalodimas @ 2018-09-13 15:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, jaswinder.singh, ard.biesheuvel, masami.hiramatsu, arnd,
	mykyta.iziumtsev, bjorn.topel, magnus.karlsson, daniel, ast
In-Reply-To: <20180913163206.4feed505@redhat.com>

On Thu, Sep 13, 2018 at 04:32:06PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 12 Sep 2018 12:29:15 +0300
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > On Wed, Sep 12, 2018 at 11:25:24AM +0200, Jesper Dangaard Brouer wrote:
> > > On Wed, 12 Sep 2018 12:02:38 +0300
> > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > >   
> > > >  static const struct net_device_ops netsec_netdev_ops = {
> > > >  	.ndo_init		= netsec_netdev_init,
> > > >  	.ndo_uninit		= netsec_netdev_uninit,
> > > > @@ -1430,6 +1627,7 @@ static const struct net_device_ops netsec_netdev_ops = {
> > > >  	.ndo_set_mac_address    = eth_mac_addr,
> > > >  	.ndo_validate_addr	= eth_validate_addr,
> > > >  	.ndo_do_ioctl		= netsec_netdev_ioctl,
> > > > +	.ndo_bpf		= netsec_xdp,
> > > >  };
> > > >    
> > > 
> > > You have not implemented ndo_xdp_xmit.
> > > 
> > > Thus, you have "only" implemented the RX side of XDP_REDIRECT.  Which
> > > allows you to do, cpumap and AF_XDP redirects, but not allowing other
> > > drivers to XDP send out this device.  
> >
> > Correct, that was the planning, is ndo_xdp_xmit() needed for the patch or
> > is the patch message just misleading and i should change that ?
> 
> Yes, I think you should ALSO implement ndo_xdp_xmit, maybe as a separate
> patch, but in the same series. (Our experience is that if we don't
> require this, people forget to complete this part of the XDP support).
Ok makes sense. Already started on that i should have something soon
> 
> Also you XDP_TX is not optimal, as it (looks like) you flush TX on
> every send.
Yes i do, the driver is queueing packet by packet (in it's default skb 
implemetation) so i just did the same. Agree it's far from optimal though
i'll see if i can change than on the next version
> 
> BTW, do you have any performance numbers?
Yes XDP_TX is doing ~330kpps and XDP_REDIRECT ~340kpps(dropping packets)
using 64b packets.  I am not really sure if this is a hardware limitation 
due to only using a single queue. I used ./samples/bpf/xdpsock for AF_XDP
and ./samples/bpf/xdp2 for XDP_TX. I hope i am doing the right tests

The default Rx path is doing ~220kpps with the improved memory allocation
scheme so we do have some improvement although we are far away from line 
rate

The default Tx seems to hang after some point with a txq full message so 
i don't have any precice numbers for that

This change on the driver started as an investigation of using AF_XDP
for Time Sensitive networking setups. The offloading seems to work wonders
there since the latency is reduced *A LOT* (more than 10x in my case) in 
Rx path

Another thing i did consider is that Bjorn is right. Since i only have 1 
shared txq i need locking to avoid race conditions. 

Once again thanks for reviewing this

/Ilias

^ permalink raw reply

* RE: [PATCH] hv_netvsc: fix schedule in RCU context
From: Haiyang Zhang @ 2018-09-13 15:33 UTC (permalink / raw)
  To: Stephen Hemminger, KY Srinivasan
  Cc: netdev@vger.kernel.org, Stephen Hemminger
In-Reply-To: <20180913150342.32101-1-sthemmin@microsoft.com>



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, September 13, 2018 11:04 AM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>
> Cc: netdev@vger.kernel.org; Stephen Hemminger <sthemmin@microsoft.com>
> Subject: [PATCH] hv_netvsc: fix schedule in RCU context
> 
> When netvsc device is removed it can call reschedule in RCU context.
> This happens because canceling the subchannel setup work could (in theory)
> cause a reschedule when manipulating the timer.
> 
> To reproduce, run with lockdep enabled kernel and unbind
> a network device from hv_netvsc (via sysfs).
> 
> [  160.682011] WARNING: suspicious RCU usage
> [  160.707466] 4.19.0-rc3-uio+ #2 Not tainted
> [  160.709937] -----------------------------
> [  160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU
> read-side critical section!
> [  160.723691]
> [  160.723691] other info that might help us debug this:
> [  160.723691]
> [  160.730955]
> [  160.730955] rcu_scheduler_active = 2, debug_locks = 1
> [  160.762813] 5 locks held by rebind-eth.sh/1812:
> [  160.766851]  #0: 000000008befa37a (sb_writers#6){.+.+}, at:
> vfs_write+0x184/0x1b0
> [  160.773416]  #1: 00000000b097f236 (&of->mutex){+.+.}, at:
> kernfs_fop_write+0xe2/0x1a0
> [  160.783766]  #2: 0000000041ee6889 (kn->count#3){++++}, at:
> kernfs_fop_write+0xeb/0x1a0
> [  160.787465]  #3: 0000000056d92a74 (&dev->mutex){....}, at:
> device_release_driver_internal+0x39/0x250
> [  160.816987]  #4: 0000000030f6031e (rcu_read_lock){....}, at:
> netvsc_remove+0x1e/0x250 [hv_netvsc]
> [  160.828629]
> [  160.828629] stack backtrace:
> [  160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-
> uio+ #2
> [  160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
> [  160.832952] Call Trace:
> [  160.832952]  dump_stack+0x85/0xcb
> [  160.832952]  ___might_sleep+0x1a3/0x240
> [  160.832952]  __flush_work+0x57/0x2e0
> [  160.832952]  ? __mutex_lock+0x83/0x990
> [  160.832952]  ? __kernfs_remove+0x24f/0x2e0
> [  160.832952]  ? __kernfs_remove+0x1b2/0x2e0
> [  160.832952]  ? mark_held_locks+0x50/0x80
> [  160.832952]  ? get_work_pool+0x90/0x90
> [  160.832952]  __cancel_work_timer+0x13c/0x1e0
> [  160.832952]  ? netvsc_remove+0x1e/0x250 [hv_netvsc]
> [  160.832952]  ? __lock_is_held+0x55/0x90
> [  160.832952]  netvsc_remove+0x9a/0x250 [hv_netvsc]
> [  160.832952]  vmbus_remove+0x26/0x30
> [  160.832952]  device_release_driver_internal+0x18a/0x250
> [  160.832952]  unbind_store+0xb4/0x180
> [  160.832952]  kernfs_fop_write+0x113/0x1a0
> [  160.832952]  __vfs_write+0x36/0x1a0
> [  160.832952]  ? rcu_read_lock_sched_held+0x6b/0x80
> [  160.832952]  ? rcu_sync_lockdep_assert+0x2e/0x60
> [  160.832952]  ? __sb_start_write+0x141/0x1a0
> [  160.832952]  ? vfs_write+0x184/0x1b0
> [  160.832952]  vfs_write+0xbe/0x1b0
> [  160.832952]  ksys_write+0x55/0xc0
> [  160.832952]  do_syscall_64+0x60/0x1b0
> [  160.832952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.832952] RIP: 0033:0x7fe48f4c8154
> 
> Resolve this by getting RTNL earlier. This is safe because the subchannel
> work queue does trylock on RTNL and will detect the race.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

Thank you!

^ permalink raw reply

* Re: [PATCHv2 net] ipv6: use rt6_info members when dst is set in rt6_fill_node
From: David Miller @ 2018-09-13 15:21 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, dsa, roopa
In-Reply-To: <84d6cd6a280394837623ddb49186e5c299eb7a03.1536647638.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 11 Sep 2018 14:33:58 +0800

> In inet6_rtm_getroute, since Commit 93531c674315 ("net/ipv6: separate
> handling of FIB entries from dst based routes"), it has used rt->from
> to dump route info instead of rt.
> 
> However for some route like cache, some of its information like flags
> or gateway is not the same as that of the 'from' one. It caused 'ip
> route get' to dump the wrong route information.
> 
> In Jianlin's testing, the output information even lost the expiration
> time for a pmtu route cache due to the wrong fib6_flags.
> 
> So change to use rt6_info members for dst addr, src addr, flags and
> gateway when it tries to dump a route entry without fibmatch set.
> 
> v1->v2:
>   - not use rt6i_prefsrc.
>   - also fix the gw dump issue.
> 
> Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes")
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks Xin.

^ permalink raw reply

* Re: [PATCH 2/2] net: qcom/emac: add shared mdio bus support
From: Wang, Dongsheng @ 2018-09-13 15:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: timur@kernel.org, davem@davemloft.net, Zheng, Joey,
	netdev@vger.kernel.org
In-Reply-To: <20180913124229.GD11702@lunn.ch>

On 9/13/2018 8:42 PM, Andrew Lunn wrote:
> On Thu, Sep 13, 2018 at 05:04:53PM +0800, Wang Dongsheng wrote:
>> Share the mii_bus for others MAC device because QDF2400 emac
>> include MDIO, and the motherboard has more than one PHY connected
>> to an MDIO bus.
>>
>> Tested: QDF2400 (ACPI), buildin/insmod/rmmod
>>
>> Signed-off-by: Wang Dongsheng <dongsheng.wang@hxt-semitech.com>
> Hi Wang
>
> This is a pretty big patch, and is hard to review. Could you try to
> break it up into a number of smaller patches. You could for example
> first refactor emacs_phy_config(), without making any functional
> changes. Then add the sharing. Maybe do OF an ACPI in different
> patches?
>
> Thanks
> 	 Andrew
>
Ok, thanks.


Cheers

Dongsheng

^ permalink raw reply

* Re: [PATCH iproute2] iproute2: fix use-after-free
From: Stephen Hemminger @ 2018-09-13 15:19 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: Mahesh Bandewar, netdev
In-Reply-To: <CAF2d9jgpN0T+BuMTk9GwNKVEcmH2V_yC8R=WvWYCZ95E_pkfwg@mail.gmail.com>

On Wed, 12 Sep 2018 23:07:20 -0700
Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote:

> On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Wed, 12 Sep 2018 16:29:28 -0700
> > Mahesh Bandewar <mahesh@bandewar.net> wrote:
> >  
> > > From: Mahesh Bandewar <maheshb@google.com>
> > >
> > > A local program using iproute2 lib pointed out the issue and looking
> > > at the code it is pretty obvious -
> > >
> > >     a = (struct nlmsghdr *)b;
> > >     ...
> > >     free(b);
> > >     if (a->nlmsg_seq == seq)
> > >     ...
> > >
> > > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
> > > Signed-off-by: Mahesh Bandewar <maheshb@google.com>  
> >
> > Yes, this is a real problem.
> >
> > Maybe a minimal patch like this would be enough:  
> actually that will leave the memory leak at the 'goto next' line (just
> few lines below) since that jump will overwrite the buf.

It looks like everytime in the while loop. a new buffer is allocated.
So yes, it looks like old, my patch, and your patch would leak there
was an error followed by other data in response.
Though I doubt kernel would ever do that.

The only user of iov style messages to the kernel is in tc batching.
My gut feeling is that if one message in batch has error, then
the netlink code should return that error and stop processing more.

^ permalink raw reply

* Re: [PATCH v3 net-next 0/6] Add support for Lantiq / Intel vrx200 network
From: David Miller @ 2018-09-13 15:15 UTC (permalink / raw)
  To: hauke
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.de>

From: Hauke Mehrtens <hauke@hauke-m.de>
Date: Sun,  9 Sep 2018 22:16:41 +0200

> This adds basic support for the GSWIP (Gigabit Switch) found in the
> VRX200 SoC.
> There are different versions of this IP core used in different SoCs, but
> this driver was currently only tested on the VRX200 SoC line, for other
> SoCs this driver probably need some adoptions to work.
> 
> I also plan to add Layer 2 offloading to the DSA driver and later also
> layer 3 offloading which is supported by the PPE HW block.
> 
> All these patches should go through the net-next tree.
> 
> This depends on the patch "MIPS: lantiq: dma: add dev pointer" which 
> should go into 4.19.

Series applied to net-next, thanks.

^ permalink raw reply

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
From: Andrew Lunn @ 2018-09-13 15:12 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Jakub Kicinski,
	Simon Horman, Alexander Duyck, Florian Fainelli, Tal Alon,
	Ariel Almog
In-Reply-To: <66584ca2-8efa-9a6d-c1f3-1cf81cb04259@mellanox.com>

> >>>>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>>>            Sets TX_COMP_ERROR sensor parameters for a specific device.

> >>This is what I had in mind:
> >>1. command interface error
> >>2. command interface timeout
> >>3. stuck TX queue (like tx_timeout)
> >>4. stuck TX completion queue (driver did not process packets in a reasonable
> >>time period)
> >>5. stuck RX queue
> >>6. RX completion error
> >>7. TX completion error
> >>8. HW / FW catastrophic error report
> >>9. completion queue overrun

> Such issues do exist in production environment, and need to be handled even
> if root cause is a bug which will be fixed in latest release. My feature
> should help developers / administrator to control and recover their live
> systems, by auto correction and logging support.
> Goal is:
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed
> debugging information.

So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.

devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on

seems a lot more understandable than:

devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on

	Andrew

^ permalink raw reply

* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Ard Biesheuvel @ 2018-09-13 15:07 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Andrew Lutomirski, LKML, Netdev, David Miller, Greg Kroah-Hartman,
	Samuel Neves, Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAHmME9pp8HoisxQWL_eELR-ziiBii-7mcA=oF9UF-WMucewX5w@mail.gmail.com>

On 13 September 2018 at 16:18, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Thu, Sep 13, 2018 at 1:45 AM Andy Lutomirski <luto@kernel.org> wrote:
>> I'm not convinced that there's any real need for *all* crypto
>> algorithms to move into lib/zinc or to move at all.  As I see it,
>> there are two classes of crypto algorithms in the kernel:
>>
>> a) Crypto that is used by code that chooses its algorithm statically
>> and wants synchronous operations.  These include everything in
>> drivers/char/random.c, but also a bunch of various networking things
>> that are hardcoded and basically everything that uses stack buffers.
>> (This means it includes all the code that I broke when I did
>> VMAP_STACK.  Sign.)
>
> Right, exactly. This is what will wind up using Zinc. I'm working on
> an example usage of this for v4 of the patch submission, which you can
> ogle in a preview here if you're curious:
>
> https://git.zx2c4.com/linux-dev/commit/?h=big_key_rewrite
>
> 28 insertions, 206 deletions :-D
>

I must say, that actually looks pretty good.

>> b) Crypto that is used dynamically.  This includes dm-crypt
>> (aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
>> lot of IPSEC stuff, possibly KCM, and probably many more.  These will
>> get comparatively little benefit from being converted to a zinc-like
>> interface.  For some of these cases, it wouldn't make any sense at all
>> to convert them.  Certainly the ones that do async hardware crypto
>> using DMA engines will never look at all like zinc, even under the
>> hood.
>
> Right, this is what the crypto API will continue to be used for.
>
>
>> I think that, as a short-term goal, it makes a lot of sense to have
>> implementations of the crypto that *new* kernel code (like Wireguard)
>> wants to use in style (a) that live in /lib, and it obviously makes
>> sense to consolidate their implementations with the crypto/
>> implementations in a timely manner.  As a medium-term goal, adding
>> more algorithms as needed for things that could use the simpler APIs
>> (Bluetooth, perhaps) would make sense.
>
> Agreed 100%. With regards to "consolidate their implementations" --
> I've actually already done this after your urging yesterday, and so
> that will be a part of v4.
>
>> But I see no reason at all that /lib should ever contain a grab-bag of
>> crypto implementations just for the heck of it.  They should have real
>> in-kernel users IMO.  And this means that there will probably always
>> be some crypto implementations in crypto/ for things like aes-xts.
>
> Right, precisely.
>
> Jason

^ permalink raw reply

* [PATCH] hv_netvsc: fix schedule in RCU context
From: Stephen Hemminger @ 2018-09-13 15:03 UTC (permalink / raw)
  To: kys, haiyangz; +Cc: netdev, Stephen Hemminger

When netvsc device is removed it can call reschedule in RCU context.
This happens because canceling the subchannel setup work could (in theory)
cause a reschedule when manipulating the timer.

To reproduce, run with lockdep enabled kernel and unbind
a network device from hv_netvsc (via sysfs).

[  160.682011] WARNING: suspicious RCU usage
[  160.707466] 4.19.0-rc3-uio+ #2 Not tainted
[  160.709937] -----------------------------
[  160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
[  160.723691]
[  160.723691] other info that might help us debug this:
[  160.723691]
[  160.730955]
[  160.730955] rcu_scheduler_active = 2, debug_locks = 1
[  160.762813] 5 locks held by rebind-eth.sh/1812:
[  160.766851]  #0: 000000008befa37a (sb_writers#6){.+.+}, at: vfs_write+0x184/0x1b0
[  160.773416]  #1: 00000000b097f236 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[  160.783766]  #2: 0000000041ee6889 (kn->count#3){++++}, at: kernfs_fop_write+0xeb/0x1a0
[  160.787465]  #3: 0000000056d92a74 (&dev->mutex){....}, at: device_release_driver_internal+0x39/0x250
[  160.816987]  #4: 0000000030f6031e (rcu_read_lock){....}, at: netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.828629]
[  160.828629] stack backtrace:
[  160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-uio+ #2
[  160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[  160.832952] Call Trace:
[  160.832952]  dump_stack+0x85/0xcb
[  160.832952]  ___might_sleep+0x1a3/0x240
[  160.832952]  __flush_work+0x57/0x2e0
[  160.832952]  ? __mutex_lock+0x83/0x990
[  160.832952]  ? __kernfs_remove+0x24f/0x2e0
[  160.832952]  ? __kernfs_remove+0x1b2/0x2e0
[  160.832952]  ? mark_held_locks+0x50/0x80
[  160.832952]  ? get_work_pool+0x90/0x90
[  160.832952]  __cancel_work_timer+0x13c/0x1e0
[  160.832952]  ? netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.832952]  ? __lock_is_held+0x55/0x90
[  160.832952]  netvsc_remove+0x9a/0x250 [hv_netvsc]
[  160.832952]  vmbus_remove+0x26/0x30
[  160.832952]  device_release_driver_internal+0x18a/0x250
[  160.832952]  unbind_store+0xb4/0x180
[  160.832952]  kernfs_fop_write+0x113/0x1a0
[  160.832952]  __vfs_write+0x36/0x1a0
[  160.832952]  ? rcu_read_lock_sched_held+0x6b/0x80
[  160.832952]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  160.832952]  ? __sb_start_write+0x141/0x1a0
[  160.832952]  ? vfs_write+0x184/0x1b0
[  160.832952]  vfs_write+0xbe/0x1b0
[  160.832952]  ksys_write+0x55/0xc0
[  160.832952]  do_syscall_64+0x60/0x1b0
[  160.832952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.832952] RIP: 0033:0x7fe48f4c8154

Resolve this by getting RTNL earlier. This is safe because the subchannel
work queue does trylock on RTNL and will detect the race.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 70921bbe0e28..915fbd66a02b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2272,17 +2272,15 @@ static int netvsc_remove(struct hv_device *dev)
 
 	cancel_delayed_work_sync(&ndev_ctx->dwork);
 
-	rcu_read_lock();
-	nvdev = rcu_dereference(ndev_ctx->nvdev);
-
-	if  (nvdev)
+	rtnl_lock();
+	nvdev = rtnl_dereference(ndev_ctx->nvdev);
+	if (nvdev)
 		cancel_work_sync(&nvdev->subchan_work);
 
 	/*
 	 * Call to the vsc driver to let it know that the device is being
 	 * removed. Also blocks mtu and channel changes.
 	 */
-	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
 		netvsc_unregister_vf(vf_netdev);
@@ -2294,7 +2292,6 @@ static int netvsc_remove(struct hv_device *dev)
 	list_del(&ndev_ctx->list);
 
 	rtnl_unlock();
-	rcu_read_unlock();
 
 	hv_set_drvdata(dev, NULL);
 
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Ard Biesheuvel @ 2018-09-13 15:04 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: LKML, Netdev, David Miller, Greg Kroah-Hartman, Andrew Lutomirski,
	Samuel Neves, Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAHmME9oYAEwFQAriVRm5zZKCo0Sh1=t5YheNZ+MtKQLQPoMWeg@mail.gmail.com>

On 13 September 2018 at 16:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Ard,
>
> On Thu, Sep 13, 2018 at 12:56 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> In this series, you are dumping a huge volume of unannotated,
>> generated asm into the kernel which has been modified [by you] to
>> [among other things?] adhere to the kernel API (without documenting
>> what the changes are exactly). How does that live up to the promise of
>> better, peer reviewed code?
>
> The code still benefits from the review that's gone into OpenSSL. It's
> not modified in ways that would affect the cryptographic operations
> being done. It's modified to be suitable for kernel space.
>

So could we please at least have those changes as a separate patch then?

>> Then there is the performance claim. We know for instance that the
>> OpenSSL ARM NEON code for ChaCha20 is faster on cores that happen to
>> possess a micro-architectural property that ALU instructions are
>> essentially free when they are interleaved with SIMD instructions. But
>> we also know that a) Cortex-A7, which is a relevant target, is not one
>> of those cores, and b) that chip designers are not likely to optimize
>> for that particular usage pattern so relying on it in generic code is
>> unwise in general.
>
> That's interesting. I'll bring this up with AndyP. FWIW, if you think
> you have a real and compelling claim here, I'd be much more likely to
> accept a different ChaCha20 implementation than I would be to accept a
> different Poly1305 implementation. (It's a *lot* harder to screw up
> ChaCha20 than it is to screw up Poly1305.)
>

The question is really whether we want different implementations in
the crypto API and in zinc.

>> I am also concerned about your claim that all software algorithms will
>> be moved into this crypto library.
>
> I'll defer to Andy's response here, which I think is a correct one:
> https://lkml.org/lkml/2018/9/13/27
>
> The short answer is that Zinc is going to be adding the ciphers that
> people want to use for normal reasons from normal code. For example,
> after this merges, we'll next be working on moving the remaining
> non-optimized C code out of lib/ that's called by places (such as
> SHA2).
>

Excellent.

>> You are not specific about whose
>> responsibility it will be that this is going to happen in a timely
>> fashion.
>
> I thought I laid out the roadmap for this in the commit message. In
> case I wasn't clear: my plan is to tackle lib/ after merging, and I
> plan to do so in a timely manner. It's a pretty common tactic to keep
> layering on tasks, "what about X?", "what about Y?", "I won't agree
> unless Z!" -- when in reality kernel development and refactorings are
> done incrementally. I've been around on this list contributing code
> for long enough that you should have a decent amount of confidence
> that I'm not just going to disappear working on this or something
> insane like that. And neither are the two academic cryptographers CC'd
> on this thread. So, as Andy said, we're going to be porting to Zinc
> the primitives that are useful for the various applications of Zinc.
> This means yes, we'll have SHA2 in there.
>
>> chaining modes
>> What are the APIs
>> going to look like for block ciphers, taking chaining modes into
>> account?
>
> As mentioned in the commit message and numerous times, we're not
> trying to make a win32-like crypto API here or to remake the existing
> Linux crypto API. Rather we're providing libraries of specific
> functions that are useful for various circumstances. For example, if
> AES-GCM is desired at some point, then we'll have a similar API for
> that as we do for ChaPoly -- one that takes buffers and one that takes
> sg. Likewise, hash functions use the familiar init/update/final.
> "Generic" chaining modes aren't really part of the equation or design
> goals.
>
> Again, I realize you've spent a long time working on the existing
> crypto API, and so your questions and concerns are in the line of,
> "how are we going to make Zinc look like the existing crypto API in
> functionality?"

You are completely missing my point. I am not particularly invested in
the crypto API, and I share the concerns about its usability. That is
why I want to make sure that your solution actually results in a net
improvement for everybody, not just for WireGuard, in a maintainable
way.

> But that's not what we're up to here. We have a
> different and complementary design goal. I understand why you're
> squirming, but please recognize we're working on different things.
>
>> I'm sure it is rather simple to port the crypto API implementation of
>> ChaCha20 to use your library. I am more concerned about how your
>> library is going to expand to cover all other software algorithms that
>> we currently use in the kernel.
>
> The subset of algorithms we add will be developed with the same
> methodology as the present ones. There is nothing making this
> particularly difficult or even more difficult for other primitives
> than it was for ChaCha20. It's especially easy, in fact, since we're
> following similar design methodologies as the vast majority of other
> cryptography libraries that have been developed. Namely, we're
> creating simple things called "functions".
>
>> Of course. But please respond to all the concerns,
>> You have not
>> responded to that concern yet.
>
> Sorry, it's certainly not my intention. I've been on vacation with my
> family for the last several weeks, and only returned home
> sleep-deprived last night after 4 days of plane delays. I've now
> rested and will resume working on this full-time and I'll try my best
> to address concerns, and also go back through emails to find things I
> might have missed. (First, though, I'm going to deal with getting back
> the three suitcases the airline lost in transit...)
>
>> > Anyway, it sounds like this whole thing may have ruffled your feathers
>> > a bit. Will you be at Linux Plumbers Conference in November? I'm
>> > planning on attending, and perhaps we could find some time there to
>> > sit down and talk one on one a bit.
>>
>> That would be good, yes. I will be there.
>
> Looking forward to talking to you there, and hopefully we can put to
> rest any lingering concerns.
>
> Jason

^ permalink raw reply

* Re: [PATCH 2/2] net: qcom/emac: add shared mdio bus support
From: Timur Tabi @ 2018-09-13 15:03 UTC (permalink / raw)
  To: Andrew Lunn, Wang Dongsheng; +Cc: davem, yu.zheng, netdev
In-Reply-To: <20180913124229.GD11702@lunn.ch>

On 9/13/18 7:42 AM, Andrew Lunn wrote:
> This is a pretty big patch, and is hard to review. Could you try to
> break it up into a number of smaller patches. You could for example
> first refactor emacs_phy_config(), without making any functional
> changes. Then add the sharing. Maybe do OF an ACPI in different
> patches?

Yes, please.

^ permalink raw reply

* Re: [PATCH net-next RFC] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-13 15:00 UTC (permalink / raw)
  To: me
  Cc: Network Development, Jason Wang, Michael S. Tsirkin, f.fainelli,
	Willem de Bruijn
In-Reply-To: <20180913100425.GF11198@eros>

> > +static u32 virtnet_get_priv_flags(struct net_device *dev)
> > +{
> > +     struct virtnet_info *vi = netdev_priv(dev);
> > +     int priv_flags = 0;
> > +
> > +     if (vi->sq[0].napi.weight)
> > +             priv_flags |= 0x1;
> > +
> > +     return priv_flags;
> > +}
>
> Why the use of priv_flags here?  Is there some reason that we don't want
> to use the more simple
>
>     static u32 virtnet_get_priv_flags(struct net_device *dev)
>     {
>             struct virtnet_info *vi = netdev_priv(dev);
>
>             if (vi->sq[0].napi.weight)
>                     return 1;
>
>             return 0;
>     }

Sure, that's fine, too.

I just wanted to make it explicit that this is one of possibly many
private flags,
and only acts on bit 0. If another private flag is added, the existing
code needs
little change, just add a branch on another bit. But either way works.

^ permalink raw reply

* [PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment()
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Taehee Yoo
In-Reply-To: <20180913145902.17531-1-sthemmin@microsoft.com>

From: Taehee Yoo <ap420073@gmail.com>

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_fragment.c                  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..e7227128df2c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			nextp = &fp->next;
 			fp->prev = NULL;
 			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			fp->sk = NULL;
 			head->data_len += fp->len;
 			head->len += fp->len;
 			if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 82ce0d0f54bf..2ed8536e10b6 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -453,6 +453,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
+		fp->sk = NULL;
 	}
 	sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.18.0

^ permalink raw reply related

* [PATCH v3 29/30] ip: process in-order fragments efficiently
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal
In-Reply-To: <20180913145902.17531-1-sthemmin@microsoft.com>

From: Peter Oskolkov <posk@google.com>

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ++++++++++++++++++++++++---------------
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6904cbb7de1a..f6764537148c 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 			fp = xp;
 		} while (fp);
 	} else {
-		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+		sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+			 struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
 		head = skb_rb_first(&qp->q.rb_fragments);
 		if (!head)
 			goto out;
-		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		if (FRAG_CB(head)->next_frag)
+			rb_replace_node(&head->rbnode,
+					&FRAG_CB(head)->next_frag->rbnode,
+					&qp->q.rb_fragments);
+		else
+			rb_erase(&head->rbnode, &qp->q.rb_fragments);
 		memset(&head->rbnode, 0, sizeof(head->rbnode));
 		barrier();
 	}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct rb_node **rbn, *parent;
-	struct sk_buff *skb1;
+	struct sk_buff *skb1, *prev_tail;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 */
 
 	/* Find out where to put this fragment.  */
-	skb1 = qp->q.fragments_tail;
-	if (!skb1) {
-		/* This is the first fragment we've received. */
-		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-		qp->q.fragments_tail = skb;
-	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-		/* This is the common/special case: skb goes to the end. */
+	prev_tail = qp->q.fragments_tail;
+	if (!prev_tail)
+		ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+	else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
-		if (offset < (skb1->ip_defrag_offset + skb1->len))
+		if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
 			goto discard_qp;
-		/* Insert after skb1. */
-		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
-		qp->q.fragments_tail = skb;
+		if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+			ip4_frag_append_to_last_run(&qp->q, skb);
+		else
+			ip4_frag_create_run(&qp->q, skb);
 	} else {
-		/* Binary search. Note that skb can become the first fragment, but
-		 * not the last (covered above). */
+		/* Binary search. Note that skb can become the first fragment,
+		 * but not the last (covered above).
+		 */
 		rbn = &qp->q.rb_fragments.rb_node;
 		do {
 			parent = *rbn;
 			skb1 = rb_to_skb(parent);
 			if (end <= skb1->ip_defrag_offset)
 				rbn = &parent->rb_left;
-			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+			else if (offset >= skb1->ip_defrag_offset +
+						FRAG_CB(skb1)->frag_run_len)
 				rbn = &parent->rb_right;
 			else /* Found an overlap with skb1. */
 				goto discard_qp;
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
-		 * one of its NULL left/right children. Insert skb. */
+		 * one of its NULL left/right children. Insert skb.
+		 */
+		ip4_frag_init_run(skb);
 		rb_link_node(&skb->rbnode, parent, rbn);
+		rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 	}
-	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
 	if (dev)
 		qp->iif = dev->ifindex;
@@ -476,7 +485,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, skb, dev);
+		err = ip_frag_reasm(qp, skb, prev_tail, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -495,7 +504,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 
 /* Build a new IP datagram from all its fragments. */
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
-			 struct net_device *dev)
+			 struct sk_buff *prev_tail, struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -519,10 +528,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
+		if (RB_EMPTY_NODE(&skb->rbnode))
+			FRAG_CB(prev_tail)->next_frag = fp;
+		else
+			rb_replace_node(&skb->rbnode, &fp->rbnode,
+					&qp->q.rb_fragments);
 		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
 		skb_morph(skb, head);
+		FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
 		rb_replace_node(&head->rbnode, &skb->rbnode,
 				&qp->q.rb_fragments);
 		consume_skb(head);
@@ -558,7 +573,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		skb->truesize += clone->truesize;
+		head->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
@@ -571,24 +586,36 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	skb_push(head, head->data - skb_network_header(head));
 
 	/* Traverse the tree in order, to build frag_list. */
+	fp = FRAG_CB(head)->next_frag;
 	rbn = rb_next(&head->rbnode);
 	rb_erase(&head->rbnode, &qp->q.rb_fragments);
-	while (rbn) {
-		struct rb_node *rbnext = rb_next(rbn);
-		fp = rb_to_skb(rbn);
-		rb_erase(rbn, &qp->q.rb_fragments);
-		rbn = rbnext;
-		*nextp = fp;
-		nextp = &fp->next;
-		fp->prev = NULL;
-		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
-		head->data_len += fp->len;
-		head->len += fp->len;
-		if (head->ip_summed != fp->ip_summed)
-			head->ip_summed = CHECKSUM_NONE;
-		else if (head->ip_summed == CHECKSUM_COMPLETE)
-			head->csum = csum_add(head->csum, fp->csum);
-		head->truesize += fp->truesize;
+	while (rbn || fp) {
+		/* fp points to the next sk_buff in the current run;
+		 * rbn points to the next run.
+		 */
+		/* Go through the current run. */
+		while (fp) {
+			*nextp = fp;
+			nextp = &fp->next;
+			fp->prev = NULL;
+			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			head->data_len += fp->len;
+			head->len += fp->len;
+			if (head->ip_summed != fp->ip_summed)
+				head->ip_summed = CHECKSUM_NONE;
+			else if (head->ip_summed == CHECKSUM_COMPLETE)
+				head->csum = csum_add(head->csum, fp->csum);
+			head->truesize += fp->truesize;
+			fp = FRAG_CB(fp)->next_frag;
+		}
+		/* Move to the next run. */
+		if (rbn) {
+			struct rb_node *rbnext = rb_next(rbn);
+
+			fp = rb_to_skb(rbn);
+			rb_erase(rbn, &qp->q.rb_fragments);
+			rbn = rbnext;
+		}
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
@@ -624,6 +651,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	return 0;
 
 out_nomem:
-- 
2.18.0

^ permalink raw reply related

* [PATCH v3 28/30] ip: add helpers to process in-order fragments faster.
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal
In-Reply-To: <20180913145902.17531-1-sthemmin@microsoft.com>

From: Peter Oskolkov <posk@google.com>

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 ++++
 net/ipv4/ip_fragment.c  | 73 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index e4c71a7644be..335cf7851f12 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
 	struct sk_buff		*fragments;  /* Used in IPv6. */
 	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
+	struct sk_buff		*last_run_head;
 	ktime_t			stamp;
 	int			len;
 	int			meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
 	if (refcount_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+	struct inet_skb_parm	h;
+	struct sk_buff		*next_frag;
+	int			frag_run_len;
+};
+
+#define FRAG_CB(skb)		((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+	FRAG_CB(skb)->next_frag = NULL;
+	FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+					struct sk_buff *skb)
+{
+	RB_CLEAR_NODE(&skb->rbnode);
+	FRAG_CB(skb)->next_frag = NULL;
+
+	FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+	FRAG_CB(q->fragments_tail)->next_frag = skb;
+	q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+	if (q->last_run_head)
+		rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+			     &q->last_run_head->rbnode.rb_right);
+	else
+		rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+	rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+	ip4_frag_init_run(skb);
+	q->fragments_tail = skb;
+	q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
 	struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
+
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		while (skb) {
+			struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+			sum += skb->truesize;
+			kfree_skb(skb);
+			skb = next;
+		}
+	}
+	return sum;
+}
+EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
 #ifdef CONFIG_SYSCTL
 static int dist_min;
 
-- 
2.18.0

^ permalink raw reply related

* [PATCH v3 26/30] net: sk_buff rbnode reorg
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh
  Cc: netdev, stable, edumazet, Soheil Hassas Yeganeh, Wei Wang,
	Willem de Bruijn
In-Reply-To: <20180913145902.17531-1-sthemmin@microsoft.com>

From: Eric Dumazet <edumazet@google.com>

commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h                  |  24 ++--
 include/net/inet_frag.h                 |   3 +-
 net/ipv4/inet_fragment.c                |  16 ++-
 net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 6 files changed, 129 insertions(+), 98 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2837e55df03e..f64e88444082 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -663,23 +663,27 @@ struct sk_buff {
 			struct sk_buff		*prev;
 
 			union {
-				ktime_t		tstamp;
-				u64		skb_mstamp;
+				struct net_device	*dev;
+				/* Some protocols might use this space to store information,
+				 * while device pointer would be NULL.
+				 * UDP receive path is one user.
+				 */
+				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
+		struct list_head	list;
 	};
-	struct sock		*sk;
 
 	union {
-		struct net_device	*dev;
-		/* Some protocols might use this space to store information,
-		 * while device pointer would be NULL.
-		 * UDP receive path is one user.
-		 */
-		unsigned long		dev_scratch;
+		struct sock		*sk;
 		int			ip_defrag_offset;
 	};
+
+	union {
+		ktime_t		tstamp;
+		u64		skb_mstamp;
+	};
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index ed07e3786d98..e4c71a7644be 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
 	struct timer_list	timer;
 	spinlock_t		lock;
 	refcount_t		refcnt;
-	struct sk_buff		*fragments;
+	struct sk_buff		*fragments;  /* Used in IPv6. */
+	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
 	int			len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d093..6904cbb7de1a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 	fp = q->fragments;
 	nf = q->net;
 	f = nf->f;
-	while (fp) {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
+	if (fp) {
+		do {
+			struct sk_buff *xp = fp->next;
+
+			sum_truesize += fp->truesize;
+			kfree_skb(fp);
+			fp = xp;
+		} while (fp);
+	} else {
+		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..0e8f8de77e71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
-	struct sk_buff *head;
+	struct sk_buff *head = NULL;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
 	ipq_kill(qp);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-	head = qp->q.fragments;
-
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+	if (!qp->q.flags & INET_FRAG_FIRST_IN)
 		goto out;
 
+	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
+	 * pull the head out of the tree in order to be able to
+	 * deal with head->dev.
+	 */
+	if (qp->q.fragments) {
+		head = qp->q.fragments;
+		qp->q.fragments = head->next;
+	} else {
+		head = skb_rb_first(&qp->q.rb_fragments);
+		if (!head)
+			goto out;
+		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		memset(&head->rbnode, 0, sizeof(head->rbnode));
+		barrier();
+	}
+	if (head == qp->q.fragments_tail)
+		qp->q.fragments_tail = NULL;
+
+	sub_frag_mem_limit(qp->q.net, head->truesize);
+
 	head->dev = dev_get_by_index_rcu(net, qp->iif);
 	if (!head->dev)
 		goto out;
@@ -179,16 +196,16 @@ static void ip_expire(struct timer_list *t)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	skb_get(head);
 	spin_unlock(&qp->q.lock);
 	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
-	kfree_skb(head);
 	goto out_rcu_unlock;
 
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
 	rcu_read_unlock();
+	if (head)
+		kfree_skb(head);
 	ipq_put(qp);
 }
 
@@ -231,7 +248,7 @@ static int ip_frag_too_far(struct ipq *qp)
 	end = atomic_inc_return(&peer->rid);
 	qp->rid = end;
 
-	rc = qp->q.fragments && (end - start) > max;
+	rc = qp->q.fragments_tail && (end - start) > max;
 
 	if (rc) {
 		struct net *net;
@@ -245,7 +262,6 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	struct sk_buff *fp;
 	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
@@ -253,20 +269,14 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	fp = qp->q.fragments;
-	do {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
-	} while (fp);
+	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
 	qp->q.meat = 0;
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
@@ -278,7 +288,8 @@ static int ip_frag_reinit(struct ipq *qp)
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
-	struct sk_buff *prev, *next;
+	struct rb_node **rbn, *parent;
+	struct sk_buff *skb1;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -341,58 +352,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	if (err)
 		goto err;
 
-	/* Find out which fragments are in front and at the back of us
-	 * in the chain of fragments so far.  We must know where to put
-	 * this fragment, right?
-	 */
-	prev = qp->q.fragments_tail;
-	if (!prev || prev->ip_defrag_offset < offset) {
-		next = NULL;
-		goto found;
-	}
-	prev = NULL;
-	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (next->ip_defrag_offset >= offset)
-			break;	/* bingo! */
-		prev = next;
-	}
+	/* Note : skb->rbnode and skb->dev share the same location. */
+	dev = skb->dev;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
 
-found:
 	/* RFC5722, Section 4, amended by Errata ID : 3089
 	 *                          When reassembling an IPv6 datagram, if
 	 *   one or more its constituent fragments is determined to be an
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4.
+	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
 
-	/* Is there an overlap with the previous fragment? */
-	if (prev &&
-	    (prev->ip_defrag_offset + prev->len) > offset)
-		goto discard_qp;
-
-	/* Is there an overlap with the next fragment? */
-	if (next && next->ip_defrag_offset < end)
-		goto discard_qp;
+	/* Find out where to put this fragment.  */
+	skb1 = qp->q.fragments_tail;
+	if (!skb1) {
+		/* This is the first fragment we've received. */
+		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
+		qp->q.fragments_tail = skb;
+	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
+		/* This is the common/special case: skb goes to the end. */
+		/* Detect and discard overlaps. */
+		if (offset < (skb1->ip_defrag_offset + skb1->len))
+			goto discard_qp;
+		/* Insert after skb1. */
+		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
+		qp->q.fragments_tail = skb;
+	} else {
+		/* Binary search. Note that skb can become the first fragment, but
+		 * not the last (covered above). */
+		rbn = &qp->q.rb_fragments.rb_node;
+		do {
+			parent = *rbn;
+			skb1 = rb_to_skb(parent);
+			if (end <= skb1->ip_defrag_offset)
+				rbn = &parent->rb_left;
+			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+				rbn = &parent->rb_right;
+			else /* Found an overlap with skb1. */
+				goto discard_qp;
+		} while (*rbn);
+		/* Here we have parent properly set, and rbn pointing to
+		 * one of its NULL left/right children. Insert skb. */
+		rb_link_node(&skb->rbnode, parent, rbn);
+	}
+	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
-	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
-	dev = skb->dev;
 	if (dev)
 		qp->iif = dev->ifindex;
-	/* Makes sure compiler wont do silly aliasing games */
-	barrier();
 	skb->ip_defrag_offset = offset;
 
-	/* Insert this fragment in the chain of fragments. */
-	skb->next = next;
-	if (!next)
-		qp->q.fragments_tail = skb;
-	if (prev)
-		prev->next = skb;
-	else
-		qp->q.fragments = skb;
-
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -414,7 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, skb, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -431,15 +442,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	return err;
 }
 
-
 /* Build a new IP datagram from all its fragments. */
-
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
-	struct sk_buff *fp, *head = qp->q.fragments;
+	struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
+	struct sk_buff **nextp; /* To build frag_list. */
+	struct rb_node *rbn;
 	int len;
 	int ihlen;
 	int err;
@@ -453,25 +464,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		goto out_fail;
 	}
 	/* Make the one we just received the head. */
-	if (prev) {
-		head = prev->next;
-		fp = skb_clone(head, GFP_ATOMIC);
+	if (head != skb) {
+		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-
-		fp->next = head->next;
-		if (!fp->next)
+		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
-		prev->next = fp;
-
-		skb_morph(head, qp->q.fragments);
-		head->next = qp->q.fragments->next;
-
-		consume_skb(qp->q.fragments);
-		qp->q.fragments = head;
+		skb_morph(skb, head);
+		rb_replace_node(&head->rbnode, &skb->rbnode,
+				&qp->q.rb_fragments);
+		consume_skb(head);
+		head = skb;
 	}
 
-	WARN_ON(!head);
 	WARN_ON(head->ip_defrag_offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
@@ -496,24 +502,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
 			goto out_nomem;
-		clone->next = head->next;
-		head->next = clone;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		head->data_len -= clone->len;
-		head->len -= clone->len;
+		skb->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
+		skb_shinfo(head)->frag_list = clone;
+		nextp = &clone->next;
+	} else {
+		nextp = &skb_shinfo(head)->frag_list;
 	}
 
-	skb_shinfo(head)->frag_list = head->next;
 	skb_push(head, head->data - skb_network_header(head));
 
-	for (fp=head->next; fp; fp = fp->next) {
+	/* Traverse the tree in order, to build frag_list. */
+	rbn = rb_next(&head->rbnode);
+	rb_erase(&head->rbnode, &qp->q.rb_fragments);
+	while (rbn) {
+		struct rb_node *rbnext = rb_next(rbn);
+		fp = rb_to_skb(rbn);
+		rb_erase(rbn, &qp->q.rb_fragments);
+		rbn = rbnext;
+		*nextp = fp;
+		nextp = &fp->next;
+		fp->prev = NULL;
+		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
 		head->data_len += fp->len;
 		head->len += fp->len;
 		if (head->ip_summed != fp->ip_summed)
@@ -524,7 +541,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
+	*nextp = NULL;
 	head->next = NULL;
+	head->prev = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
 	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
@@ -552,6 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	return 0;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 1d2f07cde01a..82ce0d0f54bf 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -471,6 +471,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 
 	return true;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index afaad60dc2ac..ede0061b6f5d 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -472,6 +472,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
 	rcu_read_unlock();
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 	return 1;
 
-- 
2.18.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox