Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 4.14.y stable] xfrm: policy: remove pcpu policy cache
From: Florian Westphal @ 2019-08-22 13:37 UTC (permalink / raw)
  To: Greg KH
  Cc: Florian Westphal, stable, vakul.garg, netdev, Kristian Evensen,
	Steffen Klassert
In-Reply-To: <20190822130941.GA15754@kroah.com>

Greg KH <greg@kroah.com> wrote:
> On Thu, Aug 22, 2019 at 01:21:09PM +0200, Florian Westphal wrote:
> > commit e4db5b61c572475bbbcf63e3c8a2606bfccf2c9d upstream.
> > 
> > Kristian Evensen says:
> >   In a project I am involved in, we are running ipsec (Strongswan) on
> >   different mt7621-based routers. Each router is configured as an
> >   initiator and has around ~30 tunnels to different responders (running
> >   on misc. devices). Before the flow cache was removed (kernel 4.9), we
> >   got a combined throughput of around 70Mbit/s for all tunnels on one
> >   router. However, we recently switched to kernel 4.14 (4.14.48), and
> >   the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
> >   drop of around 20%. Reverting the flow cache removal restores, as
> >   expected, performance levels to that of kernel 4.9.
> > 
> > When pcpu xdst exists, it has to be validated first before it can be
> > used.
> > 
> > A negative hit thus increases cost vs. no-cache.
> > 
> > As number of tunnels increases, hit rate decreases so this pcpu caching
> > isn't a viable strategy.
> > 
> > Furthermore, the xdst cache also needs to run with BH off, so when
> > removing this the bh disable/enable pairs can be removed too.
> > 
> > Kristian tested a 4.14.y backport of this change and reported
> > increased performance:
> > 
> >   In our tests, the throughput reduction has been reduced from around -20%
> >   to -5%. We also see that the overall throughput is independent of the
> >   number of tunnels, while before the throughput was reduced as the number
> >   of tunnels increased.
> > 
> > Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > ---
> >  Vakul Garg reports traffic going via ipsec tunnels will cause the kernel
> >  to spin in an infinite loop due to xfrm policy reference count
> >  overflowing and becoming 0.
> >  The refcount leak is in the pcpu cache.  Instead of fixing this, just
> >  remove the pcpu cache -- its not present in any other stable release.
> >  Vakul reported that this patch fixes the problem.
> > 
> >  There are no major deviations from the upstream revert; conflicts
> >  were only due to context.
> 
> Now queued up, does 4.9.y also need this?

No, 4.14 was the first kernel with this thing and its already gone in
4.19.

^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-22 13:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alex Williamson, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <20190822121936.GC2276@nanopsycho.orion>



> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Thursday, August 22, 2019 5:50 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> Wankhede <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jiri Pirko <jiri@resnulli.us>
> >> Sent: Thursday, August 22, 2019 3:28 PM
> >> To: Parav Pandit <parav@mellanox.com>
> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> <cohuck@redhat.com>;
> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >>
> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: Jiri Pirko <jiri@resnulli.us>
> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> >> >> To: Parav Pandit <parav@mellanox.com>
> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> >> >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> >> <cohuck@redhat.com>;
> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >> >>
> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com wrote:
> >> >> >
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> >> >> >> To: Parav Pandit <parav@mellanox.com>
> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> >> >> >> <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>;
> >> >> >> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org;
> >> >> >> linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> >> >> >> netdev@vger.kernel.org
> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >> >> >>
> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
> >> >> >> > > > > In fact, proposing that the user does not set it,
> >> >> >> > > > > mdev-core provides one
> >> >> >> > > automatically.
> >> >> >> > > > >
> >> >> >> > > > > > > Since there seems to be some prefix overhead, as I
> >> >> >> > > > > > > ask about above in how many characters we actually
> >> >> >> > > > > > > have to work with in IFNAMESZ, maybe we start with
> >> >> >> > > > > > > 8 characters (matching your "index" namespace) and
> >> >> >> > > > > > > expand as necessary for
> >> >> disambiguation.
> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ, let's start with
> 12.
> >> >> >> > > > > > > Thanks,
> >> >> >> > > > > > >
> >> >> >> > > > > > If user is going to choose the alias, why does it
> >> >> >> > > > > > have to be limited to
> >> >> >> sha1?
> >> >> >> > > > > > Or you just told it as an example?
> >> >> >> > > > > >
> >> >> >> > > > > > It can be an alpha-numeric string.
> >> >> >> > > > >
> >> >> >> > > > > No, I'm proposing a different solution where mdev-core
> >> >> >> > > > > creates an alias based on an abbreviated sha1.  The
> >> >> >> > > > > user does not provide the
> >> >> >> alias.
> >> >> >> > > > >
> >> >> >> > > > > > Instead of mdev imposing number of characters on the
> >> >> >> > > > > > alias, it should be best
> >> >> >> > > > > left to the user.
> >> >> >> > > > > > Because in future if netdev improves on the naming
> >> >> >> > > > > > scheme, mdev will be
> >> >> >> > > > > limiting it, which is not right.
> >> >> >> > > > > > So not restricting alias size seems right to me.
> >> >> >> > > > > > User configuring mdev for networking devices in a
> >> >> >> > > > > > given kernel knows what
> >> >> >> > > > > user is doing.
> >> >> >> > > > > > So user can choose alias name size as it finds suitable.
> >> >> >> > > > >
> >> >> >> > > > > That's not what I'm proposing, please read again.
> >> >> >> > > > > Thanks,
> >> >> >> > > >
> >> >> >> > > > I understood your point. But mdev doesn't know how user
> >> >> >> > > > is going to use
> >> >> >> > > udev/systemd to name the netdev.
> >> >> >> > > > So even if mdev chose to pick 12 characters, it could
> >> >> >> > > > result in
> >> collision.
> >> >> >> > > > Hence the proposal to provide the alias by the user, as
> >> >> >> > > > user know the best
> >> >> >> > > policy for its use case in the environment its using.
> >> >> >> > > > So 12 character sha1 method will still work by user.
> >> >> >> > >
> >> >> >> > > Haven't you already provided examples where certain drivers
> >> >> >> > > or subsystems have unique netdev prefixes?  If mdev
> >> >> >> > > provides a unique alias within the subsystem, couldn't we
> >> >> >> > > simply define a netdev prefix for the mdev subsystem and
> >> >> >> > > avoid all other collisions?  I'm not in favor of the user
> >> >> >> > > providing both a uuid and an alias/instance.  Thanks,
> >> >> >> > >
> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1 first 9
> >> >> >> > characters have
> >> >> >> collision?
> >> >> >>
> >> >> >> I think it would be a mistake to waste so many chars on a
> >> >> >> prefix, but
> >> >> >> 9 characters of sha1 likely wouldn't have a collision before we
> >> >> >> have 10s of thousands of devices.  Thanks,
> >> >> >>
> >> >> >> Alex
> >> >> >
> >> >> >Jiri, Dave,
> >> >> >Are you ok with it for devlink/netdev part?
> >> >> >Mdev core will create an alias from a UUID.
> >> >> >
> >> >> >This will be supplied during devlink port attr set such as,
> >> >> >
> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port, const char
> >> >> >*mdev_alias);
> >> >> >
> >> >> >This alias is used to generate representor netdev's phys_port_name.
> >> >> >This alias from the mdev device's sysfs will be used by the
> >> >> >udev/systemd to
> >> >> generate predicable netdev's name.
> >> >> >Example: enm<mdev_alias_first_12_chars>
> >> >>
> >> >> What happens in unlikely case of 2 UUIDs collide?
> >> >>
> >> >Since users sees two devices with same phys_port_name, user should
> >> >destroy
> >> recently created mdev and recreate mdev with different UUID?
> >>
> >> Driver should make sure phys port name wont collide,
> >So when mdev creation is initiated, mdev core calculates the alias and if there
> is any other mdev with same alias exist, it returns -EEXIST error before
> progressing further.
> >This way user will get to know upfront in event of collision before the mdev
> device gets created.
> >How about that?
> 
> Sounds fine to me. Now the question is how many chars do we want to have.
> 
12 characters from Alex's suggestion similar to git?

> >
> >
> >> in this case that it does
> >> not provide 2 same attrs for 2 different ports.
> >> Hmm, so the order of creation matters. That is not good.
> >>
> >> >>
> >> >> >I took Ethernet mdev as an example.
> >> >> >New prefix 'm' stands for mediated device.
> >> >> >Remaining 12 characters are first 12 chars of the mdev alias.
> >> >>
> >> >> Does this resolve the identification of devlink port representor?
> >> >Not sure if I understood your question correctly, attemping to answer
> below.
> >> >phys_port_name of devlink port is defined by the first 12 characters
> >> >of mdev
> >> alias.
> >> >> I assume you want to use the same 12(or so) chars, don't you?
> >> >Mdev's netdev will also use the same mdev alias from the sysfs to
> >> >rename
> >> netdev name from ethX to enm<mdev_alias>, where en=Etherenet,
> m=mdev.
> >> >
> >> >So yes, same 12 characters are use for mdev's netdev and mdev
> >> >devlink port's
> >> phys_port_name.
> >> >
> >> >Is that what are you asking?
> >>
> >> Yes. Then you have 3 chars to handle the rest of the name (pci, pf)...

^ permalink raw reply

* Re: [PATCH net-next 04/10] net: dsa: mv88e6xxx: prefix hidden register macro names with MV88E6XXX_
From: Andrew Lunn @ 2019-08-22 13:12 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190821232724.1544-5-marek.behun@nic.cz>

On Thu, Aug 22, 2019 at 01:27:18AM +0200, Marek Behún wrote:
> In order to be uniform with the rest of the driver, prepend hidden
> register macro names with the MV88E6XXX_ prefix.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

For the idea of the patch:

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

But it could be the actual patch is different if the code moves to
port.c/.h.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 03/10] net: dsa: mv88e6xxx: move hidden registers operations in own file
From: Andrew Lunn @ 2019-08-22 13:10 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190821232724.1544-4-marek.behun@nic.cz>

On Thu, Aug 22, 2019 at 01:27:17AM +0200, Marek Behún wrote:
> This patch moves the functions operating on the hidden debug registers
> into it's own file, hidden.c.

Humm, actually...

These are in the port register space. Maybe it would be better to move
them into port.c/port.h?

What you really need is that they have global scope within the driver
so you can call them. So add the functions definitions to port.h.

Vivien, what do you think?

	Andrew

^ permalink raw reply

* Re: [PATCH 4.14.y stable] xfrm: policy: remove pcpu policy cache
From: Greg KH @ 2019-08-22 13:09 UTC (permalink / raw)
  To: Florian Westphal
  Cc: stable, vakul.garg, netdev, Kristian Evensen, Steffen Klassert
In-Reply-To: <20190822112109.13269-1-fw@strlen.de>

On Thu, Aug 22, 2019 at 01:21:09PM +0200, Florian Westphal wrote:
> commit e4db5b61c572475bbbcf63e3c8a2606bfccf2c9d upstream.
> 
> Kristian Evensen says:
>   In a project I am involved in, we are running ipsec (Strongswan) on
>   different mt7621-based routers. Each router is configured as an
>   initiator and has around ~30 tunnels to different responders (running
>   on misc. devices). Before the flow cache was removed (kernel 4.9), we
>   got a combined throughput of around 70Mbit/s for all tunnels on one
>   router. However, we recently switched to kernel 4.14 (4.14.48), and
>   the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
>   drop of around 20%. Reverting the flow cache removal restores, as
>   expected, performance levels to that of kernel 4.9.
> 
> When pcpu xdst exists, it has to be validated first before it can be
> used.
> 
> A negative hit thus increases cost vs. no-cache.
> 
> As number of tunnels increases, hit rate decreases so this pcpu caching
> isn't a viable strategy.
> 
> Furthermore, the xdst cache also needs to run with BH off, so when
> removing this the bh disable/enable pairs can be removed too.
> 
> Kristian tested a 4.14.y backport of this change and reported
> increased performance:
> 
>   In our tests, the throughput reduction has been reduced from around -20%
>   to -5%. We also see that the overall throughput is independent of the
>   number of tunnels, while before the throughput was reduced as the number
>   of tunnels increased.
> 
> Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> ---
>  Vakul Garg reports traffic going via ipsec tunnels will cause the kernel
>  to spin in an infinite loop due to xfrm policy reference count
>  overflowing and becoming 0.
>  The refcount leak is in the pcpu cache.  Instead of fixing this, just
>  remove the pcpu cache -- its not present in any other stable release.
>  Vakul reported that this patch fixes the problem.
> 
>  There are no major deviations from the upstream revert; conflicts
>  were only due to context.

Now queued up, does 4.9.y also need this?

thanks,

greg k-h

^ permalink raw reply

* Re: WARNING in rollback_registered_many (2)
From: Andrey Konovalov @ 2019-08-22 13:07 UTC (permalink / raw)
  To: syzbot, USB list
  Cc: Larry Finger, avagin, David S. Miller, devel, Eric W . Biederman,
	Eric Dumazet, florian.c.schilhabel, Greg Kroah-Hartman,
	Kai Heng Feng, ktkhai, LKML, netdev, straube.linux,
	syzkaller-bugs, tyhicks, Matthew Wilcox, Oliver Neukum,
	Alan Stern
In-Reply-To: <CAAeHK+w+asSQ3axWymToQ+uzPfEAYS2QimVBL85GuJRBtxkjDA@mail.gmail.com>

On Wed, Aug 7, 2019 at 4:03 PM Andrey Konovalov <andreyknvl@google.com> wrote:
>
> On Fri, Apr 12, 2019 at 1:32 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> >
> > On Fri, Apr 12, 2019 at 1:29 AM syzbot
> > <syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com> wrote:
> > >
> > > syzbot has found a reproducer for the following crash on:
> > >
> > > HEAD commit:    9a33b369 usb-fuzzer: main usb gadget fuzzer driver
> > > git tree:       https://github.com/google/kasan/tree/usb-fuzzer
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10d552b7200000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=23e37f59d94ddd15
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=40918e4d826fb2ff9b96
> > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17a4c1af200000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=121b274b200000
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com
> > >
> > > usb 1-1: r8712u: MAC Address from efuse = 00:e0:4c:87:00:00
> > > usb 1-1: r8712u: Loading firmware from "rtlwifi/rtl8712u.bin"
> > > usb 1-1: USB disconnect, device number 2
> > > usb 1-1: Direct firmware load for rtlwifi/rtl8712u.bin failed with error -2
> > > usb 1-1: r8712u: Firmware request failed
> > > WARNING: CPU: 0 PID: 575 at net/core/dev.c:8152
> > > rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > Kernel panic - not syncing: panic_on_warn set ...
> > > CPU: 0 PID: 575 Comm: kworker/0:4 Not tainted 5.1.0-rc4-319354-g9a33b36 #3
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Workqueue: usb_hub_wq hub_event
> > > Call Trace:
> > >   __dump_stack lib/dump_stack.c:77 [inline]
> > >   dump_stack+0xe8/0x16e lib/dump_stack.c:113
> > >   panic+0x29d/0x5f2 kernel/panic.c:214
> > >   __warn.cold+0x20/0x48 kernel/panic.c:571
> > >   report_bug+0x262/0x2a0 lib/bug.c:186
> > >   fixup_bug arch/x86/kernel/traps.c:179 [inline]
> > >   fixup_bug arch/x86/kernel/traps.c:174 [inline]
> > >   do_error_trap+0x130/0x1f0 arch/x86/kernel/traps.c:272
> > >   do_invalid_op+0x37/0x40 arch/x86/kernel/traps.c:291
> > >   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
> > > RIP: 0010:rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > Code: 05 00 00 31 ff 44 89 fe e8 5a 15 f3 f4 45 84 ff 0f 85 49 ff ff ff e8
> > > 1c 14 f3 f4 0f 1f 44 00 00 e8 12 14 f3 f4 e8 0d 14 f3 f4 <0f> 0b 4c 89 e7
> > > e8 33 72 f2 f6 31 ff 41 89 c4 89 c6 e8 27 15 f3 f4
> > > RSP: 0018:ffff88809d087698 EFLAGS: 00010293
> > > RAX: ffff88809d058000 RBX: ffff888096240000 RCX: ffffffff8c7eb146
> > > RDX: 0000000000000000 RSI: ffffffff8c7eb163 RDI: 0000000000000001
> > > RBP: ffff88809d0877c8 R08: ffff88809d058000 R09: fffffbfff2708111
> > > R10: fffffbfff2708110 R11: ffffffff93840887 R12: ffff888096240070
> > > R13: dffffc0000000000 R14: ffff88809d087758 R15: 0000000000000000
> > >   rollback_registered+0xf7/0x1c0 net/core/dev.c:8228
> > >   unregister_netdevice_queue net/core/dev.c:9275 [inline]
> > >   unregister_netdevice_queue+0x1dc/0x2b0 net/core/dev.c:9268
> > >   unregister_netdevice include/linux/netdevice.h:2655 [inline]
> > >   unregister_netdev+0x1d/0x30 net/core/dev.c:9316
> > >   r871xu_dev_remove+0xe7/0x223 drivers/staging/rtl8712/usb_intf.c:604
> > >   usb_unbind_interface+0x1c9/0x980 drivers/usb/core/driver.c:423
> > >   __device_release_driver drivers/base/dd.c:1082 [inline]
> > >   device_release_driver_internal+0x436/0x4f0 drivers/base/dd.c:1113
> > >   bus_remove_device+0x302/0x5c0 drivers/base/bus.c:556
> > >   device_del+0x467/0xb90 drivers/base/core.c:2269
> > >   usb_disable_device+0x242/0x790 drivers/usb/core/message.c:1235
> > >   usb_disconnect+0x298/0x870 drivers/usb/core/hub.c:2197
> > >   hub_port_connect drivers/usb/core/hub.c:4940 [inline]
> > >   hub_port_connect_change drivers/usb/core/hub.c:5204 [inline]
> > >   port_event drivers/usb/core/hub.c:5350 [inline]
> > >   hub_event+0xcd2/0x3b00 drivers/usb/core/hub.c:5432
> > >   process_one_work+0x90f/0x1580 kernel/workqueue.c:2269
> > >   process_scheduled_works kernel/workqueue.c:2331 [inline]
> > >   worker_thread+0x7b0/0xe20 kernel/workqueue.c:2417
> > >   kthread+0x313/0x420 kernel/kthread.c:253
> > >   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> > > Kernel Offset: disabled
> > > Rebooting in 86400 seconds..
> > >
> >
> > +linux-usb mailing list
>
> This USB bug is the most frequently triggered one right now with over
> 27k kernel crashes.

OK, this report is confusing. It was initially reported on the
upstream instance a long time ago, but since then has stopped
happening, as it was probably fixed. Then when we launched the USB
fuzzing instance, it has started producing similarly named reports
(with a different root cause though), and they were bucketed into this
bug by syzkaller. I've improved parsing titles of such reports in
syzkaller, so I'm invalidating this one, and syzbot should send a
properly attributed USB report soon.

#syz invalid

^ permalink raw reply

* Re: [PATCH net-next 03/10] net: dsa: mv88e6xxx: move hidden registers operations in own file
From: Andrew Lunn @ 2019-08-22 13:04 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190821232724.1544-4-marek.behun@nic.cz>

On Thu, Aug 22, 2019 at 01:27:17AM +0200, Marek Behún wrote:
> This patch moves the functions operating on the hidden debug registers
> into it's own file, hidden.c.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH] nexthops: remove redundant assignment to variable err
From: David Ahern @ 2019-08-22 12:59 UTC (permalink / raw)
  To: Colin King, David Ahern, David S . Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, netdev
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20190822125340.30783-1-colin.king@canonical.com>

On 8/22/19 8:53 AM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Variable err is initialized to a value that is never read and it is
> re-assigned later. The initialization is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused Value")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  net/ipv4/nexthop.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 02/10] net: dsa: mv88e6xxx: remove extra newline
From: Andrew Lunn @ 2019-08-22 12:55 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190821232724.1544-3-marek.behun@nic.cz>

On Thu, Aug 22, 2019 at 01:27:16AM +0200, Marek Behún wrote:
> There are two newlines separating mv88e6390_hidden_wait and
> mv88e6390_hidden_read. Remove one.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v3] tun: fix use-after-free when register netdev failed
From: Yang Yingliang @ 2019-08-22 12:55 UTC (permalink / raw)
  To: Jason Wang, David Miller
  Cc: netdev, eric.dumazet, xiyou.wangcong, weiyongjun1
In-Reply-To: <5D5E3133.2070108@huawei.com>



On 2019/8/22 14:07, Yang Yingliang wrote:
>
>
> On 2019/8/22 10:13, Jason Wang wrote:
>>
>> On 2019/8/20 上午10:28, Jason Wang wrote:
>>>
>>> On 2019/8/20 上午9:25, David Miller wrote:
>>>> From: Yang Yingliang <yangyingliang@huawei.com>
>>>> Date: Mon, 19 Aug 2019 21:31:19 +0800
>>>>
>>>>> Call tun_attach() after register_netdevice() to make sure tfile->tun
>>>>> is not published until the netdevice is registered. So the read/write
>>>>> thread can not use the tun pointer that may freed by free_netdev().
>>>>> (The tun and dev pointer are allocated by alloc_netdev_mqs(), they 
>>>>> can
>>>>> be freed by netdev_freemem().)
>>>> register_netdevice() must always be the last operation in the order of
>>>> network device setup.
>>>>
>>>> At the point register_netdevice() is called, the device is visible 
>>>> globally
>>>> and therefore all of it's software state must be fully initialized and
>>>> ready for us.
>>>>
>>>> You're going to have to find another solution to these problems.
>>>
>>>
>>> The device is loosely coupled with sockets/queues. Each side is 
>>> allowed to be go away without caring the other side. So in this 
>>> case, there's a small window that network stack think the device has 
>>> one queue but actually not, the code can then safely drop them. 
>>> Maybe it's ok here with some comments?
>>>
>>> Or if not, we can try to hold the device before tun_attach and drop 
>>> it after register_netdevice().
>>
>>
>> Hi Yang:
>>
>> I think maybe we can try to hold refcnt instead of playing real num 
>> queues here. Do you want to post a V4?
> I think the refcnt can prevent freeing the memory in this case.
> When register_netdevice() failed, free_netdev() will be called directly,
> dev->pcpu_refcnt and dev are freed without checking refcnt of dev.
How about using patch-v1 that using a flag to check whether the device 
registered successfully.

>
>>
>> Thanks
>>
>>
>>>
>>> Thanks
>>>
>>
>> .
>>
>
>
>
> .
>



^ permalink raw reply

* [PATCH] nexthops: remove redundant assignment to variable err
From: Colin King @ 2019-08-22 12:53 UTC (permalink / raw)
  To: David Ahern, David S . Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Variable err is initialized to a value that is never read and it is
re-assigned later. The initialization is redundant and can be removed.

Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/ipv4/nexthop.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 5fe5a3981d43..fc34fd1668d6 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1151,7 +1151,7 @@ static int nh_create_ipv4(struct net *net, struct nexthop *nh,
 		.fc_encap_type = cfg->nh_encap_type,
 	};
 	u32 tb_id = l3mdev_fib_table(cfg->dev);
-	int err = -EINVAL;
+	int err;
 
 	err = fib_nh_init(net, fib_nh, &fib_cfg, 1, extack);
 	if (err) {
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next 10/10] net: sched: flower: don't take rtnl lock for cls hw offloads API
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

Don't manually take rtnl lock in flower classifier before calling cls
hardware offloads API. Instead, pass rtnl lock status via 'rtnl_held'
parameter.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_flower.c | 55 ++++++++++++------------------------------
 1 file changed, 16 insertions(+), 39 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1cc68702f93d..786427997be6 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -412,18 +412,13 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f,
 	struct tcf_block *block = tp->chain->block;
 	struct flow_cls_offload cls_flower = {};
 
-	if (!rtnl_held)
-		rtnl_lock();
-
 	tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, extack);
 	cls_flower.command = FLOW_CLS_DESTROY;
 	cls_flower.cookie = (unsigned long) f;
 
 	tc_setup_cb_destroy(block, tp, TC_SETUP_CLSFLOWER, &cls_flower, false,
-			    &f->flags, &f->in_hw_count, true);
+			    &f->flags, &f->in_hw_count, rtnl_held);
 
-	if (!rtnl_held)
-		rtnl_unlock();
 }
 
 static int fl_hw_replace_filter(struct tcf_proto *tp,
@@ -435,14 +430,9 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 	bool skip_sw = tc_skip_sw(f->flags);
 	int err = 0;
 
-	if (!rtnl_held)
-		rtnl_lock();
-
 	cls_flower.rule = flow_rule_alloc(tcf_exts_num_actions(&f->exts));
-	if (!cls_flower.rule) {
-		err = -ENOMEM;
-		goto errout;
-	}
+	if (!cls_flower.rule)
+		return -ENOMEM;
 
 	tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, extack);
 	cls_flower.command = FLOW_CLS_REPLACE;
@@ -453,38 +443,30 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 	cls_flower.classid = f->res.classid;
 
 	err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts,
-				   true);
+				   rtnl_held);
 	if (err) {
 		kfree(cls_flower.rule);
-		if (skip_sw)
+		if (skip_sw) {
 			NL_SET_ERR_MSG_MOD(extack, "Failed to setup flow action");
-		else
-			err = 0;
-		goto errout;
+			return err;
+		}
+		return 0;
 	}
 
 	err = tc_setup_cb_add(block, tp, TC_SETUP_CLSFLOWER, &cls_flower,
-			      skip_sw, &f->flags, &f->in_hw_count, true);
+			      skip_sw, &f->flags, &f->in_hw_count, rtnl_held);
 	tc_cleanup_flow_action(&cls_flower.rule->action);
 	kfree(cls_flower.rule);
 
 	if (err < 0) {
-		fl_hw_destroy_filter(tp, f, true, NULL);
-		goto errout;
-	} else if (err > 0) {
-		err = 0;
-	}
-
-	if (skip_sw && !(f->flags & TCA_CLS_FLAGS_IN_HW)) {
-		err = -EINVAL;
-		goto errout;
+		fl_hw_destroy_filter(tp, f, rtnl_held, NULL);
+		return err;
 	}
 
-errout:
-	if (!rtnl_held)
-		rtnl_unlock();
+	if (skip_sw && !(f->flags & TCA_CLS_FLAGS_IN_HW))
+		return -EINVAL;
 
-	return err;
+	return 0;
 }
 
 static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
@@ -493,22 +475,17 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
 	struct tcf_block *block = tp->chain->block;
 	struct flow_cls_offload cls_flower = {};
 
-	if (!rtnl_held)
-		rtnl_lock();
-
 	tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, NULL);
 	cls_flower.command = FLOW_CLS_STATS;
 	cls_flower.cookie = (unsigned long) f;
 	cls_flower.classid = f->res.classid;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false, true);
+	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false,
+			 rtnl_held);
 
 	tcf_exts_stats_update(&f->exts, cls_flower.stats.bytes,
 			      cls_flower.stats.pkts,
 			      cls_flower.stats.lastused);
-
-	if (!rtnl_held)
-		rtnl_unlock();
 }
 
 static void __fl_put(struct cls_fl_filter *f)
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 07/10] net: sched: take rtnl lock in tc_setup_flow_action()
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

In order to allow using new flow_action infrastructure from unlocked
classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held'
argument. Take rtnl lock before accessing tc_action data. This is necessary
to protect from concurrent action replace.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h    |  2 +-
 net/sched/cls_api.c      | 17 +++++++++++++----
 net/sched/cls_flower.c   |  6 ++++--
 net/sched/cls_matchall.c |  4 ++--
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a5937fb3c4b2..4d861426507c 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -504,7 +504,7 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 }
 
 int tc_setup_flow_action(struct flow_action *flow_action,
-			 const struct tcf_exts *exts);
+			 const struct tcf_exts *exts, bool rtnl_held);
 int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 		     void *type_data, bool err_stop, bool rtnl_held);
 int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index bda42f1b5514..31680493b2b1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3245,14 +3245,17 @@ int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
 EXPORT_SYMBOL(tc_setup_cb_destroy);
 
 int tc_setup_flow_action(struct flow_action *flow_action,
-			 const struct tcf_exts *exts)
+			 const struct tcf_exts *exts, bool rtnl_held)
 {
 	const struct tc_action *act;
-	int i, j, k;
+	int i, j, k, err = 0;
 
 	if (!exts)
 		return 0;
 
+	if (!rtnl_held)
+		rtnl_lock();
+
 	j = 0;
 	tcf_exts_for_each_action(i, act, exts) {
 		struct flow_action_entry *entry;
@@ -3297,6 +3300,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 				entry->vlan.prio = tcf_vlan_push_prio(act);
 				break;
 			default:
+				err = -EOPNOTSUPP;
 				goto err_out;
 			}
 		} else if (is_tcf_tunnel_set(act)) {
@@ -3314,6 +3318,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 					entry->id = FLOW_ACTION_ADD;
 					break;
 				default:
+					err = -EOPNOTSUPP;
 					goto err_out;
 				}
 				entry->mangle.htype = tcf_pedit_htype(act, k);
@@ -3372,15 +3377,19 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			entry->id = FLOW_ACTION_PTYPE;
 			entry->ptype = tcf_skbedit_ptype(act);
 		} else {
+			err = -EOPNOTSUPP;
 			goto err_out;
 		}
 
 		if (!is_tcf_pedit(act))
 			j++;
 	}
-	return 0;
+
 err_out:
-	return -EOPNOTSUPP;
+	if (!rtnl_held)
+		rtnl_unlock();
+
+	return err;
 }
 EXPORT_SYMBOL(tc_setup_flow_action);
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index cd1686a5abe7..006759b9c78a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -452,7 +452,8 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 	cls_flower.rule->match.key = &f->mkey;
 	cls_flower.classid = f->res.classid;
 
-	err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts);
+	err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts,
+				   true);
 	if (err) {
 		kfree(cls_flower.rule);
 		if (skip_sw)
@@ -1821,7 +1822,8 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 		cls_flower.rule->match.mask = &f->mask->key;
 		cls_flower.rule->match.key = &f->mkey;
 
-		err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts);
+		err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts,
+					   true);
 		if (err) {
 			kfree(cls_flower.rule);
 			if (tc_skip_sw(f->flags)) {
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index d38e329d71bd..52e6ef7538b7 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -97,7 +97,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 	cls_mall.command = TC_CLSMATCHALL_REPLACE;
 	cls_mall.cookie = cookie;
 
-	err = tc_setup_flow_action(&cls_mall.rule->action, &head->exts);
+	err = tc_setup_flow_action(&cls_mall.rule->action, &head->exts, true);
 	if (err) {
 		kfree(cls_mall.rule);
 		mall_destroy_hw_filter(tp, head, cookie, NULL);
@@ -300,7 +300,7 @@ static int mall_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 		TC_CLSMATCHALL_REPLACE : TC_CLSMATCHALL_DESTROY;
 	cls_mall.cookie = (unsigned long)head;
 
-	err = tc_setup_flow_action(&cls_mall.rule->action, &head->exts);
+	err = tc_setup_flow_action(&cls_mall.rule->action, &head->exts, true);
 	if (err) {
 		kfree(cls_mall.rule);
 		if (add && tc_skip_sw(head->flags)) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 02/10] net: sched: change tcf block offload counter type to atomic_t
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

As a preparation for running proto ops functions without rtnl lock, change
offload counter type to atomic. This is necessary to allow updating the
counter by multiple concurrent users when offloading filters to hardware
from unlocked classifiers.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h | 7 ++++---
 net/sched/cls_api.c       | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a3eaf5f9d28f..d778c502decd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -14,6 +14,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/rwsem.h>
+#include <linux/atomic.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 #include <net/flow_offload.h>
@@ -401,7 +402,7 @@ struct tcf_block {
 	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
-	unsigned int offloadcnt; /* Number of oddloaded filters */
+	atomic_t offloadcnt; /* Number of oddloaded filters */
 	unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
 	struct {
 		struct tcf_chain *chain;
@@ -443,7 +444,7 @@ static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
 	if (*flags & TCA_CLS_FLAGS_IN_HW)
 		return;
 	*flags |= TCA_CLS_FLAGS_IN_HW;
-	block->offloadcnt++;
+	atomic_inc(&block->offloadcnt);
 }
 
 static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
@@ -451,7 +452,7 @@ static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
 	if (!(*flags & TCA_CLS_FLAGS_IN_HW))
 		return;
 	*flags &= ~TCA_CLS_FLAGS_IN_HW;
-	block->offloadcnt--;
+	atomic_dec(&block->offloadcnt);
 }
 
 static inline void
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a3b07c6e3f53..8502bd006b37 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -629,7 +629,7 @@ static void tc_indr_block_call(struct tcf_block *block,
 
 static bool tcf_block_offload_in_use(struct tcf_block *block)
 {
-	return block->offloadcnt;
+	return atomic_read(&block->offloadcnt);
 }
 
 static int tcf_block_offload_cmd(struct tcf_block *block,
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 08/10] net: sched: take reference to action dev before calling offloads
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

In order to remove dependency on rtnl lock when calling hardware offload
API, take reference to action mirred dev when initializing flow_action
structure in tc_setup_flow_action(). Implement function
tc_cleanup_flow_action(), use it to release the device after hardware
offload API is done using it.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h  |  2 ++
 net/sched/cls_api.c    | 32 ++++++++++++++++++++++++++++++++
 net/sched/cls_flower.c |  2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 4d861426507c..b20017793952 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -505,6 +505,8 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts, bool rtnl_held);
+void tc_cleanup_flow_action(struct flow_action *flow_action);
+
 int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 		     void *type_data, bool err_stop, bool rtnl_held);
 int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 31680493b2b1..96adbff9e845 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3244,6 +3244,27 @@ int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
 }
 EXPORT_SYMBOL(tc_setup_cb_destroy);
 
+void tc_cleanup_flow_action(struct flow_action *flow_action)
+{
+	struct flow_action_entry *entry;
+	int i;
+
+	flow_action_for_each(i, entry, flow_action) {
+		switch (entry->id) {
+		case FLOW_ACTION_REDIRECT:
+		case FLOW_ACTION_MIRRED:
+		case FLOW_ACTION_REDIRECT_INGRESS:
+		case FLOW_ACTION_MIRRED_INGRESS:
+			if (entry->dev)
+				dev_put(entry->dev);
+			break;
+		default:
+			break;
+		}
+	}
+}
+EXPORT_SYMBOL(tc_cleanup_flow_action);
+
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts, bool rtnl_held)
 {
@@ -3273,15 +3294,23 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 		} else if (is_tcf_mirred_egress_redirect(act)) {
 			entry->id = FLOW_ACTION_REDIRECT;
 			entry->dev = tcf_mirred_dev(act);
+			if (entry->dev)
+				dev_hold(entry->dev);
 		} else if (is_tcf_mirred_egress_mirror(act)) {
 			entry->id = FLOW_ACTION_MIRRED;
 			entry->dev = tcf_mirred_dev(act);
+			if (entry->dev)
+				dev_hold(entry->dev);
 		} else if (is_tcf_mirred_ingress_redirect(act)) {
 			entry->id = FLOW_ACTION_REDIRECT_INGRESS;
 			entry->dev = tcf_mirred_dev(act);
+			if (entry->dev)
+				dev_hold(entry->dev);
 		} else if (is_tcf_mirred_ingress_mirror(act)) {
 			entry->id = FLOW_ACTION_MIRRED_INGRESS;
 			entry->dev = tcf_mirred_dev(act);
+			if (entry->dev)
+				dev_hold(entry->dev);
 		} else if (is_tcf_vlan(act)) {
 			switch (tcf_vlan_action(act)) {
 			case TCA_VLAN_ACT_PUSH:
@@ -3389,6 +3418,9 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 	if (!rtnl_held)
 		rtnl_unlock();
 
+	if (err)
+		tc_cleanup_flow_action(flow_action);
+
 	return err;
 }
 EXPORT_SYMBOL(tc_setup_flow_action);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 006759b9c78a..1cc68702f93d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -465,6 +465,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 
 	err = tc_setup_cb_add(block, tp, TC_SETUP_CLSFLOWER, &cls_flower,
 			      skip_sw, &f->flags, &f->in_hw_count, true);
+	tc_cleanup_flow_action(&cls_flower.rule->action);
 	kfree(cls_flower.rule);
 
 	if (err < 0) {
@@ -1837,6 +1838,7 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 		cls_flower.classid = f->res.classid;
 
 		err = cb(TC_SETUP_CLSFLOWER, &cls_flower, cb_priv);
+		tc_cleanup_flow_action(&cls_flower.rule->action);
 		kfree(cls_flower.rule);
 
 		if (err) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 03/10] net: sched: refactor block offloads counter usage
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

Without rtnl lock protection filters can no longer safely manage block
offloads counter themselves. Refactor cls API to protect block offloadcnt
with tcf_block->cb_lock that is already used to protect driver callback
list and nooffloaddevcnt counter. The counter can be modified by concurrent
tasks by new functions that execute block callbacks (which is safe with
previous patch that changed its type to atomic_t), however, block
bind/unbind code that checks the counter value takes cb_lock in write mode
to exclude any concurrent modifications. This approach prevents race
conditions between bind/unbind and callback execution code but allows for
concurrency for tc rule update path.

Move block offload counter, filter in hardware counter and filter flags
management from classifiers into cls hardware offloads API. Make functions
tcf_block_offload_inc() and tcf_block_offload_dec() to be cls API private.
Implement following new cls API to be used instead:

  tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
  already in hardware is successfully offloaded, increment block offloads
  counter, set filter in hardware counter and flag. On failure, previously
  offloaded filter is considered to be intact and offloads counter is not
  decremented.

  tc_setup_cb_replace() - destructive filter replace. Release existing
  filter block offload counter and reset its in hardware counter and flag.
  Set new filter in hardware counter and flag. On failure, previously
  offloaded filter is considered to be destroyed and offload counter is
  decremented.

  tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
  offloads counter.

Refactor all offload-capable classifiers to atomically offload filters to
hardware, change block offload counter, and set filter in hardware counter
and flag by means of the new cls API functions.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h     |  13 +++-
 include/net/sch_generic.h |  32 +-------
 net/sched/cls_api.c       | 155 ++++++++++++++++++++++++++++++++++----
 net/sched/cls_bpf.c       |  22 +++---
 net/sched/cls_flower.c    |  23 +++---
 net/sched/cls_matchall.c  |  15 ++--
 net/sched/cls_u32.c       |  17 ++---
 7 files changed, 193 insertions(+), 84 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 64999ffcb486..a5937fb3c4b2 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -506,7 +506,18 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts);
 int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
-		     void *type_data, bool err_stop);
+		     void *type_data, bool err_stop, bool rtnl_held);
+int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
+		    enum tc_setup_type type, void *type_data, bool err_stop,
+		    u32 *flags, unsigned int *in_hw_count, bool rtnl_held);
+int tc_setup_cb_replace(struct tcf_block *block, struct tcf_proto *tp,
+			enum tc_setup_type type, void *type_data, bool err_stop,
+			u32 *old_flags, unsigned int *old_in_hw_count,
+			u32 *new_flags, unsigned int *new_in_hw_count,
+			bool rtnl_held);
+int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
+			enum tc_setup_type type, void *type_data, bool err_stop,
+			u32 *flags, unsigned int *in_hw_count, bool rtnl_held);
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
 
 struct tc_cls_u32_knode {
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d778c502decd..373f9476c1de 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -439,36 +439,8 @@ static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp)
 #define tcf_proto_dereference(p, tp)					\
 	rcu_dereference_protected(p, lockdep_tcf_proto_is_locked(tp))
 
-static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
-{
-	if (*flags & TCA_CLS_FLAGS_IN_HW)
-		return;
-	*flags |= TCA_CLS_FLAGS_IN_HW;
-	atomic_inc(&block->offloadcnt);
-}
-
-static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
-{
-	if (!(*flags & TCA_CLS_FLAGS_IN_HW))
-		return;
-	*flags &= ~TCA_CLS_FLAGS_IN_HW;
-	atomic_dec(&block->offloadcnt);
-}
-
-static inline void
-tc_cls_offload_cnt_update(struct tcf_block *block, u32 *cnt,
-			  u32 *flags, bool add)
-{
-	if (add) {
-		if (!*cnt)
-			tcf_block_offload_inc(block, flags);
-		(*cnt)++;
-	} else {
-		(*cnt)--;
-		if (!*cnt)
-			tcf_block_offload_dec(block, flags);
-	}
-}
+void tc_cls_offload_cnt_update(struct tcf_block *block, struct tcf_proto *tp,
+			       u32 *cnt, u32 *flags, u32 diff, bool add);
 
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8502bd006b37..4215c849f4a3 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3000,13 +3000,97 @@ int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
 }
 EXPORT_SYMBOL(tcf_exts_dump_stats);
 
-int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
-		     void *type_data, bool err_stop)
+static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
+{
+	if (*flags & TCA_CLS_FLAGS_IN_HW)
+		return;
+	*flags |= TCA_CLS_FLAGS_IN_HW;
+	atomic_inc(&block->offloadcnt);
+}
+
+static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
+{
+	if (!(*flags & TCA_CLS_FLAGS_IN_HW))
+		return;
+	*flags &= ~TCA_CLS_FLAGS_IN_HW;
+	atomic_dec(&block->offloadcnt);
+}
+
+void tc_cls_offload_cnt_update(struct tcf_block *block, struct tcf_proto *tp,
+			       u32 *cnt, u32 *flags, u32 diff, bool add)
+{
+	lockdep_assert_held(&block->cb_lock);
+
+	spin_lock(&tp->lock);
+	if (add) {
+		if (!*cnt)
+			tcf_block_offload_inc(block, flags);
+		(*cnt) += diff;
+	} else {
+		(*cnt) -= diff;
+		if (!*cnt)
+			tcf_block_offload_dec(block, flags);
+	}
+	spin_unlock(&tp->lock);
+}
+EXPORT_SYMBOL(tc_cls_offload_cnt_update);
+
+static void
+tc_cls_offload_cnt_reset(struct tcf_block *block, struct tcf_proto *tp,
+			 u32 *cnt, u32 *flags)
+{
+	lockdep_assert_held(&block->cb_lock);
+
+	spin_lock(&tp->lock);
+	tcf_block_offload_dec(block, flags);
+	(*cnt) = 0;
+	spin_unlock(&tp->lock);
+}
+
+static int
+__tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
+		   void *type_data, bool err_stop)
 {
 	struct flow_block_cb *block_cb;
 	int ok_count = 0;
 	int err;
 
+	list_for_each_entry(block_cb, &block->flow_block.cb_list, list) {
+		err = block_cb->cb(type, type_data, block_cb->cb_priv);
+		if (err) {
+			if (err_stop)
+				return err;
+		} else {
+			ok_count++;
+		}
+	}
+	return ok_count;
+}
+
+int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
+		     void *type_data, bool err_stop, bool rtnl_held)
+{
+	int ok_count;
+
+	down_read(&block->cb_lock);
+	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
+	up_read(&block->cb_lock);
+	return ok_count;
+}
+EXPORT_SYMBOL(tc_setup_cb_call);
+
+/* Non-destructive filter add. If filter that wasn't already in hardware is
+ * successfully offloaded, increment block offloads counter. On failure,
+ * previously offloaded filter is considered to be intact and offloads counter
+ * is not decremented.
+ */
+
+int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
+		    enum tc_setup_type type, void *type_data, bool err_stop,
+		    u32 *flags, unsigned int *in_hw_count, bool rtnl_held)
+{
+	int ok_count;
+
 	down_read(&block->cb_lock);
 	/* Make sure all netdevs sharing this block are offload-capable. */
 	if (block->nooffloaddevcnt && err_stop) {
@@ -3014,22 +3098,67 @@ int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 		goto errout;
 	}
 
-	list_for_each_entry(block_cb, &block->flow_block.cb_list, list) {
-		err = block_cb->cb(type, type_data, block_cb->cb_priv);
-		if (err) {
-			if (err_stop) {
-				ok_count = err;
-				goto errout;
-			}
-		} else {
-			ok_count++;
-		}
+	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
+	if (ok_count > 0)
+		tc_cls_offload_cnt_update(block, tp, in_hw_count, flags,
+					  ok_count, true);
+errout:
+	up_read(&block->cb_lock);
+	return ok_count;
+}
+EXPORT_SYMBOL(tc_setup_cb_add);
+
+/* Destructive filter replace. If filter that wasn't already in hardware is
+ * successfully offloaded, increment block offload counter. On failure,
+ * previously offloaded filter is considered to be destroyed and offload counter
+ * is decremented.
+ */
+
+int tc_setup_cb_replace(struct tcf_block *block, struct tcf_proto *tp,
+			enum tc_setup_type type, void *type_data, bool err_stop,
+			u32 *old_flags, unsigned int *old_in_hw_count,
+			u32 *new_flags, unsigned int *new_in_hw_count,
+			bool rtnl_held)
+{
+	int ok_count;
+
+	down_read(&block->cb_lock);
+	/* Make sure all netdevs sharing this block are offload-capable. */
+	if (block->nooffloaddevcnt && err_stop) {
+		ok_count = -EOPNOTSUPP;
+		goto errout;
 	}
+
+	tc_cls_offload_cnt_reset(block, tp, old_in_hw_count, old_flags);
+
+	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
+	if (ok_count > 0)
+		tc_cls_offload_cnt_update(block, tp, new_in_hw_count, new_flags,
+					  ok_count, true);
 errout:
 	up_read(&block->cb_lock);
 	return ok_count;
 }
-EXPORT_SYMBOL(tc_setup_cb_call);
+EXPORT_SYMBOL(tc_setup_cb_replace);
+
+/* Destroy filter and decrement block offload counter, if filter was previously
+ * offloaded.
+ */
+
+int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
+			enum tc_setup_type type, void *type_data, bool err_stop,
+			u32 *flags, unsigned int *in_hw_count, bool rtnl_held)
+{
+	int ok_count;
+
+	down_read(&block->cb_lock);
+	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
+
+	tc_cls_offload_cnt_reset(block, tp, in_hw_count, flags);
+	up_read(&block->cb_lock);
+	return ok_count;
+}
+EXPORT_SYMBOL(tc_setup_cb_destroy);
 
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts)
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 3f7a9c02b70c..7f304db7e697 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -162,17 +162,21 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog,
 	cls_bpf.name = obj->bpf_name;
 	cls_bpf.exts_integrated = obj->exts_integrated;
 
-	if (oldprog)
-		tcf_block_offload_dec(block, &oldprog->gen_flags);
+	if (cls_bpf.oldprog)
+		err = tc_setup_cb_replace(block, tp, TC_SETUP_CLSBPF, &cls_bpf,
+					  skip_sw, &oldprog->gen_flags,
+					  &oldprog->in_hw_count,
+					  &prog->gen_flags, &prog->in_hw_count,
+					  true);
+	else
+		err = tc_setup_cb_add(block, tp, TC_SETUP_CLSBPF, &cls_bpf,
+				      skip_sw, &prog->gen_flags,
+				      &prog->in_hw_count, true);
 
-	err = tc_setup_cb_call(block, TC_SETUP_CLSBPF, &cls_bpf, skip_sw);
 	if (prog) {
 		if (err < 0) {
 			cls_bpf_offload_cmd(tp, oldprog, prog, extack);
 			return err;
-		} else if (err > 0) {
-			prog->in_hw_count = err;
-			tcf_block_offload_inc(block, &prog->gen_flags);
 		}
 	}
 
@@ -230,7 +234,7 @@ static void cls_bpf_offload_update_stats(struct tcf_proto *tp,
 	cls_bpf.name = prog->bpf_name;
 	cls_bpf.exts_integrated = prog->exts_integrated;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSBPF, &cls_bpf, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSBPF, &cls_bpf, false, true);
 }
 
 static int cls_bpf_init(struct tcf_proto *tp)
@@ -680,8 +684,8 @@ static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb
 			continue;
 		}
 
-		tc_cls_offload_cnt_update(block, &prog->in_hw_count,
-					  &prog->gen_flags, add);
+		tc_cls_offload_cnt_update(block, tp, &prog->in_hw_count,
+					  &prog->gen_flags, 1, add);
 	}
 
 	return 0;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 054123742e32..0001a933d48b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -419,10 +419,10 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f,
 	cls_flower.command = FLOW_CLS_DESTROY;
 	cls_flower.cookie = (unsigned long) f;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false);
+	tc_setup_cb_destroy(block, tp, TC_SETUP_CLSFLOWER, &cls_flower, false,
+			    &f->flags, &f->in_hw_count, true);
 	spin_lock(&tp->lock);
 	list_del_init(&f->hw_list);
-	tcf_block_offload_dec(block, &f->flags);
 	spin_unlock(&tp->lock);
 
 	if (!rtnl_held)
@@ -466,18 +466,15 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 		goto errout;
 	}
 
-	err = tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, skip_sw);
+	err = tc_setup_cb_add(block, tp, TC_SETUP_CLSFLOWER, &cls_flower,
+			      skip_sw, &f->flags, &f->in_hw_count, true);
 	kfree(cls_flower.rule);
 
 	if (err < 0) {
 		fl_hw_destroy_filter(tp, f, true, NULL);
 		goto errout;
 	} else if (err > 0) {
-		f->in_hw_count = err;
 		err = 0;
-		spin_lock(&tp->lock);
-		tcf_block_offload_inc(block, &f->flags);
-		spin_unlock(&tp->lock);
 	}
 
 	if (skip_sw && !(f->flags & TCA_CLS_FLAGS_IN_HW)) {
@@ -509,7 +506,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
 	cls_flower.cookie = (unsigned long) f;
 	cls_flower.classid = f->res.classid;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false, true);
 
 	tcf_exts_stats_update(&f->exts, cls_flower.stats.bytes,
 			      cls_flower.stats.pkts,
@@ -1855,10 +1852,8 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 			goto next_flow;
 		}
 
-		spin_lock(&tp->lock);
-		tc_cls_offload_cnt_update(block, &f->in_hw_count, &f->flags,
-					  add);
-		spin_unlock(&tp->lock);
+		tc_cls_offload_cnt_update(block, tp, &f->in_hw_count, &f->flags,
+					  1, add);
 next_flow:
 		__fl_put(f);
 	}
@@ -1886,7 +1881,7 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
 	/* We don't care if driver (any of them) fails to handle this
 	 * call. It serves just as a hint for it.
 	 */
-	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false, true);
 	kfree(cls_flower.rule);
 
 	return 0;
@@ -1902,7 +1897,7 @@ static void fl_hw_destroy_tmplt(struct tcf_chain *chain,
 	cls_flower.command = FLOW_CLS_TMPLT_DESTROY;
 	cls_flower.cookie = (unsigned long) tmplt;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false, true);
 }
 
 static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain,
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 455ea2793f9b..d38e329d71bd 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -75,8 +75,8 @@ static void mall_destroy_hw_filter(struct tcf_proto *tp,
 	cls_mall.command = TC_CLSMATCHALL_DESTROY;
 	cls_mall.cookie = cookie;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false);
-	tcf_block_offload_dec(block, &head->flags);
+	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false, true);
+	head->flags &= ~TCA_CLS_FLAGS_IN_HW;
 }
 
 static int mall_replace_hw_filter(struct tcf_proto *tp,
@@ -109,15 +109,13 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 		return err;
 	}
 
-	err = tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, skip_sw);
+	err = tc_setup_cb_add(block, tp, TC_SETUP_CLSMATCHALL, &cls_mall,
+			      skip_sw, &head->flags, &head->in_hw_count, true);
 	kfree(cls_mall.rule);
 
 	if (err < 0) {
 		mall_destroy_hw_filter(tp, head, cookie, NULL);
 		return err;
-	} else if (err > 0) {
-		head->in_hw_count = err;
-		tcf_block_offload_inc(block, &head->flags);
 	}
 
 	if (skip_sw && !(head->flags & TCA_CLS_FLAGS_IN_HW))
@@ -321,7 +319,8 @@ static int mall_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 		return 0;
 	}
 
-	tc_cls_offload_cnt_update(block, &head->in_hw_count, &head->flags, add);
+	tc_cls_offload_cnt_update(block, tp, &head->in_hw_count, &head->flags,
+				  1, add);
 
 	return 0;
 }
@@ -337,7 +336,7 @@ static void mall_stats_hw_filter(struct tcf_proto *tp,
 	cls_mall.command = TC_CLSMATCHALL_STATS;
 	cls_mall.cookie = cookie;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false, true);
 
 	tcf_exts_stats_update(&head->exts, cls_mall.stats.bytes,
 			      cls_mall.stats.pkts, cls_mall.stats.lastused);
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 8614088edd1b..94c6eca191f9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -480,7 +480,7 @@ static void u32_clear_hw_hnode(struct tcf_proto *tp, struct tc_u_hnode *h,
 	cls_u32.hnode.handle = h->handle;
 	cls_u32.hnode.prio = h->prio;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, false);
+	tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, false, true);
 }
 
 static int u32_replace_hw_hnode(struct tcf_proto *tp, struct tc_u_hnode *h,
@@ -498,7 +498,7 @@ static int u32_replace_hw_hnode(struct tcf_proto *tp, struct tc_u_hnode *h,
 	cls_u32.hnode.handle = h->handle;
 	cls_u32.hnode.prio = h->prio;
 
-	err = tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, skip_sw);
+	err = tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, skip_sw, true);
 	if (err < 0) {
 		u32_clear_hw_hnode(tp, h, NULL);
 		return err;
@@ -522,8 +522,8 @@ static void u32_remove_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n,
 	cls_u32.command = TC_CLSU32_DELETE_KNODE;
 	cls_u32.knode.handle = n->handle;
 
-	tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, false);
-	tcf_block_offload_dec(block, &n->flags);
+	tc_setup_cb_destroy(block, tp, TC_SETUP_CLSU32, &cls_u32, false,
+			    &n->flags, &n->in_hw_count, true);
 }
 
 static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n,
@@ -552,13 +552,11 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n,
 	if (n->ht_down)
 		cls_u32.knode.link_handle = ht->handle;
 
-	err = tc_setup_cb_call(block, TC_SETUP_CLSU32, &cls_u32, skip_sw);
+	err = tc_setup_cb_add(block, tp, TC_SETUP_CLSU32, &cls_u32, skip_sw,
+			      &n->flags, &n->in_hw_count, true);
 	if (err < 0) {
 		u32_remove_hw_knode(tp, n, NULL);
 		return err;
-	} else if (err > 0) {
-		n->in_hw_count = err;
-		tcf_block_offload_inc(block, &n->flags);
 	}
 
 	if (skip_sw && !(n->flags & TCA_CLS_FLAGS_IN_HW))
@@ -1208,7 +1206,8 @@ static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
 		return 0;
 	}
 
-	tc_cls_offload_cnt_update(block, &n->in_hw_count, &n->flags, add);
+	tc_cls_offload_cnt_update(block, tp, &n->in_hw_count, &n->flags, 1,
+				  add);
 
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 01/10] net: sched: protect block offload-related fields with rw_semaphore
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock'
rwsem and use it to protect flow_block->cb_list and related counters from
concurrent modification. The lock is taken in read mode for read-only
traversal of cb_list in tc_setup_cb_call() and write mode in all other
cases. This approach ensures that:

- cb_list is not changed concurrently while filters is being offloaded on
  block.

- block->nooffloaddevcnt is checked while holding the lock in read mode,
  but is only changed by bind/unbind code when holding the cb_lock in write
  mode to prevent concurrent modification.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  2 ++
 net/sched/cls_api.c       | 45 +++++++++++++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d9f359af0b93..a3eaf5f9d28f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -13,6 +13,7 @@
 #include <linux/refcount.h>
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 #include <net/flow_offload.h>
@@ -396,6 +397,7 @@ struct tcf_block {
 	refcount_t refcnt;
 	struct net *net;
 	struct Qdisc *q;
+	struct rw_semaphore cb_lock; /* protects cb_list and offload counters */
 	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index e0d8b456e9f5..a3b07c6e3f53 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -568,9 +568,11 @@ static void tc_indr_block_ing_cmd(struct net_device *dev,
 
 	bo.block = &block->flow_block;
 
+	down_write(&block->cb_lock);
 	cb(dev, cb_priv, TC_SETUP_BLOCK, &bo);
 
 	tcf_block_setup(block, &bo);
+	up_write(&block->cb_lock);
 }
 
 static struct tcf_block *tc_dev_ingress_block(struct net_device *dev)
@@ -661,6 +663,7 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	down_write(&block->cb_lock);
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_inc;
 
@@ -669,24 +672,31 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	 */
 	if (!tc_can_offload(dev) && tcf_block_offload_in_use(block)) {
 		NL_SET_ERR_MSG(extack, "Bind to offloaded block failed as dev has offload disabled");
-		return -EOPNOTSUPP;
+		err = -EOPNOTSUPP;
+		goto errout;
 	}
 
 	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, extack);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_inc;
 	if (err)
-		return err;
+		goto errout;
 
 	tc_indr_block_call(block, dev, ei, FLOW_BLOCK_BIND, extack);
+	up_write(&block->cb_lock);
 	return 0;
 
 no_offload_dev_inc:
-	if (tcf_block_offload_in_use(block))
-		return -EOPNOTSUPP;
+	if (tcf_block_offload_in_use(block)) {
+		err = -EOPNOTSUPP;
+		goto errout;
+	}
+	err = 0;
 	block->nooffloaddevcnt++;
 	tc_indr_block_call(block, dev, ei, FLOW_BLOCK_BIND, extack);
-	return 0;
+errout:
+	up_write(&block->cb_lock);
+	return err;
 }
 
 static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
@@ -695,6 +705,7 @@ static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	down_write(&block->cb_lock);
 	tc_indr_block_call(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
 
 	if (!dev->netdev_ops->ndo_setup_tc)
@@ -702,10 +713,12 @@ static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_dec;
+	up_write(&block->cb_lock);
 	return;
 
 no_offload_dev_dec:
 	WARN_ON(block->nooffloaddevcnt-- == 0);
+	up_write(&block->cb_lock);
 }
 
 static int
@@ -820,6 +833,7 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
 		return ERR_PTR(-ENOMEM);
 	}
 	mutex_init(&block->lock);
+	init_rwsem(&block->cb_lock);
 	flow_block_init(&block->flow_block);
 	INIT_LIST_HEAD(&block->chain_list);
 	INIT_LIST_HEAD(&block->owner_list);
@@ -1355,6 +1369,8 @@ tcf_block_playback_offloads(struct tcf_block *block, flow_setup_cb_t *cb,
 	struct tcf_proto *tp, *tp_prev;
 	int err;
 
+	lockdep_assert_held(&block->cb_lock);
+
 	for (chain = __tcf_get_next_chain(block, NULL);
 	     chain;
 	     chain_prev = chain,
@@ -1393,6 +1409,8 @@ static int tcf_block_bind(struct tcf_block *block,
 	struct flow_block_cb *block_cb, *next;
 	int err, i = 0;
 
+	lockdep_assert_held(&block->cb_lock);
+
 	list_for_each_entry(block_cb, &bo->cb_list, list) {
 		err = tcf_block_playback_offloads(block, block_cb->cb,
 						  block_cb->cb_priv, true,
@@ -1427,6 +1445,8 @@ static void tcf_block_unbind(struct tcf_block *block,
 {
 	struct flow_block_cb *block_cb, *next;
 
+	lockdep_assert_held(&block->cb_lock);
+
 	list_for_each_entry_safe(block_cb, next, &bo->cb_list, list) {
 		tcf_block_playback_offloads(block, block_cb->cb,
 					    block_cb->cb_priv, false,
@@ -2987,19 +3007,26 @@ int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 	int ok_count = 0;
 	int err;
 
+	down_read(&block->cb_lock);
 	/* Make sure all netdevs sharing this block are offload-capable. */
-	if (block->nooffloaddevcnt && err_stop)
-		return -EOPNOTSUPP;
+	if (block->nooffloaddevcnt && err_stop) {
+		ok_count = -EOPNOTSUPP;
+		goto errout;
+	}
 
 	list_for_each_entry(block_cb, &block->flow_block.cb_list, list) {
 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
 		if (err) {
-			if (err_stop)
-				return err;
+			if (err_stop) {
+				ok_count = err;
+				goto errout;
+			}
 		} else {
 			ok_count++;
 		}
 	}
+errout:
+	up_read(&block->cb_lock);
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_call);
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 09/10] net: sched: copy tunnel info when setting flow_action entry->tunnel
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

In order to remove dependency on rtnl lock, modify tc_setup_flow_action()
to copy tunnel info, instead of just saving pointer to tunnel_key action
tunnel info. This is necessary to prevent concurrent action overwrite from
releasing tunnel info while it is being used by rtnl-unlocked driver.

Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info
with all its options to dynamically allocated memory block. Modify
tc_cleanup_flow_action() to free dynamically allocated tunnel info.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/tc_act/tc_tunnel_key.h | 17 +++++++++++++++++
 net/sched/cls_api.c                |  9 ++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/net/tc_act/tc_tunnel_key.h b/include/net/tc_act/tc_tunnel_key.h
index 7c3f777c168c..0689d9bcdf84 100644
--- a/include/net/tc_act/tc_tunnel_key.h
+++ b/include/net/tc_act/tc_tunnel_key.h
@@ -59,4 +59,21 @@ static inline struct ip_tunnel_info *tcf_tunnel_info(const struct tc_action *a)
 	return NULL;
 #endif
 }
+
+static inline struct ip_tunnel_info *
+tcf_tunnel_info_copy(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	struct ip_tunnel_info *tun = tcf_tunnel_info(a);
+
+	if (tun) {
+		size_t tun_size = sizeof(*tun) + tun->options_len;
+		struct ip_tunnel_info *tun_copy = kmemdup(tun, tun_size,
+							  GFP_KERNEL);
+
+		return tun_copy;
+	}
+#endif
+	return NULL;
+}
 #endif /* __NET_TC_TUNNEL_KEY_H */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 96adbff9e845..15423d3a89fd 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3258,6 +3258,9 @@ void tc_cleanup_flow_action(struct flow_action *flow_action)
 			if (entry->dev)
 				dev_put(entry->dev);
 			break;
+		case FLOW_ACTION_TUNNEL_ENCAP:
+			kfree(entry->tunnel);
+			break;
 		default:
 			break;
 		}
@@ -3334,7 +3337,11 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			}
 		} else if (is_tcf_tunnel_set(act)) {
 			entry->id = FLOW_ACTION_TUNNEL_ENCAP;
-			entry->tunnel = tcf_tunnel_info(act);
+			entry->tunnel = tcf_tunnel_info_copy(act);
+			if (!entry->tunnel) {
+				err = -ENOMEM;
+				goto err_out;
+			}
 		} else if (is_tcf_tunnel_release(act)) {
 			entry->id = FLOW_ACTION_TUNNEL_DECAP;
 		} else if (is_tcf_pedit(act)) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 00/10] Refactor cls hardware offload API to support rtnl-independent drivers
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov

Currently, all cls API hardware offloads driver callbacks require caller
to hold rtnl lock when calling them. This patch set introduces new API
that allows drivers to register callbacks that are not dependent on rtnl
lock and unlocked classifiers to offload filters without obtaining rtnl
lock first, which is intended to allow offloading tc rules in parallel.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
already registered with this flag and only take rtnl lock when qdisc or
classifier requires it. Classifiers can indicate that their ops
callbacks don't require caller to hold rtnl lock by setting the
TCF_PROTO_OPS_DOIT_UNLOCKED flag. Unlocked implementation of flower
classifier is now upstreamed. However, this implementation still obtains
rtnl lock before calling hardware offloads API.

Implement following cls API changes:

- Introduce new "unlocked_driver_cb" flag to struct flow_block_offload
  to allow registering and unregistering block hardware offload
  callbacks that do not require caller to hold rtnl lock. Drivers that
  doesn't require users of its tc offload callbacks to hold rtnl lock
  sets the flag to true on block bind/unbind. Internally tcf_block is
  extended with additional lockeddevcnt counter that is used to count
  number of devices that require rtnl lock that block is bound to. When
  this counter is zero, tc_setup_cb_*() functions execute callbacks
  without obtaining rtnl lock.

- Extend cls API single hardware rule update tc_setup_cb_call() function
  with tc_setup_cb_add(), tc_setup_cb_replace() and
  tc_setup_cb_destroy() functions. These new APIs are needed to move
  management of block offload counter, filter in hardware counter and
  flag from classifier implementations to cls API, which is now
  responsible for managing them in concurrency-safe manner. Access to
  cb_list from callback execution code is synchronized by obtaining new
  'cb_lock' rw_semaphore in read mode, which allows executing callbacks
  in parallel, but excludes any modifications of data from
  register/unregister code. tcf_block offloads counter type is changed
  to atomic integer to allow updating the counter concurrently.

- Extend classifier ops with new ops->hw_add() and ops->hw_del()
  callbacks which are used to notify unlocked classifiers when filter is
  successfully added or deleted to hardware without releasing cb_lock.
  This is necessary to update classifier state atomically with callback
  list traversal and updating of all relevant counters and allows
  unlocked classifiers to synchronize with concurrent reoffload without
  requiring any changes to driver callback API implementations.

New tc flow_action infrastructure is also modified to allow its user to
execute without rtnl lock protection. Function tc_setup_flow_action() is
modified to conditionally obtain rtnl lock before accessing action
state. Action data that is accessed by reference is either copied or
reference counted to prevent concurrent action overwrite from
deallocating it. New function tc_cleanup_flow_action() is introduced to
cleanup/release all such data obtained by tc_setup_flow_action().

Flower classifier (only unlocked classifier at the moment) is modified
to use new cls hardware offloads API and no longer obtains rtnl lock
before calling it.

Vlad Buslov (10):
  net: sched: protect block offload-related fields with rw_semaphore
  net: sched: change tcf block offload counter type to atomic_t
  net: sched: refactor block offloads counter usage
  net: sched: notify classifier on successful offload add/delete
  net: sched: add API for registering unlocked offload block callbacks
  net: sched: conditionally obtain rtnl lock in cls hw offloads API
  net: sched: take rtnl lock in tc_setup_flow_action()
  net: sched: take reference to action dev before calling offloads
  net: sched: copy tunnel info when setting flow_action entry->tunnel
  net: sched: flower: don't take rtnl lock for cls hw offloads API

 .../net/ethernet/mellanox/mlx5/core/en_main.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   3 +
 include/net/flow_offload.h                    |   1 +
 include/net/pkt_cls.h                         |  17 +-
 include/net/sch_generic.h                     |  42 +--
 include/net/tc_act/tc_tunnel_key.h            |  17 +
 net/sched/cls_api.c                           | 322 +++++++++++++++++-
 net/sched/cls_bpf.c                           |  22 +-
 net/sched/cls_flower.c                        | 111 +++---
 net/sched/cls_matchall.c                      |  19 +-
 net/sched/cls_u32.c                           |  17 +-
 11 files changed, 437 insertions(+), 136 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [PATCH net-next 06/10] net: sched: conditionally obtain rtnl lock in cls hw offloads API
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

In order to remove dependency on rtnl lock from offloads code of
classifiers, take rtnl lock conditionally before executing driver
callbacks. Only obtain rtnl lock if block is bound to devices that require
it.

Block bind/unbind code is rtnl-locked and obtains block->cb_lock while
holding rtnl lock. Obtain locks in same order in tc_setup_cb_*() functions
to prevent deadlock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 02a547aa77c0..bda42f1b5514 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3076,11 +3076,28 @@ __tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 		     void *type_data, bool err_stop, bool rtnl_held)
 {
+	bool take_rtnl = false;
 	int ok_count;
 
+retry:
+	if (take_rtnl)
+		rtnl_lock();
 	down_read(&block->cb_lock);
+	/* Need to obtain rtnl lock if block is bound to devs that require it.
+	 * In block bind code cb_lock is obtained while holding rtnl, so we must
+	 * obtain the locks in same order here.
+	 */
+	if (!rtnl_held && !take_rtnl && block->lockeddevcnt) {
+		up_read(&block->cb_lock);
+		take_rtnl = true;
+		goto retry;
+	}
+
 	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
+
 	up_read(&block->cb_lock);
+	if (take_rtnl)
+		rtnl_unlock();
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_call);
@@ -3095,9 +3112,23 @@ int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
 		    enum tc_setup_type type, void *type_data, bool err_stop,
 		    u32 *flags, unsigned int *in_hw_count, bool rtnl_held)
 {
+	bool take_rtnl = false;
 	int ok_count;
 
+retry:
+	if (take_rtnl)
+		rtnl_lock();
 	down_read(&block->cb_lock);
+	/* Need to obtain rtnl lock if block is bound to devs that require it.
+	 * In block bind code cb_lock is obtained while holding rtnl, so we must
+	 * obtain the locks in same order here.
+	 */
+	if (!rtnl_held && !take_rtnl && block->lockeddevcnt) {
+		up_read(&block->cb_lock);
+		take_rtnl = true;
+		goto retry;
+	}
+
 	/* Make sure all netdevs sharing this block are offload-capable. */
 	if (block->nooffloaddevcnt && err_stop) {
 		ok_count = -EOPNOTSUPP;
@@ -3114,6 +3145,8 @@ int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
 	}
 errout:
 	up_read(&block->cb_lock);
+	if (take_rtnl)
+		rtnl_unlock();
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_add);
@@ -3130,9 +3163,23 @@ int tc_setup_cb_replace(struct tcf_block *block, struct tcf_proto *tp,
 			u32 *new_flags, unsigned int *new_in_hw_count,
 			bool rtnl_held)
 {
+	bool take_rtnl = false;
 	int ok_count;
 
+retry:
+	if (take_rtnl)
+		rtnl_lock();
 	down_read(&block->cb_lock);
+	/* Need to obtain rtnl lock if block is bound to devs that require it.
+	 * In block bind code cb_lock is obtained while holding rtnl, so we must
+	 * obtain the locks in same order here.
+	 */
+	if (!rtnl_held && !take_rtnl && block->lockeddevcnt) {
+		up_read(&block->cb_lock);
+		take_rtnl = true;
+		goto retry;
+	}
+
 	/* Make sure all netdevs sharing this block are offload-capable. */
 	if (block->nooffloaddevcnt && err_stop) {
 		ok_count = -EOPNOTSUPP;
@@ -3153,6 +3200,8 @@ int tc_setup_cb_replace(struct tcf_block *block, struct tcf_proto *tp,
 	}
 errout:
 	up_read(&block->cb_lock);
+	if (take_rtnl)
+		rtnl_unlock();
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_replace);
@@ -3165,9 +3214,23 @@ int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
 			enum tc_setup_type type, void *type_data, bool err_stop,
 			u32 *flags, unsigned int *in_hw_count, bool rtnl_held)
 {
+	bool take_rtnl = false;
 	int ok_count;
 
+retry:
+	if (take_rtnl)
+		rtnl_lock();
 	down_read(&block->cb_lock);
+	/* Need to obtain rtnl lock if block is bound to devs that require it.
+	 * In block bind code cb_lock is obtained while holding rtnl, so we must
+	 * obtain the locks in same order here.
+	 */
+	if (!rtnl_held && !take_rtnl && block->lockeddevcnt) {
+		up_read(&block->cb_lock);
+		take_rtnl = true;
+		goto retry;
+	}
+
 	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
 
 	tc_cls_offload_cnt_reset(block, tp, in_hw_count, flags);
@@ -3175,6 +3238,8 @@ int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
 		tp->ops->hw_del(tp, type_data);
 
 	up_read(&block->cb_lock);
+	if (take_rtnl)
+		rtnl_unlock();
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_destroy);
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 05/10] net: sched: add API for registering unlocked offload block callbacks
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow
registering and unregistering block hardware offload callbacks that do not
require caller to hold rtnl lock. Extend tcf_block with additional
lockeddevcnt counter that is incremented for each non-unlocked driver
callback attached to device. This counter is necessary to conditionally
obtain rtnl lock before calling hardware callbacks in following patches.

Register mlx5 tc block offload callbacks as "unlocked".

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  | 3 +++
 include/net/flow_offload.h                        | 1 +
 include/net/sch_generic.h                         | 1 +
 net/sched/cls_api.c                               | 6 ++++++
 5 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7fdea6479ff6..8abdef1f8470 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3469,10 +3469,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			  void *type_data)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct flow_block_offload *f = type_data;
 
 	switch (type) {
 #ifdef CONFIG_MLX5_ESWITCH
 	case TC_SETUP_BLOCK:
+		f->unlocked_driver_cb = true;
 		return flow_block_cb_setup_simple(type_data,
 						  &mlx5e_block_cb_list,
 						  mlx5e_setup_tc_block_cb,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 3c0d36b2b91c..e7ac6233037d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -763,6 +763,7 @@ mlx5e_rep_indr_setup_tc_block(struct net_device *netdev,
 	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
 		return -EOPNOTSUPP;
 
+	f->unlocked_driver_cb = true;
 	f->driver_block_list = &mlx5e_block_cb_list;
 
 	switch (f->command) {
@@ -1245,9 +1246,11 @@ static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			      void *type_data)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct flow_block_offload *f = type_data;
 
 	switch (type) {
 	case TC_SETUP_BLOCK:
+		f->unlocked_driver_cb = true;
 		return flow_block_cb_setup_simple(type_data,
 						  &mlx5e_rep_block_cb_list,
 						  mlx5e_rep_setup_tc_cb,
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 757fa84de654..fc881875f856 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -284,6 +284,7 @@ struct flow_block_offload {
 	enum flow_block_command command;
 	enum flow_block_binder_type binder_type;
 	bool block_shared;
+	bool unlocked_driver_cb;
 	struct net *net;
 	struct flow_block *block;
 	struct list_head cb_list;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4ad43335cae5..d0c282e3630b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -408,6 +408,7 @@ struct tcf_block {
 	bool keep_dst;
 	atomic_t offloadcnt; /* Number of oddloaded filters */
 	unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
+	unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
 	struct {
 		struct tcf_chain *chain;
 		struct list_head filter_chain_list;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d8ef7a9e6906..02a547aa77c0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1418,6 +1418,8 @@ static int tcf_block_bind(struct tcf_block *block,
 						  bo->extack);
 		if (err)
 			goto err_unroll;
+		if (!bo->unlocked_driver_cb)
+			block->lockeddevcnt++;
 
 		i++;
 	}
@@ -1433,6 +1435,8 @@ static int tcf_block_bind(struct tcf_block *block,
 						    block_cb->cb_priv, false,
 						    tcf_block_offload_in_use(block),
 						    NULL);
+			if (!bo->unlocked_driver_cb)
+				block->lockeddevcnt--;
 		}
 		flow_block_cb_free(block_cb);
 	}
@@ -1454,6 +1458,8 @@ static void tcf_block_unbind(struct tcf_block *block,
 					    NULL);
 		list_del(&block_cb->list);
 		flow_block_cb_free(block_cb);
+		if (!bo->unlocked_driver_cb)
+			block->lockeddevcnt--;
 	}
 }
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 04/10] net: sched: notify classifier on successful offload add/delete
From: Vlad Buslov @ 2019-08-22 12:43 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, jakub.kicinski, pablo,
	Vlad Buslov, Jiri Pirko
In-Reply-To: <20190822124353.16902-1-vladbu@mellanox.com>

To remove dependency on rtnl lock, extend classifier ops with new
ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
holding cb_lock every time filter if successfully added to or deleted from
hardware.

Implement the new API in flower classifier. Use it to manage hw_filters
list under cb_lock protection, instead of relying on rtnl lock to
synchronize with concurrent fl_reoffload() call.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  4 ++++
 net/sched/cls_api.c       | 25 +++++++++++++++++++------
 net/sched/cls_flower.c    | 33 ++++++++++++++++++++++++++-------
 3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 373f9476c1de..4ad43335cae5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -312,6 +312,10 @@ struct tcf_proto_ops {
 	int			(*reoffload)(struct tcf_proto *tp, bool add,
 					     flow_setup_cb_t *cb, void *cb_priv,
 					     struct netlink_ext_ack *extack);
+	void			(*hw_add)(struct tcf_proto *tp,
+					  void *type_data);
+	void			(*hw_del)(struct tcf_proto *tp,
+					  void *type_data);
 	void			(*bind_class)(void *, u32, unsigned long);
 	void *			(*tmplt_create)(struct net *net,
 						struct tcf_chain *chain,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4215c849f4a3..d8ef7a9e6906 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3099,9 +3099,13 @@ int tc_setup_cb_add(struct tcf_block *block, struct tcf_proto *tp,
 	}
 
 	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
-	if (ok_count > 0)
-		tc_cls_offload_cnt_update(block, tp, in_hw_count, flags,
-					  ok_count, true);
+	if (ok_count >= 0) {
+		if (tp->ops->hw_add)
+			tp->ops->hw_add(tp, type_data);
+		if (ok_count > 0)
+			tc_cls_offload_cnt_update(block, tp, in_hw_count, flags,
+						  ok_count, true);
+	}
 errout:
 	up_read(&block->cb_lock);
 	return ok_count;
@@ -3130,11 +3134,17 @@ int tc_setup_cb_replace(struct tcf_block *block, struct tcf_proto *tp,
 	}
 
 	tc_cls_offload_cnt_reset(block, tp, old_in_hw_count, old_flags);
+	if (tp->ops->hw_del)
+		tp->ops->hw_del(tp, type_data);
 
 	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
-	if (ok_count > 0)
-		tc_cls_offload_cnt_update(block, tp, new_in_hw_count, new_flags,
-					  ok_count, true);
+	if (ok_count >= 0) {
+		if (tp->ops->hw_add)
+			tp->ops->hw_add(tp, type_data);
+		if (ok_count > 0)
+			tc_cls_offload_cnt_update(block, tp, new_in_hw_count,
+						  new_flags, ok_count, true);
+	}
 errout:
 	up_read(&block->cb_lock);
 	return ok_count;
@@ -3155,6 +3165,9 @@ int tc_setup_cb_destroy(struct tcf_block *block, struct tcf_proto *tp,
 	ok_count = __tc_setup_cb_call(block, type, type_data, err_stop);
 
 	tc_cls_offload_cnt_reset(block, tp, in_hw_count, flags);
+	if (tp->ops->hw_del)
+		tp->ops->hw_del(tp, type_data);
+
 	up_read(&block->cb_lock);
 	return ok_count;
 }
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 0001a933d48b..cd1686a5abe7 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -421,9 +421,6 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f,
 
 	tc_setup_cb_destroy(block, tp, TC_SETUP_CLSFLOWER, &cls_flower, false,
 			    &f->flags, &f->in_hw_count, true);
-	spin_lock(&tp->lock);
-	list_del_init(&f->hw_list);
-	spin_unlock(&tp->lock);
 
 	if (!rtnl_held)
 		rtnl_unlock();
@@ -433,7 +430,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 				struct cls_fl_filter *f, bool rtnl_held,
 				struct netlink_ext_ack *extack)
 {
-	struct cls_fl_head *head = fl_head_dereference(tp);
 	struct tcf_block *block = tp->chain->block;
 	struct flow_cls_offload cls_flower = {};
 	bool skip_sw = tc_skip_sw(f->flags);
@@ -482,9 +478,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 		goto errout;
 	}
 
-	spin_lock(&tp->lock);
-	list_add(&f->hw_list, &head->hw_filters);
-	spin_unlock(&tp->lock);
 errout:
 	if (!rtnl_held)
 		rtnl_unlock();
@@ -1861,6 +1854,30 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 	return 0;
 }
 
+static void fl_hw_add(struct tcf_proto *tp, void *type_data)
+{
+	struct flow_cls_offload *cls_flower = type_data;
+	struct cls_fl_filter *f =
+		(struct cls_fl_filter *) cls_flower->cookie;
+	struct cls_fl_head *head = fl_head_dereference(tp);
+
+	spin_lock(&tp->lock);
+	list_add(&f->hw_list, &head->hw_filters);
+	spin_unlock(&tp->lock);
+}
+
+static void fl_hw_del(struct tcf_proto *tp, void *type_data)
+{
+	struct flow_cls_offload *cls_flower = type_data;
+	struct cls_fl_filter *f =
+		(struct cls_fl_filter *) cls_flower->cookie;
+
+	spin_lock(&tp->lock);
+	if (!list_empty(&f->hw_list))
+		list_del_init(&f->hw_list);
+	spin_unlock(&tp->lock);
+}
+
 static int fl_hw_create_tmplt(struct tcf_chain *chain,
 			      struct fl_flow_tmplt *tmplt)
 {
@@ -2521,6 +2538,8 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
 	.delete		= fl_delete,
 	.walk		= fl_walk,
 	.reoffload	= fl_reoffload,
+	.hw_add		= fl_hw_add,
+	.hw_del		= fl_hw_del,
 	.dump		= fl_dump,
 	.bind_class	= fl_bind_class,
 	.tmplt_create	= fl_tmplt_create,
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] Staging: isdn/gigaset : Fix bare unsigned warnings and trailing lines errors
From: Dan Carpenter @ 2019-08-22 12:42 UTC (permalink / raw)
  To: Martin Tomes; +Cc: isdn, devel, gregkh, linux-kernel, netdev
In-Reply-To: <1566401259-16921-1-git-send-email-tomesm@gmail.com>

Hi Martin,

These drivers are ancient and going to be deleted soon.  We're not
accepting cleanups for them at this point.

On Wed, Aug 21, 2019 at 03:27:39PM +0000, Martin Tomes wrote:
> There are many bare use of unsigned warnings and trailing statements should be on next line errors from checkpatch.pl script.
> Change the code by adding 'unsigned int'. Move 'break' statement of 'switch' command to next line.

For future reference, this should be split up so each patch fixes
one kind of style issue.  And the commit message lines of text are too
long.  The limit is 75 characters for commit messages (like an email).

regards,
dan carpenter


^ permalink raw reply

* [PATCH] Staging: isdn/gigaset : Fix bare unsigned warnings and trailing lines errors
From: Martin Tomes @ 2019-08-21 15:27 UTC (permalink / raw)
  To: isdn; +Cc: gregkh, linux-kernel, devel, netdev, Martin Tomes

There are many bare use of unsigned warnings and trailing statements should be on next line errors from checkpatch.pl script.
Change the code by adding 'unsigned int'. Move 'break' statement of 'switch' command to next line.

Signed-off-by: Martin Tomes <tomesm@gmail.com>
---
 drivers/staging/isdn/gigaset/usb-gigaset.c | 52 ++++++++++++++++++------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/isdn/gigaset/usb-gigaset.c b/drivers/staging/isdn/gigaset/usb-gigaset.c
index 1b9b436..d565242 100644
--- a/drivers/staging/isdn/gigaset/usb-gigaset.c
+++ b/drivers/staging/isdn/gigaset/usb-gigaset.c
@@ -143,16 +143,16 @@ struct usb_cardstate {
 	char			bchars[6];		/* for request 0x19 */
 };
 
-static inline unsigned tiocm_to_gigaset(unsigned state)
+static inline unsigned int tiocm_to_gigaset(unsigned int state)
 {
 	return ((state & TIOCM_DTR) ? 1 : 0) | ((state & TIOCM_RTS) ? 2 : 0);
 }
 
-static int gigaset_set_modem_ctrl(struct cardstate *cs, unsigned old_state,
-				  unsigned new_state)
+static int gigaset_set_modem_ctrl(struct cardstate *cs, unsigned int old_state,
+				  unsigned int new_state)
 {
 	struct usb_device *udev = cs->hw.usb->udev;
-	unsigned mask, val;
+	unsigned int mask, val;
 	int r;
 
 	mask = tiocm_to_gigaset(old_state ^ new_state);
@@ -178,7 +178,7 @@ static int set_value(struct cardstate *cs, u8 req, u16 val)
 	int r, r2;
 
 	gig_dbg(DEBUG_USBREQ, "request %02x (%04x)",
-		(unsigned)req, (unsigned)val);
+		(unsigned int)req, (unsigned int)val);
 	r = usb_control_msg(udev, usb_sndctrlpipe(udev, 0), 0x12, 0x41,
 			    0xf /*?*/, 0, NULL, 0, 2000 /*?*/);
 	/* no idea what this does */
@@ -191,7 +191,7 @@ static int set_value(struct cardstate *cs, u8 req, u16 val)
 			    val, 0, NULL, 0, 2000 /*?*/);
 	if (r < 0)
 		dev_err(&udev->dev, "error %d on request 0x%02x\n",
-			-r, (unsigned)req);
+			-r, (unsigned int)req);
 
 	r2 = usb_control_msg(udev, usb_sndctrlpipe(udev, 0), 0x19, 0x41,
 			     0, 0, cs->hw.usb->bchars, 6, 2000 /*?*/);
@@ -205,7 +205,7 @@ static int set_value(struct cardstate *cs, u8 req, u16 val)
  * set the baud rate on the internal serial adapter
  * using the undocumented parameter setting command
  */
-static int gigaset_baud_rate(struct cardstate *cs, unsigned cflag)
+static int gigaset_baud_rate(struct cardstate *cs, unsigned int cflag)
 {
 	u16 val;
 	u32 rate;
@@ -213,16 +213,26 @@ static int gigaset_baud_rate(struct cardstate *cs, unsigned cflag)
 	cflag &= CBAUD;
 
 	switch (cflag) {
-	case    B300: rate =     300; break;
-	case    B600: rate =     600; break;
-	case   B1200: rate =    1200; break;
-	case   B2400: rate =    2400; break;
-	case   B4800: rate =    4800; break;
-	case   B9600: rate =    9600; break;
-	case  B19200: rate =   19200; break;
-	case  B38400: rate =   38400; break;
-	case  B57600: rate =   57600; break;
-	case B115200: rate =  115200; break;
+	case    B300: rate =     300;
+		      break;
+	case    B600: rate =     600;
+		      break;
+	case   B1200: rate =    1200;
+		      break;
+	case   B2400: rate =    2400;
+		      break;
+	case   B4800: rate =    4800;
+		      break;
+	case   B9600: rate =    9600;
+		      break;
+	case  B19200: rate =   19200;
+		      break;
+	case  B38400: rate =   38400;
+		      break;
+	case  B57600: rate =   57600;
+		      break;
+	case B115200: rate =  115200;
+		      break;
 	default:
 		rate =  9600;
 		dev_err(cs->dev, "unsupported baudrate request 0x%x,"
@@ -345,7 +355,7 @@ static void gigaset_read_int_callback(struct urb *urb)
 	struct inbuf_t *inbuf = cs->inbuf;
 	int status = urb->status;
 	int r;
-	unsigned numbytes;
+	unsigned int numbytes;
 	unsigned char *src;
 	unsigned long flags;
 
@@ -357,7 +367,7 @@ static void gigaset_read_int_callback(struct urb *urb)
 			if (unlikely(*src))
 				dev_warn(cs->dev,
 					 "%s: There was no leading 0, but 0x%02x!\n",
-					 __func__, (unsigned) *src);
+					 __func__, (unsigned int) *src);
 			++src; /* skip leading 0x00 */
 			--numbytes;
 			if (gigaset_fill_inbuf(inbuf, src, numbytes)) {
@@ -517,7 +527,7 @@ static int gigaset_write_cmd(struct cardstate *cs, struct cmdbuf_t *cb)
 
 static int gigaset_write_room(struct cardstate *cs)
 {
-	unsigned bytes;
+	unsigned int bytes;
 
 	bytes = cs->cmdbytes;
 	return bytes < IF_WRITEBUF ? IF_WRITEBUF - bytes : 0;
@@ -611,7 +621,7 @@ static int write_modem(struct cardstate *cs)
 	}
 
 	/* Copy data to bulk out buffer and transmit data */
-	count = min(bcs->tx_skb->len, (unsigned) ucs->bulk_out_size);
+	count = min(bcs->tx_skb->len, (unsigned int) ucs->bulk_out_size);
 	skb_copy_from_linear_data(bcs->tx_skb, ucs->bulk_out_buffer, count);
 	skb_pull(bcs->tx_skb, count);
 	ucs->busy = 1;
-- 
1.8.3.1


^ permalink raw reply related

* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Daniel Borkmann @ 2019-08-22 12:33 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stephen Hemminger,
	Alexei Starovoitov
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
	Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <877e75pftb.fsf@toke.dk>

On 8/22/19 2:04 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 8/22/19 12:43 PM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>> On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
>>>>> This adds a configure check for libbpf and renames functions to allow
>>>>> lib/bpf.c to be compiled with it present. This makes it possible to
>>>>> port functionality piecemeal to use libbpf.
>>>>>
>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>> ---
>>>>>     configure          | 16 ++++++++++++++++
>>>>>     include/bpf_util.h |  6 +++---
>>>>>     ip/ipvrf.c         |  4 ++--
>>>>>     lib/bpf.c          | 33 +++++++++++++++++++--------------
>>>>>     4 files changed, 40 insertions(+), 19 deletions(-)
>>>>>
>>>>> diff --git a/configure b/configure
>>>>> index 45fcffb6..5a89ee9f 100755
>>>>> --- a/configure
>>>>> +++ b/configure
>>>>> @@ -238,6 +238,19 @@ check_elf()
>>>>>         fi
>>>>>     }
>>>>>     
>>>>> +check_libbpf()
>>>>> +{
>>>>> +    if ${PKG_CONFIG} libbpf --exists; then
>>>>> +	echo "HAVE_LIBBPF:=y" >>$CONFIG
>>>>> +	echo "yes"
>>>>> +
>>>>> +	echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
>>>>> +	echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
>>>>> +    else
>>>>> +	echo "no"
>>>>> +    fi
>>>>> +}
>>>>> +
>>>>>     check_selinux()
>>>>
>>>> More of an implementation detail at this point in time, but want to
>>>> make sure this doesn't get missed along the way: as discussed at
>>>> bpfconf [0] best for iproute2 to handle libbpf support would be the
>>>> same way of integration as pahole does, that is, to integrate it via
>>>> submodule [1] to allow kernel and libbpf features to be in sync with
>>>> iproute2 releases and therefore easily consume extensions we're adding
>>>> to libbpf to aide iproute2 integration.
>>>
>>> I can sorta see the point wrt keeping in sync with kernel features. But
>>> how will this work with distros that package libbpf as a regular
>>> library? Have you guys given up on regular library symbol versioning for
>>> libbpf?
>>
>> Not at all, and I hope you know that. ;-)
> 
> Good! Didn't really expect you had, just checking ;)
> 
>> The reason I added lib/bpf.c integration into iproute2 directly back
>> then was exactly such that users can start consuming BPF for tc and
>> XDP via iproute2 /everywhere/ with only a simple libelf dependency
>> which is also available on all distros since pretty much forever. If
>> it was an external library, we could have waited till hell freezes
>> over and initial distro adoption would have pretty much taken forever:
>> to pick one random example here wrt the pace of some downstream
>> distros [0]. The main rationale is pretty much the same as with added
>> kernel features that land complementary iproute2 patches for that
>> kernel release and as libbpf is developed alongside it is reasonable
>> to guarantee user expectations that iproute2 released for kernel
>> version x can make use of BPF features added to kernel x with same
>> loader support from x.
> 
> Well, for iproute2 I would expect this to be solved by version
> dependencies. I.e. iproute2 version X would depend on libbpf version Y+
> (like, I dunno, the version of libbpf included in the same kernel source
> tree as the kernel version iproute2 is targeting? :)).

This sounds nice in theory, but from what I've seen major (!) distros
already seem to have a hard time releasing kernel x along with iproute2
package x, concrete example was that distro kernel was on 4.13 and its
official iproute2 package on 4.9, adding yet another variable that needs
to be in sync with kernel is simply impractical especially for a _core_
package like iproute2 that should have as little dependencies as possible.
I also don't want to make a bet on whether libbpf will be available on
every distro that also ships iproute2. Hence approach that pahole (and
also bcc by the way) takes is most reasonable to have the best user
experience.

Thanks,
Daniel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox