Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-08-02 14:50 UTC (permalink / raw)
  To: Tao Ren
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean, netdev,
	linux-kernel, openbmc
In-Reply-To: <20190801235839.290689-1-taoren@fb.com>

> +static int bcm54616s_read_status(struct phy_device *phydev)
> +{
> +	int err;
> +
> +	err = genphy_read_status(phydev);
> +
> +	/* 1000Base-X register set doesn't provide speed fields: the
> +	 * link speed is always 1000 Mb/s as long as link is up.
> +	 */
> +	if (phydev->dev_flags & PHY_BCM_FLAGS_MODE_1000BX &&
> +	    phydev->link)
> +		phydev->speed = SPEED_1000;
> +
> +	return err;
> +}

This function is equivalent to bcm5482_read_status(). You should use
it, rather than add a new function.

    Andrew

^ permalink raw reply

* Re: [patch 1/1] drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
From: Andrew Lunn @ 2019-08-02 14:43 UTC (permalink / raw)
  To: Arnaud Patard; +Cc: netdev
In-Reply-To: <20190802083310.772136040@rtp-net.org>

On Fri, Aug 02, 2019 at 10:32:40AM +0200, Arnaud Patard wrote:
> Orion5.x systems are still using machine files and not device-tree.
> Commit 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be
> specified for orion-mdio") has replaced devm_clk_get() with of_clk_get(),
> leading to a oops at boot and not working network, as reported in 
> https://lists.debian.org/debian-arm/2019/07/msg00088.html and possibly in
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908712.
> 
> Link: https://lists.debian.org/debian-arm/2019/07/msg00088.html
> Fixes: 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be specified for orion-mdio")
> Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>
> Index: linux-next/drivers/net/ethernet/marvell/mvmdio.c
> ===================================================================
> --- linux-next.orig/drivers/net/ethernet/marvell/mvmdio.c
> +++ linux-next/drivers/net/ethernet/marvell/mvmdio.c
> @@ -319,20 +319,33 @@ static int orion_mdio_probe(struct platf
>  
>  	init_waitqueue_head(&dev->smi_busy_wait);
>  
> -	for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
> -		dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
> -		if (PTR_ERR(dev->clk[i]) == -EPROBE_DEFER) {
> +	if (pdev->dev.of_node) {
> +		for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
> +			dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
> +			if (PTR_ERR(dev->clk[i]) == -EPROBE_DEFER) {
> +				ret = -EPROBE_DEFER;
> +				goto out_clk;
> +			}
> +			if (IS_ERR(dev->clk[i]))
> +				break;
> +			clk_prepare_enable(dev->clk[i]);
> +		}
> +
> +		if (!IS_ERR(of_clk_get(pdev->dev.of_node,
> +				       ARRAY_SIZE(dev->clk))))
> +			dev_warn(&pdev->dev,
> +				 "unsupported number of clocks, limiting to the first "
> +				 __stringify(ARRAY_SIZE(dev->clk)) "\n");
> +	} else {
> +		dev->clk[0] = clk_get(&pdev->dev, NULL);
> +		if (PTR_ERR(dev->clk[0]) == -EPROBE_DEFER) {
>  			ret = -EPROBE_DEFER;
>  			goto out_clk;
>  		}

Hi Arnaud

It is a long time since i looked at Orion5x. Is this else clause even
needed? If my memory is right, i don't think it needs to enable tclk?
It was kirkwood which first added gateable clocks. And all kirkwood
boards are not DT.

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH net v4] net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
From: Roopa Prabhu @ 2019-08-02 14:29 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, David Miller, bridge, michael-dev
In-Reply-To: <20190802105736.26767-1-nikolay@cumulusnetworks.com>

On Fri, Aug 2, 2019 at 3:57 AM Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
>
> Most of the bridge device's vlan init bugs come from the fact that its
> default pvid is created at the wrong time, way too early in ndo_init()
> before the device is even assigned an ifindex. It introduces a bug when the
> bridge's dev_addr is added as fdb during the initial default pvid creation
> the notification has ifindex/NDA_MASTER both equal to 0 (see example below)
> which really makes no sense for user-space[0] and is wrong.
> Usually user-space software would ignore such entries, but they are
> actually valid and will eventually have all necessary attributes.
> It makes much more sense to send a notification *after* the device has
> registered and has a proper ifindex allocated rather than before when
> there's a chance that the registration might still fail or to receive
> it with ifindex/NDA_MASTER == 0. Note that we can remove the fdb flush
> from br_vlan_flush() since that case can no longer happen. At
> NETDEV_REGISTER br->default_pvid is always == 1 as it's initialized by
> br_vlan_init() before that and at NETDEV_UNREGISTER it can be anything
> depending why it was called (if called due to NETDEV_REGISTER error
> it'll still be == 1, otherwise it could be any value changed during the
> device life time).
>
> For the demonstration below a small change to iproute2 for printing all fdb
> notifications is added, because it contained a workaround not to show
> entries with ifindex == 0.
> Command executed while monitoring: $ ip l add br0 type bridge
> Before (both ifindex and master == 0):
> $ bridge monitor fdb
> 36:7e:8a:b3:56:ba dev * vlan 1 master * permanent
>
> After (proper br0 ifindex):
> $ bridge monitor fdb
> e6:2a:ae:7a:b7:48 dev br0 vlan 1 master br0 permanent
>
> v4: move only the default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
> v3: send the correct v2 patch with all changes (stub should return 0)
> v2: on error in br_vlan_init set br->vlgrp to NULL and return 0 in
>     the br_vlan_bridge_event stub when bridge vlans are disabled
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=204389
>
> Reported-by: michael-dev <michael-dev@fami-braun.de>
> Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-02 14:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <CA+FuTSc8WBx2PCUhn-sLtYHQR-OROXm2pUN9SDj7P-Bd8432UQ@mail.gmail.com>

Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午9:40写道：
>
> On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > refcount_t is better for reference counters since its
> > implementation can prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> > Changes in v2:
> >   - Convert refcount from 0-base to 1-base.
>
> This changes the initial value from 0 to 1, but does not change the
> release condition. So this introduces an accounting bug?

I have noticed this problem and have checked other files which use refcount_t.
I find although the refcounts are 1-based, they still use
refcount_dec_and_test()
to check whether the resource should be released.
One example is drivers/char/mspec.c.
Therefore I think this is okay and do not change the release condition.

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-02 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802124613.GA11245@ziepe.ca>

On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > This must be a proper barrier, like a spinlock, mutex, or
> > > synchronize_rcu.
> > 
> > 
> > I start with synchronize_rcu() but both you and Michael raise some
> > concern.
> 
> I've also idly wondered if calling synchronize_rcu() under the various
> mm locks is a deadlock situation.
> 
> > Then I try spinlock and mutex:
> > 
> > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > improvement.
> 
> I think the topic here is correctness not performance improvement

The topic is whether we should revert
commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")

or keep it in. The only reason to keep it is performance.

Now as long as all this code is disabled anyway, we can experiment a
bit.

I personally feel we would be best served by having two code paths:

- Access to VM memory directly mapped into kernel
- Access to userspace

Having it all cleanly split will allow a bunch of optimizations, for
example for years now we planned to be able to process an incoming short
packet directly on softirq path, or an outgoing on directly within
eventfd.

-- 
MST

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Matthew Wilcox @ 2019-08-02 14:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, john.hubbard, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Dave Hansen, Ira Weiny,
	Jason Gunthorpe, Jérôme Glisse, LKML, amd-gfx,
	ceph-devel, devel, devel, dri-devel, intel-gfx, kvm,
	linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard
In-Reply-To: <20190802124146.GL25064@quack2.suse.cz>

On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
> > [...]
> > > 2) Convert all of the call sites for get_user_pages*(), to
> > > invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > call sites, and will take some time.
> > 
> > How do we make sure this is the case and it will remain the case in the
> > future? There must be some automagic to enforce/check that. It is simply
> > not manageable to do it every now and then because then 3) will simply
> > be never safe.
> > 
> > Have you considered coccinele or some other scripted way to do the
> > transition? I have no idea how to deal with future changes that would
> > break the balance though.
> 
> Yeah, that's why I've been suggesting at LSF/MM that we may need to create
> a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> references got converted by using this wrapper instead of gup. The
> counterpart would then be more logically named as unpin_page() or whatever
> instead of put_user_page().  Sure this is not completely foolproof (you can
> create new callsite using vaddr_pin_pages() and then just drop refs using
> put_page()) but I suppose it would be a high enough barrier for missed
> conversions... Thoughts?

I think the API we really need is get_user_bvec() / put_user_bvec(),
and I know Christoph has been putting some work into that.  That avoids
doing refcount operations on hundreds of pages if the page in question is
a huge page.  Once people are switched over to that, they won't be tempted
to manually call put_page() on the individual constituent pages of a bvec.

^ permalink raw reply

* Re: [PATCH 26/34] mm/gup_benchmark.c: convert put_page() to put_user_page*()
From: Keith Busch @ 2019-08-02 14:19 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Dave Hansen, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
	dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
	linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
	linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
	rds-devel, sparclinux, x86, xen-devel, John Hubbard,
	Dan Carpenter, Greg Kroah-Hartman, Kirill A . Shutemov,
	Michael S . Tsirkin, YueHaibing
In-Reply-To: <20190802022005.5117-27-jhubbard@nvidia.com>

On Thu, Aug 01, 2019 at 07:19:57PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: YueHaibing <yuehaibing@huawei.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Looks fine.

Reviewed-by: Keith Busch <keith.busch@intel.com>

>  mm/gup_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d7f8db..515ac8eeb6ee 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -79,7 +79,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>  	for (i = 0; i < nr_pages; i++) {
>  		if (!pages[i])
>  			break;
> -		put_page(pages[i]);
> +		put_user_page(pages[i]);
>  	}
>  	end_time = ktime_get();
>  	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> -- 

^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Allan W. Nielsen @ 2019-08-02 14:21 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <fcb0e526-778f-5451-9934-e6c2421e4eb3@cumulusnetworks.com>

The 08/02/2019 17:16, Nikolay Aleksandrov wrote:
> On 02/08/2019 17:07, Allan W. Nielsen wrote:
> > The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
> >>> To create a group for two of the front ports the following entries can
> >>> be added:
> >>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> >>>
> >>> Now the entries will be display as following:
> >>> dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
> >>> dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
> >>>
> >>> This requires changes to iproute2 as well, see the follogin patch for that.
> >>>
> >>> Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
> >>> ports. If we have HW offload support, then the frame will be forwarded by
> >>> the switch, and need not to go to the CPU. In a pure SW world, the frame is
> >>> forwarded by the SW bridge, which will flooded it only the ports which are
> >>> part of the group.
> >>>
> >>> So far so good. This is an important part of the problem we wanted to solve.
> >>>
> >>> But, there is one drawback of this approach. If you want to add two of the
> >>> front ports and br0 to receive the frame then I can't see a way of doing it
> >>> with the bridge mdb command. To do that it requireds many more changes to
> >>> the existing code.
> >>>
> >>> Example:
> >>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
> >>>
> >>> We believe we come a long way by re-using the facilities in MDB (thanks for
> >>> convincing us in doing this), but we are still not completely happy with
> >>> the result.
> >> Just add self argument for the bridge mdb command, no need to specify it twice.
> > Like this:
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self
> 
> What ?! No, that is not what I meant.
> bridge mdb add dev br0 grp 01:00:00:00:00:04 permanent vid self
> bridge mdb del dev br0 grp 01:00:00:00:00:04 permanent vid self
> 
> Similar to fdb. You don't need no-self..
> I don't mind a different approach, this was just a suggestion. But please
> without "no-self" :)

Good, then we are in sync on that one :-D

> > 
> > Then if I want to remove br0 rom the group, should I then have a no-self, and
> > then it becomes even more strange what to write in the port.
> > 
> > bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
> >                             ^^
> > And, what if it is a group with only br0 (the traffic should go to br0 and
> > not any of the slave interfaces)?
> > 
> > Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
> > 'bridge fdb' commands where it refres to if the offload rule should be install
> > in HW - or only in the SW bridge.
> 
> No, it shouldn't. Self means act on the device, in this case act on the bridge,
> otherwise master is assumed.
> 
> > 
> > The proposed does not look pretty bad, but at least it will be possible to
> > configured the different scenarios:
> > 
> > bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> > 
> 
> That works too, but the "port" part is redundant.
> 
> > The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
> > less I understand why do we have both 'dev' and 'port'? The implementation will
> > only allow this if 'port' has become enslaved to the switch represented by
> > 'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
> > but we could make it optional, and then it looks a bit less strange to allow the
> > port to specify a br0.
> > 
> > Like this:
> > 
> > bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]
> > 
> > bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> > bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> > 
> 
> br0 is not a port, thus the "self" or just dev or whatever you choose..
Ahh, now I understand what you meant.

> > Alternative we could also make the port optional:
> > 
> > bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]
> > 
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> > bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> > 
> 
> Right. I read this one later. :)
> 
> > Any preferences?
> Not really, up to you. Any of the above seem fine to me.
Perfect, I like this one the most:

bridge mdb { add | del } dev DEV [ port PORT ] grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

/Allan

^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Nikolay Aleksandrov @ 2019-08-02 14:16 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <20190802140655.ngbok2ubprhivlhy@lx-anielsen.microsemi.net>

On 02/08/2019 17:07, Allan W. Nielsen wrote:
> The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
>>> To create a group for two of the front ports the following entries can
>>> be added:
>>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
>>>
>>> Now the entries will be display as following:
>>> dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
>>> dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
>>>
>>> This requires changes to iproute2 as well, see the follogin patch for that.
>>>
>>> Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
>>> ports. If we have HW offload support, then the frame will be forwarded by
>>> the switch, and need not to go to the CPU. In a pure SW world, the frame is
>>> forwarded by the SW bridge, which will flooded it only the ports which are
>>> part of the group.
>>>
>>> So far so good. This is an important part of the problem we wanted to solve.
>>>
>>> But, there is one drawback of this approach. If you want to add two of the
>>> front ports and br0 to receive the frame then I can't see a way of doing it
>>> with the bridge mdb command. To do that it requireds many more changes to
>>> the existing code.
>>>
>>> Example:
>>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
>>>
>>> We believe we come a long way by re-using the facilities in MDB (thanks for
>>> convincing us in doing this), but we are still not completely happy with
>>> the result.
>> Just add self argument for the bridge mdb command, no need to specify it twice.
> Like this:
> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self

What ?! No, that is not what I meant.
bridge mdb add dev br0 grp 01:00:00:00:00:04 permanent vid self
bridge mdb del dev br0 grp 01:00:00:00:00:04 permanent vid self

Similar to fdb. You don't need no-self..
I don't mind a different approach, this was just a suggestion. But please
without "no-self" :)

> 
> Then if I want to remove br0 rom the group, should I then have a no-self, and
> then it becomes even more strange what to write in the port.
> 
> bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
>                             ^^
> And, what if it is a group with only br0 (the traffic should go to br0 and
> not any of the slave interfaces)?
> 
> Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
> 'bridge fdb' commands where it refres to if the offload rule should be install
> in HW - or only in the SW bridge.

No, it shouldn't. Self means act on the device, in this case act on the bridge,
otherwise master is assumed.

> 
> The proposed does not look pretty bad, but at least it will be possible to
> configured the different scenarios:
> 
> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> 

That works too, but the "port" part is redundant.

> The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
> less I understand why do we have both 'dev' and 'port'? The implementation will
> only allow this if 'port' has become enslaved to the switch represented by
> 'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
> but we could make it optional, and then it looks a bit less strange to allow the
> port to specify a br0.
> 
> Like this:
> 
> bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]
> 
> bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> 

br0 is not a port, thus the "self" or just dev or whatever you choose..

> Alternative we could also make the port optional:
> 
> bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]
> 
> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> 

Right. I read this one later. :)

> Any preferences?
> 

Not really, up to you. Any of the above seem fine to me.

> /Allan
> 
> 


^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Allan W. Nielsen @ 2019-08-02 14:07 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <f758fdbf-4e0a-57b3-f13d-23e893ba7458@cumulusnetworks.com>

The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
> > To create a group for two of the front ports the following entries can
> > be added:
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > 
> > Now the entries will be display as following:
> > dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
> > dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
> > 
> > This requires changes to iproute2 as well, see the follogin patch for that.
> > 
> > Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
> > ports. If we have HW offload support, then the frame will be forwarded by
> > the switch, and need not to go to the CPU. In a pure SW world, the frame is
> > forwarded by the SW bridge, which will flooded it only the ports which are
> > part of the group.
> > 
> > So far so good. This is an important part of the problem we wanted to solve.
> > 
> > But, there is one drawback of this approach. If you want to add two of the
> > front ports and br0 to receive the frame then I can't see a way of doing it
> > with the bridge mdb command. To do that it requireds many more changes to
> > the existing code.
> > 
> > Example:
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
> > 
> > We believe we come a long way by re-using the facilities in MDB (thanks for
> > convincing us in doing this), but we are still not completely happy with
> > the result.
> Just add self argument for the bridge mdb command, no need to specify it twice.
Like this:
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self

Then if I want to remove br0 rom the group, should I then have a no-self, and
then it becomes even more strange what to write in the port.

bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
                            ^^
And, what if it is a group with only br0 (the traffic should go to br0 and
not any of the slave interfaces)?

Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
'bridge fdb' commands where it refres to if the offload rule should be install
in HW - or only in the SW bridge.

The proposed does not look pretty bad, but at least it will be possible to
configured the different scenarios:

bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1

The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
less I understand why do we have both 'dev' and 'port'? The implementation will
only allow this if 'port' has become enslaved to the switch represented by
'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
but we could make it optional, and then it looks a bit less strange to allow the
port to specify a br0.

Like this:

bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

Alternative we could also make the port optional:

bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

Any preferences?

/Allan



^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-02 14:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <42ead87b-1749-4c73-cbe4-29dbeb945041@redhat.com>

On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> Btw, I come up another idea, that is to disable preemption when vhost thread
> need to access the memory. Then register preempt notifier and if vhost
> thread is preempted, we're sure no one will access the memory and can do the
> cleanup.

Great, more notifiers :(

Maybe can live with
1- disable preemption while using the cached pointer
2- teach vhost to recover from memory access failures,
   by switching to regular from/to user path

So if you want to try that, fine since it's a step in
the right direction.

But I think fundamentally it's not what we want to do long term.

It's always been a fundamental problem with this patch series that only
metadata is accessed through a direct pointer.

The difference in ways you handle metadata and data is what is
now coming and messing everything up.

So if continuing the direct map approach,
what is needed is a cache of mapped VM memory, then on a cache miss
we'd queue work along the lines of 1-2 above.

That's one direction to take. Another one is to give up on that and
write our own version of uaccess macros.  Add a "high security" flag to
the vhost module and if not active use these for userspace memory
access.

-- 
MST

^ permalink raw reply

* Re: [PATCH 2/2] ixgbe: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 13:46 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Jeff Kirsher, David S . Miller, intel-wired-lan,
	Network Development, linux-kernel
In-Reply-To: <20190802105507.16650-1-hslester96@gmail.com>

On Fri, Aug 2, 2019 at 6:55 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> refcount_t is better for reference counters since its
> implementation can prevent overflows.
> So convert atomic_t ref counters to refcount_t.
>
> Also convert refcount from 0-based to 1-based.
>
> This patch depends on PATCH 1/2.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.h | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> index 00710f43cfd2..d313d00065cd 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> @@ -773,7 +773,7 @@ int ixgbe_setup_fcoe_ddp_resources(struct ixgbe_adapter *adapter)
>
>         fcoe->extra_ddp_buffer = buffer;
>         fcoe->extra_ddp_buffer_dma = dma;
> -       atomic_set(&fcoe->refcnt, 0);
> +       refcount_set(&fcoe->refcnt, 1);

Same point as in the cxgb4 driver patch: how can you just change the
initial value without modifying the condition for release?

This is not a suggestion to resubmit all these changes again with a
change to the release condition.

^ permalink raw reply

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 13:39 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <20190802083541.12602-1-hslester96@gmail.com>

On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> refcount_t is better for reference counters since its
> implementation can prevent overflows.
> So convert atomic_t ref counters to refcount_t.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
> Changes in v2:
>   - Convert refcount from 0-base to 1-base.

This changes the initial value from 0 to 1, but does not change the
release condition. So this introduces an accounting bug?

^ permalink raw reply

* [PATCH AUTOSEL 5.2 02/76] netfilter: nfnetlink: avoid deadlock due to synchronous request_module
From: Sasha Levin @ 2019-08-02 13:18 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Florian Westphal, Thomas Jarosch, Juliana Rodrigueiro,
	Pablo Neira Ayuso, Sasha Levin, netfilter-devel, coreteam, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 1b0890cd60829bd51455dc5ad689ed58c4408227 ]

Thomas and Juliana report a deadlock when running:

(rmmod nf_conntrack_netlink/xfrm_user)

  conntrack -e NEW -E &
  modprobe -v xfrm_user

They provided following analysis:

conntrack -e NEW -E
    netlink_bind()
        netlink_lock_table() -> increases "nl_table_users"
            nfnetlink_bind()
            # does not unlock the table as it's locked by netlink_bind()
                __request_module()
                    call_usermodehelper_exec()

This triggers "modprobe nf_conntrack_netlink" from kernel, netlink_bind()
won't return until modprobe process is done.

"modprobe xfrm_user":
    xfrm_user_init()
        register_pernet_subsys()
            -> grab pernet_ops_rwsem
                ..
                netlink_table_grab()
                    calls schedule() as "nl_table_users" is non-zero

so modprobe is blocked because netlink_bind() increased
nl_table_users while also holding pernet_ops_rwsem.

"modprobe nf_conntrack_netlink" runs and inits nf_conntrack_netlink:
    ctnetlink_init()
        register_pernet_subsys()
            -> blocks on "pernet_ops_rwsem" thanks to xfrm_user module

both modprobe processes wait on one another -- neither can make
progress.

Switch netlink_bind() to "nowait" modprobe -- this releases the netlink
table lock, which then allows both modprobe instances to complete.

Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Reported-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nfnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 92077d4591090..4abbb452cf6c6 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -578,7 +578,7 @@ static int nfnetlink_bind(struct net *net, int group)
 	ss = nfnetlink_get_subsys(type << 8);
 	rcu_read_unlock();
 	if (!ss)
-		request_module("nfnetlink-subsys-%d", type);
+		request_module_nowait("nfnetlink-subsys-%d", type);
 	return 0;
 }
 #endif
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 05/76] netfilter: Fix rpfilter dropping vrf packets by mistake
From: Sasha Levin @ 2019-08-02 13:18 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Miaohe Lin, Pablo Neira Ayuso, Sasha Levin, netfilter-devel,
	coreteam, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Miaohe Lin <linmiaohe@huawei.com>

[ Upstream commit b575b24b8eee37f10484e951b62ce2a31c579775 ]

When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
ipv4/ipv6 packets will be dropped. Vrf device will pass
through netfilter hook twice. One with enslaved device
and another one with l3 master device. So in device may
dismatch witch out device because out device is always
enslaved device.So failed with the check of the rpfilter
and drop the packets by mistake.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/ipv4/netfilter/ipt_rpfilter.c  | 1 +
 net/ipv6/netfilter/ip6t_rpfilter.c | 8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 59031670b16a0..cc23f1ce239c2 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -78,6 +78,7 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
 	flow.flowi4_tos = RT_TOS(iph->tos);
 	flow.flowi4_scope = RT_SCOPE_UNIVERSE;
+	flow.flowi4_oif = l3mdev_master_ifindex_rcu(xt_in(par));
 
 	return rpfilter_lookup_reverse(xt_net(par), &flow, xt_in(par), info->flags) ^ invert;
 }
diff --git a/net/ipv6/netfilter/ip6t_rpfilter.c b/net/ipv6/netfilter/ip6t_rpfilter.c
index 6bcaf73571834..d800801a5dd27 100644
--- a/net/ipv6/netfilter/ip6t_rpfilter.c
+++ b/net/ipv6/netfilter/ip6t_rpfilter.c
@@ -55,7 +55,9 @@ static bool rpfilter_lookup_reverse6(struct net *net, const struct sk_buff *skb,
 	if (rpfilter_addr_linklocal(&iph->saddr)) {
 		lookup_flags |= RT6_LOOKUP_F_IFACE;
 		fl6.flowi6_oif = dev->ifindex;
-	} else if ((flags & XT_RPFILTER_LOOSE) == 0)
+	/* Set flowi6_oif for vrf devices to lookup route in l3mdev domain. */
+	} else if (netif_is_l3_master(dev) || netif_is_l3_slave(dev) ||
+		  (flags & XT_RPFILTER_LOOSE) == 0)
 		fl6.flowi6_oif = dev->ifindex;
 
 	rt = (void *)ip6_route_lookup(net, &fl6, skb, lookup_flags);
@@ -70,7 +72,9 @@ static bool rpfilter_lookup_reverse6(struct net *net, const struct sk_buff *skb,
 		goto out;
 	}
 
-	if (rt->rt6i_idev->dev == dev || (flags & XT_RPFILTER_LOOSE))
+	if (rt->rt6i_idev->dev == dev ||
+	    l3mdev_master_ifindex_rcu(rt->rt6i_idev->dev) == dev->ifindex ||
+	    (flags & XT_RPFILTER_LOOSE))
 		ret = true;
  out:
 	ip6_rt_put(rt);
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 07/76] netfilter: conntrack: always store window size un-scaled
From: Sasha Levin @ 2019-08-02 13:18 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Florian Westphal, Jakub Jankowski, Jozsef Kadlecsik,
	Pablo Neira Ayuso, Sasha Levin, netfilter-devel, coreteam, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 959b69ef57db00cb33e9c4777400ae7183ebddd3 ]

Jakub Jankowski reported following oddity:

After 3 way handshake completes, timeout of new connection is set to
max_retrans (300s) instead of established (5 days).

shortened excerpt from pcap provided:
25.070622 IP (flags [DF], proto TCP (6), length 52)
10.8.5.4.1025 > 10.8.1.2.80: Flags [S], seq 11, win 64240, [wscale 8]
26.070462 IP (flags [DF], proto TCP (6), length 48)
10.8.1.2.80 > 10.8.5.4.1025: Flags [S.], seq 82, ack 12, win 65535, [wscale 3]
27.070449 IP (flags [DF], proto TCP (6), length 40)
10.8.5.4.1025 > 10.8.1.2.80: Flags [.], ack 83, win 512, length 0

Turns out the last_win is of u16 type, but we store the scaled value:
512 << 8 (== 0x20000) becomes 0 window.

The Fixes tag is not correct, as the bug has existed forever, but
without that change all that this causes might cause is to mistake a
window update (to-nonzero-from-zero) for a retransmit.

Fixes: fbcd253d2448b8 ("netfilter: conntrack: lower timeout to RETRANS seconds if window is 0")
Reported-by: Jakub Jankowski <shasta@toxcorp.com>
Tested-by: Jakub Jankowski <shasta@toxcorp.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nf_conntrack_proto_tcp.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 1e2cc83ff5da8..ae1f8c6b3a974 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -472,6 +472,7 @@ static bool tcp_in_window(const struct nf_conn *ct,
 	struct ip_ct_tcp_state *receiver = &state->seen[!dir];
 	const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
 	__u32 seq, ack, sack, end, win, swin;
+	u16 win_raw;
 	s32 receiver_offset;
 	bool res, in_recv_win;
 
@@ -480,7 +481,8 @@ static bool tcp_in_window(const struct nf_conn *ct,
 	 */
 	seq = ntohl(tcph->seq);
 	ack = sack = ntohl(tcph->ack_seq);
-	win = ntohs(tcph->window);
+	win_raw = ntohs(tcph->window);
+	win = win_raw;
 	end = segment_seq_plus_len(seq, skb->len, dataoff, tcph);
 
 	if (receiver->flags & IP_CT_TCP_FLAG_SACK_PERM)
@@ -655,14 +657,14 @@ static bool tcp_in_window(const struct nf_conn *ct,
 			    && state->last_seq == seq
 			    && state->last_ack == ack
 			    && state->last_end == end
-			    && state->last_win == win)
+			    && state->last_win == win_raw)
 				state->retrans++;
 			else {
 				state->last_dir = dir;
 				state->last_seq = seq;
 				state->last_ack = ack;
 				state->last_end = end;
-				state->last_win = win;
+				state->last_win = win_raw;
 				state->retrans = 0;
 			}
 		}
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 08/76] netfilter: nft_hash: fix symhash with modulus one
From: Sasha Levin @ 2019-08-02 13:18 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Laura Garcia Liebana, Pablo Neira Ayuso, Sasha Levin,
	netfilter-devel, coreteam, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Laura Garcia Liebana <nevola@gmail.com>

[ Upstream commit 28b1d6ef53e3303b90ca8924bb78f31fa527cafb ]

The rule below doesn't work as the kernel raises -ERANGE.

nft add rule netdev nftlb lb01 ip daddr set \
	symhash mod 1 map { 0 : 192.168.0.10 } fwd to "eth0"

This patch allows to use the symhash modulus with one
element, in the same way that the other types of hashes and
algorithms that uses the modulus parameter.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nft_hash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index fe93e731dc7fb..b836d550b9199 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -129,7 +129,7 @@ static int nft_symhash_init(const struct nft_ctx *ctx,
 	priv->dreg = nft_parse_register(tb[NFTA_HASH_DREG]);
 
 	priv->modulus = ntohl(nla_get_be32(tb[NFTA_HASH_MODULUS]));
-	if (priv->modulus <= 1)
+	if (priv->modulus < 1)
 		return -ERANGE;
 
 	if (priv->offset + priv->modulus - 1 < priv->offset)
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 15/76] netfilter: nf_tables: Support auto-loading for inet nat
From: Sasha Levin @ 2019-08-02 13:18 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Phil Sutter, Pablo Neira Ayuso, Sasha Levin, netfilter-devel,
	coreteam, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Phil Sutter <phil@nwl.cc>

[ Upstream commit b4f1483cbfa5fafca4874e90063f75603edbc210 ]

Trying to create an inet family nat chain would not cause
nft_chain_nat.ko module to auto-load due to missing module alias. Add a
proper one with hard-coded family value 1 for the pseudo-family
NFPROTO_INET.

Fixes: d164385ec572 ("netfilter: nat: add inet family nat support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nft_chain_nat.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nft_chain_nat.c b/net/netfilter/nft_chain_nat.c
index 2f89bde3c61cb..ff9ac8ae0031f 100644
--- a/net/netfilter/nft_chain_nat.c
+++ b/net/netfilter/nft_chain_nat.c
@@ -142,3 +142,6 @@ MODULE_ALIAS_NFT_CHAIN(AF_INET, "nat");
 #ifdef CONFIG_NF_TABLES_IPV6
 MODULE_ALIAS_NFT_CHAIN(AF_INET6, "nat");
 #endif
+#ifdef CONFIG_NF_TABLES_INET
+MODULE_ALIAS_NFT_CHAIN(1, "nat");	/* NFPROTO_INET */
+#endif
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 28/76] mac80211: fix possible memory leak in ieee80211_assign_beacon
From: Sasha Levin @ 2019-08-02 13:19 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Lorenzo Bianconi, Johannes Berg, Sasha Levin, linux-wireless,
	netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Lorenzo Bianconi <lorenzo@kernel.org>

[ Upstream commit bcc27fab8cc673ddc95452674373cce618ccb3a3 ]

Free new beacon_data in ieee80211_assign_beacon whenever
ieee80211_assign_beacon fails

Fixes: 8860020e0be1 ("cfg80211: restructure AP/GO mode API")
Fixes: bc847970f432 ("mac80211: support FTM responder configuration/statistic")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/770285772543c9fca33777bb4ad4760239e56256.1562105631.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/mac80211/cfg.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index a1973a26c7fc4..b8288125e05db 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -935,8 +935,10 @@ static int ieee80211_assign_beacon(struct ieee80211_sub_if_data *sdata,
 
 	err = ieee80211_set_probe_resp(sdata, params->probe_resp,
 				       params->probe_resp_len, csa);
-	if (err < 0)
+	if (err < 0) {
+		kfree(new);
 		return err;
+	}
 	if (err == 0)
 		changed |= BSS_CHANGED_AP_PROBE_RESP;
 
@@ -948,8 +950,10 @@ static int ieee80211_assign_beacon(struct ieee80211_sub_if_data *sdata,
 							 params->civicloc,
 							 params->civicloc_len);
 
-		if (err < 0)
+		if (err < 0) {
+			kfree(new);
 			return err;
+		}
 
 		changed |= BSS_CHANGED_FTM_RESPONDER;
 	}
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.2 29/76] mac80211: don't warn about CW params when not using them
From: Sasha Levin @ 2019-08-02 13:19 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Brian Norris, Johannes Berg, Sasha Levin, linux-wireless, netdev
In-Reply-To: <20190802131951.11600-1-sashal@kernel.org>

From: Brian Norris <briannorris@chromium.org>

[ Upstream commit d2b3fe42bc629c2d4002f652b3abdfb2e72991c7 ]

ieee80211_set_wmm_default() normally sets up the initial CW min/max for
each queue, except that it skips doing this if the driver doesn't
support ->conf_tx. We still end up calling drv_conf_tx() in some cases
(e.g., ieee80211_reconfig()), which also still won't do anything
useful...except it complains here about the invalid CW parameters.

Let's just skip the WARN if we weren't going to do anything useful with
the parameters.

Signed-off-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/r/20190718015712.197499-1-briannorris@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/mac80211/driver-ops.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/mac80211/driver-ops.c b/net/mac80211/driver-ops.c
index acd4afb4944b8..c9a8a2433e8ac 100644
--- a/net/mac80211/driver-ops.c
+++ b/net/mac80211/driver-ops.c
@@ -187,11 +187,16 @@ int drv_conf_tx(struct ieee80211_local *local,
 	if (!check_sdata_in_driver(sdata))
 		return -EIO;
 
-	if (WARN_ONCE(params->cw_min == 0 ||
-		      params->cw_min > params->cw_max,
-		      "%s: invalid CW_min/CW_max: %d/%d\n",
-		      sdata->name, params->cw_min, params->cw_max))
+	if (params->cw_min == 0 || params->cw_min > params->cw_max) {
+		/*
+		 * If we can't configure hardware anyway, don't warn. We may
+		 * never have initialized the CW parameters.
+		 */
+		WARN_ONCE(local->ops->conf_tx,
+			  "%s: invalid CW_min/CW_max: %d/%d\n",
+			  sdata->name, params->cw_min, params->cw_max);
 		return -EINVAL;
+	}
 
 	trace_drv_conf_tx(local, sdata, ac, params);
 	if (local->ops->conf_tx)
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 01/42] netfilter: nfnetlink: avoid deadlock due to synchronous request_module
From: Sasha Levin @ 2019-08-02 13:22 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Florian Westphal, Thomas Jarosch, Juliana Rodrigueiro,
	Pablo Neira Ayuso, Sasha Levin, netfilter-devel, coreteam, netdev

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 1b0890cd60829bd51455dc5ad689ed58c4408227 ]

Thomas and Juliana report a deadlock when running:

(rmmod nf_conntrack_netlink/xfrm_user)

  conntrack -e NEW -E &
  modprobe -v xfrm_user

They provided following analysis:

conntrack -e NEW -E
    netlink_bind()
        netlink_lock_table() -> increases "nl_table_users"
            nfnetlink_bind()
            # does not unlock the table as it's locked by netlink_bind()
                __request_module()
                    call_usermodehelper_exec()

This triggers "modprobe nf_conntrack_netlink" from kernel, netlink_bind()
won't return until modprobe process is done.

"modprobe xfrm_user":
    xfrm_user_init()
        register_pernet_subsys()
            -> grab pernet_ops_rwsem
                ..
                netlink_table_grab()
                    calls schedule() as "nl_table_users" is non-zero

so modprobe is blocked because netlink_bind() increased
nl_table_users while also holding pernet_ops_rwsem.

"modprobe nf_conntrack_netlink" runs and inits nf_conntrack_netlink:
    ctnetlink_init()
        register_pernet_subsys()
            -> blocks on "pernet_ops_rwsem" thanks to xfrm_user module

both modprobe processes wait on one another -- neither can make
progress.

Switch netlink_bind() to "nowait" modprobe -- this releases the netlink
table lock, which then allows both modprobe instances to complete.

Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Reported-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nfnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 916913454624f..7f2c1915763f8 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -575,7 +575,7 @@ static int nfnetlink_bind(struct net *net, int group)
 	ss = nfnetlink_get_subsys(type << 8);
 	rcu_read_unlock();
 	if (!ss)
-		request_module("nfnetlink-subsys-%d", type);
+		request_module_nowait("nfnetlink-subsys-%d", type);
 	return 0;
 }
 #endif
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 04/42] netfilter: conntrack: always store window size un-scaled
From: Sasha Levin @ 2019-08-02 13:22 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Florian Westphal, Jakub Jankowski, Jozsef Kadlecsik,
	Pablo Neira Ayuso, Sasha Levin, netfilter-devel, coreteam, netdev
In-Reply-To: <20190802132302.13537-1-sashal@kernel.org>

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 959b69ef57db00cb33e9c4777400ae7183ebddd3 ]

Jakub Jankowski reported following oddity:

After 3 way handshake completes, timeout of new connection is set to
max_retrans (300s) instead of established (5 days).

shortened excerpt from pcap provided:
25.070622 IP (flags [DF], proto TCP (6), length 52)
10.8.5.4.1025 > 10.8.1.2.80: Flags [S], seq 11, win 64240, [wscale 8]
26.070462 IP (flags [DF], proto TCP (6), length 48)
10.8.1.2.80 > 10.8.5.4.1025: Flags [S.], seq 82, ack 12, win 65535, [wscale 3]
27.070449 IP (flags [DF], proto TCP (6), length 40)
10.8.5.4.1025 > 10.8.1.2.80: Flags [.], ack 83, win 512, length 0

Turns out the last_win is of u16 type, but we store the scaled value:
512 << 8 (== 0x20000) becomes 0 window.

The Fixes tag is not correct, as the bug has existed forever, but
without that change all that this causes might cause is to mistake a
window update (to-nonzero-from-zero) for a retransmit.

Fixes: fbcd253d2448b8 ("netfilter: conntrack: lower timeout to RETRANS seconds if window is 0")
Reported-by: Jakub Jankowski <shasta@toxcorp.com>
Tested-by: Jakub Jankowski <shasta@toxcorp.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nf_conntrack_proto_tcp.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 842f3f86fb2e7..7011ab27c4371 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -480,6 +480,7 @@ static bool tcp_in_window(const struct nf_conn *ct,
 	struct ip_ct_tcp_state *receiver = &state->seen[!dir];
 	const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
 	__u32 seq, ack, sack, end, win, swin;
+	u16 win_raw;
 	s32 receiver_offset;
 	bool res, in_recv_win;
 
@@ -488,7 +489,8 @@ static bool tcp_in_window(const struct nf_conn *ct,
 	 */
 	seq = ntohl(tcph->seq);
 	ack = sack = ntohl(tcph->ack_seq);
-	win = ntohs(tcph->window);
+	win_raw = ntohs(tcph->window);
+	win = win_raw;
 	end = segment_seq_plus_len(seq, skb->len, dataoff, tcph);
 
 	if (receiver->flags & IP_CT_TCP_FLAG_SACK_PERM)
@@ -663,14 +665,14 @@ static bool tcp_in_window(const struct nf_conn *ct,
 			    && state->last_seq == seq
 			    && state->last_ack == ack
 			    && state->last_end == end
-			    && state->last_win == win)
+			    && state->last_win == win_raw)
 				state->retrans++;
 			else {
 				state->last_dir = dir;
 				state->last_seq = seq;
 				state->last_ack = ack;
 				state->last_end = end;
-				state->last_win = win;
+				state->last_win = win_raw;
 				state->retrans = 0;
 			}
 		}
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 03/42] netfilter: Fix rpfilter dropping vrf packets by mistake
From: Sasha Levin @ 2019-08-02 13:22 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Miaohe Lin, Pablo Neira Ayuso, Sasha Levin, netfilter-devel,
	coreteam, netdev
In-Reply-To: <20190802132302.13537-1-sashal@kernel.org>

From: Miaohe Lin <linmiaohe@huawei.com>

[ Upstream commit b575b24b8eee37f10484e951b62ce2a31c579775 ]

When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
ipv4/ipv6 packets will be dropped. Vrf device will pass
through netfilter hook twice. One with enslaved device
and another one with l3 master device. So in device may
dismatch witch out device because out device is always
enslaved device.So failed with the check of the rpfilter
and drop the packets by mistake.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/ipv4/netfilter/ipt_rpfilter.c  | 1 +
 net/ipv6/netfilter/ip6t_rpfilter.c | 8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 12843c9ef1421..74b19a5c572e9 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -96,6 +96,7 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
 	flow.flowi4_tos = RT_TOS(iph->tos);
 	flow.flowi4_scope = RT_SCOPE_UNIVERSE;
+	flow.flowi4_oif = l3mdev_master_ifindex_rcu(xt_in(par));
 
 	return rpfilter_lookup_reverse(xt_net(par), &flow, xt_in(par), info->flags) ^ invert;
 }
diff --git a/net/ipv6/netfilter/ip6t_rpfilter.c b/net/ipv6/netfilter/ip6t_rpfilter.c
index c3c6b09acdc4f..0f3407f2851ed 100644
--- a/net/ipv6/netfilter/ip6t_rpfilter.c
+++ b/net/ipv6/netfilter/ip6t_rpfilter.c
@@ -58,7 +58,9 @@ static bool rpfilter_lookup_reverse6(struct net *net, const struct sk_buff *skb,
 	if (rpfilter_addr_linklocal(&iph->saddr)) {
 		lookup_flags |= RT6_LOOKUP_F_IFACE;
 		fl6.flowi6_oif = dev->ifindex;
-	} else if ((flags & XT_RPFILTER_LOOSE) == 0)
+	/* Set flowi6_oif for vrf devices to lookup route in l3mdev domain. */
+	} else if (netif_is_l3_master(dev) || netif_is_l3_slave(dev) ||
+		  (flags & XT_RPFILTER_LOOSE) == 0)
 		fl6.flowi6_oif = dev->ifindex;
 
 	rt = (void *)ip6_route_lookup(net, &fl6, skb, lookup_flags);
@@ -73,7 +75,9 @@ static bool rpfilter_lookup_reverse6(struct net *net, const struct sk_buff *skb,
 		goto out;
 	}
 
-	if (rt->rt6i_idev->dev == dev || (flags & XT_RPFILTER_LOOSE))
+	if (rt->rt6i_idev->dev == dev ||
+	    l3mdev_master_ifindex_rcu(rt->rt6i_idev->dev) == dev->ifindex ||
+	    (flags & XT_RPFILTER_LOOSE))
 		ret = true;
  out:
 	ip6_rt_put(rt);
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 05/42] netfilter: nft_hash: fix symhash with modulus one
From: Sasha Levin @ 2019-08-02 13:22 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Laura Garcia Liebana, Pablo Neira Ayuso, Sasha Levin,
	netfilter-devel, coreteam, netdev
In-Reply-To: <20190802132302.13537-1-sashal@kernel.org>

From: Laura Garcia Liebana <nevola@gmail.com>

[ Upstream commit 28b1d6ef53e3303b90ca8924bb78f31fa527cafb ]

The rule below doesn't work as the kernel raises -ERANGE.

nft add rule netdev nftlb lb01 ip daddr set \
	symhash mod 1 map { 0 : 192.168.0.10 } fwd to "eth0"

This patch allows to use the symhash modulus with one
element, in the same way that the other types of hashes and
algorithms that uses the modulus parameter.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nft_hash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index c2d237144f747..b8f23f75aea6c 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -196,7 +196,7 @@ static int nft_symhash_init(const struct nft_ctx *ctx,
 	priv->dreg = nft_parse_register(tb[NFTA_HASH_DREG]);
 
 	priv->modulus = ntohl(nla_get_be32(tb[NFTA_HASH_MODULUS]));
-	if (priv->modulus <= 1)
+	if (priv->modulus < 1)
 		return -ERANGE;
 
 	if (priv->offset + priv->modulus - 1 < priv->offset)
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 14/42] mac80211: don't warn about CW params when not using them
From: Sasha Levin @ 2019-08-02 13:22 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Brian Norris, Johannes Berg, Sasha Levin, linux-wireless, netdev
In-Reply-To: <20190802132302.13537-1-sashal@kernel.org>

From: Brian Norris <briannorris@chromium.org>

[ Upstream commit d2b3fe42bc629c2d4002f652b3abdfb2e72991c7 ]

ieee80211_set_wmm_default() normally sets up the initial CW min/max for
each queue, except that it skips doing this if the driver doesn't
support ->conf_tx. We still end up calling drv_conf_tx() in some cases
(e.g., ieee80211_reconfig()), which also still won't do anything
useful...except it complains here about the invalid CW parameters.

Let's just skip the WARN if we weren't going to do anything useful with
the parameters.

Signed-off-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/r/20190718015712.197499-1-briannorris@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/mac80211/driver-ops.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/mac80211/driver-ops.c b/net/mac80211/driver-ops.c
index bb886e7db47f1..f783d1377d9a8 100644
--- a/net/mac80211/driver-ops.c
+++ b/net/mac80211/driver-ops.c
@@ -169,11 +169,16 @@ int drv_conf_tx(struct ieee80211_local *local,
 	if (!check_sdata_in_driver(sdata))
 		return -EIO;
 
-	if (WARN_ONCE(params->cw_min == 0 ||
-		      params->cw_min > params->cw_max,
-		      "%s: invalid CW_min/CW_max: %d/%d\n",
-		      sdata->name, params->cw_min, params->cw_max))
+	if (params->cw_min == 0 || params->cw_min > params->cw_max) {
+		/*
+		 * If we can't configure hardware anyway, don't warn. We may
+		 * never have initialized the CW parameters.
+		 */
+		WARN_ONCE(local->ops->conf_tx,
+			  "%s: invalid CW_min/CW_max: %d/%d\n",
+			  sdata->name, params->cw_min, params->cw_max);
 		return -EINVAL;
+	}
 
 	trace_drv_conf_tx(local, sdata, ac, params);
 	if (local->ops->conf_tx)
-- 
2.20.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox