Netdev List
 help / color / mirror / Atom feed
* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-12 21:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet
In-Reply-To: <CAM_iQpUdknc_4Hjk9ieC3mshrF20BPEhP8e=rKTRybpOadDg6w@mail.gmail.com>

On Fri, May 12, 2017 at 1:58 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, May 12, 2017 at 10:27 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Or maybe don't touch nh_oif but using a new flag in
>> nh_flags?? We only need to know if we should call
>> dev_put() in free_fib_info_rcu().
>>
>> Again, I am still not sure if it is any better than just
>> putting these fib_nh into a linked list.
>>
>
> I am afraid we have to use a linked list, because either a flag or
> nh_oif is per nh, but one nh could have multiple different nh_devs...

Scratch that. Each fib_nh has one nh_dev...
I am trying to use a flag.

^ permalink raw reply

* Re: [PATCH net] netvsc: fix net poll mode
From: Eric Dumazet @ 2017-05-12 21:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, Stephen Hemminger
In-Reply-To: <20170512200331.14433-1-sthemmin@microsoft.com>

On Fri, 2017-05-12 at 13:03 -0700, Stephen Hemminger wrote:
> The ndo_poll_controller function needs to schedule NAPI to pick
> up arriving packets and send completions. Otherwise no data
> will ever be received.
> 
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
>  drivers/net/hyperv/netvsc_drv.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 4421a6d00375..e487ccea251c 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -1158,11 +1158,20 @@ netvsc_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info,
>  }
>  
>  #ifdef CONFIG_NET_POLL_CONTROLLER
> -static void netvsc_poll_controller(struct net_device *net)
> +static void netvsc_poll_controller(struct net_device *dev)
>  {
> -	/* As netvsc_start_xmit() works synchronous we don't have to
> -	 * trigger anything here.
> -	 */
> +	struct net_device_context *ndc = netdev_priv(dev);
> +	struct netvsc_device *ndev = rtnl_dereference(ndc->nvdev);


rtnl_dereference() can not possibly be used in an ndo_poll_controller()

You would certainly trigger a lockdep issue here if you compile a kernel
with enough debugging support.

CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y
CONFIG_SPARSE_RCU_POINTER=y

^ permalink raw reply

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Julian Anastasov @ 2017-05-12 21:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet
In-Reply-To: <CAM_iQpVQSMd_LYMqm3EauRRgJhfkW=zsdvKRf4-YcGWwSufuvQ@mail.gmail.com>


	Hello,

On Fri, 12 May 2017, Cong Wang wrote:

> On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >
> >         fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
> > time, so we can not see them in any hash tables later on
> > NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
> > except if moving NHs in another hash table but that should
> > not be needed.
> 
> Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some
> way to link those fib_nh together for NETDEV_UNREGISTER_FINAL,
> a linked list should be sufficient, but requires adding list_head to fib_nh.

	It could be a fib_info_devhash_dead hash table, similar
to fib_info_devhash where we can hash NHs for dead fib infos
but then we will need a fib_clntref reference or otherwise last
user can drop the fib info before NETDEV_UNREGISTER_FINAL is called.
It will add more code in fib_sync_down_dev to know when
to call fib_info_hold. But OTOH, it will reduce the work
needed for careful release of the references in fib_release_info.

	So, we have 2 possible cases to consider:

1. route deleted, fib_release_info called, fi held by socket,
NETDEV_UNREGISTER can not see the NHs because they are
unlinked in fib_release_info. dev unreg delayed for long time...

2. NETDEV_UNREGISTER called first, before the NHs are
unlinked.

	Now the main question: is FIB_LOOKUP_NOREF used
everywhere in IPv4? I guess so. If not, it means
someone can walk its res->fi NHs which is bad. I think,
this will delay the unregistration for long time and we
can not solve the problem.

	If yes, free_fib_info() should not use call_rcu.
Instead, fib_release_info() will start RCU callback to
drop everything via a common function for fib_release_info
and free_fib_info. As result, the last fib_info_put will
just need to free fi->fib_metrics and fi.

	Something like:

/* Long living data */
fib_info_destroy:
{
	if (fi->fib_dead == 0) {
		pr_warn("Freeing alive fib_info %p\n", fi);
		return;
	}

	if (fi->fib_metrics != (u32 *) dst_default_metrics)
		kfree(fi->fib_metrics);
	kfree(fi);
}

/* Do not imply RCU grace period anymore, last user is last */
fib_info_put():
{
	if (atomic_dec_and_test(&fi->fib_clntref))
		fib_info_destroy(fi);
}

/* Release everything except long living fields (fib_metrics) */
fib_info_release():
{
        change_nexthops(fi) {
                if (nexthop_nh->nh_dev)
                        dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
                rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
                rt_fibinfo_free(&nexthop_nh->nh_rth_input);
        } endfor_nexthops(fi);
}

/* RCU grace period after unlink */
fib_release_info_rcu():
{
	struct fib_info *fi = container_of(head, struct fib_info, rcu);

	fib_info_release(fi);
	fib_info_put(fi);
}

/* Called just on error */
free_fib_info():
{
	fib_info_release(fi);
	fib_info_put(fi);
}

fib_release_info():
{
	...
	fi->fib_dead = 1;
-	fib_info_put(fi);
+	call_rcu(&fi->rcu, fib_release_info_rcu);
}

> >         I'm thinking for the following solution which
> > is a bit hackish:
> >
> > - on NETDEV_UNREGISTER we want to put the nh_dev references,
> > so fib_release_info is a good place. But as fib_release_info
> > is not always called, we will have two places that put
> > references. We can use such hack:
> >
> >         - for example, use nh_oif to know if dev_put is
> >         already called
> >
> >         - fib_release_info() should set nh_oif to 0 because
> >         it will now call dev_put without clearing nh_dev
> 
> Are you sure we are safe to call dev_put() in fib_release_info()
> for _all_ paths, especially non-unregister paths? See:

	Yep, dev_put is safe there...

> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date:   Wed May 23 15:39:45 2012 +0000
> 
>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

	...as long as we do not set nh_dev to NULL

> But, hmm, very interesting, I always thought dev_put() triggers a
> kfree() potentially, but here your suggestion actually leverages the fact
> that it is merely a pcpu counter decrease, so for unregister path,
> this is just giving refcnt back, which means it is safe as long as
> we don't have any parallel unregister?  We should because of RTNL.

	Yes, we are in the context of unregistration and
there is a rcu_barrier() that prevents device to be
destroyed while there are RCU users. Refcnt reaching 0 is
not enough to free dev, RCU users must be considered,
they do not get reference.

> I see why you say this is hackish, really it is. ;) We have to ensure
> the evil dev_put() is only called on unregister path.

	Or after a RCU grace period but not later...

> > - fib_dev/nh_dev != NULL checks are not needed, so this addresses
> > Eric's concerns. BTW, fib_route_seq_show actually checks
> > for fi->fib_dev, may be not in a safe way (lockless_dereference
> > needed?) but as we don't set nh_dev to NULL this is not
> > needed.
> >
> 
> I think Eric was complaining about the lack of rcu_dereference()
> there.

	Not needed if we don't set nh_dev to NULL.

> >         What more? What about nh_pcpu_rth_output and
> > nh_rth_input holding routes? We should think about
> > them too. I should think more if nh_oif trick can work
> > for them, may be not because nh_oif is optional...
> > May be we should purge them somehow?
> >
> 
> Or maybe don't touch nh_oif but using a new flag in
> nh_flags?? We only need to know if we should call
> dev_put() in free_fib_info_rcu().

	fib_dead is a better option if we decide to use
such solution: 1=not called by fib_release_info,
2=called by fib_release_info.

Regards

^ permalink raw reply

* Re: [PATCH] net: ipv6: Truncate single route when it doesn't fit into dump buffer.
From: Jan Moskyto Matejka @ 2017-05-12 21:34 UTC (permalink / raw)
  To: David Miller
  Cc: mq, netdev, linux-kernel, dsa, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20170512.112447.1421545301592945998.davem@davemloft.net>

On Fri, May 12, 2017 at 11:24:47AM -0400, David Miller wrote:
> From: Jan Moskyto Matejka <mq@ucw.cz>
> Date: Fri, 12 May 2017 13:15:10 +0200
> 
> > -int rt6_dump_route(struct rt6_info *rt, void *p_arg);
> > +int rt6_dump_route(struct rt6_info *rt, void *p_arg, int truncate);
> 
> Please use "bool" and "true"/"false" for boolean values.

Missed that, sorry. Will remember next time.

> What does ipv4 do in this situation?
> 
> I'm hesitant to be OK with adding a new nlmsg flag just for this case
> if we solve this problem differently and using existing mechanisms
> elsewhere.

IPv4 is broken the same way as IPv6, with the difference that
'ip route append' does not append a new multipath nexthop but
creates a whole new route.

It is probably impossible to create such a big route via iproute2 tool;
it is needed to open netlink socket by hand and write there the
attached file. (It is one nlmsg adding one huge multipath route.)

It should be also noted that 'ip monitor' gets the huge routes
truncated.

MQ

^ permalink raw reply

* Re: [PATCH] net: ipv6: Truncate single route when it doesn't fit into dump buffer.
From: Jan Moskyto Matejka @ 2017-05-12 21:41 UTC (permalink / raw)
  To: David Ahern
  Cc: David Miller, mq, netdev, linux-kernel, dsa, kuznet, jmorris,
	yoshfuji, kaber
In-Reply-To: <3828cf04-1bf8-92d2-dfc4-184bd615fe10@gmail.com>

On Fri, May 12, 2017 at 10:26:08AM -0700, David Ahern wrote:
> On 5/12/17 8:24 AM, David Miller wrote:
> > From: Jan Moskyto Matejka <mq@ucw.cz>
> > Date: Fri, 12 May 2017 13:15:10 +0200
> > 
> >> -int rt6_dump_route(struct rt6_info *rt, void *p_arg);
> >> +int rt6_dump_route(struct rt6_info *rt, void *p_arg, int truncate);
> > 
> > Please use "bool" and "true"/"false" for boolean values.
> > 
> > What does ipv4 do in this situation?
> > 
> > I'm hesitant to be OK with adding a new nlmsg flag just for this case
> > if we solve this problem differently and using existing mechanisms
> > elsewhere.
> > 
> 
> I'll take a look at this later today or this weekend; we can't just
> truncate the route returned to userspace.

Agreed. My favourite would be skb realloc somewhere inside the dump loop
... but I don't know whether it's feasible.

MQ

^ permalink raw reply

* Re: [PATCH] net: ipv6: Truncate single route when it doesn't fit into dump buffer.
From: Jan Moskyto Matejka @ 2017-05-12 21:43 UTC (permalink / raw)
  To: David Miller
  Cc: mq, netdev, linux-kernel, dsa, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20170512213404.rndcujbptjhhusww@lopatka.joja.cz>

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

On Fri, May 12, 2017 at 11:34:04PM +0200, Jan Moskyto Matejka wrote:
> On Fri, May 12, 2017 at 11:24:47AM -0400, David Miller wrote:
> > From: Jan Moskyto Matejka <mq@ucw.cz>
> > Date: Fri, 12 May 2017 13:15:10 +0200
> > 
> > > -int rt6_dump_route(struct rt6_info *rt, void *p_arg);
> > > +int rt6_dump_route(struct rt6_info *rt, void *p_arg, int truncate);
> > 
> > Please use "bool" and "true"/"false" for boolean values.
> 
> Missed that, sorry. Will remember next time.
> 
> > What does ipv4 do in this situation?
> > 
> > I'm hesitant to be OK with adding a new nlmsg flag just for this case
> > if we solve this problem differently and using existing mechanisms
> > elsewhere.
> 
> IPv4 is broken the same way as IPv6, with the difference that
> 'ip route append' does not append a new multipath nexthop but
> creates a whole new route.
> 
> It is probably impossible to create such a big route via iproute2 tool;
> it is needed to open netlink socket by hand and write there the
> attached file. (It is one nlmsg adding one huge multipath route.)

Oops, it's late evening here. Attached now.
MQ

[-- Attachment #2: ipv4-huge-multipath --]
[-- Type: application/octet-stream, Size: 65560 bytes --]

^ permalink raw reply

* Re: [Problem] Broadcom BCM5762 Ethernet tg3 times out with stack trace
From: Michael Chan @ 2017-05-12 21:45 UTC (permalink / raw)
  To: Daniel Kim
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan,
	Linux Netdev List
In-Reply-To: <CANeD+K_ojPPUS2QtHRg+1o4EZaCSFcaa=dDPfkR2tUYNwdGNiw@mail.gmail.com>

On Fri, May 12, 2017 at 12:10 PM, Daniel Kim <dkim@playmechanix.com> wrote:

> [ 4898.361530] tg3 0000:04:00.0 enp4s0: 0: Host status block
> [00000001:00000021:(0000:0486:0000):(0000:002c)]
> [ 4898.361535] tg3 0000:04:00.0 enp4s0: 0: NAPI info
> [00000021:00000021:(0135:002c:01ff):0000:(0568:0000:0000:0000)]

As you can see here, the driver has caught up with the hardware's
status tag of 0x21 and tx consumer of 0x2c.  However, the tx producer
is at 0x135, way ahead of the consumer.  The hardware is not
completing all these TX BDs from 0x2c to 0x135.

> [ 4898.360797] tg3 0000:04:00.0 enp4s0: 0x00000c00: 0x0000000a,
> 0x00000000, 0x00000003, 0x00000001
>
> [ 4898.360849] tg3 0000:04:00.0 enp4s0: 0x00001800: 0x00000016,
> 0x00000000, 0x00000135, 0x00000000
>
> [ 4898.361157] tg3 0000:04:00.0 enp4s0: 0x00004800: 0x380303fe,
> 0x00000000, 0x00000000, 0x00000100

I checked a few of the status registers related to the TX blocks and
the read DMA engine status and they are all ok.  The hardware sees the
tx producer of 0x135.

> [ 4898.361087] tg3 0000:04:00.0 enp4s0: 0x00003cc0: 0x00000135,
0x00000000, 0x00000000, 0x00000000

Even the internal host coalescing is seeing 0x135 as the tx index.
Somehow this is not DMA'ed to the host.

Do you have IOMMU enabled?  Does the same NIC used to work with a
different kernel?

^ permalink raw reply

* Re: [PATCH] mdio: mux: fix device_node_continue.cocci warnings
From: Florian Fainelli @ 2017-05-12 22:52 UTC (permalink / raw)
  To: David Miller, julia.lawall, jon.mason
  Cc: netdev, jon.mason, andrew, kbuild-all, linux-kernel
In-Reply-To: <20170512.122223.341607427763777325.davem@davemloft.net>

On 05/12/2017 09:22 AM, David Miller wrote:
> From: Julia Lawall <julia.lawall@lip6.fr>
> Date: Fri, 12 May 2017 22:54:23 +0800 (SGT)
> 
>> Device node iterators put the previous value of the index variable, so an
>> explicit put causes a double put.
>  ...
>> @@ -169,7 +169,6 @@ int mdio_mux_init(struct device *dev,
>>  		if (r) {
>>  			mdiobus_free(cb->mii_bus);
>>  			devm_kfree(dev, cb);
>> -			of_node_put(child_bus_node);
>>  		} else {
> 
> I think we're instead simply missing a break; statement here.

It's kind of questionable, if we have an error initializing one of our
child MDIO bus controller (child from the perspective of the MDIO mux,
boy this is getting complicated...), should we keep on going, or should
we abort entirely and rollback what we have successfully registered?

I don't think Julia's patch makes thing worse, in that if we had to
rollback, we would not be doing this correctly now anyway.

Jon, what do you think?
-- 
Florian

^ permalink raw reply

* [PATCH net v2] netvsc: fix net poll mode
From: Stephen Hemminger @ 2017-05-12 22:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, Stephen Hemminger

The ndo_poll_controller function needs to schedule NAPI to pick
up arriving packets and send completions. Otherwise no data
will ever be received.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
v2 - use rcu to ensure lower data structure is present

 drivers/net/hyperv/netvsc_drv.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 4421a6d00375..c171f5bc41b9 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1158,11 +1158,22 @@ netvsc_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info,
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-static void netvsc_poll_controller(struct net_device *net)
+static void netvsc_poll_controller(struct net_device *dev)
 {
-	/* As netvsc_start_xmit() works synchronous we don't have to
-	 * trigger anything here.
-	 */
+	struct net_device_context *ndc = netdev_priv(dev);
+	struct netvsc_device *ndev;
+	int i;
+
+	rcu_read_lock();
+	ndev = rcu_dereference(ndc->nvdev);
+	if (ndev) {
+		for (i = 0; i < ndev->num_chn; i++) {
+			struct netvsc_channel *nvchan = &ndev->chan_table[i];
+
+			napi_schedule(&nvchan->napi);
+		}
+	}
+	rcu_read_unlock();
 }
 #endif
 
-- 
2.11.0

^ permalink raw reply related

* Re: [Problem] Broadcom BCM5762 Ethernet tg3 times out with stack trace
From: Daniel Kim @ 2017-05-12 23:12 UTC (permalink / raw)
  To: Michael Chan
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan,
	Linux Netdev List
In-Reply-To: <CACKFLi=762N5LzmcOE4dPS_1Op8EsJ8QJBRFqZySaSx7SNb1bg@mail.gmail.com>

I believe I do have IOMMU enabled. At least the dmesg output seems to
imply that I do:
[    1.141948] iommu: Adding device 0000:00:02.0 to group 0
[    1.142033] iommu: Adding device 0000:00:10.0 to group 1
[    1.142074] iommu: Adding device 0000:00:10.1 to group 1
[    1.142119] iommu: Adding device 0000:00:11.0 to group 2
[    1.142172] iommu: Adding device 0000:00:12.0 to group 3
[    1.142184] iommu: Adding device 0000:00:12.2 to group 3
[    1.142234] iommu: Adding device 0000:00:13.0 to group 4
[    1.142247] iommu: Adding device 0000:00:13.2 to group 4
[    1.142303] iommu: Adding device 0000:00:14.0 to group 5
[    1.142315] iommu: Adding device 0000:00:14.2 to group 5
[    1.142328] iommu: Adding device 0000:00:14.3 to group 5
[    1.142373] iommu: Adding device 0000:00:14.4 to group 6
[    1.142417] iommu: Adding device 0000:00:14.5 to group 7
[    1.142529] iommu: Adding device 0000:00:15.0 to group 8
[    1.142570] iommu: Adding device 0000:00:15.2 to group 8
[    1.142639] iommu: Adding device 0000:00:18.0 to group 9
[    1.142653] iommu: Adding device 0000:00:18.1 to group 9
[    1.142668] iommu: Adding device 0000:00:18.2 to group 9
[    1.142682] iommu: Adding device 0000:00:18.3 to group 9
[    1.142695] iommu: Adding device 0000:00:18.4 to group 9
[    1.142712] iommu: Adding device 0000:00:18.5 to group 9
[    1.142725] iommu: Adding device 0000:01:00.0 to group 0
[    1.142733] iommu: Adding device 0000:01:00.1 to group 0
[    1.142840] iommu: Adding device 0000:04:00.0 to group 8
[    1.143305] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40

The last kernel I used where the NIC mostly worked was 4.4.0 (x86-64).
Even when it worked, it had this issue we've experienced since v3.13
where the NIC would suddenly become unresponsive. It won't be
considered disconnected from the network, but RX/TX packets, errors,
dropped, etc. values displayed with ifconfig would show a number that
looked suspiciously close to 32-bit INT_MAX.

Since v4.8, I can trigger it much faster and it now prints out a call
stack and other messages into dmesg.

^ permalink raw reply

* [PATCH] kmod: don't load module unless req process has CAP_SYS_MODULE
From: Mahesh Bandewar @ 2017-05-12 23:22 UTC (permalink / raw)
  To: Ingo Molnar, Greg Kroah-Hartman, LKML, netdev
  Cc: Eric W . Biederman, Kees Cook, David Miller, Eric Dumazet,
	Mahesh Bandewar, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

A process inside random user-ns should not load a module, which is
currently possible. As demonstrated in following scenario -

  Create namespaces; especially a user-ns and become root inside.
  $ unshare -rfUp -- unshare -unm -- bash

  Try to load the bridge module. It should fail and this is expected!
  #  modprobe bridge
  WARNING: Error inserting stp (/lib/modules/4.11.0-smp-DEV/kernel/net/802/stp.ko): Operation not permitted
  FATAL: Error inserting bridge (/lib/modules/4.11.0-smp-DEV/kernel/net/bridge/bridge.ko): Operation not permitted

  Verify bridge module is not loaded.
  # lsmod | grep bridge
  #

  Now try to create a bridge inside this newly created net-ns which would
  mean bridge module need to be loaded.
  # ip link add br0 type bridge
  # echo $?
  0
  # lsmod | grep bridge
  bridge                110592  0
  stp                    16384  1 bridge
  llc                    16384  2 bridge,stp
  #

  After this patch -
  # ip link add br0 type bridge
  RTNETLINK answers: Operation not supported
  # echo $?
  2
  # lsmod | grep bridge
  #

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 kernel/kmod.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 563f97e2be36..ac30157169b7 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -133,6 +133,9 @@ int __request_module(bool wait, const char *fmt, ...)
 #define MAX_KMOD_CONCURRENT 50	/* Completely arbitrary value - KAO */
 	static int kmod_loop_msg;
 
+	if (!capable(CAP_SYS_MODULE))
+		return -EPERM;
+
 	/*
 	 * We don't allow synchronous module loading from async.  Module
 	 * init may invoke async_synchronize_full() which will end up
-- 
2.13.0.rc2.291.g57267f2277-goog

^ permalink raw reply related

* Re: [PATCH net v2] netvsc: fix net poll mode
From: Stephen Hemminger @ 2017-05-12 23:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Stephen Hemminger
In-Reply-To: <20170512225530.2233-1-sthemmin@microsoft.com>

On Fri, 12 May 2017 15:55:30 -0700
Stephen Hemminger <stephen@networkplumber.org> wrote:

> The ndo_poll_controller function needs to schedule NAPI to pick
> up arriving packets and send completions. Otherwise no data
> will ever be received.
> 
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>

Hold off on this. I will send v3 with a couple other RCU related fixes.

^ permalink raw reply

* [PATCH net] ipv6: avoid dad-failures for addresses with NODAD
From: Mahesh Bandewar @ 2017-05-13  0:03 UTC (permalink / raw)
  To: Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, David Miller
  Cc: Eric Dumazet, Mahesh Bandewar, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

Every address gets added with TENTATIVE flag even for the addresses with
IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
we realize it's an address with NODAD and complete the process without
sending any probe. However the TENTATIVE flags stays on the
address for sometime enough to cause misinterpretation when we receive a NS.
While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
and endup with an address that was originally configured as NODAD with
DADFAILED.

We can't avoid scheduling dad_work for addresses with NODAD but we can
avoid adding TENTATIVE flag to avoid this racy situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 net/ipv6/addrconf.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b09ac38d8dc4..53f2dc092023 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1022,7 +1022,10 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
 	INIT_HLIST_NODE(&ifa->addr_lst);
 	ifa->scope = scope;
 	ifa->prefix_len = pfxlen;
-	ifa->flags = flags | IFA_F_TENTATIVE;
+	ifa->flags = flags;
+	/* No need to add the TENTATIVE flag for addresses with NODAD */
+	if (!(flags & IFA_F_NODAD))
+		ifa->flags |= IFA_F_TENTATIVE;
 	ifa->valid_lft = valid_lft;
 	ifa->prefered_lft = prefered_lft;
 	ifa->cstamp = ifa->tstamp = jiffies;
-- 
2.13.0.rc2.291.g57267f2277-goog

^ permalink raw reply related

* Re: [PATCH v5 14/17] dt-bindings: slave-device: add current-speed property
From: Rob Herring @ 2017-05-13  0:05 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: David S. Miller, Mark Rutland, Greg Kroah-Hartman, Jiri Slaby,
	linux-serial, linux-kernel, netdev, devicetree
In-Reply-To: <1494406408-31760-15-git-send-email-stefan.wahren@i2se.com>

On Wed, May 10, 2017 at 10:53:25AM +0200, Stefan Wahren wrote:
> This adds a new DT property to define the current baud rate of the
> slave device.
> 
> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
> ---
>  Documentation/devicetree/bindings/serial/slave-device.txt | 9 +++++++++
>  1 file changed, 9 insertions(+)

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* [PATCH iproute2 -master 0/2] Two misc BPF updates
From: Daniel Borkmann @ 2017-05-13  0:32 UTC (permalink / raw)
  To: stephen; +Cc: alexei.starovoitov, netdev, Daniel Borkmann

Requires header rebase with -net.

Thanks!

Daniel Borkmann (2):
  bpf: update printing of generic xdp mode
  bpf: dump error to the user when retrieving pinned prog fails

 ip/iplink_xdp.c | 19 +++++++++++--------
 lib/bpf.c       | 12 +++++++++++-
 2 files changed, 22 insertions(+), 9 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH iproute2 -master 2/2] bpf: dump error to the user when retrieving pinned prog fails
From: Daniel Borkmann @ 2017-05-13  0:32 UTC (permalink / raw)
  To: stephen; +Cc: alexei.starovoitov, netdev, Daniel Borkmann
In-Reply-To: <cover.1494635217.git.daniel@iogearbox.net>

I noticed we currently don't dump an error message when a pinned
program couldn't be retrieved, thus add a hint to the user.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 lib/bpf.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/bpf.c b/lib/bpf.c
index 04ee1ab..ae4d97d 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -648,6 +648,16 @@ static int bpf_obj_get(const char *pathname, enum bpf_prog_type type)
 	return bpf(BPF_OBJ_GET, &attr, sizeof(attr));
 }
 
+static int bpf_obj_pinned(const char *pathname, enum bpf_prog_type type)
+{
+	int prog_fd = bpf_obj_get(pathname, type);
+
+	if (prog_fd < 0)
+		fprintf(stderr, "Couldn\'t retrieve pinned program \'%s\': %s\n",
+			pathname, strerror(errno));
+	return prog_fd;
+}
+
 enum bpf_mode {
 	CBPF_BYTECODE,
 	CBPF_FILE,
@@ -750,7 +760,7 @@ static int bpf_parse(enum bpf_prog_type *type, enum bpf_mode *mode,
 	else if (*mode == EBPF_OBJECT)
 		ret = bpf_obj_open(file, *type, section, verbose);
 	else if (*mode == EBPF_PINNED)
-		ret = bpf_obj_get(file, *type);
+		ret = bpf_obj_pinned(file, *type);
 	else
 		return -1;
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 1/2] bpf: update printing of generic xdp mode
From: Daniel Borkmann @ 2017-05-13  0:32 UTC (permalink / raw)
  To: stephen; +Cc: alexei.starovoitov, netdev, Daniel Borkmann
In-Reply-To: <cover.1494635217.git.daniel@iogearbox.net>

Follow-up to d67b9cd28c1d ("xdp: refine xdp api with regards to
generic xdp") in order to update the XDP dumping part.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 ip/iplink_xdp.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/ip/iplink_xdp.c b/ip/iplink_xdp.c
index a1380ee..98503fa 100644
--- a/ip/iplink_xdp.c
+++ b/ip/iplink_xdp.c
@@ -79,17 +79,20 @@ int xdp_parse(int *argc, char ***argv, struct iplink_req *req, bool generic)
 void xdp_dump(FILE *fp, struct rtattr *xdp)
 {
 	struct rtattr *tb[IFLA_XDP_MAX + 1];
-	__u32 flags = 0;
+	__u8 mode;
 
 	parse_rtattr_nested(tb, IFLA_XDP_MAX, xdp);
 
-	if (!tb[IFLA_XDP_ATTACHED] ||
-	    !rta_getattr_u8(tb[IFLA_XDP_ATTACHED]))
+	if (!tb[IFLA_XDP_ATTACHED])
 		return;
 
-	if (tb[IFLA_XDP_FLAGS])
-		flags = rta_getattr_u32(tb[IFLA_XDP_FLAGS]);
-
-	fprintf(fp, "xdp%s ",
-		flags & XDP_FLAGS_SKB_MODE ? "generic" : "");
+	mode = rta_getattr_u8(tb[IFLA_XDP_ATTACHED]);
+	if (mode == XDP_ATTACHED_NONE)
+		return;
+	else if (mode == XDP_ATTACHED_DRV)
+		fprintf(fp, "xdp ");
+	else if (mode == XDP_ATTACHED_SKB)
+		fprintf(fp, "xdpgeneric ");
+	else
+		fprintf(fp, "xdp[%u] ", mode);
 }
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH] mdio: mux: fix device_node_continue.cocci warnings
From: Julia Lawall @ 2017-05-13  0:40 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David Miller, jon.mason, netdev, andrew, kbuild-all, linux-kernel
In-Reply-To: <872f3980-9faa-718f-3260-9e4b22946140@gmail.com>



On Fri, 12 May 2017, Florian Fainelli wrote:

> On 05/12/2017 09:22 AM, David Miller wrote:
> > From: Julia Lawall <julia.lawall@lip6.fr>
> > Date: Fri, 12 May 2017 22:54:23 +0800 (SGT)
> >
> >> Device node iterators put the previous value of the index variable, so an
> >> explicit put causes a double put.
> >  ...
> >> @@ -169,7 +169,6 @@ int mdio_mux_init(struct device *dev,
> >>  		if (r) {
> >>  			mdiobus_free(cb->mii_bus);
> >>  			devm_kfree(dev, cb);
> >> -			of_node_put(child_bus_node);
> >>  		} else {
> >
> > I think we're instead simply missing a break; statement here.
>
> It's kind of questionable, if we have an error initializing one of our
> child MDIO bus controller (child from the perspective of the MDIO mux,
> boy this is getting complicated...), should we keep on going, or should
> we abort entirely and rollback what we have successfully registered?
>
> I don't think Julia's patch makes thing worse, in that if we had to
> rollback, we would not be doing this correctly now anyway.

Just to be clear, if you want the break instead, then you need to keep the
put.

julia

>
> Jon, what do you think?
> --
> Florian
>

^ permalink raw reply

* Re: [PATCH net-next] ipv6: Implement limits on hop by hop and destination options
From: Tom Herbert @ 2017-05-13  1:02 UTC (permalink / raw)
  To: David Miller; +Cc: Linux Kernel Network Developers, Martin Lau
In-Reply-To: <20170512.113847.1389407948502066394.davem@davemloft.net>

On Fri, May 12, 2017 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Wed, 10 May 2017 14:33:19 -0700
>
>> RFC 2460 (IPv6) defines hop by hop options and destination options
>> extension headers. Both of these carry a list of TLVs which is
>> only limited by the maximum length of the extension header (2048
>> bytes). By the spec a host must process all the TLVs in these
>> options, however these could be used as a fairly obvious
>> denial of service attack. I think this could in fact be
>> a significant DOS vector on the Internet, one mitigating
>> factor might be that many FWs drop all packets with EH (and
>> obviously this is only IPv6) so an Internet wide attack might not
>> be so effective (yet!).
>>
>> By my calculation, the worse case packet with TLVs in a standard
>> 1500 byte MTU packet that would be processed by the stack contains
>> 1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I
>> wrote a quick test program that floods a whole bunch of these
>> packets to a host and sure enough there is substantial time spent
>> in ip6_parse_tlv. These packets contain nothing but unknown TLVS
>> (that are ignored), TLV padding, and bogus UDP header with zero
>> payload length.
>  ...
>> Default values are set to 8 for options counts, and set to INT_MAX
>> for maximum length. Note the choice to limit options to 8 is an
>> arbitrary guess (roughly based on the fact that the stack supports
>> three HBH options and just one destination option).
>
> So the maximum number of TLVs we could process is some function of the
> option count limit, and the number of padding TLVs that can be stuffed
> alongside of those 8 non-padding options?

Yes, if you count the one byte pad TLVs then maximum TLVs processed (#
times through the loop) with the default of 8 is 8 + 9*7= 71. The
padding TLVs are less expensive to process at least.

Function is:
N(max_non_pad) = (max_non_pad + 1) * 7 + max_non_pad

^ permalink raw reply

* [PATCH net] i40e: Fix state flags for bit set and clean operations of PF
From: Mauro S. M. Rodrigues @ 2017-05-13  2:26 UTC (permalink / raw)
  To: netdev, intel-wired-lan
  Cc: jeffrey.t.kirsher, gpiccoli, Mauro S. M. Rodrigues

Commit 0da36b9774cc ("i40e: use DECLARE_BITMAP for state fields")
introduced changes in the way i40e works with state flags converting
them to bitmaps using kernel bitmap API. This change introduced a
regression due to a mistaken substitution using __I40E_VSI_DOWN instead
of __I40E_DOWN when testing state of a PF at i40e_reset_subtask()
function. This caused a flood in the kernel log with the follow message:

[49.013] i40e 0002:01:00.0: bad reset request 0x00000020

Commit d19cb64b9222 ("i40e: separate PF and VSI state flags")
also introduced some misuse of the VSI and PF flags, so both could be
considered as the offenders.

This patch simply fixes the flags where it makes sense by changing
__I40E_VSI_DOWN to __I40E_DOWN.

Fixes: 0da36b9774cc ("i40e: use DECLARE_BITMAP for state fields")
Fixes: d19cb64b9222 ("i40e: separate PF and VSI state flags")

Reviewed-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 36 ++++++++++++++---------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d5c9c9e..150caf6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -295,7 +295,7 @@ struct i40e_vsi *i40e_find_vsi_from_id(struct i40e_pf *pf, u16 id)
  **/
 void i40e_service_event_schedule(struct i40e_pf *pf)
 {
-	if (!test_bit(__I40E_VSI_DOWN, pf->state) &&
+	if (!test_bit(__I40E_DOWN, pf->state) &&
 	    !test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state))
 		queue_work(i40e_wq, &pf->service_task);
 }
@@ -3611,7 +3611,7 @@ static irqreturn_t i40e_intr(int irq, void *data)
 		 * this is not a performance path and napi_schedule()
 		 * can deal with rescheduling.
 		 */
-		if (!test_bit(__I40E_VSI_DOWN, pf->state))
+		if (!test_bit(__I40E_DOWN, pf->state))
 			napi_schedule_irqoff(&q_vector->napi);
 	}
 
@@ -3687,7 +3687,7 @@ static irqreturn_t i40e_intr(int irq, void *data)
 enable_intr:
 	/* re-enable interrupt causes */
 	wr32(hw, I40E_PFINT_ICR0_ENA, ena_mask);
-	if (!test_bit(__I40E_VSI_DOWN, pf->state)) {
+	if (!test_bit(__I40E_DOWN, pf->state)) {
 		i40e_service_event_schedule(pf);
 		i40e_irq_dynamic_enable_icr0(pf, false);
 	}
@@ -6203,7 +6203,7 @@ static void i40e_fdir_reinit_subtask(struct i40e_pf *pf)
 {
 
 	/* if interface is down do nothing */
-	if (test_bit(__I40E_VSI_DOWN, pf->state))
+	if (test_bit(__I40E_DOWN, pf->state))
 		return;
 
 	if (test_bit(__I40E_FD_FLUSH_REQUESTED, pf->state))
@@ -6344,7 +6344,7 @@ static void i40e_watchdog_subtask(struct i40e_pf *pf)
 	int i;
 
 	/* if interface is down do nothing */
-	if (test_bit(__I40E_VSI_DOWN, pf->state) ||
+	if (test_bit(__I40E_DOWN, pf->state) ||
 	    test_bit(__I40E_CONFIG_BUSY, pf->state))
 		return;
 
@@ -6399,9 +6399,9 @@ static void i40e_reset_subtask(struct i40e_pf *pf)
 		reset_flags |= BIT(__I40E_GLOBAL_RESET_REQUESTED);
 		clear_bit(__I40E_GLOBAL_RESET_REQUESTED, pf->state);
 	}
-	if (test_bit(__I40E_VSI_DOWN_REQUESTED, pf->state)) {
-		reset_flags |= BIT(__I40E_VSI_DOWN_REQUESTED);
-		clear_bit(__I40E_VSI_DOWN_REQUESTED, pf->state);
+	if (test_bit(__I40E_DOWN_REQUESTED, pf->state)) {
+		reset_flags |= BIT(__I40E_DOWN_REQUESTED);
+		clear_bit(__I40E_DOWN_REQUESTED, pf->state);
 	}
 
 	/* If there's a recovery already waiting, it takes
@@ -6415,7 +6415,7 @@ static void i40e_reset_subtask(struct i40e_pf *pf)
 
 	/* If we're already down or resetting, just bail */
 	if (reset_flags &&
-	    !test_bit(__I40E_VSI_DOWN, pf->state) &&
+	    !test_bit(__I40E_DOWN, pf->state) &&
 	    !test_bit(__I40E_CONFIG_BUSY, pf->state)) {
 		rtnl_lock();
 		i40e_do_reset(pf, reset_flags, true);
@@ -7002,7 +7002,7 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
 	u32 val;
 	int v;
 
-	if (test_bit(__I40E_VSI_DOWN, pf->state))
+	if (test_bit(__I40E_DOWN, pf->state))
 		goto clear_recovery;
 	dev_dbg(&pf->pdev->dev, "Rebuilding internal switch\n");
 
@@ -9767,7 +9767,7 @@ int i40e_vsi_release(struct i40e_vsi *vsi)
 		return -ENODEV;
 	}
 	if (vsi == pf->vsi[pf->lan_vsi] &&
-	    !test_bit(__I40E_VSI_DOWN, pf->state)) {
+	    !test_bit(__I40E_DOWN, pf->state)) {
 		dev_info(&pf->pdev->dev, "Can't remove PF VSI\n");
 		return -ENODEV;
 	}
@@ -11003,7 +11003,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	}
 	pf->next_vsi = 0;
 	pf->pdev = pdev;
-	set_bit(__I40E_VSI_DOWN, pf->state);
+	set_bit(__I40E_DOWN, pf->state);
 
 	hw = &pf->hw;
 	hw->back = pf;
@@ -11293,7 +11293,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	 * before setting up the misc vector or we get a race and the vector
 	 * ends up disabled forever.
 	 */
-	clear_bit(__I40E_VSI_DOWN, pf->state);
+	clear_bit(__I40E_DOWN, pf->state);
 
 	/* In case of MSIX we are going to setup the misc vector right here
 	 * to handle admin queue events etc. In case of legacy and MSI
@@ -11448,7 +11448,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	/* Unwind what we've done if something failed in the setup */
 err_vsis:
-	set_bit(__I40E_VSI_DOWN, pf->state);
+	set_bit(__I40E_DOWN, pf->state);
 	i40e_clear_interrupt_scheme(pf);
 	kfree(pf->vsi);
 err_switch_setup:
@@ -11500,7 +11500,7 @@ static void i40e_remove(struct pci_dev *pdev)
 
 	/* no more scheduling of any task */
 	set_bit(__I40E_SUSPENDED, pf->state);
-	set_bit(__I40E_VSI_DOWN, pf->state);
+	set_bit(__I40E_DOWN, pf->state);
 	if (pf->service_timer.data)
 		del_timer_sync(&pf->service_timer);
 	if (pf->service_task.func)
@@ -11740,7 +11740,7 @@ static void i40e_shutdown(struct pci_dev *pdev)
 	struct i40e_hw *hw = &pf->hw;
 
 	set_bit(__I40E_SUSPENDED, pf->state);
-	set_bit(__I40E_VSI_DOWN, pf->state);
+	set_bit(__I40E_DOWN, pf->state);
 	rtnl_lock();
 	i40e_prep_for_reset(pf, true);
 	rtnl_unlock();
@@ -11789,7 +11789,7 @@ static int i40e_suspend(struct pci_dev *pdev, pm_message_t state)
 	int retval = 0;
 
 	set_bit(__I40E_SUSPENDED, pf->state);
-	set_bit(__I40E_VSI_DOWN, pf->state);
+	set_bit(__I40E_DOWN, pf->state);
 
 	if (pf->wol_en && (pf->flags & I40E_FLAG_WOL_MC_MAGIC_PKT_WAKE))
 		i40e_enable_mc_magic_wake(pf);
@@ -11841,7 +11841,7 @@ static int i40e_resume(struct pci_dev *pdev)
 
 	/* handling the reset will rebuild the device state */
 	if (test_and_clear_bit(__I40E_SUSPENDED, pf->state)) {
-		clear_bit(__I40E_VSI_DOWN, pf->state);
+		clear_bit(__I40E_DOWN, pf->state);
 		rtnl_lock();
 		i40e_reset_and_rebuild(pf, false, true);
 		rtnl_unlock();
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/3] bpf: Track MAP pointer alignment
From: David Miller @ 2017-05-13  2:28 UTC (permalink / raw)
  To: daniel; +Cc: ast, alexei.starovoitov, netdev


This patch series updates the BPF verifier to properly track MAP
access alignment, just as we already do for packet pointers.

The two main elements of the implementation are putting the register
offset into a shared rather than reg type specific area of the
verifier register tracking strcutre, and also eliding packet pointer
specific checks when managing MAP pointer alignment.

More details of the implementation are in the commit message for patch
#2.

Patch #3 removes alignment restrictions from various test_verifier.c
tests which are no longer necessary with this MAP alignment
infrastructure in place.  Further improvements are possible, but
I didn't want to be overly ambitious here.

In the future we can add some MAP tests to test_align.c as well.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH 1/3] bpf: Use 1<<16 as ceiling for immediate alignment in verifier.
From: David Miller @ 2017-05-13  2:28 UTC (permalink / raw)
  To: daniel; +Cc: ast, alexei.starovoitov, netdev


If we use 1<<31, then sequences like:

	R1 = 0
	R1 <<= 2

do silly things.  Examples of this actually exist in
the MAP tests of test_verifier.c

Update test_align.c expectation strings.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 kernel/bpf/verifier.c                    |  2 +-
 tools/testing/selftests/bpf/test_align.c | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 39f2dcb..0f33db0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1715,7 +1715,7 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
 static u32 calc_align(u32 imm)
 {
 	if (!imm)
-		return 1U << 31;
+		return 1U << 16;
 	return imm - ((imm - 1) & imm);
 }
 
diff --git a/tools/testing/selftests/bpf/test_align.c b/tools/testing/selftests/bpf/test_align.c
index 9644d4e..afaa297 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -235,12 +235,12 @@ static struct bpf_align_test tests[] = {
 		},
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 		.matches = {
-			"4: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R5=pkt(id=0,off=0,r=0) R10=fp",
-			"5: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R5=pkt(id=0,off=14,r=0) R10=fp",
-			"6: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R4=pkt(id=0,off=14,r=0) R5=pkt(id=0,off=14,r=0) R10=fp",
-			"10: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv56 R5=pkt(id=0,off=14,r=18) R10=fp",
-			"14: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv48 R5=pkt(id=0,off=14,r=18) R10=fp",
-			"15: R0=imm0,min_value=0,max_value=0,min_align=2147483648 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv48 R5=pkt(id=0,off=14,r=18) R10=fp",
+			"4: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R5=pkt(id=0,off=0,r=0) R10=fp",
+			"5: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R5=pkt(id=0,off=14,r=0) R10=fp",
+			"6: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R4=pkt(id=0,off=14,r=0) R5=pkt(id=0,off=14,r=0) R10=fp",
+			"10: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv56 R5=pkt(id=0,off=14,r=18) R10=fp",
+			"14: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv48 R5=pkt(id=0,off=14,r=18) R10=fp",
+			"15: R0=imm0,min_value=0,max_value=0,min_align=65536 R1=ctx R2=pkt(id=0,off=0,r=18) R3=pkt_end R4=inv48 R5=pkt(id=0,off=14,r=18) R10=fp",
 		},
 	},
 	{
-- 
2.1.2.532.g19b5d50

^ permalink raw reply related

* [PATCH 2/3] bpf: Track alignment of MAP pointers in verifier.
From: David Miller @ 2017-05-13  2:28 UTC (permalink / raw)
  To: daniel; +Cc: ast, alexei.starovoitov, netdev


Just like packet pointers, track the known alignment of MAP pointers.

In order to facilitate the state tracking, move the register offset
field into where there is an unused 32-bit padding slot on 64-bit.

The check logic is the same as for packet pointers, except we do not
apply NET_IP_ALIGN to the calculations.

Also, there are several restrictions that apply to packet pointers
which we do not extend to MAP pointers.  For example, the
MAX_PACKET_OFF limitation and the "adding integer with < 48 upper zero
bits" thing.

When we add a variable to the MAP pointer, all of the state
transitions are identical except that we elide the reg->range clear
because it is a packet pointer specific piece of state.

This changes the string emitted when an unaligned access is trapped by
the verifier.  Therefore, we need to adjust the search string used by
test_verifier.c

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/bpf_verifier.h                |   6 +-
 kernel/bpf/verifier.c                       | 112 +++++++++++++++++++---------
 tools/testing/selftests/bpf/test_verifier.c |   3 +-
 3 files changed, 80 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index d5093b5..acf711b 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -18,15 +18,13 @@
 
 struct bpf_reg_state {
 	enum bpf_reg_type type;
+	u32 off;
 	union {
 		/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */
 		s64 imm;
 
 		/* valid when type == PTR_TO_PACKET* */
-		struct {
-			u16 off;
-			u16 range;
-		};
+		u32 range;
 
 		/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
 		 *   PTR_TO_MAP_VALUE_OR_NULL
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0f33db0..b187815 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -231,10 +231,10 @@ static void print_verifier_state(struct bpf_verifier_state *state)
 		else if (t == CONST_PTR_TO_MAP || t == PTR_TO_MAP_VALUE ||
 			 t == PTR_TO_MAP_VALUE_OR_NULL ||
 			 t == PTR_TO_MAP_VALUE_ADJ)
-			verbose("(ks=%d,vs=%d,id=%u)",
+			verbose("(ks=%d,vs=%d,id=%u,off=%u)",
 				reg->map_ptr->key_size,
 				reg->map_ptr->value_size,
-				reg->id);
+				reg->id, reg->off);
 		if (reg->min_value != BPF_REGISTER_MIN_RANGE)
 			verbose(",min_value=%lld",
 				(long long)reg->min_value);
@@ -469,6 +469,7 @@ static void init_reg_state(struct bpf_reg_state *regs)
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		regs[i].type = NOT_INIT;
+		regs[i].off = 0;
 		regs[i].imm = 0;
 		regs[i].min_value = BPF_REGISTER_MIN_RANGE;
 		regs[i].max_value = BPF_REGISTER_MAX_RANGE;
@@ -487,6 +488,7 @@ static void init_reg_state(struct bpf_reg_state *regs)
 static void __mark_reg_unknown_value(struct bpf_reg_state *regs, u32 regno)
 {
 	regs[regno].type = UNKNOWN_VALUE;
+	regs[regno].off = 0;
 	regs[regno].id = 0;
 	regs[regno].imm = 0;
 }
@@ -823,10 +825,27 @@ static int check_pkt_ptr_alignment(const struct bpf_reg_state *reg,
 }
 
 static int check_val_ptr_alignment(const struct bpf_reg_state *reg,
-				   int size, bool strict)
+				   int off, int size, bool strict)
 {
-	if (strict && size != 1) {
-		verbose("Unknown alignment. Only byte-sized access allowed in value access.\n");
+	int reg_off;
+
+	/* Byte size accesses are always allowed. */
+	if (!strict || size == 1)
+		return 0;
+
+	reg_off = reg->off;
+	if (reg->id) {
+		if (reg->aux_off_align % size) {
+			verbose("Value access is only %u byte aligned, %d byte access not allowed\n",
+				reg->aux_off_align, size);
+			return -EACCES;
+		}
+		reg_off += reg->aux_off;
+	}
+
+	if ((reg_off + off) % size != 0) {
+		verbose("misaligned value access off %d+%d size %d\n",
+			reg_off, off, size);
 		return -EACCES;
 	}
 
@@ -846,7 +865,7 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_PACKET:
 		return check_pkt_ptr_alignment(reg, off, size, strict);
 	case PTR_TO_MAP_VALUE_ADJ:
-		return check_val_ptr_alignment(reg, size, strict);
+		return check_val_ptr_alignment(reg, off, size, strict);
 	default:
 		if (off % size != 0) {
 			verbose("misaligned access off %d size %d\n",
@@ -1336,6 +1355,7 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 		    reg->type != PTR_TO_PACKET_END)
 			continue;
 		reg->type = UNKNOWN_VALUE;
+		reg->off = 0;
 		reg->imm = 0;
 	}
 }
@@ -1415,6 +1435,7 @@ static int check_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		reg = regs + caller_saved[i];
 		reg->type = NOT_INIT;
+		reg->off = 0;
 		reg->imm = 0;
 	}
 
@@ -1458,8 +1479,8 @@ static int check_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
 	return 0;
 }
 
-static int check_packet_ptr_add(struct bpf_verifier_env *env,
-				struct bpf_insn *insn)
+static int check_pointer_add(struct bpf_verifier_env *env,
+			     struct bpf_insn *insn, bool is_packet)
 {
 	struct bpf_reg_state *regs = env->cur_state.regs;
 	struct bpf_reg_state *dst_reg = &regs[insn->dst_reg];
@@ -1468,28 +1489,28 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
 	s32 imm;
 
 	if (BPF_SRC(insn->code) == BPF_K) {
-		/* pkt_ptr += imm */
+		/* pointer += imm */
 		imm = insn->imm;
 
 add_imm:
-		if (imm < 0) {
-			verbose("addition of negative constant to packet pointer is not allowed\n");
-			return -EACCES;
-		}
-		if (imm >= MAX_PACKET_OFF ||
-		    imm + dst_reg->off >= MAX_PACKET_OFF) {
-			verbose("constant %d is too large to add to packet pointer\n",
-				imm);
-			return -EACCES;
+		if (is_packet) {
+			if (imm < 0) {
+				verbose("addition of negative constant to packet pointer is not allowed\n");
+				return -EACCES;
+			}
+			if (imm >= MAX_PACKET_OFF ||
+			    imm + dst_reg->off >= MAX_PACKET_OFF) {
+				verbose("constant %d is too large to add to packet pointer\n",
+					imm);
+				return -EACCES;
+			}
 		}
-		/* a constant was added to pkt_ptr.
+		/* a constant was added to the pointer.
 		 * Remember it while keeping the same 'id'
 		 */
 		dst_reg->off += imm;
 	} else {
-		bool had_id;
-
-		if (src_reg->type == PTR_TO_PACKET) {
+		if (is_packet && src_reg->type == PTR_TO_PACKET) {
 			/* R6=pkt(id=0,off=0,r=62) R7=imm22; r7 += r6 */
 			tmp_reg = *dst_reg;  /* save r7 state */
 			*dst_reg = *src_reg; /* copy pkt_ptr state r6 into r7 */
@@ -1503,46 +1524,62 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
 		}
 
 		if (src_reg->type == CONST_IMM) {
-			/* pkt_ptr += reg where reg is known constant */
+			/* pointer += reg where reg is known constant */
 			imm = src_reg->imm;
 			goto add_imm;
 		}
-		/* disallow pkt_ptr += reg
+		/* disallow pointer += reg
 		 * if reg is not uknown_value with guaranteed zero upper bits
-		 * otherwise pkt_ptr may overflow and addition will become
+		 * otherwise pointer_ptr may overflow and addition will become
 		 * subtraction which is not allowed
 		 */
 		if (src_reg->type != UNKNOWN_VALUE) {
-			verbose("cannot add '%s' to ptr_to_packet\n",
+			verbose("cannot add '%s' to pointer\n",
 				reg_type_str[src_reg->type]);
 			return -EACCES;
 		}
-		if (src_reg->imm < 48) {
+		if (is_packet && src_reg->imm < 48) {
 			verbose("cannot add integer value with %lld upper zero bits to ptr_to_packet\n",
 				src_reg->imm);
 			return -EACCES;
 		}
 
-		had_id = (dst_reg->id != 0);
-
-		/* dst_reg stays as pkt_ptr type and since some positive
+		/* dst_reg stays as the same type and since some positive
 		 * integer value was added to the pointer, increment its 'id'
 		 */
 		dst_reg->id = ++env->id_gen;
 
-		/* something was added to pkt_ptr, set range to zero */
 		dst_reg->aux_off += dst_reg->off;
 		dst_reg->off = 0;
-		dst_reg->range = 0;
-		if (had_id)
+
+		if (is_packet) {
+			/* something was added to packet ptr, set range to zero */
+			dst_reg->range = 0;
+		}
+		if (dst_reg->aux_off_align) {
 			dst_reg->aux_off_align = min(dst_reg->aux_off_align,
 						     src_reg->min_align);
-		else
+		} else {
 			dst_reg->aux_off_align = src_reg->min_align;
+		}
+		if (!dst_reg->aux_off_align)
+			dst_reg->aux_off_align = 1;
 	}
 	return 0;
 }
 
+static int check_packet_ptr_add(struct bpf_verifier_env *env,
+				struct bpf_insn *insn)
+{
+	return check_pointer_add(env, insn, true);
+}
+
+static int check_map_ptr_add(struct bpf_verifier_env *env,
+			     struct bpf_insn *insn)
+{
+	return check_pointer_add(env, insn, false);
+}
+
 static int evaluate_reg_alu(struct bpf_verifier_env *env, struct bpf_insn *insn)
 {
 	struct bpf_reg_state *regs = env->cur_state.regs;
@@ -2056,10 +2093,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		if (env->allow_ptr_leaks &&
 		    BPF_CLASS(insn->code) == BPF_ALU64 && opcode == BPF_ADD &&
 		    (dst_reg->type == PTR_TO_MAP_VALUE ||
-		     dst_reg->type == PTR_TO_MAP_VALUE_ADJ))
+		     dst_reg->type == PTR_TO_MAP_VALUE_ADJ)) {
 			dst_reg->type = PTR_TO_MAP_VALUE_ADJ;
-		else
+			check_map_ptr_add(env, insn);
+		} else {
 			mark_reg_unknown_value(regs, insn->dst_reg);
+		}
 	}
 
 	return 0;
@@ -2480,6 +2519,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		reg = regs + caller_saved[i];
 		reg->type = NOT_INIT;
+		reg->off = 0;
 		reg->imm = 0;
 	}
 
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3773562..889fb5a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -5070,7 +5070,8 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 
 	reject_from_alignment = fd_prog < 0 &&
 				(test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-				strstr(bpf_vlog, "Unknown alignment.");
+				(strstr(bpf_vlog, "misaligned value access") ||
+				 strstr(bpf_vlog, "Value access is only"));
 #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 	if (reject_from_alignment) {
 		printf("FAIL\nFailed due to alignment despite having efficient unaligned access: '%s'!\n",
-- 
2.1.2.532.g19b5d50

^ permalink raw reply related

* [PATCH 3/3] bpf: Update MAP test_verifier.c tests wrt. alignment.
From: David Miller @ 2017-05-13  2:29 UTC (permalink / raw)
  To: daniel; +Cc: ast, alexei.starovoitov, netdev


Now that we perform proper alignment tracking of MAP accesses in the
verifier, most of the MAP tests in test_verifier.c no longer need the
special F_NEEDS_EFFICIENT_UNALIGNED_ACCESS flag or can trivially be
made to not need it.

struct test_val is an integer count followed by an array of integers.
So for most tests, we adjust the memory access to be 32-bit rather
than 64-bit.

The exception is for tests where the 64-bit access is an important
aspect for what the test is trying to validate.  For example,
the test "invalid map access into an array with a invalid max check"
is trying to do a 64-bit store at the final 4 bytes of the test
array which violates the valid access range.

We could fix this by arranging the test array size such that
the last entry is 8 byte aligned, but that is for future
consideration.

Two other tests which retain the flag are intentionally doing
unaligned accesses to a MAP element.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 tools/testing/selftests/bpf/test_verifier.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 889fb5a..73ac23a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3117,7 +3117,6 @@ static struct bpf_test tests[] = {
 		.errstr_unpriv = "R0 pointer arithmetic prohibited",
 		.result_unpriv = REJECT,
 		.result = ACCEPT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"valid map access into an array with a variable",
@@ -3133,7 +3132,7 @@ static struct bpf_test tests[] = {
 			BPF_JMP_IMM(BPF_JGE, BPF_REG_1, MAX_ENTRIES, 3),
 			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
-			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
+			BPF_ST_MEM(BPF_W, BPF_REG_0, 0,
 				   offsetof(struct test_val, foo)),
 			BPF_EXIT_INSN(),
 		},
@@ -3141,7 +3140,6 @@ static struct bpf_test tests[] = {
 		.errstr_unpriv = "R0 pointer arithmetic prohibited",
 		.result_unpriv = REJECT,
 		.result = ACCEPT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"valid map access into an array with a signed variable",
@@ -3161,7 +3159,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV32_IMM(BPF_REG_1, 0),
 			BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 2),
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
-			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
+			BPF_ST_MEM(BPF_W, BPF_REG_0, 0,
 				   offsetof(struct test_val, foo)),
 			BPF_EXIT_INSN(),
 		},
@@ -3169,7 +3167,6 @@ static struct bpf_test tests[] = {
 		.errstr_unpriv = "R0 pointer arithmetic prohibited",
 		.result_unpriv = REJECT,
 		.result = ACCEPT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"invalid map access into an array with a constant",
@@ -3211,7 +3208,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is outside of the array range",
 		.result_unpriv = REJECT,
 		.result = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"invalid map access into an array with a variable",
@@ -3226,7 +3222,7 @@ static struct bpf_test tests[] = {
 			BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
 			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
-			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
+			BPF_ST_MEM(BPF_W, BPF_REG_0, 0,
 				   offsetof(struct test_val, foo)),
 			BPF_EXIT_INSN(),
 		},
@@ -3235,7 +3231,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is negative, either use unsigned index or do a if (index >=0) check.",
 		.result_unpriv = REJECT,
 		.result = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"invalid map access into an array with no floor check",
@@ -3253,7 +3248,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV32_IMM(BPF_REG_1, 0),
 			BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 2),
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
-			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
+			BPF_ST_MEM(BPF_W, BPF_REG_0, 0,
 				   offsetof(struct test_val, foo)),
 			BPF_EXIT_INSN(),
 		},
@@ -3262,7 +3257,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is negative, either use unsigned index or do a if (index >=0) check.",
 		.result_unpriv = REJECT,
 		.result = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"invalid map access into an array with a invalid max check",
@@ -3319,7 +3313,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is negative, either use unsigned index or do a if (index >=0) check.",
 		.result_unpriv = REJECT,
 		.result = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"multiple registers share map_lookup_elem result",
@@ -3435,7 +3428,7 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1),
 			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
-			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, offsetof(struct test_val, foo)),
+			BPF_ST_MEM(BPF_W, BPF_REG_0, 0, offsetof(struct test_val, foo)),
 			BPF_EXIT_INSN(),
 		},
 		.fixup_map2 = { 3 },
@@ -3443,7 +3436,6 @@ static struct bpf_test tests[] = {
 		.result = REJECT,
 		.errstr_unpriv = "R0 pointer arithmetic prohibited",
 		.result_unpriv = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"constant register |= constant should keep constant type",
@@ -4835,7 +4827,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is negative, either use unsigned index or do a if (index >=0) check.",
 		.result = REJECT,
 		.result_unpriv = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"invalid range check",
@@ -4867,7 +4858,6 @@ static struct bpf_test tests[] = {
 		.errstr = "R0 min value is negative, either use unsigned index or do a if (index >=0) check.",
 		.result = REJECT,
 		.result_unpriv = REJECT,
-		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
 	{
 		"map in map access",
-- 
2.1.2.532.g19b5d50

^ permalink raw reply related

* Re: [PATCH] net: ipv6: Truncate single route when it doesn't fit into dump buffer.
From: David Ahern @ 2017-05-13  6:52 UTC (permalink / raw)
  To: Jan Moskyto Matejka; +Cc: David Miller, mq, netdev, roopa
In-Reply-To: <20170512214107.w3q7ycl6wm7phwox@lopatka.joja.cz>

On 5/12/17 3:41 PM, Jan Moskyto Matejka wrote:
> On Fri, May 12, 2017 at 10:26:08AM -0700, David Ahern wrote:
>> On 5/12/17 8:24 AM, David Miller wrote:
>>> From: Jan Moskyto Matejka <mq@ucw.cz>
>>> Date: Fri, 12 May 2017 13:15:10 +0200
>>>
>>>> -int rt6_dump_route(struct rt6_info *rt, void *p_arg);
>>>> +int rt6_dump_route(struct rt6_info *rt, void *p_arg, int truncate);
>>>
>>> Please use "bool" and "true"/"false" for boolean values.
>>>
>>> What does ipv4 do in this situation?
>>>
>>> I'm hesitant to be OK with adding a new nlmsg flag just for this case
>>> if we solve this problem differently and using existing mechanisms
>>> elsewhere.
>>>
>>
>> I'll take a look at this later today or this weekend; we can't just
>> truncate the route returned to userspace.
> 
> Agreed. My favourite would be skb realloc somewhere inside the dump loop
> ... but I don't know whether it's feasible.
> 
> MQ
> 

Here is what is happening: user initiates a route dump and specifies a
buffer size for receiving the message which becomes max_recvmsg_len.
This buffer size dictates the skb length allocated by netlink_dump.

The dump is interrupted when a route does not fit in the skb and
returns. Subsequent call picks up with the next route to be dumped - the
one that overflowed. All good and normal so far.

If the next route is larger than max_recvmsg_len, then the route can not
be put in the buffer, nothing is returned to the user which causes the
dump to abort showing the abbreviated output. This problem occurs with
IPv4 and IPv6. You can see this with modest size routes by just dropping
the buffer size to something really small (e.g., with iproute2, change
buf size in rtnl_dump_filter_l to say 2048).

I see 2 problems:
1. the kernel is not telling the user the supplied buffer is too small
(ie., if a single route does not fit in the skb then it should fail and
return an error code to the user),

2. multipath routes for IPv4 and IPv6 do not have a limit.

Should the kernel put a limit on the number of nexthops? I recently put
a cap on MPLS route size as 4096 bytes, but I think this should be
revisited in terms of a limit on number of nexthops to create a
consistent limit even if struct sizes change. And, the limit on the
number of nexthops should be consistent across address families (same
limit for IPv4, IPv6, and MPLS).

>From discussions I have had, 32 nexthops for a single route is on the
laughably high side, but some people do crazy things. How about a limit
of 256 nexthops?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox