Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v2 4/4] net: dsa: Utilize switchdev_port_bridge_getlink_deferred()
From: Florian Fainelli @ 2017-01-09 20:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, vivien.didelot, andrew, jiri, marcelo.leitner,
	Florian Fainelli
In-Reply-To: <20170109204523.5843-1-f.fainelli@gmail.com>

Fixes the following sleeping in atomic splat:

[   69.008021] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:752
[   69.016523] in_atomic(): 1, irqs_disabled(): 0, pid: 1528, name: bridge
[   69.023167] INFO: lockdep is turned off.
[   69.027118] CPU: 1 PID: 1528 Comm: bridge Not tainted 4.10.0-rc2-00131-g719651624789-dirty #205
[   69.035840] Hardware name: Broadcom STB (Flattened Device Tree)
[   69.041796] [<c020fc40>] (unwind_backtrace) from [<c020ba28>] (show_stack+0x10/0x14)
[   69.049570] [<c020ba28>] (show_stack) from [<c04fb91c>] (dump_stack+0xb0/0xdc)
[   69.056823] [<c04fb91c>] (dump_stack) from [<c0244d00>] (___might_sleep+0x1a4/0x2a4)
[   69.064599] [<c0244d00>] (___might_sleep) from [<c08f3eb8>] (mutex_lock_nested+0x28/0x7d8)
[   69.072897] [<c08f3eb8>] (mutex_lock_nested) from [<c0674dbc>] (b53_vlan_dump+0x2c/0x104)
[   69.081105] [<c0674dbc>] (b53_vlan_dump) from [<c08eba88>] (switchdev_port_obj_dump_now+0x30/0x6c)
[   69.090094] [<c08eba88>] (switchdev_port_obj_dump_now) from [<c08ebb00>] (switchdev_port_obj_dump+0x3c/0x98)
[   69.099950] [<c08ebb00>] (switchdev_port_obj_dump) from [<c08ebc34>] (switchdev_port_vlan_fill+0x68/0x90)
[   69.109550] [<c08ebc34>] (switchdev_port_vlan_fill) from [<c07f13a4>] (ndo_dflt_bridge_getlink+0x28c/0x4fc)
[   69.119320] [<c07f13a4>] (ndo_dflt_bridge_getlink) from [<c08eb2c8>] (switchdev_port_bridge_getlink+0xc4/0xd4)
[   69.129350] [<c08eb2c8>] (switchdev_port_bridge_getlink) from [<c07f0b10>] (rtnl_bridge_getlink+0x12c/0x28c)
[   69.139206] [<c07f0b10>] (rtnl_bridge_getlink) from [<c0802e28>] (netlink_dump+0xe8/0x268)
[   69.147495] [<c0802e28>] (netlink_dump) from [<c0803864>] (__netlink_dump_start+0x12c/0x18c)
[   69.155958] [<c0803864>] (__netlink_dump_start) from [<c07f3c34>] (rtnetlink_rcv_msg+0x11c/0x228)
[   69.164857] [<c07f3c34>] (rtnetlink_rcv_msg) from [<c0805e3c>] (netlink_rcv_skb+0xc4/0xd8)
[   69.173145] [<c0805e3c>] (netlink_rcv_skb) from [<c07f1110>] (rtnetlink_rcv+0x28/0x30)
[   69.181087] [<c07f1110>] (rtnetlink_rcv) from [<c080577c>] (netlink_unicast+0x16c/0x238)
[   69.189201] [<c080577c>] (netlink_unicast) from [<c0805c3c>] (netlink_sendmsg+0x350/0x364)
[   69.197493] [<c0805c3c>] (netlink_sendmsg) from [<c07beab0>] (sock_sendmsg+0x14/0x24)
[   69.205350] [<c07beab0>] (sock_sendmsg) from [<c07bfdbc>] (SyS_sendto+0xb8/0xe0)
[   69.212769] [<c07bfdbc>] (SyS_sendto) from [<c07bfdfc>] (SyS_send+0x18/0x20)
[   69.219841] [<c07bfdfc>] (SyS_send) from [<c0208100>] (ret_fast_syscall+0x0/0x1c)

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/slave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 5cd5b8137c08..b38536f951ea 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1038,7 +1038,7 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_netpoll_cleanup	= dsa_slave_netpoll_cleanup,
 	.ndo_poll_controller	= dsa_slave_poll_controller,
 #endif
-	.ndo_bridge_getlink	= switchdev_port_bridge_getlink,
+	.ndo_bridge_getlink	= switchdev_port_bridge_getlink_deferred,
 	.ndo_bridge_setlink	= switchdev_port_bridge_setlink,
 	.ndo_bridge_dellink	= switchdev_port_bridge_dellink,
 	.ndo_get_phys_port_id	= dsa_slave_get_phys_port_id,
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next 0/4] afs: Refcount afs_call struct
From: David Miller @ 2017-01-09 20:48 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs, linux-kernel
In-Reply-To: <148397356909.20445.15077871371099721338.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Mon, 09 Jan 2017 14:52:49 +0000

> 
> These patches provide some tracepoints for AFS and fix a potential leak by
> adding refcounting to the afs_call struct.
> 
> The patches are:
> 
>  (1) Add some tracepoints for logging incoming calls and monitoring
>      notifications from AF_RXRPC and data reception.
> 
>  (2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
>      initially expected.  It can be brought back later if needed.  This
>      clears some stuff out that I don't then need to fix up in (4).
> 
>  (3) Allow listen(..., 0) to be used to disable listening.  This makes
>      shutting down the AFS cache manager server in the kernel much easier
>      and the accounting simpler as we can then be sure that (a) all
>      preallocated afs_call structs are relesed and (b) no new incoming
>      calls are going to be started.
> 
>      For the moment, listening cannot be reenabled.
> 
>  (4) Add refcounting to the afs_call struct to fix a potential multiple
>      release detected by static checking and add a tracepoint to follow the
>      lifecycle of afs_call objects.
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
> 
> Tagged thusly:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 	rxrpc-rewrite-20170109

Pulled, thanks David.

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Ido Schimmel @ 2017-01-09 20:48 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, davem, vivien.didelot, andrew, jiri
In-Reply-To: <20170109194503.10713-1-f.fainelli@gmail.com>

Hi Florian,

On Mon, Jan 09, 2017 at 11:44:59AM -0800, Florian Fainelli wrote:
> Hi all,
> 
> This patch series is to resolve a sleeping function called in atomic context
> debug splat that we observe with DSA.
> 
> Let me know what you think, I was also wondering if we should just always
> make switchdev_port_vlan_fill() set SWITCHDEV_F_DEFER, but was afraid this
> could cause invalid contexts to be used for rocker, mlxsw, i40e etc.

Isn't this a bit of overkill? All the drivers you mention fill the VLAN
dump from their cache and don't require sleeping. Even b53 that you
mention in the last patch does that, but reads the PVID from the device,
which entails taking a mutex.

Can't you just cache the PVID as well? I think this will solve your
problem. Didn't look too much into the b53 code, so maybe I'm missing
something. Seems that mv88e6xxx has a similar problem.

Thanks!

^ permalink raw reply

* Re: [net-next PATCH 0/3] net: optimize ICMP-reply code path
From: David Miller @ 2017-01-09 20:49 UTC (permalink / raw)
  To: brouer; +Cc: netdev, eric.dumazet, xiyou.wangcong
In-Reply-To: <20170109150246.30215.63371.stgit@firesoul>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 09 Jan 2017 16:03:59 +0100

> This patchset is optimizing the ICMP-reply code path, for ICMP packets
> that gets rate limited. A remote party can easily trigger this code
> path by sending packets to port number with no listening service.
> 
> Generally the patchset moves the sysctl_icmp_msgs_per_sec ratelimit
> checking to earlier in the code path and removes an allocation.
> 
> 
> Use-case: The specific case I experienced this being a bottleneck is,
> sending UDP packets to a port with no listener, which obviously result
> in kernel replying with ICMP Destination Unreachable (type:3), Port
> Unreachable (code:3), which cause the bottleneck.
> 
>  After Eric and Paolo optimized the UDP socket code, the kernels PPS
> processing capabilities is lower for no-listen ports, than normal UDP
> sockets.  This is bad for capacity planning when restarting a service.
> 
> UDP no-listen benchmark 8xCPUs using pktgen_sample04_many_flows.sh:
>  Baseline: 6.6 Mpps
>  Patch:   14.7 Mpps
> Driver mlx5 at 50Gbit/s.

Series applied, thanks Jesper!

^ permalink raw reply

* Re: [PATCH net] net: dsa: Ensure validity of dst->ds[0]
From: Vivien Didelot @ 2017-01-09 20:50 UTC (permalink / raw)
  To: Florian Fainelli, netdev; +Cc: davem, andrew, Florian Fainelli
In-Reply-To: <20170109195834.11697-1-f.fainelli@gmail.com>

Hi Florian,

Florian Fainelli <f.fainelli@gmail.com> writes:

> It is perfectly possible to have non zero indexed switches being present
> in a DSA switch tree, in such a case, we will be deferencing a NULL
> pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive
> and ensure that dst->ds[0] is valid before doing anything with it.
>
> Fixes: 0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

The patch is correct since we are already using dst->ds[0] here.

But we should stop using that and use dst->cpu_switch instead, because
the switch with ID 0 won't necessary be the CPU switch. Now that the
Ethernet switch chips are true Linux devices, they are registered in
order depending on their bus/address. So in a setup like this:

       ,--MDIO--@4--------@2--
      |         |         |
    [CPU] <-> [swA] <-> [swB]

swB will have index 0 and swA will have index 1. Please correct me if
I'm wrong.

Thanks,

        Vivien

^ permalink raw reply

* Re: [net-next PATCH 1/3] Revert "icmp: avoid allocating large struct on stack"
From: Jesper Dangaard Brouer @ 2017-01-09 20:53 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, xiyou.wangcong, netdev, brouer
In-Reply-To: <20170109.135259.988711786570465428.davem@davemloft.net>

On Mon, 09 Jan 2017 13:52:59 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com> 
> Date: Mon, 09 Jan 2017 10:07:04 -0800
> 
> > You really should come to netdev conferences so that you understand
> > goals and efforts, instead of living in your cave.  
> 
> I completely agree with Eric.
> 
> Cong we have a very serious problem with you exactly because you make
> quite vicious emotional statements targetted at other developers
> merely when they say something you disagree with.
> 
> This is completely unacceptable behavior, and you must stop doing
> this, now.

I agree, and it is even documented in:
 Documentation/process/code-of-conflict.rst

Quote: "As a reviewer of code, please strive to keep things civil and
focused on the technical issues involved. [...]"

https://www.kernel.org/doc/html/latest/process/code-of-conflict.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCHv3 0/6] sh_eth: add wake-on-lan support via magic packet
From: David Miller @ 2017-01-09 20:55 UTC (permalink / raw)
  To: niklas.soderlund+renesas
  Cc: sergei.shtylyov, horms+renesas, netdev, linux-renesas-soc, geert,
	linux-pm
In-Reply-To: <20170109153409.13956-1-niklas.soderlund+renesas@ragnatech.se>

From: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Date: Mon,  9 Jan 2017 16:34:03 +0100

> This series adds support for Wake-on-Lan using Magic Packet for a few
> models of the sh_eth driver. Patch 1/6 fix a naming error, patch 2/6 
> adds generic support to control and support WoL while patches 3/6 - 6/6 
> enable different models.
> 
> Based ontop of net-next master.
 ...

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Florian Fainelli @ 2017-01-09 20:56 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, vivien.didelot, andrew, jiri
In-Reply-To: <20170109204849.GA28310@splinter>

On 01/09/2017 12:48 PM, Ido Schimmel wrote:
> Hi Florian,
> 
> On Mon, Jan 09, 2017 at 11:44:59AM -0800, Florian Fainelli wrote:
>> Hi all,
>>
>> This patch series is to resolve a sleeping function called in atomic context
>> debug splat that we observe with DSA.
>>
>> Let me know what you think, I was also wondering if we should just always
>> make switchdev_port_vlan_fill() set SWITCHDEV_F_DEFER, but was afraid this
>> could cause invalid contexts to be used for rocker, mlxsw, i40e etc.
> 
> Isn't this a bit of overkill? All the drivers you mention fill the VLAN
> dump from their cache and don't require sleeping. Even b53 that you
> mention in the last patch does that, but reads the PVID from the device,
> which entails taking a mutex.

Correct.

> 
> Can't you just cache the PVID as well? I think this will solve your
> problem. Didn't look too much into the b53 code, so maybe I'm missing
> something. Seems that mv88e6xxx has a similar problem.

I suppose we could indeed cache the PVID for b53, but for mv88e6xxx it
seems like we need to perform a bunch of VTU operations, and those
access HW registers, Andrew, Vivien, how do you want to solve that, do
we want to introduce a general VLAN cache somewhere in switchdev/DSA/driver?

Thanks Ido!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] net: dsa: Ensure validity of dst->ds[0]
From: Andrew Lunn @ 2017-01-09 21:01 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: Florian Fainelli, netdev, davem
In-Reply-To: <8737gsc5zm.fsf@weeman.i-did-not-set--mail-host-address--so-tickle-me>

On Mon, Jan 09, 2017 at 03:50:53PM -0500, Vivien Didelot wrote:
> Hi Florian,
> 
> Florian Fainelli <f.fainelli@gmail.com> writes:
> 
> > It is perfectly possible to have non zero indexed switches being present
> > in a DSA switch tree, in such a case, we will be deferencing a NULL
> > pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive
> > and ensure that dst->ds[0] is valid before doing anything with it.
> >
> > Fixes: 0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree")
> > Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> 
> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> 
> The patch is correct since we are already using dst->ds[0] here.
> 
> But we should stop using that and use dst->cpu_switch instead, because
> the switch with ID 0 won't necessary be the CPU switch. Now that the
> Ethernet switch chips are true Linux devices, they are registered in
> order depending on their bus/address. So in a setup like this:
> 
>        ,--MDIO--@4--------@2--
>       |         |         |
>     [CPU] <-> [swA] <-> [swB]

> 
> swB will have index 0 and swA will have index 1. Please correct me if
> I'm wrong.

Correct, which DS has the CPU port is arbitrary.

It also gets messier when Johns finishes reworking my PoC patchset for
multiple CPU ports. You ideally want the correct CPU port for this
host interface.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Andrew Lunn @ 2017-01-09 21:05 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Florian Fainelli, netdev, davem, vivien.didelot, jiri
In-Reply-To: <20170109204849.GA28310@splinter>

On Mon, Jan 09, 2017 at 10:48:49PM +0200, Ido Schimmel wrote:
> Hi Florian,
> 
> On Mon, Jan 09, 2017 at 11:44:59AM -0800, Florian Fainelli wrote:
> > Hi all,
> > 
> > This patch series is to resolve a sleeping function called in atomic context
> > debug splat that we observe with DSA.
> > 
> > Let me know what you think, I was also wondering if we should just always
> > make switchdev_port_vlan_fill() set SWITCHDEV_F_DEFER, but was afraid this
> > could cause invalid contexts to be used for rocker, mlxsw, i40e etc.
> 
> Isn't this a bit of overkill? All the drivers you mention fill the VLAN
> dump from their cache and don't require sleeping.

Hi Ido

DSA in general does not cache information. It always ask the hardware.
So for mv88e6xxx, this is going to trigger MDIO operations, which take
mutex's and do sleep.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Ido Schimmel @ 2017-01-09 21:14 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, davem, vivien.didelot, andrew, jiri
In-Reply-To: <e28f29c8-04ed-1d71-7b13-27a5d58d111c@gmail.com>

On Mon, Jan 09, 2017 at 12:56:48PM -0800, Florian Fainelli wrote:
> On 01/09/2017 12:48 PM, Ido Schimmel wrote:
> > Hi Florian,
> > 
> > On Mon, Jan 09, 2017 at 11:44:59AM -0800, Florian Fainelli wrote:
> >> Hi all,
> >>
> >> This patch series is to resolve a sleeping function called in atomic context
> >> debug splat that we observe with DSA.
> >>
> >> Let me know what you think, I was also wondering if we should just always
> >> make switchdev_port_vlan_fill() set SWITCHDEV_F_DEFER, but was afraid this
> >> could cause invalid contexts to be used for rocker, mlxsw, i40e etc.
> > 
> > Isn't this a bit of overkill? All the drivers you mention fill the VLAN
> > dump from their cache and don't require sleeping. Even b53 that you
> > mention in the last patch does that, but reads the PVID from the device,
> > which entails taking a mutex.
> 
> Correct.
> 
> > 
> > Can't you just cache the PVID as well? I think this will solve your
> > problem. Didn't look too much into the b53 code, so maybe I'm missing
> > something. Seems that mv88e6xxx has a similar problem.
> 
> I suppose we could indeed cache the PVID for b53, but for mv88e6xxx it
> seems like we need to perform a bunch of VTU operations, and those
> access HW registers, Andrew, Vivien, how do you want to solve that, do
> we want to introduce a general VLAN cache somewhere in switchdev/DSA/driver?

Truth be told, I don't quite understand why switchdev infra even tries
to dump the VLANs from the device. Like, in which situations is this
going to be different from what the software bridge reports? Sure, you
can set the VLAN filters with SELF and skip the software bridge, but how
does that make sense in a model where you want to reflect the software
datapath?

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Vivien Didelot @ 2017-01-09 21:19 UTC (permalink / raw)
  To: Florian Fainelli, Ido Schimmel; +Cc: netdev, davem, andrew, jiri
In-Reply-To: <e28f29c8-04ed-1d71-7b13-27a5d58d111c@gmail.com>

Hi Florian, Ido,

Florian Fainelli <f.fainelli@gmail.com> writes:

>> Can't you just cache the PVID as well? I think this will solve your
>> problem. Didn't look too much into the b53 code, so maybe I'm missing
>> something. Seems that mv88e6xxx has a similar problem.
>
> I suppose we could indeed cache the PVID for b53, but for mv88e6xxx it
> seems like we need to perform a bunch of VTU operations, and those
> access HW registers, Andrew, Vivien, how do you want to solve that, do
> we want to introduce a general VLAN cache somewhere in switchdev/DSA/driver?

Yes mv88e6xxx does read the hardware registers to get the port default
VID and read the hardware VLAN table.

DSA drivers should be dumb and simply implement switch operations. If
caching is required, it must be implemented in the DSA layer. Since
switchdev is stateless, I hardly see a way to implement it there.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Andrew Lunn @ 2017-01-09 21:23 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Florian Fainelli, netdev, davem, vivien.didelot, jiri
In-Reply-To: <20170109211436.GB28310@splinter>

> Truth be told, I don't quite understand why switchdev infra even tries
> to dump the VLANs from the device. Like, in which situations is this
> going to be different from what the software bridge reports?

What happens when the hardware is out of resources and says sorry,
cannot do that. There has been a few discussions about what to do in
that situation. Fall back to software, i.e. the software bridge does
it, or totally fail the operation. If you are failing back to
software, the states can be different.

    Andrew

^ permalink raw reply

* Re: [PATCH V4 net-next 00/15] net/smc: Shared Memory Communications - RDMA
From: David Miller @ 2017-01-09 21:23 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, utz.bacher
In-Reply-To: <20170109155526.10961-1-ubraun@linux.vnet.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date: Mon,  9 Jan 2017 16:55:11 +0100

> here is now V4 of the SMC-R patches having processed your feedback from end
> of November. The most important change is the replacement of sysfs by a
> generic netlink solution in patch 04. And I tried to get rid of the __packed
> attributes. There are still a few usages left due to SMC-R protocol defined
> structures.

Series applied.

^ permalink raw reply

* Re: [PATCH net-next] bridge: multicast to unicast
From: Linus Lüssing @ 2017-01-09 21:23 UTC (permalink / raw)
  To: M. Braun
  Cc: Johannes Berg, Felix Fietkau, netdev, David S . Miller,
	Stephen Hemminger, bridge, linux-kernel, linux-wireless
In-Reply-To: <6f5ec9f1-800a-2bc4-2f41-9d803343bb22@fami-braun.de>

On Mon, Jan 09, 2017 at 12:44:19PM +0100, M. Braun wrote:
> Am 09.01.2017 um 09:08 schrieb Johannes Berg:
> > Does it make sense to implement the two in separate layers though?
> > 
> > Clearly, this part needs to be implemented in the bridge layer due to
> > the snooping knowledge, but the code is very similar to what mac80211
> > has now.
> 
> Does the bridge always know about all stations connected?

The bridge does not always know about all stations, especially the
silent ones like in your DVB-T example.

However, concerning IP multicast, there is IGMP/MLD. So the bridge
does know about all stations which are interested in a specific IP
multicast stream.

(As long as there is a querier on the link, which periodically
queriers for IGMP/MLD reports from any listener. If there is no
querier then the bridge multicast snooping, including the bridge
multicast-to-unicast will fall back to flooding)

So if your television example uses IP multicast properly, it is
completely doable with the bridge multicast-to-unicast, thanks to
IGMP/MLD.

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: switchdev: Avoid sleep in atomic with DSA
From: Ido Schimmel @ 2017-01-09 21:29 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, netdev, davem, vivien.didelot, jiri
In-Reply-To: <20170109212320.GB20958@lunn.ch>

On Mon, Jan 09, 2017 at 10:23:20PM +0100, Andrew Lunn wrote:
> > Truth be told, I don't quite understand why switchdev infra even tries
> > to dump the VLANs from the device. Like, in which situations is this
> > going to be different from what the software bridge reports?
> 
> What happens when the hardware is out of resources and says sorry,
> cannot do that. There has been a few discussions about what to do in
> that situation. Fall back to software, i.e. the software bridge does
> it, or totally fail the operation. If you are failing back to
> software, the states can be different.

If the driver fails to set its VLAN filters then the operation is also
rollbacked in the software bridge, in which case you're still in sync.

^ permalink raw reply

* Re: [PATCH net-next] bridge: multicast to unicast
From: Stephen Hemminger @ 2017-01-09 21:30 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: M. Braun, Johannes Berg, Felix Fietkau, netdev, David S . Miller,
	bridge, linux-kernel, linux-wireless
In-Reply-To: <20170109212345.GA5513@otheros>

On Mon, 9 Jan 2017 22:23:45 +0100
Linus Lüssing <linus.luessing@c0d3.blue> wrote:

> On Mon, Jan 09, 2017 at 12:44:19PM +0100, M. Braun wrote:
> > Am 09.01.2017 um 09:08 schrieb Johannes Berg:  
> > > Does it make sense to implement the two in separate layers though?
> > > 
> > > Clearly, this part needs to be implemented in the bridge layer due to
> > > the snooping knowledge, but the code is very similar to what mac80211
> > > has now.  
> > 
> > Does the bridge always know about all stations connected?  
> 
> The bridge does not always know about all stations, especially the
> silent ones like in your DVB-T example.
> 
> However, concerning IP multicast, there is IGMP/MLD. So the bridge
> does know about all stations which are interested in a specific IP
> multicast stream.
> 
> (As long as there is a querier on the link, which periodically
> queriers for IGMP/MLD reports from any listener. If there is no
> querier then the bridge multicast snooping, including the bridge
> multicast-to-unicast will fall back to flooding)
> 
> 
> So if your television example uses IP multicast properly, it is
> completely doable with the bridge multicast-to-unicast, thanks to
> IGMP/MLD.

I wonder if MAC80211 should be doing IGMP snooping and not bridge
in this environment.

^ permalink raw reply

* [PATCH net-next 0/7] net: ivp4: return matching route for GETROUTE request
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern

For complicated and highly populated route tables, RTM_GETROUTE requests
are an eye chart trying to match the response with the route entry that
was hit. This series solves that problem by returning the RIB entry that
was matched for a GETROUTE request as an a new nested attribute,
RTA_ROUTE_GET, that contains the typical RTA's for a route spec.

Example:
    $ ip ro get 10.10.10.10
    10.10.10.10 via 172.16.20.21 dev virt01 src 172.16.20.20 uid 0
        cache

    Matching route:
    10.10.10.10  encap mpls  100 via 172.16.20.21 dev virt01

Patches 1-3 refactor the existing input and output route lookups, moving
the rcu read lock protected sections into standalone functions that take
the fib_result as input an argument. inet_rtm_getroute is then converted
to use the new functions while holding the rcu read lock. Doing so gives
inet_rtm_getroute access to the matching fib_info.

Patch 4 refactors fib_dump_info, moving the code that adds route
attributes to a response into a separate function.

Patch 5 adds the prefix for the matching trie entry to fib_result.

Patch 6 then adds the prefix and matching fib_info to the GETROUTE
response using the fib_dump_add_attrs_rcu from Patch 4.

Patch 7 removes the event arg from rt_fill_info simplifying its
argument list.

IPv6 will be converted to return the same in a follow on patch set.

David Ahern (7):
  net: ipv4: refactor __ip_route_output_key_hash
  net: ipv4: refactor ip_route_input_noref
  net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup
  net: ipv4: refactor fib_dump_info
  net: ipv4: Save trie prefix to fib lookup result
  net: ipv4: return route match in GETROUTE request
  net: ipv4: Remove event arg to rt_fill_info

 include/net/ip_fib.h           |   1 +
 include/net/route.h            |  12 ++-
 include/uapi/linux/rtnetlink.h |   2 +
 net/ipv4/fib_lookup.h          |   2 +
 net/ipv4/fib_semantics.c       |  17 +++-
 net/ipv4/fib_trie.c            |   1 +
 net/ipv4/icmp.c                |   4 +-
 net/ipv4/route.c               | 177 +++++++++++++++++++++++++++--------------
 8 files changed, 149 insertions(+), 67 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next 1/7] net: ipv4: refactor __ip_route_output_key_hash
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

A later patch wants access to the fib result on an output route lookup
with the rcu lock held. Refactor __ip_route_output_key_hash, pushing
the logic between rcu_read_lock ... rcu_read_unlock into a new helper
that takes the fib_result as an input arg.

To keep the name length under control remove the leading underscores
from the name. _rcu is added to the name of the new helper indicating
it is called with the rcu read lock held.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h |  9 ++++++---
 net/ipv4/icmp.c     |  4 ++--
 net/ipv4/route.c    | 50 +++++++++++++++++++++++++++++---------------------
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index c0874c87c173..bf4f7a98f753 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -113,13 +113,16 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
-					  int mp_hash);
+struct rtable *ip_route_output_key_hash(struct net *, struct flowi4 *flp,
+					int mp_hash);
+struct rtable *ip_route_output_key_hash_rcu(struct net *, struct flowi4 *flp,
+					    struct fib_result *res,
+					    int mp_hash);
 
 static inline struct rtable *__ip_route_output_key(struct net *net,
 						   struct flowi4 *flp)
 {
-	return __ip_route_output_key_hash(net, flp, -1);
+	return ip_route_output_key_hash(net, flp, -1);
 }
 
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 0777ea949223..67ed57365f80 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -482,8 +482,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
 
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-	rt = __ip_route_output_key_hash(net, fl4,
-					icmp_multipath_hash_skb(skb_in));
+	rt = ip_route_output_key_hash(net, fl4,
+				      icmp_multipath_hash_skb(skb_in));
 	if (IS_ERR(rt))
 		return rt;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7144288371cf..effd7f8e31f9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2181,29 +2181,39 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
-					  int mp_hash)
+struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
+					int mp_hash)
 {
-	struct net_device *dev_out = NULL;
 	__u8 tos = RT_FL_TOS(fl4);
-	unsigned int flags = 0;
 	struct fib_result res;
 	struct rtable *rth;
-	int orig_oif;
-	int err = -ENETUNREACH;
 
 	res.tclassid	= 0;
 	res.fi		= NULL;
 	res.table	= NULL;
 
-	orig_oif = fl4->flowi4_oif;
-
 	fl4->flowi4_iif = LOOPBACK_IFINDEX;
 	fl4->flowi4_tos = tos & IPTOS_RT_MASK;
 	fl4->flowi4_scope = ((tos & RTO_ONLINK) ?
 			 RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
 
 	rcu_read_lock();
+	rth = ip_route_output_key_hash_rcu(net, fl4, &res, mp_hash);
+	rcu_read_unlock();
+
+	return rth;
+}
+EXPORT_SYMBOL_GPL(ip_route_output_key_hash);
+
+struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4,
+					    struct fib_result *res, int mp_hash)
+{
+	struct net_device *dev_out = NULL;
+	int orig_oif = fl4->flowi4_oif;
+	unsigned int flags = 0;
+	struct rtable *rth;
+	int err = -ENETUNREACH;
+
 	if (fl4->saddr) {
 		rth = ERR_PTR(-EINVAL);
 		if (ipv4_is_multicast(fl4->saddr) ||
@@ -2289,15 +2299,15 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 			fl4->daddr = fl4->saddr = htonl(INADDR_LOOPBACK);
 		dev_out = net->loopback_dev;
 		fl4->flowi4_oif = LOOPBACK_IFINDEX;
-		res.type = RTN_LOCAL;
+		res->type = RTN_LOCAL;
 		flags |= RTCF_LOCAL;
 		goto make_route;
 	}
 
-	err = fib_lookup(net, fl4, &res, 0);
+	err = fib_lookup(net, fl4, res, 0);
 	if (err) {
-		res.fi = NULL;
-		res.table = NULL;
+		res->fi = NULL;
+		res->table = NULL;
 		if (fl4->flowi4_oif &&
 		    (ipv4_is_multicast(fl4->daddr) ||
 		    !netif_index_is_l3_master(net, fl4->flowi4_oif))) {
@@ -2322,17 +2332,17 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 			if (fl4->saddr == 0)
 				fl4->saddr = inet_select_addr(dev_out, 0,
 							      RT_SCOPE_LINK);
-			res.type = RTN_UNICAST;
+			res->type = RTN_UNICAST;
 			goto make_route;
 		}
 		rth = ERR_PTR(err);
 		goto out;
 	}
 
-	if (res.type == RTN_LOCAL) {
+	if (res->type == RTN_LOCAL) {
 		if (!fl4->saddr) {
-			if (res.fi->fib_prefsrc)
-				fl4->saddr = res.fi->fib_prefsrc;
+			if (res->fi->fib_prefsrc)
+				fl4->saddr = res->fi->fib_prefsrc;
 			else
 				fl4->saddr = fl4->daddr;
 		}
@@ -2344,20 +2354,18 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 		goto make_route;
 	}
 
-	fib_select_path(net, &res, fl4, mp_hash);
+	fib_select_path(net, res, fl4, mp_hash);
 
-	dev_out = FIB_RES_DEV(res);
+	dev_out = FIB_RES_DEV(*res);
 	fl4->flowi4_oif = dev_out->ifindex;
 
 
 make_route:
-	rth = __mkroute_output(&res, fl4, orig_oif, dev_out, flags);
+	rth = __mkroute_output(res, fl4, orig_oif, dev_out, flags);
 
 out:
-	rcu_read_unlock();
 	return rth;
 }
-EXPORT_SYMBOL_GPL(__ip_route_output_key_hash);
 
 static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
 {
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 2/7] net: ipv4: refactor ip_route_input_noref
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

A later patch wants access to the fib result on an input route lookup
with the rcu lock held. Refactor ip_route_input_noref pushing the logic
between rcu_read_lock ... rcu_read_unlock into a new helper that takes
the fib_result as an input arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h |  3 +++
 net/ipv4/route.c    | 66 ++++++++++++++++++++++++++++++-----------------------
 2 files changed, 40 insertions(+), 29 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index bf4f7a98f753..4f3502a67203 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -178,6 +178,9 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 
 int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
 			 u8 tos, struct net_device *devin);
+int ip_route_input_rcu(struct sk_buff *skb, __be32 dst, __be32 src,
+		       u8 tos, struct net_device *devin,
+		       struct fib_result *res);
 
 static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
 				 u8 tos, struct net_device *devin)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index effd7f8e31f9..3142cd802e79 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1789,9 +1789,9 @@ static int ip_mkroute_input(struct sk_buff *skb,
  */
 
 static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-			       u8 tos, struct net_device *dev)
+			       u8 tos, struct net_device *dev,
+			       struct fib_result *res)
 {
-	struct fib_result res;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
 	struct ip_tunnel_info *tun_info;
 	struct flowi4	fl4;
@@ -1821,8 +1821,8 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr))
 		goto martian_source;
 
-	res.fi = NULL;
-	res.table = NULL;
+	res->fi = NULL;
+	res->table = NULL;
 	if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
 		goto brd_input;
 
@@ -1857,17 +1857,17 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	fl4.flowi4_flags = 0;
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
-	err = fib_lookup(net, &fl4, &res, 0);
+	err = fib_lookup(net, &fl4, res, 0);
 	if (err != 0) {
 		if (!IN_DEV_FORWARD(in_dev))
 			err = -EHOSTUNREACH;
 		goto no_route;
 	}
 
-	if (res.type == RTN_BROADCAST)
+	if (res->type == RTN_BROADCAST)
 		goto brd_input;
 
-	if (res.type == RTN_LOCAL) {
+	if (res->type == RTN_LOCAL) {
 		err = fib_validate_source(skb, saddr, daddr, tos,
 					  0, dev, in_dev, &itag);
 		if (err < 0)
@@ -1879,10 +1879,10 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		err = -EHOSTUNREACH;
 		goto no_route;
 	}
-	if (res.type != RTN_UNICAST)
+	if (res->type != RTN_UNICAST)
 		goto martian_destination;
 
-	err = ip_mkroute_input(skb, &res, in_dev, daddr, saddr, tos);
+	err = ip_mkroute_input(skb, res, in_dev, daddr, saddr, tos);
 out:	return err;
 
 brd_input:
@@ -1896,14 +1896,14 @@ out:	return err;
 			goto martian_source;
 	}
 	flags |= RTCF_BROADCAST;
-	res.type = RTN_BROADCAST;
+	res->type = RTN_BROADCAST;
 	RT_CACHE_STAT_INC(in_brd);
 
 local_input:
 	do_cache = false;
-	if (res.fi) {
+	if (res->fi) {
 		if (!itag) {
-			rth = rcu_dereference(FIB_RES_NH(res).nh_rth_input);
+			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
 			if (rt_cache_valid(rth)) {
 				skb_dst_set_noref(skb, &rth->dst);
 				err = 0;
@@ -1914,7 +1914,7 @@ out:	return err;
 	}
 
 	rth = rt_dst_alloc(l3mdev_master_dev_rcu(dev) ? : net->loopback_dev,
-			   flags | RTCF_LOCAL, res.type,
+			   flags | RTCF_LOCAL, res->type,
 			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
 	if (!rth)
 		goto e_nobufs;
@@ -1924,18 +1924,18 @@ out:	return err;
 	rth->dst.tclassid = itag;
 #endif
 	rth->rt_is_input = 1;
-	if (res.table)
-		rth->rt_table_id = res.table->tb_id;
+	if (res->table)
+		rth->rt_table_id = res->table->tb_id;
 
 	RT_CACHE_STAT_INC(in_slow_tot);
-	if (res.type == RTN_UNREACHABLE) {
+	if (res->type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
 
 	if (do_cache) {
-		struct fib_nh *nh = &FIB_RES_NH(res);
+		struct fib_nh *nh = &FIB_RES_NH(*res);
 
 		rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
 		if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
@@ -1955,9 +1955,9 @@ out:	return err;
 
 no_route:
 	RT_CACHE_STAT_INC(in_no_route);
-	res.type = RTN_UNREACHABLE;
-	res.fi = NULL;
-	res.table = NULL;
+	res->type = RTN_UNREACHABLE;
+	res->fi = NULL;
+	res->table = NULL;
 	goto local_input;
 
 	/*
@@ -1987,10 +1987,21 @@ out:	return err;
 int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			 u8 tos, struct net_device *dev)
 {
-	int res;
+	struct fib_result res;
+	int err;
 
 	rcu_read_lock();
+	err = ip_route_input_rcu(skb, daddr, saddr, tos, dev, &res);
+	rcu_read_unlock();
 
+	return err;
+}
+EXPORT_SYMBOL(ip_route_input_noref);
+
+/* called with rcu_read_lock held */
+int ip_route_input_rcu(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+		       u8 tos, struct net_device *dev, struct fib_result *res)
+{
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2005,6 +2016,7 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	if (ipv4_is_multicast(daddr)) {
 		struct in_device *in_dev = __in_dev_get_rcu(dev);
 		int our = 0;
+		int err = -EINVAL;
 
 		if (in_dev)
 			our = ip_check_mc_rcu(in_dev, daddr, saddr,
@@ -2020,7 +2032,6 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 						      ip_hdr(skb)->protocol);
 		}
 
-		res = -EINVAL;
 		if (our
 #ifdef CONFIG_IP_MROUTE
 			||
@@ -2028,17 +2039,14 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		     IN_DEV_MFORWARD(in_dev))
 #endif
 		   ) {
-			res = ip_route_input_mc(skb, daddr, saddr,
+			err = ip_route_input_mc(skb, daddr, saddr,
 						tos, dev, our);
 		}
-		rcu_read_unlock();
-		return res;
+		return err;
 	}
-	res = ip_route_input_slow(skb, daddr, saddr, tos, dev);
-	rcu_read_unlock();
-	return res;
+
+	return ip_route_input_slow(skb, daddr, saddr, tos, dev, res);
 }
-EXPORT_SYMBOL(ip_route_input_noref);
 
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 3/7] net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

Convert inet_rtm_getroute to use ip_route_input_rcu and
ip_route_output_key_hash_rcu passing the fib_result arg to both.
The rcu lock is held through the creation of the response, so the
rtable/dst does not need to be attached to the skb and is passed
to rt_fill_info directly.

In converting from ip_route_output_key to ip_route_output_key_hash_rcu
the xfrm_lookup_route in ip_route_output_flow is dropped since
flowi4_proto is not set for a route get request. Also, the flow struct
adjustments from __ip_route_output_key_hash are added to make sure
the route request logic is not altered by the conversion.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/route.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3142cd802e79..03ddc03c185a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2467,11 +2467,11 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
+/* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 			struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
-			u32 seq, int event)
+			u32 seq, int event, struct rtable *rt)
 {
-	struct rtable *rt = skb_rtable(skb);
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
 	unsigned long expires = 0;
@@ -2585,10 +2585,12 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	struct net *net = sock_net(in_skb->sk);
 	struct rtmsg *rtm;
 	struct nlattr *tb[RTA_MAX+1];
+	struct fib_result res = {};
 	struct rtable *rt = NULL;
 	struct flowi4 fl4;
 	__be32 dst = 0;
 	__be32 src = 0;
+	__u8 tos;
 	u32 iif;
 	int err;
 	int mark;
@@ -2630,15 +2632,20 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	memset(&fl4, 0, sizeof(fl4));
 	fl4.daddr = dst;
 	fl4.saddr = src;
-	fl4.flowi4_tos = rtm->rtm_tos;
+	tos = rtm->rtm_tos & (IPTOS_RT_MASK | RTO_ONLINK);
+	fl4.flowi4_tos = tos & IPTOS_RT_MASK;
+	fl4.flowi4_scope = ((tos & RTO_ONLINK) ?
+				RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
 	fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
 	fl4.flowi4_mark = mark;
 	fl4.flowi4_uid = uid;
 
+	rcu_read_lock();
+
 	if (iif) {
 		struct net_device *dev;
 
-		dev = __dev_get_by_index(net, iif);
+		dev = dev_get_by_index_rcu(net, iif);
 		if (!dev) {
 			err = -ENODEV;
 			goto errout_free;
@@ -2647,14 +2654,16 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 		skb->protocol	= htons(ETH_P_IP);
 		skb->dev	= dev;
 		skb->mark	= mark;
-		err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev);
+		err = ip_route_input_rcu(skb, dst, src, rtm->rtm_tos,
+					 dev, &res);
 
 		rt = skb_rtable(skb);
 		if (err == 0 && rt->dst.error)
 			err = -rt->dst.error;
 	} else {
-		rt = ip_route_output_key(net, &fl4);
+		fl4.flowi4_iif = LOOPBACK_IFINDEX;
 
+		rt = ip_route_output_key_hash_rcu(net, &fl4, &res, -1);
 		err = 0;
 		if (IS_ERR(rt))
 			err = PTR_ERR(rt);
@@ -2663,7 +2672,6 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	if (err)
 		goto errout_free;
 
-	skb_dst_set(skb, &rt->dst);
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
@@ -2672,15 +2680,18 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 
 	err = rt_fill_info(net, dst, src, table_id, &fl4, skb,
 			   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
-			   RTM_NEWROUTE);
+			   RTM_NEWROUTE, rt);
 	if (err < 0)
 		goto errout_free;
 
+	rcu_read_unlock();
+
 	err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
 errout:
 	return err;
 
 errout_free:
+	rcu_read_unlock();
 	kfree_skb(skb);
 	goto errout;
 }
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 4/7] net: ipv4: refactor fib_dump_info
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

Pull code that adds attributes to the response from fib_dump_info into
a separate, stand-alone function. That function is used by a later patch
to add the matching route to a get route request.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_lookup.h    |  2 ++
 net/ipv4/fib_semantics.c | 17 +++++++++++++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index 9c02920725db..7d2c019c5257 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -33,6 +33,8 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi);
 int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, u32 tb_id,
 		  u8 type, __be32 dst, int dst_len, u8 tos, struct fib_info *fi,
 		  unsigned int);
+int fib_dump_add_attrs_rcu(struct sk_buff *skb, __be32 dst, struct rtmsg *rtm,
+			   struct fib_info *fi);
 void rtmsg_fib(int event, __be32 key, struct fib_alias *fa, int dst_len,
 	       u32 tb_id, const struct nl_info *info, unsigned int nlm_flags);
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 05c911d21782..c6f7223fcd08 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1247,6 +1247,21 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	rtm->rtm_scope = fi->fib_scope;
 	rtm->rtm_protocol = fi->fib_protocol;
 
+	if (fib_dump_add_attrs_rcu(skb, dst, rtm, fi))
+		goto nla_put_failure;
+
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+/* called with rcu_read_lock held */
+int fib_dump_add_attrs_rcu(struct sk_buff *skb, __be32 dst, struct rtmsg *rtm,
+			   struct fib_info *fi)
+{
 	if (rtm->rtm_dst_len &&
 	    nla_put_in_addr(skb, RTA_DST, dst))
 		goto nla_put_failure;
@@ -1325,11 +1340,9 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 		nla_nest_end(skb, mp);
 	}
 #endif
-	nlmsg_end(skb, nlh);
 	return 0;
 
 nla_put_failure:
-	nlmsg_cancel(skb, nlh);
 	return -EMSGSIZE;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 5/7] net: ipv4: Save trie prefix to fib lookup result
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

Prefix is needed for returning matching route spec on get route request.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/ip_fib.h | 1 +
 net/ipv4/fib_trie.c  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 57c2a863d0b2..f2cc345852d7 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -136,6 +136,7 @@ struct fib_rule;
 
 struct fib_table;
 struct fib_result {
+	__be32		prefix;
 	unsigned char	prefixlen;
 	unsigned char	nh_sel;
 	unsigned char	type;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2919d1a10cfd..2fc5793cce36 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1544,6 +1544,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 			if (!(fib_flags & FIB_LOOKUP_NOREF))
 				atomic_inc(&fi->fib_clntref);
 
+			res->prefix = htonl(n->key);
 			res->prefixlen = KEYLENGTH - fa->fa_slen;
 			res->nh_sel = nhsel;
 			res->type = fa->fa_type;
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 6/7] net: ipv4: return route match in GETROUTE request
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

Add the matching route returned in fib_result as a new, nested attribute,
RTA_ROUTE_GET, to the GETROUTE response. The rtmsg struct is added use a
new attribute, RTA_ROUTE_GET_RTM. These attributes allow userspace to show
which route was matched for a GETROUTE request.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/uapi/linux/rtnetlink.h |  2 ++
 net/ipv4/route.c               | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 8c93ad1ef9ab..471384b72cea 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -319,6 +319,8 @@ enum rtattr_type_t {
 	RTA_EXPIRES,
 	RTA_PAD,
 	RTA_UID,
+	RTA_ROUTE_GET,  /* nested attribute; route spec for RTM_GETROUTE */
+	RTA_ROUTE_GET_RTM, /* struct rtmsg for nested spec */
 	__RTA_MAX
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 03ddc03c185a..9f44b869b8a6 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -113,6 +113,7 @@
 #include <net/secure_seq.h>
 #include <net/ip_tunnels.h>
 #include <net/l3mdev.h>
+#include "fib_lookup.h"
 
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
@@ -2470,7 +2471,8 @@ EXPORT_SYMBOL_GPL(ip_route_output_flow);
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 			struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
-			u32 seq, int event, struct rtable *rt)
+			u32 seq, int event, struct rtable *rt,
+			struct fib_result *res)
 {
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
@@ -2572,6 +2574,36 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 	if (rtnl_put_cacheinfo(skb, &rt->dst, 0, expires, error) < 0)
 		goto nla_put_failure;
 
+	if (res->fi) {
+		struct nlattr *get_rt;
+		struct rtmsg r_match;
+
+		/* Add data for matching route */
+		get_rt = nla_nest_start(skb, RTA_ROUTE_GET);
+		if (!get_rt)
+			goto nla_put_failure;
+
+		r_match.rtm_family = AF_INET;
+		r_match.rtm_dst_len = res->prefixlen;
+		r_match.rtm_src_len = 0;
+		r_match.rtm_tos  = fl4->flowi4_tos;
+		r_match.rtm_type = rt->rt_type;
+		r_match.rtm_flags = res->fi->fib_flags;
+		r_match.rtm_scope = res->fi->fib_scope;
+		r_match.rtm_protocol = res->fi->fib_protocol;
+		r_match.rtm_table = table_id;
+		if (nla_put_u32(skb, RTA_TABLE, table_id))
+			goto nla_put_failure;
+
+		if (fib_dump_add_attrs_rcu(skb, res->prefix, &r_match, res->fi))
+			goto nla_put_failure;
+
+		if (nla_put(skb, RTA_ROUTE_GET_RTM, sizeof(r_match), &r_match))
+			goto nla_put_failure;
+
+		nla_nest_end(skb, get_rt);
+	}
+
 	nlmsg_end(skb, nlh);
 	return 0;
 
@@ -2680,7 +2712,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 
 	err = rt_fill_info(net, dst, src, table_id, &fl4, skb,
 			   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
-			   RTM_NEWROUTE, rt);
+			   RTM_NEWROUTE, rt, &res);
 	if (err < 0)
 		goto errout_free;
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 7/7] net: ipv4: Remove event arg to rt_fill_info
From: David Ahern @ 2017-01-09 21:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <1483997571-3964-1-git-send-email-dsa@cumulusnetworks.com>

rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that
remove the arg and use RTM_NEWROUTE directly in rt_fill_info.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/route.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9f44b869b8a6..b34f79ffb11d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2471,8 +2471,7 @@ EXPORT_SYMBOL_GPL(ip_route_output_flow);
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 			struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
-			u32 seq, int event, struct rtable *rt,
-			struct fib_result *res)
+			u32 seq, struct rtable *rt, struct fib_result *res)
 {
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
@@ -2480,7 +2479,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 	u32 error;
 	u32 metrics[RTAX_MAX];
 
-	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*r), 0);
+	nlh = nlmsg_put(skb, portid, seq, RTM_NEWROUTE, sizeof(*r), 0);
 	if (!nlh)
 		return -EMSGSIZE;
 
@@ -2711,8 +2710,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 		table_id = rt->rt_table_id;
 
 	err = rt_fill_info(net, dst, src, table_id, &fl4, skb,
-			   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
-			   RTM_NEWROUTE, rt, &res);
+			   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, rt, &res);
 	if (err < 0)
 		goto errout_free;
 
-- 
2.1.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox