Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Question on rhashtable in worst-case scenario.
From: Herbert Xu @ 2016-04-01  0:46 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Ben Greear, David Miller, linux-kernel, linux-wireless, netdev,
	tgraf
In-Reply-To: <1459438199.4576.26.camel@sipsolutions.net>

On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:
> 
> Does removing this completely disable the "-EEXIST" error? I can't say
> I fully understand the elasticity stuff in __rhashtable_insert_fast().

What EEXIST error are you talking about? The only one that can be
returned on insertion is if you're explicitly checking for dups
which clearly can't be the case for you.

If you're talking about the EEXIST error due to a rehash then it is
completely hidden from you by rhashtable_insert_rehash.

If you actually meant EBUSY then yes this should prevent it from
occurring, unless your chain-length exceeds 2^32.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459469572.6473.239.camel@edumazet-glaptop3.roam.corp.google.com>

On 01.04.2016 02:12, Eric Dumazet wrote:
> Since this function does not take a reference on the dst, how could it
> be safe ?
>
> As soon as you exit the function, dst would possibly point to something
> obsolete/freed.
>
> This works because the caller owns the socket.
>
> If you believe one path needs to be fixed, tell me which one it is.
>
> Please ?

I fixed this one, I wait with review a bit and collapse some of the 
newer fixes into one and check and repost again tomorrow.

Thanks for reviewing!

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH] sctp: avoid refreshing heartbeat timer too often
From: Marcelo Ricardo Leitner @ 2016-04-01  0:24 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, Neil Horman, Vlad Yasevich,
	linux-sctp@vger.kernel.org
In-Reply-To: <20160331212512.GJ3387@mrl.redhat.com>

On Thu, Mar 31, 2016 at 06:25:12PM -0300, 'Marcelo Ricardo Leitner' wrote:
> On Thu, Mar 31, 2016 at 11:16:52AM +0000, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 30 March 2016 13:13
> > > Em 30-03-2016 06:37, David Laight escreveu:
> > > > From: Marcelo Ricardo Leitner
> > > >> Sent: 29 March 2016 14:42
> > > >>
> > > >> Currently on high rate SCTP streams the heartbeat timer refresh can
> > > >> consume quite a lot of resources as timer updates are costly and it
> > > >> contains a random factor, which a) is also costly and b) invalidates
> > > >> mod_timer() optimization for not editing a timer to the same value.
> > > >> It may even cause the timer to be slightly advanced, for no good reason.
> > > >
> > > > Interesting thoughts:
> > > > 1) Is it necessary to use a different 'random factor' until the timer actually
> > > >     expires?
> > > 
> > > I don't understand you fully here, but we have to have a random factor
> > > on timer expire. As noted by Daniel Borkmann on his commit 8f61059a96c2
> > > ("net: sctp: improve timer slack calculation for transport HBs"):
> > 
> > When a HEARTBEAT chunk is sent determine the new interval, use that
> > interval until the timer actually expires when a new interval is
> > calculated. So the random number is only generated once per heartbeat.
> > 
> > >      RFC4960, section 8.3 says:
> > > 
> > >        On an idle destination address that is allowed to heartbeat,
> > >        it is recommended that a HEARTBEAT chunk is sent once per RTO
> > >        of that destination address plus the protocol parameter
> > >        'HB.interval', with jittering of +/- 50% of the RTO value,
> > >        and exponential backoff of the RTO if the previous HEARTBEAT
> > >        is unanswered.
> > > 
> > > Previous to his commit, it was using a random factor based on jiffies.
> > > 
> > > This patch then assumes that random_A+2 is just as random as random_B as
> > > long as it is within the allowed range, avoiding the unnecessary updates.
> > > 
> > > > 2) It might be better to allow the heartbeat timer to expire, on expiry work
> > > >     out the new interval based on when the last 'refresh' was done.
> > > 
> > > Cool, I thought about this too. It would introduce some extra complexity
> > > that is not really worth I think, specially because now we may be doing
> > > more timer updates even with this patch but it's not triggering any wake
> > > ups and we would need at least 2 wake ups then: one for the first
> > > timeout event, and then re-schedule the timer for the next updated one,
> > > and maybe again, and again.. less timer updates but more wake ups, one
> > > at every heartbeat interval even on a busy transport. Seems it's cheaper
> > > to just update the timer then.
> > 
> > One wakeup per heartbeat interval on a busy connection is probably noise.
> > Probably much less than the 1000s of timer updates that would otherwise happen.
> 
> I was thinking more on near-idle systems, as the overhead for this
> refresh looked rather small now even for busy transports if compared to
> other points, a worth trade-off for reducing wake ups, imho.
> 
> But then I checked tcp, and it does what you're suggesting.
> I'll rework the patch. Thanks

This is what I'm getting with the new approach. I splitted
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, thus why sctp_transport_reset_t3_rtx in there
and it never updates the timer, only start if it's not running already
(as before). Ran netperf for 60 seconds now, to be sure that the timer
would expire twice (1st for initial path validation and 2nd for pure hb).

Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
  Overhead  Command  Shared Object      Symbol
+    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
+    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
-    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      + 49,89% __wake_up_sync_key
      + 45,68% sctp_ulpq_tail_event
      - 2,85% mod_timer
         + 76,51% sctp_transport_reset_t3_rtx
         + 23,49% sctp_do_sm
      + 1,55% del_timer
+    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
+    2,26%  netperf  [sctp]             [k] sctp_sendmsg

Doesn't seem much different from v1, but ok.

Also I could do some more cleanups on heartbeat/timer code.

  Marcelo

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459469572.6473.239.camel@edumazet-glaptop3.roam.corp.google.com>

On 01.04.2016 02:12, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:
>> Hi Eric,
>>
>> On 01.04.2016 01:39, Eric Dumazet wrote:
>>> On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
>>>> Various fixes which were discovered by this changeset. More probably
>>>> to come...
>>>
>>>
>>> Really ?
>>>
>>> Again, TCP stack locks the socket most of the time.
>>>
>>> The fact that lockdep does not understand this is not a reason to add
>>> this overhead.
>>
>> I don't see how lockdep does not understand this? I think I added the
>> appropriate helper to exactly verify if we have the socket lock with
>> lockdep, please have a look at lockdep_sock_is_held in patch #2.
>>
>> Some of the patches also just reorder the rcu_read_lock, which is anyway
>> mostly free.
>
> I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
> a sense of 'security' by merely shutting down maybe good signals.

I did those patches by adding the first patches and running tcp tests, 
so I have splats for every single one of those. I just didn't bother 
them into the changelog. I certainly can do so.

> Your changelog explains nothing, and your patch makes absolutely no
> sense to me.
>
> Lets take following example :
>
>   struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
>   {
> -       struct dst_entry *dst = __sk_dst_get(sk);
> +       struct dst_entry *dst;
>
> +       rcu_read_lock();
> +       dst = __sk_dst_get(sk);
>          if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
>                  sk_tx_queue_clear(sk);
>                  RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
>                  dst_release(dst);
> -               return NULL;
> +               dst = NULL;
>          }
> +       rcu_read_unlock();
>
>          return dst;
>   }
>
>
> Since this function does not take a reference on the dst, how could it
> be safe ?

This is stupid by me, I am sorry, thanks for pointing this out. I will
fix this. Just looking too long at all those lockdep reports.

I will find another fix for this.

> As soon as you exit the function, dst would possibly point to something
> obsolete/freed.
>
> This works because the caller owns the socket.
>
> If you believe one path needs to be fixed, tell me which one it is.
>
> Please ?

[   31.064029] ===============================
[   31.064030] [ INFO: suspicious RCU usage. ]
[   31.064032] 4.5.0+ #13 Not tainted
[   31.064033] -------------------------------
[   31.064034] include/net/sock.h:1594 suspicious 
rcu_dereference_check() usage!
[   31.064035]
                other info that might help us debug this:

[   31.064041]
                rcu_scheduler_active = 1, debug_locks = 1
[   31.064042] no locks held by ssh/817.
[   31.064043]
                stack backtrace:
[   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
[   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.8.2-20150714_191134- 04/01/2014
[   31.064047]  0000000000000286 000000006730b46b ffff8800badf7bd0 
ffffffff81442b33
[   31.064050]  ffff8800b8c78000 0000000000000001 ffff8800badf7c00 
ffffffff8110ae75
[   31.064052]  ffff880035ea2f00 ffff8800b8e28000 0000000000000003 
00000000000004c4
[   31.064054] Call Trace:
[   31.064058]  [<ffffffff81442b33>] dump_stack+0x85/0xc2
[   31.064062]  [<ffffffff8110ae75>] lockdep_rcu_suspicious+0xc5/0x100
[   31.064064]  [<ffffffff8173bf57>] __sk_dst_check+0x77/0xb0
[   31.064066]  [<ffffffff8182e502>] inet6_sk_rebuild_header+0x52/0x300
[   31.064068]  [<ffffffff813bb61e>] ? selinux_skb_peerlbl_sid+0x5e/0xa0
[   31.064070]  [<ffffffff813bb69e>] ? 
selinux_inet_conn_established+0x3e/0x40
[   31.064072]  [<ffffffff817c2bad>] tcp_finish_connect+0x4d/0x270
[   31.064074]  [<ffffffff817c33f7>] tcp_rcv_state_process+0x627/0xe40
[   31.064076]  [<ffffffff81866584>] tcp_v6_do_rcv+0xd4/0x410
[   31.064078]  [<ffffffff8173bc65>] release_sock+0x85/0x1c0
[   31.064079]  [<ffffffff817e9983>] __inet_stream_connect+0x1c3/0x340
[   31.064081]  [<ffffffff8173b089>] ? lock_sock_nested+0x49/0xb0
[   31.064083]  [<ffffffff81100270>] ? abort_exclusive_wait+0xb0/0xb0
[   31.064084]  [<ffffffff817e9b38>] inet_stream_connect+0x38/0x50
[   31.064086]  [<ffffffff8173794f>] SYSC_connect+0xcf/0xf0
[   31.064088]  [<ffffffff8110d069>] ? trace_hardirqs_on_caller+0x129/0x1b0
[   31.064090]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
[   31.064091]  [<ffffffff8173854e>] SyS_connect+0xe/0x10
[   31.064094]  [<ffffffff818a0e7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-04-01  0:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <56FDBA6E.5030508@stressinduktion.org>

On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:
> Hi Eric,
> 
> On 01.04.2016 01:39, Eric Dumazet wrote:
> > On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> >> Various fixes which were discovered by this changeset. More probably
> >> to come...
> >
> >
> > Really ?
> >
> > Again, TCP stack locks the socket most of the time.
> >
> > The fact that lockdep does not understand this is not a reason to add
> > this overhead.
> 
> I don't see how lockdep does not understand this? I think I added the 
> appropriate helper to exactly verify if we have the socket lock with 
> lockdep, please have a look at lockdep_sock_is_held in patch #2.
> 
> Some of the patches also just reorder the rcu_read_lock, which is anyway 
> mostly free.

I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
a sense of 'security' by merely shutting down maybe good signals.

Your changelog explains nothing, and your patch makes absolutely no
sense to me.

Lets take following example :

 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-       struct dst_entry *dst = __sk_dst_get(sk);
+       struct dst_entry *dst;
 
+       rcu_read_lock();
+       dst = __sk_dst_get(sk);
        if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
                sk_tx_queue_clear(sk);
                RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
                dst_release(dst);
-               return NULL;
+               dst = NULL;
        }
+       rcu_read_unlock();
 
        return dst;
 }


Since this function does not take a reference on the dst, how could it
be safe ?

As soon as you exit the function, dst would possibly point to something
obsolete/freed.

This works because the caller owns the socket.

If you believe one path needs to be fixed, tell me which one it is.

Please ?

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459467595.6473.233.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On 01.04.2016 01:39, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
>> Various fixes which were discovered by this changeset. More probably
>> to come...
>
>
> Really ?
>
> Again, TCP stack locks the socket most of the time.
>
> The fact that lockdep does not understand this is not a reason to add
> this overhead.

I don't see how lockdep does not understand this? I think I added the 
appropriate helper to exactly verify if we have the socket lock with 
lockdep, please have a look at lockdep_sock_is_held in patch #2.

Some of the patches also just reorder the rcu_read_lock, which is anyway 
mostly free.

Bye,
Hannes

^ permalink raw reply

* [PATCH net 6/4] tcp: fix __sk_dst_get usage in tcp_current_mss
From: Hannes Frederic Sowa @ 2016-03-31 23:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv4/tcp_output.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ba3621834e7bfa..3f70582578ada0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1426,21 +1426,21 @@ EXPORT_SYMBOL(tcp_sync_mss);
 unsigned int tcp_current_mss(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
 	u32 mss_now;
 	unsigned int header_len;
 	struct tcp_out_options opts;
 	struct tcp_md5sig_key *md5;
 
+	rcu_read_lock();
 	mss_now = tp->mss_cache;
-
+	dst = __sk_dst_get(sk);
 	if (dst) {
 		u32 mtu = dst_mtu(dst);
 		if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
 
-	rcu_read_lock();
 	header_len = tcp_established_options(sk, NULL, &opts, &md5) +
 		     sizeof(struct tcphdr);
 	rcu_read_unlock();
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 5/4] tcp: fix rcu usage around __sk_dst_get in tcp_update_metrics
From: Hannes Frederic Sowa @ 2016-03-31 23:48 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Already another one fix I overlooked.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv4/tcp_metrics.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 33a36648423e8b..196de79902819a 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -367,7 +367,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 void tcp_update_metrics(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
 	struct tcp_metrics_block *tm;
@@ -375,13 +375,14 @@ void tcp_update_metrics(struct sock *sk)
 	u32 val;
 	int m;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (sysctl_tcp_nometrics_save || !dst)
-		return;
+		goto out_unlock;
 
 	if (dst->flags & DST_HOST)
 		dst_confirm(dst);
 
-	rcu_read_lock();
 	if (icsk->icsk_backoff || !tp->srtt_us) {
 		/* This session failed to estimate rtt. Why?
 		 * Probably, no packets returned in time.  Reset our
-- 
2.5.5

^ permalink raw reply related

* Re: qdisc spin lock
From: Michael Ma @ 2016-03-31 23:48 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpVSHqLcGjrErAsh54+iD8uwGH5QNGwiJVOQOTsk=dwDbw@mail.gmail.com>

I didn't really know that multiple qdiscs can be isolated using MQ so
that each txq can be associated with a particular qdisc. Also we don't
really have multiple interfaces...

With this MQ solution we'll still need to assign transmit queues to
different classes by doing some math on the bandwidth limit if I
understand correctly, which seems to be less convenient compared with
a solution purely within HTB.

I assume that with this solution I can still share qdisc among
multiple transmit queues - please let me know if this is not the case.

2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> If your HTB classes don't share bandwidth, why do you still make them
> under the same hierarchy? IOW, you can just isolate them either with some
> other qdisc or just separated interfaces.

^ permalink raw reply

* Re: [PATCH net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Yuchung Cheng @ 2016-03-31 23:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Soheil Hassas Yeganeh, David Miller, Network Development,
	Willem de Bruijn, Eric Dumazet, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <CAF=yD-KZJM8KL5ncH2OK=PmKrXagqTdayPNwkuDiKu4MLkhztw@mail.gmail.com>

On Wed, Mar 30, 2016 at 8:35 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
> <soheil.kdev@gmail.com> wrote:
>> From: Soheil Hassas Yeganeh <soheil@google.com>
>>
>> SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
>> to associate timestamps with send calls. For TCP connections,
>> tp->snd_una is used as the starting point to calculate
>> relative IDs.
>>
>> This socket option will fail if set before the handshake on a
>> passive TCP fast open connection with data in SYN or SYN/ACK,
>> since setsockopt requires the connection to be in the
>> ESTABLISHED state.
>>
>> To address these, instead of limiting the option to the
>> ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
>> long as the connection is not in LISTEN or CLOSE states.
>>
>> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>
> Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>

^ permalink raw reply

* Re: [PATCH] fec: Do not access unexisting register in Coldfire
From: Greg Ungerer @ 2016-03-31 23:43 UTC (permalink / raw)
  To: Fabio Estevam, davem; +Cc: fugang.duan, troy.kisky, netdev, Fabio Estevam
In-Reply-To: <1459436717-12809-1-git-send-email-festevam@gmail.com>

Thanks for taking care of that Fabio.

Regards
Greg


On 01/04/16 01:05, Fabio Estevam wrote:
> From: Fabio Estevam <fabio.estevam@nxp.com>
> 
> Commit 55cd48c821de ("net: fec: stop the "rcv is not +last, " error
> messages") introduces a write to a register that does not exist in
> Coldfire.
> 
> Move the FEC_FTRL register access inside the FEC_QUIRK_HAS_RACC 'if' block,
> so that we guarantee it will not be used on Coldfire CPUs.
> 
> Reported-by: Greg Ungerer <gerg@uclinux.org>
> Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 37c0815..08243c2 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -943,8 +943,8 @@ fec_restart(struct net_device *ndev)
>  		else
>  			val &= ~FEC_RACC_OPTIONS;
>  		writel(val, fep->hwp + FEC_RACC);
> +		writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  	}
> -	writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  #endif
>  
>  	/*
> 

^ permalink raw reply

* Re: qdisc spin lock
From: Michael Ma @ 2016-03-31 23:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev
In-Reply-To: <20160331211852.2d228976@redhat.com>

Thanks for the suggestion - I'll try the MQ solution out. It seems to
be able to solve the problem well with the assumption that bandwidth
can be statically partitioned.

2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote:
>
>> I know this might be an old topic so bare with me – what we are facing
>> is that applications are sending small packets using hundreds of
>> threads so the contention on spin lock in __dev_xmit_skb increases the
>> latency of dev_queue_xmit significantly. We’re building a network QoS
>> solution to avoid interference of different applications using HTB.
>
> Yes, as you have noticed with HTB there is a single qdisc lock, and
> congestion obviously happens :-)
>
> It is possible with different tricks to make it scale.  I believe
> Google is using a variant of HTB, and it scales for them.  They have
> not open source their modifications to HTB (which likely also involves
> a great deal of setup tricks).
>
> If your purpose it to limit traffic/bandwidth per "cloud" instance,
> then you can just use another TC setup structure.  Like using MQ and
> assigning a HTB per MQ queue (where the MQ queues are bound to each
> CPU/HW queue)... But you have to figure out this setup yourself...
>
>
>> But in this case when some applications send massive small packets in
>> parallel, the application to be protected will get its throughput
>> affected (because it’s doing synchronous network communication using
>> multiple threads and throughput is sensitive to the increased latency)
>>
>> Here is the profiling from perf:
>>
>> -  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
>>   - 99.94% dev_queue_xmit
>>   -  96.91% _spin_lock
>>  - 2.62%  __qdisc_run
>>   - 98.98% sch_direct_xmit
>> - 99.98% _spin_lock
>>
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> Yes, there is a lot go gain, but it is not easy ;-)
>
>> I must have oversimplified a lot of details since I’m not familiar
>> with the TC implementation at this point – just want to get your input
>> in terms of whether this is a worthwhile effort or there is something
>> fundamental that I’m not aware of. If this is just a matter of quite
>> some additional work, would also appreciate helping to outline the
>> required work here.
>>
>> Also would appreciate if there is any information about the latest
>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>
> This article seems to be very low quality... spelling errors, only 5
> pages, no real code, etc.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-03-31 23:39 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-5-git-send-email-hannes@stressinduktion.org>

On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> Various fixes which were discovered by this changeset. More probably
> to come...


Really ?

Again, TCP stack locks the socket most of the time.

The fact that lockdep does not understand this is not a reason to add
this overhead.

^ permalink raw reply

* Re: Best way to reduce system call overhead for tun device I/O?
From: Stephen Hemminger @ 2016-03-31 23:39 UTC (permalink / raw)
  To: Guus Sliepen; +Cc: David Miller, tom, netdev
In-Reply-To: <20160331222857.GH3771@sliepen.org>

On Fri, 1 Apr 2016 00:28:57 +0200
Guus Sliepen <guus@tinc-vpn.org> wrote:

> On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:
> 
> > >> I'm trying to reduce system call overhead when reading/writing to/from a
> > >> tun device in userspace. [...] What would be the right way to do this?
> > >>
> > > Personally I think tun could benefit greatly if it were implemented as
> > > a socket instead of character interface. One thing that could be much
> > > better is sending/receiving of meta data attached to skbuf. For
> > > instance GSO data could be in ancillary data in a socket instead of
> > > inline with packet data as tun seems to be doing now.
> > 
> > Agreed.
> 
> Ok. So how should the userspace API work? Creating an AF_PACKET socket
> and then using a tun ioctl to create a tun interface and bind it to the
> socket?
> 
> int fd = socket(AF_PACKET, ...)
> struct ifreq ifr = {...};
> ioctl(fd, TUNSETIFF, &ifr);
> 

Rather than bodge AF_PACKET onto TUN, why not just create a new device type
and control it from something modern like netlink.

^ permalink raw reply

* Re: [PULL REQUEST] Please pull rdma.git
From: Leon Romanovsky @ 2016-03-31 23:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Doug Ledford, Linus Torvalds, RDMA mailing list, linux-netdev,
	Matan Barak, Or Gerlitz, Leon Romanovsky
In-Reply-To: <56F29C32.305-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Wed, Mar 23, 2016 at 09:37:54AM -0400, Doug Ledford wrote:
> On 03/23/2016 06:57 AM, Leon Romanovsky wrote:
> > On Sat, Mar 19, 2016 at 02:37:08PM -0700, Linus Torvalds wrote:
> >> So the *best* situation would be:
> >>
> >>  - your two groups talk it over, and figure out what the common commits are
> >>
> >>  - you put those common commits as a "base" branch in git
> >>
> >>  - you ask the two upper-level maintainers to both pull that base branch
> >>
> >>  - you then make sure that you send the later patches (whether as
> >> emailed patches or as pull requests) based on top of that base branch.
> > 
> > Hi David and Doug,
> > 
> > Are you OK with the approach suggested by Linus?
> > We are eager to know it, so we will adopt it as soon
> > as possible in our development flow.
> > 
> > The original thread [1].
> > 
> > [1] http://thread.gmane.org/gmane.linux.drivers.rdma/34907
> > 
> > Thanks.
> > 
> 
> I'm fine with it.  Since I happen to use topic branches to build my
> for-next from anyway, I might need to be the one that Dave pulls from
> versus the other way around.

Resending to linux-netdev.

David,
Can you please express your opinion about Linus's suggestion to
eliminate merge conflicts in Mellanox related products?

Thanks

> 
> -- 
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Hannes Frederic Sowa @ 2016-03-31 23:31 UTC (permalink / raw)
  To: Daniel Borkmann, David Miller
  Cc: eric.dumazet, alexei.starovoitov, mkubecek, sasha.levin, jslaby,
	mst, netdev
In-Reply-To: <56FD9C39.6040703@iogearbox.net>

Hi Daniel,

On 31.03.2016 23:52, Daniel Borkmann wrote:
> On 03/31/2016 09:48 PM, Hannes Frederic Sowa wrote:
> [...]
>> Tightest solution would probably be to combine both patches.
>>
>> bool called_by_tuntap;
>>
>> old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ?
>> lockdep_rtnl_is_held() : lockdep_sock_is_held());
>
> If I understand you correctly with combining them, you mean you'd still
> need the API change to pass the bool 'called_by_tuntap' down, right?

I actually decided to simply lock the sockets. So that there will always 
be a clear lock owner during the updates. I think this is cleaner. What 
do you think?

> If so, your main difference is, after all, to replace the
> sock_owned_by_user()
> with the lockdep_sock_is_held() construction instead, correct?

I just didn't do that part because we hold socket lock now.

> But then, isn't it already sufficient when you pass the bool itself down
> that 'demuxes' in this case between the sock_owned_by_user() vs
> lockdep_rtnl_is_held() check?

I just send out the patches, please have a look.

Thanks,
Hannes

^ permalink raw reply

* [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Various fixes which were discovered by this changeset. More probably
to come...

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/tcp.h      |  5 ++++-
 net/core/sock.c        |  7 +++++--
 net/ipv4/tcp_input.c   | 18 ++++++++++++++----
 net/ipv4/tcp_metrics.c | 12 +++++-------
 net/ipv4/tcp_output.c  | 22 ++++++++++++++++++++--
 5 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f61be64a..541c99bf633d4b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -647,11 +647,14 @@ static inline void tcp_fast_path_check(struct sock *sk)
 /* Compute the actual rto_min value */
 static inline u32 tcp_rto_min(struct sock *sk)
 {
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
 	u32 rto_min = TCP_RTO_MIN;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
 		rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
+	rcu_read_unlock();
 	return rto_min;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230f9..963d0ba7aa4232 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -486,14 +486,17 @@ EXPORT_SYMBOL(sk_receive_skb);
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
 		sk_tx_queue_clear(sk);
 		RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
 		dst_release(dst);
-		return NULL;
+		dst = NULL;
 	}
+	rcu_read_unlock();
 
 	return dst;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f79ade821..8489e9e45f906c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -203,10 +203,16 @@ static void tcp_enter_quickack_mode(struct sock *sk)
 static bool tcp_in_quickack_mode(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
+	bool ret;
 
-	return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
-		(icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
+	ret = (dst && dst_metric(dst, RTAX_QUICKACK)) ||
+	      (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
@@ -3674,9 +3680,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_process_tlp_ack(sk, ack, flag);
 
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
-		struct dst_entry *dst = __sk_dst_get(sk);
+		struct dst_entry *dst;
+
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst)
 			dst_confirm(dst);
+		rcu_read_unlock();
 	}
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 7b7eec43990692..33a36648423e8b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -488,22 +488,20 @@ out_unlock:
 
 void tcp_init_metrics(struct sock *sk)
 {
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_metrics_block *tm;
 	u32 val, crtt = 0; /* cached RTT scaled by 8 */
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (!dst)
 		goto reset;
-
 	dst_confirm(dst);
 
-	rcu_read_lock();
 	tm = tcp_get_metrics(sk, dst, true);
-	if (!tm) {
-		rcu_read_unlock();
+	if (!tm)
 		goto reset;
-	}
 
 	if (tcp_metric_locked(tm, TCP_METRIC_CWND))
 		tp->snd_cwnd_clamp = tcp_metric_get(tm, TCP_METRIC_CWND);
@@ -527,7 +525,6 @@ void tcp_init_metrics(struct sock *sk)
 	}
 
 	crtt = tcp_metric_get(tm, TCP_METRIC_RTT);
-	rcu_read_unlock();
 reset:
 	/* The initial RTT measurement from the SYN/SYN-ACK is not ideal
 	 * to seed the RTO for later data packets because SYN packets are
@@ -575,6 +572,7 @@ reset:
 	else
 		tp->snd_cwnd = tcp_init_cwnd(tp, dst);
 	tp->snd_cwnd_stamp = tcp_time_stamp;
+	rcu_read_unlock();
 }
 
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2dc015cd19a6..ba3621834e7bfa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -548,6 +548,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	struct tcp_fastopen_request *fastopen = tp->fastopen_req;
 
 #ifdef CONFIG_TCP_MD5SIG
+	rcu_read_lock();
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (*md5) {
 		opts->options |= OPTION_MD5;
@@ -601,6 +602,10 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+#ifdef CONFIG_TCP_MD5SIG
+	rcu_read_unlock();
+#endif
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -928,6 +933,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
+#ifdef CONFIG_TCP_MD5SIG
+	rcu_read_lock();
+#endif
+
 	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
 	else
@@ -996,6 +1005,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		tp->af_specific->calc_md5_hash(opts.hash_location,
 					       md5, sk, skb);
 	}
+	rcu_read_unlock();
 #endif
 
 	icsk->icsk_af_ops->send_check(sk, skb);
@@ -1294,10 +1304,13 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
 
 	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
 	if (icsk->icsk_af_ops->net_frag_header_len) {
-		const struct dst_entry *dst = __sk_dst_get(sk);
+		const struct dst_entry *dst;
 
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst && dst_allfrag(dst))
 			mss_now -= icsk->icsk_af_ops->net_frag_header_len;
+		rcu_read_unlock();
 	}
 
 	/* Clamp it (mss_clamp does not include tcp options) */
@@ -1335,10 +1348,13 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
 
 	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
 	if (icsk->icsk_af_ops->net_frag_header_len) {
-		const struct dst_entry *dst = __sk_dst_get(sk);
+		const struct dst_entry *dst;
 
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst && dst_allfrag(dst))
 			mtu += icsk->icsk_af_ops->net_frag_header_len;
+		rcu_read_unlock();
 	}
 	return mtu;
 }
@@ -1424,8 +1440,10 @@ unsigned int tcp_current_mss(struct sock *sk)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
 
+	rcu_read_lock();
 	header_len = tcp_established_options(sk, NULL, &opts, &md5) +
 		     sizeof(struct tcphdr);
+	rcu_read_unlock();
 	/* The mss_cache is sized based on tp->tcp_header_len, which assumes
 	 * some common options. If this is an odd packet (because we have SACK
 	 * blocks etc) then our calculated header_len will be different, and
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 3/4] sock: use lockdep_sock_is_held were appropriate
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Also make lockdep_sock_is_held accept const arguments, so we don't need to
modify too many callers.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/sock.h       | 7 ++++---
 net/dccp/ipv4.c          | 2 +-
 net/dccp/ipv6.c          | 2 +-
 net/ipv4/af_inet.c       | 2 +-
 net/ipv4/cipso_ipv4.c    | 3 ++-
 net/ipv4/ip_sockglue.c   | 4 ++--
 net/ipv4/tcp_ipv4.c      | 8 +++-----
 net/ipv6/ipv6_sockglue.c | 6 ++++--
 net/ipv6/tcp_ipv6.c      | 2 +-
 net/socket.c             | 2 +-
 10 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 30f9b5ad0a82ef..bbea02fdaaa0fd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1382,8 +1382,9 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static bool lockdep_sock_is_held(struct sock *sk)
+static bool lockdep_sock_is_held(const struct sock *csk)
 {
+	struct sock *sk = (struct sock *)csk;
 	return lockdep_is_held(&sk->sk_lock) ||
 	       lockdep_is_held(&sk->sk_lock.slock);
 }
@@ -1589,8 +1590,8 @@ static inline void sk_rethink_txhash(struct sock *sk)
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-	return rcu_dereference_check(sk->sk_dst_cache, sock_owned_by_user(sk) ||
-						       lockdep_is_held(&sk->sk_lock.slock));
+	return rcu_dereference_check(sk->sk_dst_cache,
+				     lockdep_sock_is_held(sk));
 }
 
 static inline struct dst_entry *
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c67a961ba5382..0ea298d849383f 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -62,7 +62,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	nexthop = daddr = usin->sin_addr.s_addr;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt != NULL && inet_opt->opt.srr) {
 		if (daddr == 0)
 			return -EINVAL;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 4663a01d503991..6ea214fa5499c5 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -865,7 +865,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl6.fl6_sport = inet->inet_sport;
 	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
 
-	opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+	opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
 	final_p = fl6_update_dst(&fl6, opt, &final);
 
 	dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbaef2..7e37ebb5af396e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1106,7 +1106,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	struct ip_options_rcu *inet_opt;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
 
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index bdb2a07ec363b7..40d6b87713a132 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1933,7 +1933,8 @@ int cipso_v4_sock_setattr(struct sock *sk,
 
 	sk_inet = inet_sk(sk);
 
-	old = rcu_dereference_protected(sk_inet->inet_opt, sock_owned_by_user(sk));
+	old = rcu_dereference_protected(sk_inet->inet_opt,
+					lockdep_sock_is_held(sk));
 	if (sk_inet->is_icsk) {
 		sk_conn = inet_csk(sk);
 		if (old)
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad645a8d9d8..cf073059192d99 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -635,7 +635,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		if (err)
 			break;
 		old = rcu_dereference_protected(inet->inet_opt,
-						sock_owned_by_user(sk));
+						lockdep_sock_is_held(sk));
 		if (inet->is_icsk) {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 #if IS_ENABLED(CONFIG_IPV6)
@@ -1295,7 +1295,7 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 		struct ip_options_rcu *inet_opt;
 
 		inet_opt = rcu_dereference_protected(inet->inet_opt,
-						     sock_owned_by_user(sk));
+						     lockdep_sock_is_held(sk));
 		opt->optlen = 0;
 		if (inet_opt)
 			memcpy(optbuf, &inet_opt->opt,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad450509029bce..17cc1840337756 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -157,7 +157,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	nexthop = daddr = usin->sin_addr.s_addr;
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt && inet_opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
@@ -882,8 +882,7 @@ struct tcp_md5sig_key *tcp_md5_do_lookup(const struct sock *sk,
 
 	/* caller either holds rcu_read_lock() or socket lock */
 	md5sig = rcu_dereference_check(tp->md5sig_info,
-				       sock_owned_by_user(sk) ||
-				       lockdep_is_held((spinlock_t *)&sk->sk_lock.slock));
+				       lockdep_sock_is_held(sk));
 	if (!md5sig)
 		return NULL;
 #if IS_ENABLED(CONFIG_IPV6)
@@ -928,8 +927,7 @@ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr,
 	}
 
 	md5sig = rcu_dereference_protected(tp->md5sig_info,
-					   sock_owned_by_user(sk) ||
-					   lockdep_is_held(&sk->sk_lock.slock));
+					   lockdep_sock_is_held(sk));
 	if (!md5sig) {
 		md5sig = kmalloc(sizeof(*md5sig), gfp);
 		if (!md5sig)
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4449ad1f81147c..516b6a31c30f7b 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -407,7 +407,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		if (optname != IPV6_RTHDR && !ns_capable(net->user_ns, CAP_NET_RAW))
 			break;
 
-		opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+		opt = rcu_dereference_protected(np->opt,
+						lockdep_sock_is_held(sk));
 		opt = ipv6_renew_options(sk, opt, optname,
 					 (struct ipv6_opt_hdr __user *)optval,
 					 optlen);
@@ -1123,7 +1124,8 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		struct ipv6_txoptions *opt;
 
 		lock_sock(sk);
-		opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+		opt = rcu_dereference_protected(np->opt,
+						lockdep_sock_is_held(sk));
 		len = ipv6_getsockopt_sticky(sk, opt, optname, optval, len);
 		release_sock(sk);
 		/* check if ipv6_getsockopt_sticky() returns err code */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 711d209f912473..bd16dc4b6ba71b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -234,7 +234,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl6.fl6_dport = usin->sin6_port;
 	fl6.fl6_sport = inet->inet_sport;
 
-	opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+	opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
 	final_p = fl6_update_dst(&fl6, opt, &final);
 
 	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
diff --git a/net/socket.c b/net/socket.c
index 5f77a8e93830bd..e3299cdfe9db39 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1046,7 +1046,7 @@ static int sock_fasync(int fd, struct file *filp, int on)
 		return -EINVAL;
 
 	lock_sock(sk);
-	wq = rcu_dereference_protected(sock->wq, sock_owned_by_user(sk));
+	wq = rcu_dereference_protected(sock->wq, lockdep_sock_is_held(sk));
 	fasync_helper(fd, filp, on, &wq->fasync_list);
 
 	if (!wq->fasync_list)
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 2/4] net: proper check if we hold the socket lock during dereference
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

lockdep_sock_is_held makes sure that we currently own the
lock. sock_owned_by_user simply checks if a user holds the socket. This
could lead to non deterministic lock checks.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/sock.h | 5 +++++
 net/core/filter.c  | 6 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b73..30f9b5ad0a82ef 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1382,6 +1382,11 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
+static bool lockdep_sock_is_held(struct sock *sk)
+{
+	return lockdep_is_held(&sk->sk_lock) ||
+	       lockdep_is_held(&sk->sk_lock.slock);
+}
 
 struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		      struct proto *prot, int kern);
diff --git a/net/core/filter.c b/net/core/filter.c
index 4b81b71171b4ce..e8486ba601eae7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1166,7 +1166,7 @@ static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
 	}
 
 	old_fp = rcu_dereference_protected(sk->sk_filter,
-					   sock_owned_by_user(sk));
+					   lockdep_sock_is_held(sk));
 	rcu_assign_pointer(sk->sk_filter, fp);
 
 	if (old_fp)
@@ -2259,7 +2259,7 @@ int sk_detach_filter(struct sock *sk)
 		return -EPERM;
 
 	filter = rcu_dereference_protected(sk->sk_filter,
-					   sock_owned_by_user(sk));
+					   lockdep_sock_is_held(sk));
 	if (filter) {
 		RCU_INIT_POINTER(sk->sk_filter, NULL);
 		sk_filter_uncharge(sk, filter);
@@ -2279,7 +2279,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 
 	lock_sock(sk);
 	filter = rcu_dereference_protected(sk->sk_filter,
-					   sock_owned_by_user(sk));
+					   lockdep_sock_is_held(sk));
 	if (!filter)
 		goto out;
 
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 1/4] tun: add socket locking around sk_{attach,detach}_filter
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

[Changelog stolen from Daniel Borkmann:]

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
  [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
  [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
  [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
  [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in commit f91ff5b9ff52 ("net: sk_{detach|attach}_filter()
rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in commit 994051625981 ("tun: socket
filter support"), tap filters are managed under RTNL lock from
__tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this
specific case and therefore triggers the false positive.

Simply holding the locks during the filter updates will fix this problem.

Fixes: 994051625981 ("tun: socket filter support")
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/tun.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950617c36e..dccbbacbbc7f02 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -622,7 +622,11 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte
 
 	/* Re-attach the filter to persist device */
 	if (!skip_filter && (tun->filter_attached == true)) {
+		bool slow;
+
+		slow = lock_sock_fast(tfile->socket.sk);
 		err = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+		unlock_sock_fast(tfile->socket.sk, slow);
 		if (!err)
 			goto out;
 	}
@@ -1821,8 +1825,12 @@ static void tun_detach_filter(struct tun_struct *tun, int n)
 	struct tun_file *tfile;
 
 	for (i = 0; i < n; i++) {
+		bool slow;
+
 		tfile = rtnl_dereference(tun->tfiles[i]);
+		slow = lock_sock_fast(tfile->socket.sk);
 		sk_detach_filter(tfile->socket.sk);
+		unlock_sock_fast(tfile->socket.sk, slow);
 	}
 
 	tun->filter_attached = false;
@@ -1834,8 +1842,12 @@ static int tun_attach_filter(struct tun_struct *tun)
 	struct tun_file *tfile;
 
 	for (i = 0; i < tun->numqueues; i++) {
+		bool slow;
+
 		tfile = rtnl_dereference(tun->tfiles[i]);
+		slow = lock_sock_fast(tfile->socket.sk);
 		ret = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+		unlock_sock_fast(tfile->socket.sk, slow);
 		if (ret) {
 			tun_detach_filter(tun, i);
 			return ret;
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 0/4] net: fix and tighten rcu dereference checks
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek

Only the first patch is really applicable for stable. It adds appropriate
socket locks so lockdep doesn't complain if tuntap's ioctls modify the
filters on the socket.

Rest of the patches tighten the rcu dereference socket lock checks.

Last patch fixes missing rcu_read_locks which were discovered by this
change. Certainly there are more to come.

Hannes Frederic Sowa (4):
  tun: add socket locking around sk_{attach,detach}_filter
  net: proper check if we hold the socket lock during dereference
  sock: use lockdep_sock_is_held were appropriate
  tcp: various missing rcu_read_lock around __sk_dst_get

 drivers/net/tun.c        | 12 ++++++++++++
 include/net/sock.h       | 10 ++++++++--
 include/net/tcp.h        |  5 ++++-
 net/core/filter.c        |  6 +++---
 net/core/sock.c          |  7 +++++--
 net/dccp/ipv4.c          |  2 +-
 net/dccp/ipv6.c          |  2 +-
 net/ipv4/af_inet.c       |  2 +-
 net/ipv4/cipso_ipv4.c    |  3 ++-
 net/ipv4/ip_sockglue.c   |  4 ++--
 net/ipv4/tcp_input.c     | 18 ++++++++++++++----
 net/ipv4/tcp_ipv4.c      |  8 +++-----
 net/ipv4/tcp_metrics.c   | 12 +++++-------
 net/ipv4/tcp_output.c    | 22 ++++++++++++++++++++--
 net/ipv6/ipv6_sockglue.c |  6 ++++--
 net/ipv6/tcp_ipv6.c      |  2 +-
 net/socket.c             |  2 +-
 17 files changed, 87 insertions(+), 36 deletions(-)

-- 
2.5.5

^ permalink raw reply

* Urgent Founds!!
From: Mr Tony Hope @ 2016-03-31 22:21 UTC (permalink / raw)
  To: DHS

Urgent Founds!!

We wish to inform you that your over due Inheritance funds which we agreed to pay you in cash is already sealed and package with a security proof box. The funds worth of $7.5 millions US Dollar,in the package will be conveyed to you by an Int'l diplomatic agent, Mr. Tony Hope.

He will be leaving for your country any time from now, therefore reach us with the details below.

Using a Diplomatic agent this time is because of the failure that were recorded in the other transfer options. Just try and give the diplomat your information and offer him all assistance he may need, especially directive assistance so that he will be able to get your consignment box to you in the couple of days.

Please contact him with your full information, such as your.

1) Full name: ......
2) Resident address:......
3) Phone number:.....
4) The name of your nearest airport:....
Send him the information above for him to locate your home with your package.

Below is some of his contact information:

Name: Mr Tony Hope
Email: (dipltonyhope@photofile.ru)Phone numbers:
+229-99 22 69 25

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: possible bug in latest network tree
From: Cong Wang @ 2016-03-31 22:34 UTC (permalink / raw)
  To: Light, John J; +Cc: David S. Miller, netdev@vger.kernel.org
In-Reply-To: <660C7E53FDDFB64C8741117E0D659AAAB8463B9F@ORSMSX116.amr.corp.intel.com>

On Wed, Mar 30, 2016 at 2:05 PM, Light, John J <john.j.light@intel.com> wrote:
>
> Maybe the problem is that the default shouldn't be to a TCP value, but should be a 'transport' value.
>

The code is fine because net->ipv4 is always there, it doesn't
depend on any CONFIG, so it is safe for dccp to use too, although
I agree it is a bit ugly.

^ permalink raw reply

* Re: Best way to reduce system call overhead for tun device I/O?
From: Guus Sliepen @ 2016-03-31 22:28 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev
In-Reply-To: <20160331.172050.1423948167839485574.davem@davemloft.net>

On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:

> >> I'm trying to reduce system call overhead when reading/writing to/from a
> >> tun device in userspace. [...] What would be the right way to do this?
> >>
> > Personally I think tun could benefit greatly if it were implemented as
> > a socket instead of character interface. One thing that could be much
> > better is sending/receiving of meta data attached to skbuf. For
> > instance GSO data could be in ancillary data in a socket instead of
> > inline with packet data as tun seems to be doing now.
> 
> Agreed.

Ok. So how should the userspace API work? Creating an AF_PACKET socket
and then using a tun ioctl to create a tun interface and bind it to the
socket?

int fd = socket(AF_PACKET, ...)
struct ifreq ifr = {...};
ioctl(fd, TUNSETIFF, &ifr);

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply

* [PATCH v2 net-next] net: ipv4: Consider unreachable nexthops in multipath routes
From: David Ahern @ 2016-03-31 22:24 UTC (permalink / raw)
  To: netdev; +Cc: ja, David Ahern

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v2
- use rcu locking to avoid refcnts per Eric's suggestion
- only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
  comment
- drop the 'state == NUD_REACHABLE' from the state check since it is
  part of NUD_VALID (comment from Julian)
- wrapped the use of the neigh in a sysctl

 Documentation/networking/ip-sysctl.txt | 10 ++++++++++
 include/net/netns/ipv4.h               |  3 +++
 net/ipv4/fib_semantics.c               | 32 ++++++++++++++++++++++++++++++--
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 4 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b183e2b606c8..5b316d33a23f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
 	fwmark of the packet they are replying to.
 	Default: 0
 
+fib_multipath_use_neigh - BOOLEAN
+	Use status of existing neighbor entry when determining nexthop for
+	multipath routes. If disabled neighbor information is not used then
+	packets could be directed to a dead nexthop. Only valid for kernels
+	built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+	Default: 0 (disabled)
+	Possible values:
+	0 - disabled
+	1 - enabled
+
 route/max_size - INTEGER
 	Maximum number of routes allowed in the kernel.  Increase
 	this when using large numbers of interfaces and/or routes.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a69cde3ce460..d061ffeb1e71 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -133,6 +133,9 @@ struct netns_ipv4 {
 	struct fib_rules_ops	*mr_rules_ops;
 #endif
 #endif
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	int sysctl_fib_multipath_use_neigh;
+#endif
 	atomic_t	rt_genid;
 };
 #endif
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e8ff10..6d423faff0ce 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1559,17 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
+static bool fib_good_nh(const struct fib_nh *nh, struct net_device *dev)
+{
+	struct neighbour *n = NULL;
+	int state = NUD_NONE;
+
+	if (nh->nh_scope == RT_SCOPE_LINK) {
+		rcu_read_lock_bh();
+
+		n = __neigh_lookup_noref(&arp_tbl, &nh->nh_gw, dev);
+		if (n)
+			state = n->nud_state;
+
+		rcu_read_unlock_bh();
+	}
+
+	/* outside of rcu locking using n only as a boolean
+	 * on whether a neighbor entry existed
+	 */
+	if (!n || (state & NUD_VALID))
+		return true;
+
+	return false;
+}
 
 void fib_select_multipath(struct fib_result *res, int hash)
 {
 	struct fib_info *fi = res->fi;
+	struct net_device *dev = fi->fib_dev;
+	struct net *net = fi->fib_net;
 
 	for_nexthops(fi) {
 		if (hash > atomic_read(&nh->nh_upper_bound))
 			continue;
 
-		res->nh_sel = nhsel;
-		return;
+		if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
+		    fib_good_nh(nh, dev)) {
+			res->nh_sel = nhsel;
+			return;
+		}
 	} endfor_nexthops(fi);
 
 	/* Race condition: route has just become dead. */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe6086dd9..bb0419582b8d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -960,6 +960,17 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	{
+		.procname	= "fib_multipath_use_neigh",
+		.data		= &init_net.ipv4.sysctl_fib_multipath_use_neigh,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{ }
 };
 
-- 
1.9.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox