Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Alexei Starovoitov @ 2016-04-01  1:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, davem, netdev, sasha.levin, daniel,
	mkubecek
In-Reply-To: <1459473592.6473.243.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Mar 31, 2016 at 06:19:52PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:21 +0200, Hannes Frederic Sowa wrote:
> 
> > 
> > [   31.064029] ===============================
> > [   31.064030] [ INFO: suspicious RCU usage. ]
> > [   31.064032] 4.5.0+ #13 Not tainted
> > [   31.064033] -------------------------------
> > [   31.064034] include/net/sock.h:1594 suspicious 
> > rcu_dereference_check() usage!
> > [   31.064035]
> >                 other info that might help us debug this:
> > 
> > [   31.064041]
> >                 rcu_scheduler_active = 1, debug_locks = 1
> > [   31.064042] no locks held by ssh/817.
> > [   31.064043]
> >                 stack backtrace:
> > [   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
> > [   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > BIOS 1.8.2-20150714_191134- 04/01/2014
> > [   31.064047]  0000000000000286 000000006730b46b ffff8800badf7bd0 
> > ffffffff81442b33
> > [   31.064050]  ffff8800b8c78000 0000000000000001 ffff8800badf7c00 
> > ffffffff8110ae75
> > [   31.064052]  ffff880035ea2f00 ffff8800b8e28000 0000000000000003 
> > 00000000000004c4
> > [   31.064054] Call Trace:
> > [   31.064058]  [<ffffffff81442b33>] dump_stack+0x85/0xc2
> > [   31.064062]  [<ffffffff8110ae75>] lockdep_rcu_suspicious+0xc5/0x100
> > [   31.064064]  [<ffffffff8173bf57>] __sk_dst_check+0x77/0xb0
> > [   31.064066]  [<ffffffff8182e502>] inet6_sk_rebuild_header+0x52/0x300
> > [   31.064068]  [<ffffffff813bb61e>] ? selinux_skb_peerlbl_sid+0x5e/0xa0
> > [   31.064070]  [<ffffffff813bb69e>] ? 
> > selinux_inet_conn_established+0x3e/0x40
> > [   31.064072]  [<ffffffff817c2bad>] tcp_finish_connect+0x4d/0x270
> > [   31.064074]  [<ffffffff817c33f7>] tcp_rcv_state_process+0x627/0xe40
> > [   31.064076]  [<ffffffff81866584>] tcp_v6_do_rcv+0xd4/0x410
> > [   31.064078]  [<ffffffff8173bc65>] release_sock+0x85/0x1c0
> > [   31.064079]  [<ffffffff817e9983>] __inet_stream_connect+0x1c3/0x340
> > [   31.064081]  [<ffffffff8173b089>] ? lock_sock_nested+0x49/0xb0
> > [   31.064083]  [<ffffffff81100270>] ? abort_exclusive_wait+0xb0/0xb0
> > [   31.064084]  [<ffffffff817e9b38>] inet_stream_connect+0x38/0x50
> > [   31.064086]  [<ffffffff8173794f>] SYSC_connect+0xcf/0xf0
> > [   31.064088]  [<ffffffff8110d069>] ? trace_hardirqs_on_caller+0x129/0x1b0
> > [   31.064090]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
> > [   31.064091]  [<ffffffff8173854e>] SyS_connect+0xe/0x10
> > [   31.064094]  [<ffffffff818a0e7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> > 
> > Bye,
> > Hannes
> 
> Thanks.
> 
> As you can see, release_sock() messes badly lockdep (once your other
> patches are in )
> 
> Once we properly fix release_sock() and/or __release_sock(), all these
> false positives disappear.

+1. Nice catch.

Eric, what's your take on Hannes's patch 2 ?
Is it more accurate to ask lockdep to check for actual lock
or lockdep can rely on owned flag?
Potentially there could be races between setting the flag and
actual lock... but that code is contained, so unlikely.
Will we find the real issues with this 'stronger' check or
just spend a ton of time adapting to new model like your other
patch for release_sock and whatever may need to come next...

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-04-01  1:39 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459474618.3751123.565321858.30DF1858@webmail.messagingengine.com>

On Fri, 2016-04-01 at 03:36 +0200, Hannes Frederic Sowa wrote:
> On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:
> > Thanks.
> > 
> > As you can see, release_sock() messes badly lockdep (once your other
> > patches are in )
> > 
> > Once we properly fix release_sock() and/or __release_sock(), all these
> > false positives disappear.
> 
> This was a loopback connection. I need to study release_sock and
> __release_sock more as I cannot currently see an issue with the lockdep
> handling.

Okay, please try :

diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..570dcd91d64e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);
 
 void release_sock(struct sock *sk)
 {
-	/*
-	 * The sk_lock has mutex_unlock() semantics:
-	 */
-	mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 
 	spin_lock_bh(&sk->sk_lock.slock);
 	if (sk->sk_backlog.tail)
@@ -2445,6 +2441,10 @@ void release_sock(struct sock *sk)
 		sk->sk_prot->release_cb(sk);
 
 	sock_release_ownership(sk);
+	/*
+	 * The sk_lock has mutex_unlock() semantics:
+	 */
+	mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	if (waitqueue_active(&sk->sk_lock.wq))
 		wake_up(&sk->sk_lock.wq);
 	spin_unlock_bh(&sk->sk_lock.slock);

^ permalink raw reply related

* RE: [PATCH] net: fec: stop the "rcv is not +last, " error messages
From: Fugang Duan @ 2016-04-01  1:37 UTC (permalink / raw)
  To: Fabio Estevam; +Cc: Greg Ungerer, Troy Kisky, netdev@vger.kernel.org
In-Reply-To: <CAOMZO5AfQrhjRErp+cCmcpn2qqjuPp0oNN9gaJp65A0o=p0DUQ@mail.gmail.com>

From: Fabio Estevam <festevam@gmail.com> Sent: Thursday, March 31, 2016 6:57 PM
> To: Fugang Duan <fugang.duan@nxp.com>
> Cc: Greg Ungerer <gerg@uclinux.org>; Troy Kisky
> <troy.kisky@boundarydevices.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH] net: fec: stop the "rcv is not +last, " error messages
> 
> Hi Andy,
> 
> On Wed, Mar 30, 2016 at 10:41 PM, Fugang Duan <fugang.duan@nxp.com>
> wrote:
> >
> > Fabio, we cannot do it like this that may cause confused for the quirk flag
> "FEC_QUIRK_HAS_RACC".
> 
> We can treat FEC_QUIRK_HAS_RACC flag as "this is a non-Coldfire SoC".
> 

FEC_QUIRK_HAS_RACC means the HW support "Receive Accelerator Function Configuration". It is really make somebody confused.

To save trouble,  you treat  FEC_QUIRK_HAS_RACC flag as "this is a non-Coldfire SoC",  you must add comment on the flag define.

> >
> >
> > Hi, Greg,
> >
> > The header file fec.h define the FEC_FTRL as below,  if ColdFire SoC has no this
> register,  we may remove the define in here and define the register according
> to SOC type. For example, it is ColdFire Soc, define it as 0xFFF. Is it  feasible ?
> >
> 
> This is even worse IMHO. We should not write to a 'fake' register offset of 0xFFF.

We can do it like this:

#if defined(CONFIG_ARM)
        writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
#endif

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  1:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459473790.6473.247.camel@edumazet-glaptop3.roam.corp.google.com>



On Fri, Apr 1, 2016, at 03:23, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:30 +0200, Hannes Frederic Sowa wrote:
> 
> > I fixed this one, I wait with review a bit and collapse some of the 
> > newer fixes into one and check and repost again tomorrow.
> 
> No change should be needed in TCP, once net/core/sock.c is fixed.
> 
> When someone fixes a lockdep issue, it should be mandatory to include in
> the changelog the lockdep splat, so that we can double check.

Yup, will do that. Every change will be a single patch with lockdep
report.

Thank you,
Hannes

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  1:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459473592.6473.243.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:
> Thanks.
> 
> As you can see, release_sock() messes badly lockdep (once your other
> patches are in )
> 
> Once we properly fix release_sock() and/or __release_sock(), all these
> false positives disappear.

This was a loopback connection. I need to study release_sock and
__release_sock more as I cannot currently see an issue with the lockdep
handling.

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-04-01  1:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <56FDC143.5060507@stressinduktion.org>

On Fri, 2016-04-01 at 02:30 +0200, Hannes Frederic Sowa wrote:

> I fixed this one, I wait with review a bit and collapse some of the 
> newer fixes into one and check and repost again tomorrow.

No change should be needed in TCP, once net/core/sock.c is fixed.

When someone fixes a lockdep issue, it should be mandatory to include in
the changelog the lockdep splat, so that we can double check.

Thanks.

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-04-01  1:19 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <56FDBF0E.8020309@stressinduktion.org>

On Fri, 2016-04-01 at 02:21 +0200, Hannes Frederic Sowa wrote:

> 
> [   31.064029] ===============================
> [   31.064030] [ INFO: suspicious RCU usage. ]
> [   31.064032] 4.5.0+ #13 Not tainted
> [   31.064033] -------------------------------
> [   31.064034] include/net/sock.h:1594 suspicious 
> rcu_dereference_check() usage!
> [   31.064035]
>                 other info that might help us debug this:
> 
> [   31.064041]
>                 rcu_scheduler_active = 1, debug_locks = 1
> [   31.064042] no locks held by ssh/817.
> [   31.064043]
>                 stack backtrace:
> [   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
> [   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.8.2-20150714_191134- 04/01/2014
> [   31.064047]  0000000000000286 000000006730b46b ffff8800badf7bd0 
> ffffffff81442b33
> [   31.064050]  ffff8800b8c78000 0000000000000001 ffff8800badf7c00 
> ffffffff8110ae75
> [   31.064052]  ffff880035ea2f00 ffff8800b8e28000 0000000000000003 
> 00000000000004c4
> [   31.064054] Call Trace:
> [   31.064058]  [<ffffffff81442b33>] dump_stack+0x85/0xc2
> [   31.064062]  [<ffffffff8110ae75>] lockdep_rcu_suspicious+0xc5/0x100
> [   31.064064]  [<ffffffff8173bf57>] __sk_dst_check+0x77/0xb0
> [   31.064066]  [<ffffffff8182e502>] inet6_sk_rebuild_header+0x52/0x300
> [   31.064068]  [<ffffffff813bb61e>] ? selinux_skb_peerlbl_sid+0x5e/0xa0
> [   31.064070]  [<ffffffff813bb69e>] ? 
> selinux_inet_conn_established+0x3e/0x40
> [   31.064072]  [<ffffffff817c2bad>] tcp_finish_connect+0x4d/0x270
> [   31.064074]  [<ffffffff817c33f7>] tcp_rcv_state_process+0x627/0xe40
> [   31.064076]  [<ffffffff81866584>] tcp_v6_do_rcv+0xd4/0x410
> [   31.064078]  [<ffffffff8173bc65>] release_sock+0x85/0x1c0
> [   31.064079]  [<ffffffff817e9983>] __inet_stream_connect+0x1c3/0x340
> [   31.064081]  [<ffffffff8173b089>] ? lock_sock_nested+0x49/0xb0
> [   31.064083]  [<ffffffff81100270>] ? abort_exclusive_wait+0xb0/0xb0
> [   31.064084]  [<ffffffff817e9b38>] inet_stream_connect+0x38/0x50
> [   31.064086]  [<ffffffff8173794f>] SYSC_connect+0xcf/0xf0
> [   31.064088]  [<ffffffff8110d069>] ? trace_hardirqs_on_caller+0x129/0x1b0
> [   31.064090]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
> [   31.064091]  [<ffffffff8173854e>] SyS_connect+0xe/0x10
> [   31.064094]  [<ffffffff818a0e7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> 
> Bye,
> Hannes

Thanks.

As you can see, release_sock() messes badly lockdep (once your other
patches are in )

Once we properly fix release_sock() and/or __release_sock(), all these
false positives disappear.

^ permalink raw reply

* Re: Question on rhashtable in worst-case scenario.
From: Herbert Xu @ 2016-04-01  0:46 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Ben Greear, David Miller, linux-kernel, linux-wireless, netdev,
	tgraf
In-Reply-To: <1459438199.4576.26.camel@sipsolutions.net>

On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:
> 
> Does removing this completely disable the "-EEXIST" error? I can't say
> I fully understand the elasticity stuff in __rhashtable_insert_fast().

What EEXIST error are you talking about? The only one that can be
returned on insertion is if you're explicitly checking for dups
which clearly can't be the case for you.

If you're talking about the EEXIST error due to a rehash then it is
completely hidden from you by rhashtable_insert_rehash.

If you actually meant EBUSY then yes this should prevent it from
occurring, unless your chain-length exceeds 2^32.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459469572.6473.239.camel@edumazet-glaptop3.roam.corp.google.com>

On 01.04.2016 02:12, Eric Dumazet wrote:
> Since this function does not take a reference on the dst, how could it
> be safe ?
>
> As soon as you exit the function, dst would possibly point to something
> obsolete/freed.
>
> This works because the caller owns the socket.
>
> If you believe one path needs to be fixed, tell me which one it is.
>
> Please ?

I fixed this one, I wait with review a bit and collapse some of the 
newer fixes into one and check and repost again tomorrow.

Thanks for reviewing!

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH] sctp: avoid refreshing heartbeat timer too often
From: Marcelo Ricardo Leitner @ 2016-04-01  0:24 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, Neil Horman, Vlad Yasevich,
	linux-sctp@vger.kernel.org
In-Reply-To: <20160331212512.GJ3387@mrl.redhat.com>

On Thu, Mar 31, 2016 at 06:25:12PM -0300, 'Marcelo Ricardo Leitner' wrote:
> On Thu, Mar 31, 2016 at 11:16:52AM +0000, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 30 March 2016 13:13
> > > Em 30-03-2016 06:37, David Laight escreveu:
> > > > From: Marcelo Ricardo Leitner
> > > >> Sent: 29 March 2016 14:42
> > > >>
> > > >> Currently on high rate SCTP streams the heartbeat timer refresh can
> > > >> consume quite a lot of resources as timer updates are costly and it
> > > >> contains a random factor, which a) is also costly and b) invalidates
> > > >> mod_timer() optimization for not editing a timer to the same value.
> > > >> It may even cause the timer to be slightly advanced, for no good reason.
> > > >
> > > > Interesting thoughts:
> > > > 1) Is it necessary to use a different 'random factor' until the timer actually
> > > >     expires?
> > > 
> > > I don't understand you fully here, but we have to have a random factor
> > > on timer expire. As noted by Daniel Borkmann on his commit 8f61059a96c2
> > > ("net: sctp: improve timer slack calculation for transport HBs"):
> > 
> > When a HEARTBEAT chunk is sent determine the new interval, use that
> > interval until the timer actually expires when a new interval is
> > calculated. So the random number is only generated once per heartbeat.
> > 
> > >      RFC4960, section 8.3 says:
> > > 
> > >        On an idle destination address that is allowed to heartbeat,
> > >        it is recommended that a HEARTBEAT chunk is sent once per RTO
> > >        of that destination address plus the protocol parameter
> > >        'HB.interval', with jittering of +/- 50% of the RTO value,
> > >        and exponential backoff of the RTO if the previous HEARTBEAT
> > >        is unanswered.
> > > 
> > > Previous to his commit, it was using a random factor based on jiffies.
> > > 
> > > This patch then assumes that random_A+2 is just as random as random_B as
> > > long as it is within the allowed range, avoiding the unnecessary updates.
> > > 
> > > > 2) It might be better to allow the heartbeat timer to expire, on expiry work
> > > >     out the new interval based on when the last 'refresh' was done.
> > > 
> > > Cool, I thought about this too. It would introduce some extra complexity
> > > that is not really worth I think, specially because now we may be doing
> > > more timer updates even with this patch but it's not triggering any wake
> > > ups and we would need at least 2 wake ups then: one for the first
> > > timeout event, and then re-schedule the timer for the next updated one,
> > > and maybe again, and again.. less timer updates but more wake ups, one
> > > at every heartbeat interval even on a busy transport. Seems it's cheaper
> > > to just update the timer then.
> > 
> > One wakeup per heartbeat interval on a busy connection is probably noise.
> > Probably much less than the 1000s of timer updates that would otherwise happen.
> 
> I was thinking more on near-idle systems, as the overhead for this
> refresh looked rather small now even for busy transports if compared to
> other points, a worth trade-off for reducing wake ups, imho.
> 
> But then I checked tcp, and it does what you're suggesting.
> I'll rework the patch. Thanks

This is what I'm getting with the new approach. I splitted
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, thus why sctp_transport_reset_t3_rtx in there
and it never updates the timer, only start if it's not running already
(as before). Ran netperf for 60 seconds now, to be sure that the timer
would expire twice (1st for initial path validation and 2nd for pure hb).

Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
  Overhead  Command  Shared Object      Symbol
+    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
+    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
-    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      + 49,89% __wake_up_sync_key
      + 45,68% sctp_ulpq_tail_event
      - 2,85% mod_timer
         + 76,51% sctp_transport_reset_t3_rtx
         + 23,49% sctp_do_sm
      + 1,55% del_timer
+    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
+    2,26%  netperf  [sctp]             [k] sctp_sendmsg

Doesn't seem much different from v1, but ok.

Also I could do some more cleanups on heartbeat/timer code.

  Marcelo

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459469572.6473.239.camel@edumazet-glaptop3.roam.corp.google.com>

On 01.04.2016 02:12, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:
>> Hi Eric,
>>
>> On 01.04.2016 01:39, Eric Dumazet wrote:
>>> On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
>>>> Various fixes which were discovered by this changeset. More probably
>>>> to come...
>>>
>>>
>>> Really ?
>>>
>>> Again, TCP stack locks the socket most of the time.
>>>
>>> The fact that lockdep does not understand this is not a reason to add
>>> this overhead.
>>
>> I don't see how lockdep does not understand this? I think I added the
>> appropriate helper to exactly verify if we have the socket lock with
>> lockdep, please have a look at lockdep_sock_is_held in patch #2.
>>
>> Some of the patches also just reorder the rcu_read_lock, which is anyway
>> mostly free.
>
> I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
> a sense of 'security' by merely shutting down maybe good signals.

I did those patches by adding the first patches and running tcp tests, 
so I have splats for every single one of those. I just didn't bother 
them into the changelog. I certainly can do so.

> Your changelog explains nothing, and your patch makes absolutely no
> sense to me.
>
> Lets take following example :
>
>   struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
>   {
> -       struct dst_entry *dst = __sk_dst_get(sk);
> +       struct dst_entry *dst;
>
> +       rcu_read_lock();
> +       dst = __sk_dst_get(sk);
>          if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
>                  sk_tx_queue_clear(sk);
>                  RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
>                  dst_release(dst);
> -               return NULL;
> +               dst = NULL;
>          }
> +       rcu_read_unlock();
>
>          return dst;
>   }
>
>
> Since this function does not take a reference on the dst, how could it
> be safe ?

This is stupid by me, I am sorry, thanks for pointing this out. I will
fix this. Just looking too long at all those lockdep reports.

I will find another fix for this.

> As soon as you exit the function, dst would possibly point to something
> obsolete/freed.
>
> This works because the caller owns the socket.
>
> If you believe one path needs to be fixed, tell me which one it is.
>
> Please ?

[   31.064029] ===============================
[   31.064030] [ INFO: suspicious RCU usage. ]
[   31.064032] 4.5.0+ #13 Not tainted
[   31.064033] -------------------------------
[   31.064034] include/net/sock.h:1594 suspicious 
rcu_dereference_check() usage!
[   31.064035]
                other info that might help us debug this:

[   31.064041]
                rcu_scheduler_active = 1, debug_locks = 1
[   31.064042] no locks held by ssh/817.
[   31.064043]
                stack backtrace:
[   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
[   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.8.2-20150714_191134- 04/01/2014
[   31.064047]  0000000000000286 000000006730b46b ffff8800badf7bd0 
ffffffff81442b33
[   31.064050]  ffff8800b8c78000 0000000000000001 ffff8800badf7c00 
ffffffff8110ae75
[   31.064052]  ffff880035ea2f00 ffff8800b8e28000 0000000000000003 
00000000000004c4
[   31.064054] Call Trace:
[   31.064058]  [<ffffffff81442b33>] dump_stack+0x85/0xc2
[   31.064062]  [<ffffffff8110ae75>] lockdep_rcu_suspicious+0xc5/0x100
[   31.064064]  [<ffffffff8173bf57>] __sk_dst_check+0x77/0xb0
[   31.064066]  [<ffffffff8182e502>] inet6_sk_rebuild_header+0x52/0x300
[   31.064068]  [<ffffffff813bb61e>] ? selinux_skb_peerlbl_sid+0x5e/0xa0
[   31.064070]  [<ffffffff813bb69e>] ? 
selinux_inet_conn_established+0x3e/0x40
[   31.064072]  [<ffffffff817c2bad>] tcp_finish_connect+0x4d/0x270
[   31.064074]  [<ffffffff817c33f7>] tcp_rcv_state_process+0x627/0xe40
[   31.064076]  [<ffffffff81866584>] tcp_v6_do_rcv+0xd4/0x410
[   31.064078]  [<ffffffff8173bc65>] release_sock+0x85/0x1c0
[   31.064079]  [<ffffffff817e9983>] __inet_stream_connect+0x1c3/0x340
[   31.064081]  [<ffffffff8173b089>] ? lock_sock_nested+0x49/0xb0
[   31.064083]  [<ffffffff81100270>] ? abort_exclusive_wait+0xb0/0xb0
[   31.064084]  [<ffffffff817e9b38>] inet_stream_connect+0x38/0x50
[   31.064086]  [<ffffffff8173794f>] SYSC_connect+0xcf/0xf0
[   31.064088]  [<ffffffff8110d069>] ? trace_hardirqs_on_caller+0x129/0x1b0
[   31.064090]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
[   31.064091]  [<ffffffff8173854e>] SyS_connect+0xe/0x10
[   31.064094]  [<ffffffff818a0e7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-04-01  0:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <56FDBA6E.5030508@stressinduktion.org>

On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:
> Hi Eric,
> 
> On 01.04.2016 01:39, Eric Dumazet wrote:
> > On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> >> Various fixes which were discovered by this changeset. More probably
> >> to come...
> >
> >
> > Really ?
> >
> > Again, TCP stack locks the socket most of the time.
> >
> > The fact that lockdep does not understand this is not a reason to add
> > this overhead.
> 
> I don't see how lockdep does not understand this? I think I added the 
> appropriate helper to exactly verify if we have the socket lock with 
> lockdep, please have a look at lockdep_sock_is_held in patch #2.
> 
> Some of the patches also just reorder the rcu_read_lock, which is anyway 
> mostly free.

I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
a sense of 'security' by merely shutting down maybe good signals.

Your changelog explains nothing, and your patch makes absolutely no
sense to me.

Lets take following example :

 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-       struct dst_entry *dst = __sk_dst_get(sk);
+       struct dst_entry *dst;
 
+       rcu_read_lock();
+       dst = __sk_dst_get(sk);
        if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
                sk_tx_queue_clear(sk);
                RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
                dst_release(dst);
-               return NULL;
+               dst = NULL;
        }
+       rcu_read_unlock();
 
        return dst;
 }


Since this function does not take a reference on the dst, how could it
be safe ?

As soon as you exit the function, dst would possibly point to something
obsolete/freed.

This works because the caller owns the socket.

If you believe one path needs to be fixed, tell me which one it is.

Please ?

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-04-01  0:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459467595.6473.233.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On 01.04.2016 01:39, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
>> Various fixes which were discovered by this changeset. More probably
>> to come...
>
>
> Really ?
>
> Again, TCP stack locks the socket most of the time.
>
> The fact that lockdep does not understand this is not a reason to add
> this overhead.

I don't see how lockdep does not understand this? I think I added the 
appropriate helper to exactly verify if we have the socket lock with 
lockdep, please have a look at lockdep_sock_is_held in patch #2.

Some of the patches also just reorder the rcu_read_lock, which is anyway 
mostly free.

Bye,
Hannes

^ permalink raw reply

* [PATCH net 6/4] tcp: fix __sk_dst_get usage in tcp_current_mss
From: Hannes Frederic Sowa @ 2016-03-31 23:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv4/tcp_output.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ba3621834e7bfa..3f70582578ada0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1426,21 +1426,21 @@ EXPORT_SYMBOL(tcp_sync_mss);
 unsigned int tcp_current_mss(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
 	u32 mss_now;
 	unsigned int header_len;
 	struct tcp_out_options opts;
 	struct tcp_md5sig_key *md5;
 
+	rcu_read_lock();
 	mss_now = tp->mss_cache;
-
+	dst = __sk_dst_get(sk);
 	if (dst) {
 		u32 mtu = dst_mtu(dst);
 		if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
 
-	rcu_read_lock();
 	header_len = tcp_established_options(sk, NULL, &opts, &md5) +
 		     sizeof(struct tcphdr);
 	rcu_read_unlock();
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 5/4] tcp: fix rcu usage around __sk_dst_get in tcp_update_metrics
From: Hannes Frederic Sowa @ 2016-03-31 23:48 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Already another one fix I overlooked.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv4/tcp_metrics.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 33a36648423e8b..196de79902819a 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -367,7 +367,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 void tcp_update_metrics(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
 	struct tcp_metrics_block *tm;
@@ -375,13 +375,14 @@ void tcp_update_metrics(struct sock *sk)
 	u32 val;
 	int m;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (sysctl_tcp_nometrics_save || !dst)
-		return;
+		goto out_unlock;
 
 	if (dst->flags & DST_HOST)
 		dst_confirm(dst);
 
-	rcu_read_lock();
 	if (icsk->icsk_backoff || !tp->srtt_us) {
 		/* This session failed to estimate rtt. Why?
 		 * Probably, no packets returned in time.  Reset our
-- 
2.5.5

^ permalink raw reply related

* Re: qdisc spin lock
From: Michael Ma @ 2016-03-31 23:48 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpVSHqLcGjrErAsh54+iD8uwGH5QNGwiJVOQOTsk=dwDbw@mail.gmail.com>

I didn't really know that multiple qdiscs can be isolated using MQ so
that each txq can be associated with a particular qdisc. Also we don't
really have multiple interfaces...

With this MQ solution we'll still need to assign transmit queues to
different classes by doing some math on the bandwidth limit if I
understand correctly, which seems to be less convenient compared with
a solution purely within HTB.

I assume that with this solution I can still share qdisc among
multiple transmit queues - please let me know if this is not the case.

2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> If your HTB classes don't share bandwidth, why do you still make them
> under the same hierarchy? IOW, you can just isolate them either with some
> other qdisc or just separated interfaces.

^ permalink raw reply

* Re: [PATCH net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Yuchung Cheng @ 2016-03-31 23:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Soheil Hassas Yeganeh, David Miller, Network Development,
	Willem de Bruijn, Eric Dumazet, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <CAF=yD-KZJM8KL5ncH2OK=PmKrXagqTdayPNwkuDiKu4MLkhztw@mail.gmail.com>

On Wed, Mar 30, 2016 at 8:35 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
> <soheil.kdev@gmail.com> wrote:
>> From: Soheil Hassas Yeganeh <soheil@google.com>
>>
>> SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
>> to associate timestamps with send calls. For TCP connections,
>> tp->snd_una is used as the starting point to calculate
>> relative IDs.
>>
>> This socket option will fail if set before the handshake on a
>> passive TCP fast open connection with data in SYN or SYN/ACK,
>> since setsockopt requires the connection to be in the
>> ESTABLISHED state.
>>
>> To address these, instead of limiting the option to the
>> ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
>> long as the connection is not in LISTEN or CLOSE states.
>>
>> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>
> Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>

^ permalink raw reply

* Re: [PATCH] fec: Do not access unexisting register in Coldfire
From: Greg Ungerer @ 2016-03-31 23:43 UTC (permalink / raw)
  To: Fabio Estevam, davem; +Cc: fugang.duan, troy.kisky, netdev, Fabio Estevam
In-Reply-To: <1459436717-12809-1-git-send-email-festevam@gmail.com>

Thanks for taking care of that Fabio.

Regards
Greg


On 01/04/16 01:05, Fabio Estevam wrote:
> From: Fabio Estevam <fabio.estevam@nxp.com>
> 
> Commit 55cd48c821de ("net: fec: stop the "rcv is not +last, " error
> messages") introduces a write to a register that does not exist in
> Coldfire.
> 
> Move the FEC_FTRL register access inside the FEC_QUIRK_HAS_RACC 'if' block,
> so that we guarantee it will not be used on Coldfire CPUs.
> 
> Reported-by: Greg Ungerer <gerg@uclinux.org>
> Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 37c0815..08243c2 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -943,8 +943,8 @@ fec_restart(struct net_device *ndev)
>  		else
>  			val &= ~FEC_RACC_OPTIONS;
>  		writel(val, fep->hwp + FEC_RACC);
> +		writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  	}
> -	writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  #endif
>  
>  	/*
> 

^ permalink raw reply

* Re: qdisc spin lock
From: Michael Ma @ 2016-03-31 23:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev
In-Reply-To: <20160331211852.2d228976@redhat.com>

Thanks for the suggestion - I'll try the MQ solution out. It seems to
be able to solve the problem well with the assumption that bandwidth
can be statically partitioned.

2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote:
>
>> I know this might be an old topic so bare with me – what we are facing
>> is that applications are sending small packets using hundreds of
>> threads so the contention on spin lock in __dev_xmit_skb increases the
>> latency of dev_queue_xmit significantly. We’re building a network QoS
>> solution to avoid interference of different applications using HTB.
>
> Yes, as you have noticed with HTB there is a single qdisc lock, and
> congestion obviously happens :-)
>
> It is possible with different tricks to make it scale.  I believe
> Google is using a variant of HTB, and it scales for them.  They have
> not open source their modifications to HTB (which likely also involves
> a great deal of setup tricks).
>
> If your purpose it to limit traffic/bandwidth per "cloud" instance,
> then you can just use another TC setup structure.  Like using MQ and
> assigning a HTB per MQ queue (where the MQ queues are bound to each
> CPU/HW queue)... But you have to figure out this setup yourself...
>
>
>> But in this case when some applications send massive small packets in
>> parallel, the application to be protected will get its throughput
>> affected (because it’s doing synchronous network communication using
>> multiple threads and throughput is sensitive to the increased latency)
>>
>> Here is the profiling from perf:
>>
>> -  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
>>   - 99.94% dev_queue_xmit
>>   -  96.91% _spin_lock
>>  - 2.62%  __qdisc_run
>>   - 98.98% sch_direct_xmit
>> - 99.98% _spin_lock
>>
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> Yes, there is a lot go gain, but it is not easy ;-)
>
>> I must have oversimplified a lot of details since I’m not familiar
>> with the TC implementation at this point – just want to get your input
>> in terms of whether this is a worthwhile effort or there is something
>> fundamental that I’m not aware of. If this is just a matter of quite
>> some additional work, would also appreciate helping to outline the
>> required work here.
>>
>> Also would appreciate if there is any information about the latest
>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>
> This article seems to be very low quality... spelling errors, only 5
> pages, no real code, etc.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Eric Dumazet @ 2016-03-31 23:39 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-5-git-send-email-hannes@stressinduktion.org>

On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> Various fixes which were discovered by this changeset. More probably
> to come...


Really ?

Again, TCP stack locks the socket most of the time.

The fact that lockdep does not understand this is not a reason to add
this overhead.

^ permalink raw reply

* Re: Best way to reduce system call overhead for tun device I/O?
From: Stephen Hemminger @ 2016-03-31 23:39 UTC (permalink / raw)
  To: Guus Sliepen; +Cc: David Miller, tom, netdev
In-Reply-To: <20160331222857.GH3771@sliepen.org>

On Fri, 1 Apr 2016 00:28:57 +0200
Guus Sliepen <guus@tinc-vpn.org> wrote:

> On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:
> 
> > >> I'm trying to reduce system call overhead when reading/writing to/from a
> > >> tun device in userspace. [...] What would be the right way to do this?
> > >>
> > > Personally I think tun could benefit greatly if it were implemented as
> > > a socket instead of character interface. One thing that could be much
> > > better is sending/receiving of meta data attached to skbuf. For
> > > instance GSO data could be in ancillary data in a socket instead of
> > > inline with packet data as tun seems to be doing now.
> > 
> > Agreed.
> 
> Ok. So how should the userspace API work? Creating an AF_PACKET socket
> and then using a tun ioctl to create a tun interface and bind it to the
> socket?
> 
> int fd = socket(AF_PACKET, ...)
> struct ifreq ifr = {...};
> ioctl(fd, TUNSETIFF, &ifr);
> 

Rather than bodge AF_PACKET onto TUN, why not just create a new device type
and control it from something modern like netlink.

^ permalink raw reply

* Re: [PULL REQUEST] Please pull rdma.git
From: Leon Romanovsky @ 2016-03-31 23:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Doug Ledford, Linus Torvalds, RDMA mailing list, linux-netdev,
	Matan Barak, Or Gerlitz, Leon Romanovsky
In-Reply-To: <56F29C32.305-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Wed, Mar 23, 2016 at 09:37:54AM -0400, Doug Ledford wrote:
> On 03/23/2016 06:57 AM, Leon Romanovsky wrote:
> > On Sat, Mar 19, 2016 at 02:37:08PM -0700, Linus Torvalds wrote:
> >> So the *best* situation would be:
> >>
> >>  - your two groups talk it over, and figure out what the common commits are
> >>
> >>  - you put those common commits as a "base" branch in git
> >>
> >>  - you ask the two upper-level maintainers to both pull that base branch
> >>
> >>  - you then make sure that you send the later patches (whether as
> >> emailed patches or as pull requests) based on top of that base branch.
> > 
> > Hi David and Doug,
> > 
> > Are you OK with the approach suggested by Linus?
> > We are eager to know it, so we will adopt it as soon
> > as possible in our development flow.
> > 
> > The original thread [1].
> > 
> > [1] http://thread.gmane.org/gmane.linux.drivers.rdma/34907
> > 
> > Thanks.
> > 
> 
> I'm fine with it.  Since I happen to use topic branches to build my
> for-next from anyway, I might need to be the one that Dave pulls from
> versus the other way around.

Resending to linux-netdev.

David,
Can you please express your opinion about Linus's suggestion to
eliminate merge conflicts in Mellanox related products?

Thanks

> 
> -- 
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Hannes Frederic Sowa @ 2016-03-31 23:31 UTC (permalink / raw)
  To: Daniel Borkmann, David Miller
  Cc: eric.dumazet, alexei.starovoitov, mkubecek, sasha.levin, jslaby,
	mst, netdev
In-Reply-To: <56FD9C39.6040703@iogearbox.net>

Hi Daniel,

On 31.03.2016 23:52, Daniel Borkmann wrote:
> On 03/31/2016 09:48 PM, Hannes Frederic Sowa wrote:
> [...]
>> Tightest solution would probably be to combine both patches.
>>
>> bool called_by_tuntap;
>>
>> old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ?
>> lockdep_rtnl_is_held() : lockdep_sock_is_held());
>
> If I understand you correctly with combining them, you mean you'd still
> need the API change to pass the bool 'called_by_tuntap' down, right?

I actually decided to simply lock the sockets. So that there will always 
be a clear lock owner during the updates. I think this is cleaner. What 
do you think?

> If so, your main difference is, after all, to replace the
> sock_owned_by_user()
> with the lockdep_sock_is_held() construction instead, correct?

I just didn't do that part because we hold socket lock now.

> But then, isn't it already sufficient when you pass the bool itself down
> that 'demuxes' in this case between the sock_owned_by_user() vs
> lockdep_rtnl_is_held() check?

I just send out the patches, please have a look.

Thanks,
Hannes

^ permalink raw reply

* [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Various fixes which were discovered by this changeset. More probably
to come...

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/tcp.h      |  5 ++++-
 net/core/sock.c        |  7 +++++--
 net/ipv4/tcp_input.c   | 18 ++++++++++++++----
 net/ipv4/tcp_metrics.c | 12 +++++-------
 net/ipv4/tcp_output.c  | 22 ++++++++++++++++++++--
 5 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f61be64a..541c99bf633d4b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -647,11 +647,14 @@ static inline void tcp_fast_path_check(struct sock *sk)
 /* Compute the actual rto_min value */
 static inline u32 tcp_rto_min(struct sock *sk)
 {
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
 	u32 rto_min = TCP_RTO_MIN;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
 		rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
+	rcu_read_unlock();
 	return rto_min;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230f9..963d0ba7aa4232 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -486,14 +486,17 @@ EXPORT_SYMBOL(sk_receive_skb);
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
 		sk_tx_queue_clear(sk);
 		RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
 		dst_release(dst);
-		return NULL;
+		dst = NULL;
 	}
+	rcu_read_unlock();
 
 	return dst;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f79ade821..8489e9e45f906c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -203,10 +203,16 @@ static void tcp_enter_quickack_mode(struct sock *sk)
 static bool tcp_in_quickack_mode(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	const struct dst_entry *dst = __sk_dst_get(sk);
+	const struct dst_entry *dst;
+	bool ret;
 
-	return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
-		(icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
+	ret = (dst && dst_metric(dst, RTAX_QUICKACK)) ||
+	      (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
@@ -3674,9 +3680,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_process_tlp_ack(sk, ack, flag);
 
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
-		struct dst_entry *dst = __sk_dst_get(sk);
+		struct dst_entry *dst;
+
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst)
 			dst_confirm(dst);
+		rcu_read_unlock();
 	}
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 7b7eec43990692..33a36648423e8b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -488,22 +488,20 @@ out_unlock:
 
 void tcp_init_metrics(struct sock *sk)
 {
-	struct dst_entry *dst = __sk_dst_get(sk);
+	struct dst_entry *dst;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_metrics_block *tm;
 	u32 val, crtt = 0; /* cached RTT scaled by 8 */
 
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
 	if (!dst)
 		goto reset;
-
 	dst_confirm(dst);
 
-	rcu_read_lock();
 	tm = tcp_get_metrics(sk, dst, true);
-	if (!tm) {
-		rcu_read_unlock();
+	if (!tm)
 		goto reset;
-	}
 
 	if (tcp_metric_locked(tm, TCP_METRIC_CWND))
 		tp->snd_cwnd_clamp = tcp_metric_get(tm, TCP_METRIC_CWND);
@@ -527,7 +525,6 @@ void tcp_init_metrics(struct sock *sk)
 	}
 
 	crtt = tcp_metric_get(tm, TCP_METRIC_RTT);
-	rcu_read_unlock();
 reset:
 	/* The initial RTT measurement from the SYN/SYN-ACK is not ideal
 	 * to seed the RTO for later data packets because SYN packets are
@@ -575,6 +572,7 @@ reset:
 	else
 		tp->snd_cwnd = tcp_init_cwnd(tp, dst);
 	tp->snd_cwnd_stamp = tcp_time_stamp;
+	rcu_read_unlock();
 }
 
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2dc015cd19a6..ba3621834e7bfa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -548,6 +548,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	struct tcp_fastopen_request *fastopen = tp->fastopen_req;
 
 #ifdef CONFIG_TCP_MD5SIG
+	rcu_read_lock();
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (*md5) {
 		opts->options |= OPTION_MD5;
@@ -601,6 +602,10 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+#ifdef CONFIG_TCP_MD5SIG
+	rcu_read_unlock();
+#endif
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -928,6 +933,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
+#ifdef CONFIG_TCP_MD5SIG
+	rcu_read_lock();
+#endif
+
 	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
 	else
@@ -996,6 +1005,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		tp->af_specific->calc_md5_hash(opts.hash_location,
 					       md5, sk, skb);
 	}
+	rcu_read_unlock();
 #endif
 
 	icsk->icsk_af_ops->send_check(sk, skb);
@@ -1294,10 +1304,13 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
 
 	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
 	if (icsk->icsk_af_ops->net_frag_header_len) {
-		const struct dst_entry *dst = __sk_dst_get(sk);
+		const struct dst_entry *dst;
 
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst && dst_allfrag(dst))
 			mss_now -= icsk->icsk_af_ops->net_frag_header_len;
+		rcu_read_unlock();
 	}
 
 	/* Clamp it (mss_clamp does not include tcp options) */
@@ -1335,10 +1348,13 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
 
 	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
 	if (icsk->icsk_af_ops->net_frag_header_len) {
-		const struct dst_entry *dst = __sk_dst_get(sk);
+		const struct dst_entry *dst;
 
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
 		if (dst && dst_allfrag(dst))
 			mtu += icsk->icsk_af_ops->net_frag_header_len;
+		rcu_read_unlock();
 	}
 	return mtu;
 }
@@ -1424,8 +1440,10 @@ unsigned int tcp_current_mss(struct sock *sk)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
 
+	rcu_read_lock();
 	header_len = tcp_established_options(sk, NULL, &opts, &md5) +
 		     sizeof(struct tcphdr);
+	rcu_read_unlock();
 	/* The mss_cache is sized based on tp->tcp_header_len, which assumes
 	 * some common options. If this is an odd packet (because we have SACK
 	 * blocks etc) then our calculated header_len will be different, and
-- 
2.5.5

^ permalink raw reply related

* [PATCH net 3/4] sock: use lockdep_sock_is_held were appropriate
From: Hannes Frederic Sowa @ 2016-03-31 23:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, sasha.levin, daniel, alexei.starovoitov, mkubecek
In-Reply-To: <1459466982-20432-1-git-send-email-hannes@stressinduktion.org>

Also make lockdep_sock_is_held accept const arguments, so we don't need to
modify too many callers.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/sock.h       | 7 ++++---
 net/dccp/ipv4.c          | 2 +-
 net/dccp/ipv6.c          | 2 +-
 net/ipv4/af_inet.c       | 2 +-
 net/ipv4/cipso_ipv4.c    | 3 ++-
 net/ipv4/ip_sockglue.c   | 4 ++--
 net/ipv4/tcp_ipv4.c      | 8 +++-----
 net/ipv6/ipv6_sockglue.c | 6 ++++--
 net/ipv6/tcp_ipv6.c      | 2 +-
 net/socket.c             | 2 +-
 10 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 30f9b5ad0a82ef..bbea02fdaaa0fd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1382,8 +1382,9 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static bool lockdep_sock_is_held(struct sock *sk)
+static bool lockdep_sock_is_held(const struct sock *csk)
 {
+	struct sock *sk = (struct sock *)csk;
 	return lockdep_is_held(&sk->sk_lock) ||
 	       lockdep_is_held(&sk->sk_lock.slock);
 }
@@ -1589,8 +1590,8 @@ static inline void sk_rethink_txhash(struct sock *sk)
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-	return rcu_dereference_check(sk->sk_dst_cache, sock_owned_by_user(sk) ||
-						       lockdep_is_held(&sk->sk_lock.slock));
+	return rcu_dereference_check(sk->sk_dst_cache,
+				     lockdep_sock_is_held(sk));
 }
 
 static inline struct dst_entry *
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c67a961ba5382..0ea298d849383f 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -62,7 +62,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	nexthop = daddr = usin->sin_addr.s_addr;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt != NULL && inet_opt->opt.srr) {
 		if (daddr == 0)
 			return -EINVAL;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 4663a01d503991..6ea214fa5499c5 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -865,7 +865,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl6.fl6_sport = inet->inet_sport;
 	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
 
-	opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+	opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
 	final_p = fl6_update_dst(&fl6, opt, &final);
 
 	dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbaef2..7e37ebb5af396e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1106,7 +1106,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	struct ip_options_rcu *inet_opt;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
 
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index bdb2a07ec363b7..40d6b87713a132 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1933,7 +1933,8 @@ int cipso_v4_sock_setattr(struct sock *sk,
 
 	sk_inet = inet_sk(sk);
 
-	old = rcu_dereference_protected(sk_inet->inet_opt, sock_owned_by_user(sk));
+	old = rcu_dereference_protected(sk_inet->inet_opt,
+					lockdep_sock_is_held(sk));
 	if (sk_inet->is_icsk) {
 		sk_conn = inet_csk(sk);
 		if (old)
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad645a8d9d8..cf073059192d99 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -635,7 +635,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		if (err)
 			break;
 		old = rcu_dereference_protected(inet->inet_opt,
-						sock_owned_by_user(sk));
+						lockdep_sock_is_held(sk));
 		if (inet->is_icsk) {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 #if IS_ENABLED(CONFIG_IPV6)
@@ -1295,7 +1295,7 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 		struct ip_options_rcu *inet_opt;
 
 		inet_opt = rcu_dereference_protected(inet->inet_opt,
-						     sock_owned_by_user(sk));
+						     lockdep_sock_is_held(sk));
 		opt->optlen = 0;
 		if (inet_opt)
 			memcpy(optbuf, &inet_opt->opt,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad450509029bce..17cc1840337756 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -157,7 +157,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	nexthop = daddr = usin->sin_addr.s_addr;
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
-					     sock_owned_by_user(sk));
+					     lockdep_sock_is_held(sk));
 	if (inet_opt && inet_opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
@@ -882,8 +882,7 @@ struct tcp_md5sig_key *tcp_md5_do_lookup(const struct sock *sk,
 
 	/* caller either holds rcu_read_lock() or socket lock */
 	md5sig = rcu_dereference_check(tp->md5sig_info,
-				       sock_owned_by_user(sk) ||
-				       lockdep_is_held((spinlock_t *)&sk->sk_lock.slock));
+				       lockdep_sock_is_held(sk));
 	if (!md5sig)
 		return NULL;
 #if IS_ENABLED(CONFIG_IPV6)
@@ -928,8 +927,7 @@ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr,
 	}
 
 	md5sig = rcu_dereference_protected(tp->md5sig_info,
-					   sock_owned_by_user(sk) ||
-					   lockdep_is_held(&sk->sk_lock.slock));
+					   lockdep_sock_is_held(sk));
 	if (!md5sig) {
 		md5sig = kmalloc(sizeof(*md5sig), gfp);
 		if (!md5sig)
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4449ad1f81147c..516b6a31c30f7b 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -407,7 +407,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		if (optname != IPV6_RTHDR && !ns_capable(net->user_ns, CAP_NET_RAW))
 			break;
 
-		opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+		opt = rcu_dereference_protected(np->opt,
+						lockdep_sock_is_held(sk));
 		opt = ipv6_renew_options(sk, opt, optname,
 					 (struct ipv6_opt_hdr __user *)optval,
 					 optlen);
@@ -1123,7 +1124,8 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		struct ipv6_txoptions *opt;
 
 		lock_sock(sk);
-		opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+		opt = rcu_dereference_protected(np->opt,
+						lockdep_sock_is_held(sk));
 		len = ipv6_getsockopt_sticky(sk, opt, optname, optval, len);
 		release_sock(sk);
 		/* check if ipv6_getsockopt_sticky() returns err code */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 711d209f912473..bd16dc4b6ba71b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -234,7 +234,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl6.fl6_dport = usin->sin6_port;
 	fl6.fl6_sport = inet->inet_sport;
 
-	opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+	opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
 	final_p = fl6_update_dst(&fl6, opt, &final);
 
 	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
diff --git a/net/socket.c b/net/socket.c
index 5f77a8e93830bd..e3299cdfe9db39 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1046,7 +1046,7 @@ static int sock_fasync(int fd, struct file *filp, int on)
 		return -EINVAL;
 
 	lock_sock(sk);
-	wq = rcu_dereference_protected(sock->wq, sock_owned_by_user(sk));
+	wq = rcu_dereference_protected(sock->wq, lockdep_sock_is_held(sk));
 	fasync_helper(fd, filp, on, &wq->fasync_list);
 
 	if (!wq->fasync_list)
-- 
2.5.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox