From: Tejun Heo <htejun@gmail.com>
To: Jiri Olsa <jolsa@redhat.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
fbl@redhat.com, nhorman@redhat.com, davem@redhat.com,
oleg@redhat.com, eric.dumazet@gmail.com
Subject: Re: [PATCH] net: fix race in the receive/select
Date: Fri, 26 Jun 2009 10:50:28 +0900 [thread overview]
Message-ID: <4A442964.2010406@gmail.com> (raw)
In-Reply-To: <20090625122545.GA3625@jolsa.lab.eng.brq.redhat.com>
Hello,
Jiri Olsa wrote:
> Adding memory barrier to the __pollwait function paired with
> receive callbacks. The smp_mb__after_lock define is added,
> since {read|write|spin}_lock() on x86 are full memory barriers.
>
> The race fires, when following code paths meet, and the tp->rcv_nxt and
> __add_wait_queue updates stay in CPU caches.
>
>
> CPU1 CPU2
>
> sys_select receive packet
> ... ...
> __add_wait_queue update tp->rcv_nxt
> ... ...
> tp->rcv_nxt check sock_def_readable
> ... {
> schedule ...
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> wake_up_interruptible(sk->sk_sleep)
> ...
> }
>
> If there was no cache the code would work ok, since the wait_queue and
> rcv_nxt are opposit to each other.
>
> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> tp->rcv_nxt and will return with new data mask.
> In both cases the process (CPU1) is being added to the wait queue, so the
> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
>
> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
> endup calling schedule and sleep forever if there are no more data on the
> socket.
So, the problem is the half barrier semantics of spin_lock on CPU1's
side and lack of any barrier (read for tp->rcv_nxt check can creep
above the queuelist update) in waitqueue_active() check on CPU2's side
(read for waitqueue list can creep above tcv_nxt update). Am I
understanding it right?
This is a little bit scary. The interface kind of suggests that they
have strong enough barrier semantics (well, I would assume that). I
wonder whether there are more more places where this kind of race
condition exists.
> @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
> init_waitqueue_func_entry(&entry->wait, pollwake);
> entry->wait.private = pwq;
> add_wait_queue(wait_address, &entry->wait);
> +
> + /* This memory barrier is paired with the smp_mb__after_lock
> + * in the sk_has_sleeper. */
> + smp_mb();
I'm not entirely sure this is the correct place to do it while the mb
for the other side lives in network code. Wouldn't it be better to
move this memory barrier to network select code? It's strange for an
API to have only single side of a barrier pair and leave the other to
the API user.
Also, maybe we need smp_mb__after_unlock() too? Maybe
add_wait_queue_mb() possibly paired with wait_queue_active_mb() is
better? On x86, it wouldn't make any difference tho.
One more thing, can you please use fully winged style for multiline
comments?
Thanks.
--
tejun
next prev parent reply other threads:[~2009-06-26 1:50 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-25 12:25 [PATCH] net: fix race in the receive/select Jiri Olsa
2009-06-25 12:24 ` Oleg Nesterov
2009-06-26 1:31 ` Davide Libenzi
2009-06-26 1:59 ` Eric Dumazet
2009-06-26 2:04 ` Davide Libenzi
2009-06-26 2:11 ` David Miller
2009-06-26 2:19 ` Eric Dumazet
2009-06-26 3:14 ` Davide Libenzi
2009-06-26 5:42 ` Jarek Poplawski
2009-06-26 8:10 ` Jarek Poplawski
2009-06-26 13:57 ` Oleg Nesterov
2009-06-26 17:32 ` Davide Libenzi
2009-06-26 14:50 ` Oleg Nesterov
2009-06-26 18:12 ` Davide Libenzi
2009-06-26 18:17 ` David Miller
2009-06-26 19:35 ` Eric Dumazet
2009-06-29 9:34 ` Jiri Olsa
2009-06-28 11:10 ` Jarek Poplawski
2009-06-28 11:22 ` Jarek Poplawski
2009-06-28 18:04 ` Oleg Nesterov
2009-06-28 21:48 ` Jarek Poplawski
2009-06-29 9:27 ` Jiri Olsa
2009-06-26 13:46 ` Oleg Nesterov
2009-06-25 23:18 ` Eric Dumazet
2009-06-26 1:50 ` Tejun Heo [this message]
2009-06-29 9:12 ` Andi Kleen
2009-06-29 9:24 ` Jiri Olsa
2009-06-29 16:59 ` Zan Lynx
2009-06-29 17:29 ` Andi Kleen
2009-07-01 3:39 ` Herbert Xu
2009-07-01 6:27 ` Andi Kleen
2009-07-01 7:03 ` Herbert Xu
2009-07-01 7:22 ` Andi Kleen
2009-07-01 8:31 ` Herbert Xu
2009-07-01 8:44 ` Jiri Olsa
2009-07-01 10:58 ` Herbert Xu
2009-07-01 13:07 ` Herbert Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A442964.2010406@gmail.com \
--to=htejun@gmail.com \
--cc=davem@redhat.com \
--cc=eric.dumazet@gmail.com \
--cc=fbl@redhat.com \
--cc=jolsa@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nhorman@redhat.com \
--cc=oleg@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).