Re: [RFC] tcp: race in receive part

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jiri Olsa <jolsa@redhat.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	fbl@redhat.com, nhorman@redhat.com, davem@redhat.com,
	oleg@redhat.com
Subject: Re: [RFC] tcp: race in receive part
Date: Thu, 18 Jun 2009 16:06:42 +0200	[thread overview]
Message-ID: <4A3A49F2.6060705@gmail.com> (raw)
In-Reply-To: <20090618102727.GC3782@jolsa.lab.eng.brq.redhat.com>

Jiri Olsa a écrit :
> Hi,
> 
> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
> this on the upstream kernel, but since the issue occurs very rarelly
> (once per 8 days), we just might not be lucky.
> 
> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> 

Thanks for your mail and detailed analysis

> 
> 
> RACE DESCRIPTION
> ================
> 
> There's a nice pdf describing the issue (and sollution using locks) on
> https://bugzilla.redhat.com/attachment.cgi?id=345014

I could not reach this url unfortunatly 

--> "You are not authorized to access bug #494404. "

> 
> 
> The race fires, when following code paths meet, and the tp->rcv_nxt and
> __add_wait_queue updates stay in CPU caches.
> 
> CPU1                         CPU2
> 
> 
> sys_select                   receive packet
>   ...                        ...
>   __add_wait_queue           update tp->rcv_nxt
>   ...                        ...
>   tp->rcv_nxt check          sock_def_readable
>   ...                        {
>   schedule                      ...
>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>                                         wake_up_interruptible(sk->sk_sleep)
>                                 ...
>                              }
> 
> If there were no cache the code would work ok, since the wait_queue and
> rcv_nxt are opposit to each other.
> 
> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> tp->rcv_nxt and will return with new data mask.  
> In both cases the process (CPU1) is being added to the wait queue, so the
> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> 
> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
> endup calling schedule and sleep forever if there are no more data on the
> socket.
> 
> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> should prevent the above bad scenario.
> 
> The upstream patch is attached. It seems to prevent the issue.
> 
> 
> 
> CPU BUGS
> ========
> 
> The customer has been able to reproduce this problem only on one CPU model:
> Xeon E5345*2. They didn't reproduce on XEON MV, for example.

Is there an easy way to reproduce the problem ?

> 
> That CPU model happens to have 2 possible issues, that might cause the issue:
> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> 
> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> the other one has following notes:

AJ18 only matters on unaligned accesses, tcp code doesnt do this.

> 
>       Software should ensure at least one of the following is true when
>       modifying shared data by multiple agents:
>              • The shared data is aligned
>              • Proper semaphores or barriers are used in order to
>                 prevent concurrent data accesses.
> 
> 
> 
> RFC
> ===
> 
> I'm aware that not having this issue reproduced on upstream lowers the odds
> having this checked in. However AFAICS the issue is present. I'd appreciate
> any comment/ideas.
> 
> 
> thanks,
> jirka
> 
> 
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 17b89c5..f5d9dbf 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  	struct tcp_sock *tp = tcp_sk(sk);
>  
>  	poll_wait(file, sk->sk_sleep, wait);

poll_wait() calls add_wait_queue() which contains a 
spin_lock_irqsave()/spin_unlock_irqrestore() pair

Documentation/memory-barriers.txt states in line 1123 :

	Memory operations issued after the LOCK will be completed after the LOCK
	operation has completed.

and line 1131 states :

	Memory operations issued before the UNLOCK will be completed before the
	UNLOCK operation has completed.

So yes, there is no full smp_mb() in poll_wait()

> +
> +	/* Get in sync with tcp_data_queue, tcp_urg
> +	   and tcp_rcv_established function. */
> +	smp_mb();

If this barrier is really necessary, I guess it should be done in poll_wait() itself.

Documentation/memory-barriers.txt misses some information about poll_wait()




> +
>  	if (sk->sk_state == TCP_LISTEN)
>  		return inet_csk_listen_poll(sk);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2bdb0da..0606e5e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4362,8 +4362,11 @@ queue_and_out:
>  
>  		if (eaten > 0)
>  			__kfree_skb(skb);
> -		else if (!sock_flag(sk, SOCK_DEAD))
> +		else if (!sock_flag(sk, SOCK_DEAD)) {
> +			/* Get in sync with tcp_poll function. */
> +			smp_mb();
>  			sk->sk_data_ready(sk, 0);
> +		}
>  		return;
>

Oh well... if smp_mb() is needed, I believe it should be done
right before "if (waitqueue_active(sk->sk_sleep) ... "

 	read_lock(&sk->sk_callback_lock);
+	smp_mb();
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible(sk->sk_sleep)

It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)

Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
"lock subl $0x1,(%eax)"

Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)

next prev parent reply	other threads:[~2009-06-18 14:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-18 10:27 [RFC] tcp: race in receive part Jiri Olsa
2009-06-18 14:06 ` Eric Dumazet [this message]
2009-06-23  9:12   ` Jiri Olsa
2009-06-23 10:32     ` Eric Dumazet
2009-06-23 19:44       ` Oleg Nesterov
2009-06-24 10:20       ` Jiri Olsa
2009-06-24 11:04         ` Eric Dumazet
2009-06-24 16:21           ` Jiri Olsa
2009-06-24 16:30             ` Oleg Nesterov
2009-06-24 16:41               ` Oleg Nesterov
2009-06-25 10:51                 ` Eric Dumazet
2009-06-25 10:28             ` Eric Dumazet
2009-06-25 10:46               ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A3A49F2.6060705@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=davem@redhat.com \
    --cc=fbl@redhat.com \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@redhat.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.