Re: [RFC] tcp: race in receive part

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jiri Olsa <jolsa@redhat.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	fbl@redhat.com, nhorman@redhat.com, davem@redhat.com,
	oleg@redhat.com
Subject: Re: [RFC] tcp: race in receive part
Date: Wed, 24 Jun 2009 13:04:13 +0200	[thread overview]
Message-ID: <4A42082D.9060402@gmail.com> (raw)
In-Reply-To: <20090624102038.GA5409@jolsa.lab.eng.brq.redhat.com>

Jiri Olsa a écrit :
> On Tue, Jun 23, 2009 at 12:32:10PM +0200, Eric Dumazet wrote:
>> Jiri Olsa a écrit :
>>> Hi,
>>>
>>> thanks for an answer, and sorry for my late reply,
>>> we needed the cust permission to disclose the debug data.
>>>
>> I see ! Now this is me with litle time as I am traveling right now.
>>
>>> On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
>>>> Jiri Olsa a écrit :
>>>>> Hi,
>>>>>
>>>>> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce 
>>>>> this on the upstream kernel, but since the issue occurs very rarelly
>>>>> (once per 8 days), we just might not be lucky.
>>>>>
>>>>> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
>>>>>
>>>> Thanks for your mail and detailed analysis
>>>>
>>>>> RACE DESCRIPTION
>>>>> ================
>>>>>
>>>>> There's a nice pdf describing the issue (and sollution using locks) on
>>>>> https://bugzilla.redhat.com/attachment.cgi?id=345014
>>>> I could not reach this url unfortunatly 
>>>>
>>>> --> "You are not authorized to access bug #494404. "
>>> please try it now, the bug should be accessible now
>>>
>> Thanks, this doc is indeed nice :)
>>
>> But adding an write_lock()/write_unlock() in tcp_poll() was overkill
>> It had an sm_mb() implied because of the nesting of locks.
>>
>>>>> The race fires, when following code paths meet, and the tp->rcv_nxt and
>>>>> __add_wait_queue updates stay in CPU caches.
>>>>>
>>>>> CPU1                         CPU2
>>>>>
>>>>>
>>>>> sys_select                   receive packet
>>>>>   ...                        ...
>>>>>   __add_wait_queue           update tp->rcv_nxt
>>>>>   ...                        ...
>>>>>   tp->rcv_nxt check          sock_def_readable
>>>>>   ...                        {
>>>>>   schedule                      ...
>>>>>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>>>>>                                         wake_up_interruptible(sk->sk_sleep)
>>>>>                                 ...
>>>>>                              }
>>>>>
>>>>> If there were no cache the code would work ok, since the wait_queue and
>>>>> rcv_nxt are opposit to each other.
>>>>>
>>>>> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
>>>>> passed the tp->rcv_nxt check and sleeps, or will get the new value for
>>>>> tp->rcv_nxt and will return with new data mask.  
>>>>> In both cases the process (CPU1) is being added to the wait queue, so the
>>>>> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
>>>>>
>>>>> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
>>>>> cache , and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
>>>>> endup calling schedule and sleep forever if there are no more data on the
>>>>> socket.
>>>>>
>>>>> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
>>>>> should prevent the above bad scenario.
>>>>>
>>>>> The upstream patch is attached. It seems to prevent the issue.
>>>>>
>>>>>
>>>>>
>>>>> CPU BUGS
>>>>> ========
>>>>>
>>>>> The customer has been able to reproduce this problem only on one CPU model:
>>>>> Xeon E5345*2. They didn't reproduce on XEON MV, for example.
>>>> Is there an easy way to reproduce the problem ?
>>> there's a reproducer attached to the bug
>>>
>>> https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78
>>>
>>> it is basically the client/server program. 
>>> They're passing messages to each other. When a message is sent,
>>> both of them expect message on the input before sending another message.
>>>
>>> Very rarely the code hits the place when the process which called select
>>> is not woken up by incomming data. Probably because of the memory cache
>>> incoherency. See backtrace in the 
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1
>>>
>>>
>>>>> That CPU model happens to have 2 possible issues, that might cause the issue:
>>>>> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
>>>>>
>>>>> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
>>>>> the other one has following notes:
>>>> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
>>>>
>>>>>       Software should ensure at least one of the following is true when
>>>>>       modifying shared data by multiple agents:
>>>>>              • The shared data is aligned
>>>>>              • Proper semaphores or barriers are used in order to
>>>>>                 prevent concurrent data accesses.
>>>>>
>>>>>
>>>>>
>>>>> RFC
>>>>> ===
>>>>>
>>>>> I'm aware that not having this issue reproduced on upstream lowers the odds
>>>>> having this checked in. However AFAICS the issue is present. I'd appreciate
>>>>> any comment/ideas.
>>>>>
>>>>>
>>>>> thanks,
>>>>> jirka
>>>>>
>>>>>
>>>>> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
>>>>>
>>>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>>>> index 17b89c5..f5d9dbf 100644
>>>>> --- a/net/ipv4/tcp.c
>>>>> +++ b/net/ipv4/tcp.c
>>>>> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>>>>>  	struct tcp_sock *tp = tcp_sk(sk);
>>>>>  
>>>>>  	poll_wait(file, sk->sk_sleep, wait);
>>>> poll_wait() calls add_wait_queue() which contains a 
>>>> spin_lock_irqsave()/spin_unlock_irqrestore() pair
>>>>
>>>> Documentation/memory-barriers.txt states in line 1123 :
>>>>
>>>> 	Memory operations issued after the LOCK will be completed after the LOCK
>>>> 	operation has completed.
>>>>
>>>> and line 1131 states :
>>>>
>>>> 	Memory operations issued before the UNLOCK will be completed before the
>>>> 	UNLOCK operation has completed.
>>>>
>>>> So yes, there is no full smp_mb() in poll_wait()
>>>>
>>>>> +
>>>>> +	/* Get in sync with tcp_data_queue, tcp_urg
>>>>> +	   and tcp_rcv_established function. */
>>>>> +	smp_mb();
>>>> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
>>>>
>>>> Documentation/memory-barriers.txt misses some information about poll_wait()
>>>>
>>>>
>>>>
>>>>
>>>>> +
>>>>>  	if (sk->sk_state == TCP_LISTEN)
>>>>>  		return inet_csk_listen_poll(sk);
>>>>>  
>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>> index 2bdb0da..0606e5e 100644
>>>>> --- a/net/ipv4/tcp_input.c
>>>>> +++ b/net/ipv4/tcp_input.c
>>>>> @@ -4362,8 +4362,11 @@ queue_and_out:
>>>>>  
>>>>>  		if (eaten > 0)
>>>>>  			__kfree_skb(skb);
>>>>> -		else if (!sock_flag(sk, SOCK_DEAD))
>>>>> +		else if (!sock_flag(sk, SOCK_DEAD)) {
>>>>> +			/* Get in sync with tcp_poll function. */
>>>>> +			smp_mb();
>>>>>  			sk->sk_data_ready(sk, 0);
>>>>> +		}
>>>>>  		return;
>>>>>
>>>> Oh well... if smp_mb() is needed, I believe it should be done
>>>> right before "if (waitqueue_active(sk->sk_sleep) ... "
>>>>
>>>>  	read_lock(&sk->sk_callback_lock);
>>>> +	smp_mb();
>>>>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>>>>  		wake_up_interruptible(sk->sk_sleep)
>>>>
>>>> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
>>>>
>>>> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
>>>> "lock subl $0x1,(%eax)"
>>>>
>>>> Maybe we could define a smp_mb_after_read_lock()  (a compiler barrier() on x86)
>>>>
>>> First version of the patch was actually in this layer, see
>>> https://bugzilla.redhat.com/attachment.cgi?id=345886
>>>
>>> I was adviced this could be to invasive (it was in waitqueue_active actually), 
>>> so I moved the change to the TCP layer itself... 
>>>
>>> As far as I understand the problem there's need for 2 barriers to be
>>> sure, the memory will have correct data. One in the codepath calling the
>>> select (tcp_poll), and in the other one updating the available data status 
>>> (sock_def_readable), am I missing smth?
>>>
>> Hmm, I am not saying your patch doesnt fix the problem, I am saying it
>> is a partial fix of a general problem. We might have same problem(s) in other
>> parts of network stack. This is a very serious issue.
>>
>> Point 1 :
>>
>> You added an smp_mb() call in tcp_poll(). This one looks fine to solve
>> the problem for tcp sockets. What about other protocols ? Do we have
>> same problem ?
> 
> Looks like most of the protocols using the poll_wait. Also I assume
> that most of them will probably have the same scenario as the one
> described (CPU1 and CPU2 codepaths in the RACE DESCRIPTION).
> 
> So I moved the poll smp_mb() call to the __pollwait function, plz
> check the attached diff. This might be too invasive, so another
> place could be probably polling callbacks themselfs like 
> datagram_poll (used very often by protocols), tcp_poll, udp_poll...
> 
> I'm still looking which way would be more suitable, comments are very
> welcome :)
> 
>> Point 2 :
>>
>> You added several smp_mb() calls in tcp input path. In my first
>> reply, I said it was probably better to add smp_mb() in a single 
>> place, right before "if (waitqueue_active(sk->sk_sleep) ... ",
>> but in all paths (input path & output path).
>>
>> Point 3 :
>>
>> The optimization we could do, defining
>> a smp_mb_after_read_lock() that could be a nop on x86
>>
>> 	read_lock(&sk->sk_callback_lock); // "lock subl $0x1,(%eax)" on x86
>> 	smp_mb_after_read_lock(); /* compiler barrier() on x86 */
>> 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>>  		wake_up_interruptible(sk->sk_sleep);
>>
>> Am I missing something ?
>>
>> ;)
>>
> 
> not at all :) I'm the one behind..
> 
> Anyway I made modifications based on Point 2) and 3) and the diff is
> attached, please check.
> 
> thanks a lot,
> jirka
> 
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index b7e5db8..570c0ff 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>  #define _raw_read_relax(lock)	cpu_relax()
>  #define _raw_write_relax(lock)	cpu_relax()
>  
> +/* The read_lock() on x86 is a full memory barrier. */
> +#define smp_mb__after_read_lock() barrier()
> +
>  #endif /* _ASM_X86_SPINLOCK_H */
> diff --git a/fs/select.c b/fs/select.c
> index d870237..f9ebd45 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> +
> +	/* This memory barrier is paired with the smp_mb__after_read_lock
> +	 * in the sock_def_readable. */
> +	smp_mb();
>  }
>  
>  int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 252b245..dd28726 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -132,6 +132,11 @@ do {								\
>  #endif /*__raw_spin_is_contended*/
>  #endif
>  
> +/* The read_lock does not imply full memory barrier. */
> +#ifndef smp_mb__after_read_lock
> +#define smp_mb__after_read_lock() smp_mb()
> +#endif
> +
>  /**
>   * spin_unlock_wait - wait until the spinlock gets unlocked
>   * @lock: the spinlock in question.
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b0ba569..11e414f 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1732,6 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
>  static void sock_def_readable(struct sock *sk, int len)
>  {
>  	read_lock(&sk->sk_callback_lock);
> +	smp_mb__after_read_lock();
>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
>  						POLLRDNORM | POLLRDBAND);

I suspect we need to change all places where we do :


read_lock(&sk->sk_callback_lock);
...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))

to :

read_lock(&sk->sk_callback_lock);
...
smp_mb__after_read_lock();
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))

I suggest you add a helper function like

static inline int sk_has_sleeper(struct sock *sk)
{
	/*
	 * some nice comment here why this barrier is needed
	 * (and where opposite one is located)
	 */
	smp_mb__after_read_lock();
	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
}

and use it in net/atm/common.c : vcc_def_wakeup() & vcc_write_space()
net/dccp/output.c : dccp_write_space()
net/core/stream.c : sk_stream_write_space()
net/core/sock.c : sock_def_wakeup(), sock_def_error_report(),
		sock_def_readable(), sock_def_write_space()
net/iucv/af_iucv.c : iucv_sock_wake_msglim()

and several others as well in net tree... "find|grep" are your friends :)


Thanks

next prev parent reply	other threads:[~2009-06-24 11:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-18 10:27 [RFC] tcp: race in receive part Jiri Olsa
2009-06-18 14:06 ` Eric Dumazet
2009-06-23  9:12   ` Jiri Olsa
2009-06-23 10:32     ` Eric Dumazet
2009-06-23 19:44       ` Oleg Nesterov
2009-06-24 10:20       ` Jiri Olsa
2009-06-24 11:04         ` Eric Dumazet [this message]
2009-06-24 16:21           ` Jiri Olsa
2009-06-24 16:30             ` Oleg Nesterov
2009-06-24 16:41               ` Oleg Nesterov
2009-06-25 10:51                 ` Eric Dumazet
2009-06-25 10:28             ` Eric Dumazet
2009-06-25 10:46               ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A42082D.9060402@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=davem@redhat.com \
    --cc=fbl@redhat.com \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@redhat.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.