From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [RFC] tcp: race in receive part Date: Thu, 18 Jun 2009 16:06:42 +0200 Message-ID: <4A3A49F2.6060705@gmail.com> References: <20090618102727.GC3782@jolsa.lab.eng.brq.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, fbl@redhat.com, nhorman@redhat.com, davem@redhat.com, oleg@redhat.com To: Jiri Olsa Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:59069 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758566AbZFROHA (ORCPT ); Thu, 18 Jun 2009 10:07:00 -0400 In-Reply-To: <20090618102727.GC3782@jolsa.lab.eng.brq.redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: Jiri Olsa a =C3=A9crit : > Hi, >=20 > in RHEL4 we can see a race in the tcp layer. We were not able to repr= oduce=20 > this on the upstream kernel, but since the issue occurs very rarelly > (once per 8 days), we just might not be lucky. >=20 > I'm affraid this might be a long email, I'll try to structure it nice= ly.. :) >=20 Thanks for your mail and detailed analysis >=20 >=20 > RACE DESCRIPTION > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > There's a nice pdf describing the issue (and sollution using locks) o= n > https://bugzilla.redhat.com/attachment.cgi?id=3D345014 I could not reach this url unfortunatly=20 --> "You are not authorized to access bug #494404. " >=20 >=20 > The race fires, when following code paths meet, and the tp->rcv_nxt a= nd > __add_wait_queue updates stay in CPU caches. >=20 > CPU1 CPU2 >=20 >=20 > sys_select receive packet > ... ... > __add_wait_queue update tp->rcv_nxt > ... ... > tp->rcv_nxt check sock_def_readable > ... { > schedule ... > if (sk->sk_sleep && waitqueue_active(= sk->sk_sleep)) > wake_up_interruptible(sk->sk_= sleep) > ... > } >=20 > If there were no cache the code would work ok, since the wait_queue a= nd > rcv_nxt are opposit to each other. >=20 > Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either alr= eady > passed the tp->rcv_nxt check and sleeps, or will get the new value fo= r > tp->rcv_nxt and will return with new data mask. =20 > In both cases the process (CPU1) is being added to the wait queue, so= the > waitqueue_active (CPU2) call cannot miss and will wake up CPU1. >=20 > The bad case is when the __add_wait_queue changes done by CPU1 stay i= n its > cache , and so does the tp->rcv_nxt update on CPU2 side. The CPU1 wi= ll then > endup calling schedule and sleep forever if there are no more data on= the > socket. >=20 > Adding smp_mb() calls before sock_def_readable call and after __add_w= ait_queue > should prevent the above bad scenario. >=20 > The upstream patch is attached. It seems to prevent the issue. >=20 >=20 >=20 > CPU BUGS > =3D=3D=3D=3D=3D=3D=3D=3D >=20 > The customer has been able to reproduce this problem only on one CPU = model: > Xeon E5345*2. They didn't reproduce on XEON MV, for example. Is there an easy way to reproduce the problem ? >=20 > That CPU model happens to have 2 possible issues, that might cause th= e issue: > (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf) >=20 > AJ39 and AJ18. The first one can be workarounded by BIOS upgrade, > the other one has following notes: AJ18 only matters on unaligned accesses, tcp code doesnt do this. >=20 > Software should ensure at least one of the following is true wh= en > modifying shared data by multiple agents: > =E2=80=A2 The shared data is aligned > =E2=80=A2 Proper semaphores or barriers are used in orde= r to > prevent concurrent data accesses. >=20 >=20 >=20 > RFC > =3D=3D=3D >=20 > I'm aware that not having this issue reproduced on upstream lowers th= e odds > having this checked in. However AFAICS the issue is present. I'd appr= eciate > any comment/ideas. >=20 >=20 > thanks, > jirka >=20 >=20 > Signed-off-by: Jiri Olsa >=20 > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 17b89c5..f5d9dbf 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct = socket *sock, poll_table *wait) > struct tcp_sock *tp =3D tcp_sk(sk); > =20 > poll_wait(file, sk->sk_sleep, wait); poll_wait() calls add_wait_queue() which contains a=20 spin_lock_irqsave()/spin_unlock_irqrestore() pair Documentation/memory-barriers.txt states in line 1123 : Memory operations issued after the LOCK will be completed after the LO= CK operation has completed. and line 1131 states : Memory operations issued before the UNLOCK will be completed before th= e UNLOCK operation has completed. So yes, there is no full smp_mb() in poll_wait() > + > + /* Get in sync with tcp_data_queue, tcp_urg > + and tcp_rcv_established function. */ > + smp_mb(); If this barrier is really necessary, I guess it should be done in poll_= wait() itself. Documentation/memory-barriers.txt misses some information about poll_wa= it() > + > if (sk->sk_state =3D=3D TCP_LISTEN) > return inet_csk_listen_poll(sk); > =20 > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 2bdb0da..0606e5e 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -4362,8 +4362,11 @@ queue_and_out: > =20 > if (eaten > 0) > __kfree_skb(skb); > - else if (!sock_flag(sk, SOCK_DEAD)) > + else if (!sock_flag(sk, SOCK_DEAD)) { > + /* Get in sync with tcp_poll function. */ > + smp_mb(); > sk->sk_data_ready(sk, 0); > + } > return; > Oh well... if smp_mb() is needed, I believe it should be done right before "if (waitqueue_active(sk->sk_sleep) ... " read_lock(&sk->sk_callback_lock); + smp_mb(); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible(sk->sk_sleep) It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...) Strange thing is that read_lock() on x86 is a full memory barrier, as i= t uses "lock subl $0x1,(%eax)" Maybe we could define a smp_mb_after_read_lock() (a compiler barrier()= on x86)