netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jike Song <albcamus@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	netdev@vger.kernel.org, David Miller <davem@davemloft.net>,
	Parag Warudkar <parag.lkml
Subject: Re: [PATCH] net: Fix sock_wfree() race
Date: Wed, 9 Sep 2009 15:14:39 +0800	[thread overview]
Message-ID: <df9815e70909090014w27828827j860ecee4f1c3a9e2@mail.gmail.com> (raw)
In-Reply-To: <4AA6DF7B.7060105@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]

On Wed, Sep 9, 2009 at 6:49 AM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Eric Dumazet a écrit :
>> Jike Song a écrit :
>>> On Tue, Sep 8, 2009 at 3:38 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
>>>> We decrement a refcnt while object already freed.
>>>>
>>>> (SLUB DEBUG poisons the zone with 0x6B pattern)
>>>>
>>>> You might add this patch to trigger a WARN_ON when refcnt >= 0x60000000U
>>>> in sk_free() : We'll see the path trying to delete an already freed sock
>>>>
>>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>>> index 7633422..1cb85ff 100644
>>>> --- a/net/core/sock.c
>>>> +++ b/net/core/sock.c
>>>> @@ -1058,6 +1058,7 @@ static void __sk_free(struct sock *sk)
>>>>
>>>>  void sk_free(struct sock *sk)
>>>>  {
>>>> +       WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>>>>        /*
>>>>         * We substract one from sk_wmem_alloc and can know if
>>>>        * some packets are still in some tx queue.
>>>>
>>>>
>>> The output of dmesg with this patch appllied is attached.
>>>
>>>
>>
>> Unfortunatly this WARN_ON was not triggered,
>> maybe freeing comes from sock_wfree()
>>
>> Could you try this patch instead ?
>>
>> Thanks
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 7633422..30469dc 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1058,6 +1058,7 @@ static void __sk_free(struct sock *sk)
>>
>>  void sk_free(struct sock *sk)
>>  {
>> +     WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>>       /*
>>        * We substract one from sk_wmem_alloc and can know if
>>       * some packets are still in some tx queue.
>> @@ -1220,6 +1221,7 @@ void sock_wfree(struct sk_buff *skb)
>>       struct sock *sk = skb->sk;
>>       int res;
>>
>> +     WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>>       /* In case it might be waiting for more memory. */
>>       res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
>>       if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
>>
>
>
> David, I believe problem could come from a race in sock_wfree()
>
> It used to have two atomic ops.
>
> One doing the atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
> then one sock_put() doing the atomic_dec_and_test(&sk->sk_refcnt)
>
> Now, if two cpus are both :
>
> CPU 1 calling sock_wfree()
> CPU 2 calling the 'final' sock_put(),
> CPU 1 doing sock_wfree() might call sk->sk_write_space(sk)
> while CPU 2 is already freeing the socket.
>
>
> Please note I did not test this patch, its very late here and I should get some sleep now...
>
> Thanks
>
> [PATCH] net: Fix sock_wfree() race
>
> Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> (net: No more expensive sock_hold()/sock_put() on each tx)
> opens a window in sock_wfree() where another cpu
> might free the socket we are working on.
>
> Fix is to call sk->sk_write_space(sk) only
> while still holding a reference on sk.
>
> Since doing this call is done before the
> atomic_sub(truesize, &sk->sk_wmem_alloc), we should pass truesize as
> a bias for possible sk_wmem_alloc evaluations.
>
> Reported-by: Jike Song <albcamus@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Eric, I'm unable to apply this patch neatly.  I applied it by hand,
and did some change necessary. This patch for test is attached.

With this patch applied, when run vncviewer, the kerneloops service
still reports kernel failure. But I can't see any in dmesg output.


-- 
Thanks,
Jike

[-- Attachment #2: my.patch --]
[-- Type: application/octet-stream, Size: 12139 bytes --]

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 42b6c63..d1040fe 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -843,11 +843,11 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
 	.validate	= tun_validate,
 };
 
-static void tun_sock_write_space(struct sock *sk)
+static void tun_sock_write_space(struct sock *sk, unsigned int bias)
 {
 	struct tun_struct *tun;
 
-	if (!sock_writeable(sk))
+	if (!sock_writeable_bias(sk, bias))
 		return;
 
 	if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 04dba23..f80ebff 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -23,7 +23,7 @@ struct svc_sock {
 	/* We keep the old state_change and data_ready CB's here */
 	void			(*sk_ostate)(struct sock *);
 	void			(*sk_odata)(struct sock *, int bytes);
-	void			(*sk_owspace)(struct sock *);
+	void			(*sk_owspace)(struct sock *, unsigned int bias);
 
 	/* private TCP part */
 	u32			sk_reclen;	/* length of record */
diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..5fee407 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -296,7 +296,7 @@ struct sock {
 	/* XXX 4 bytes hole on 64 bit */
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
-	void			(*sk_write_space)(struct sock *sk);
+	void			(*sk_write_space)(struct sock *sk, unsigned int bias);
 	void			(*sk_error_report)(struct sock *sk);
   	int			(*sk_backlog_rcv)(struct sock *sk,
 						  struct sk_buff *skb);  
@@ -554,7 +554,7 @@ static inline int sk_stream_wspace(struct sock *sk)
 	return sk->sk_sndbuf - sk->sk_wmem_queued;
 }
 
-extern void sk_stream_write_space(struct sock *sk);
+extern void sk_stream_write_space(struct sock *sk, unsigned int bias);
 
 static inline int sk_stream_memory_free(struct sock *sk)
 {
@@ -1433,6 +1433,11 @@ static inline int sock_writeable(const struct sock *sk)
 	return atomic_read(&sk->sk_wmem_alloc) < (sk->sk_sndbuf >> 1);
 }
 
+static inline int sock_writeable_bias(const struct sock *sk, unsigned int bias)
+{
+	return (atomic_read(&sk->sk_wmem_alloc) - bias) < (sk->sk_sndbuf >> 1);
+}
+
 static inline gfp_t gfp_any(void)
 {
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
diff --git a/net/atm/raw.c b/net/atm/raw.c
index cbfcc71..ea14509 100644
--- a/net/atm/raw.c
+++ b/net/atm/raw.c
@@ -36,7 +36,7 @@ static void atm_pop_raw(struct atm_vcc *vcc,struct sk_buff *skb)
 		sk_wmem_alloc_get(sk), skb->truesize);
 	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
 	dev_kfree_skb_any(skb);
-	sk->sk_write_space(sk);
+	sk->sk_write_space(sk, 0);
 }
 
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 7633422..b840c10 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -510,7 +510,7 @@ set_sndbuf:
 		 *	Wake up sending tasks if we
 		 *	upped the value.
 		 */
-		sk->sk_write_space(sk);
+		sk->sk_write_space(sk, 0);
 		break;
 
 	case SO_SNDBUFFORCE:
@@ -1220,10 +1220,10 @@ void sock_wfree(struct sk_buff *skb)
 	struct sock *sk = skb->sk;
 	int res;
 
-	/* In case it might be waiting for more memory. */
-	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
-		sk->sk_write_space(sk);
+		sk->sk_write_space(sk, skb->truesize);
+
+	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	/*
 	 * if sk_wmem_alloc reached 0, we are last user and should
 	 * free this sock, as sk_free() call could not do it.
@@ -1766,20 +1766,20 @@ static void sock_def_readable(struct sock *sk, int len)
 	read_unlock(&sk->sk_callback_lock);
 }
 
-static void sock_def_write_space(struct sock *sk)
+static void sock_def_write_space(struct sock *sk, unsigned int bias)
 {
 	read_lock(&sk->sk_callback_lock);
 
 	/* Do not wake up a writer until he can make "significant"
 	 * progress.  --DaveM
 	 */
-	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
+	if (((atomic_read(&sk->sk_wmem_alloc) - bias) << 1) <= sk->sk_sndbuf) {
 		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
-		if (sock_writeable(sk))
+		if (sock_writeable_bias(sk, bias))
 			sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 
diff --git a/net/core/stream.c b/net/core/stream.c
index a37debf..df720e9 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -25,7 +25,7 @@
  *
  * FIXME: write proper description
  */
-void sk_stream_write_space(struct sock *sk)
+void sk_stream_write_space(struct sock *sk, unsigned int bias)
 {
 	struct socket *sock = sk->sk_socket;
 
diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index a27b7f4..bb9cf19 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -480,7 +480,7 @@ done_computing_x:
 	 * As we have calculated new ipi, delta, t_nom it is possible
 	 * that we now can send a packet, so wake up dccp_wait_for_ccid
 	 */
-	sk->sk_write_space(sk);
+	sk->sk_write_space(sk, 0);
 
 	/*
 	 * Update timeout interval for the nofeedback timer.
diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index d6bc473..f32274f 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -235,7 +235,7 @@ extern void dccp_send_sync(struct sock *sk, const u64 seq,
 			   const enum dccp_pkt_type pkt_type);
 
 extern void dccp_write_xmit(struct sock *sk, int block);
-extern void dccp_write_space(struct sock *sk);
+extern void dccp_write_space(struct sock *sk, unsigned int);
 
 extern void dccp_init_xmit_timers(struct sock *sk);
 static inline void dccp_clear_xmit_timers(struct sock *sk)
diff --git a/net/dccp/output.c b/net/dccp/output.c
index c96119f..cf0635e 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -192,14 +192,14 @@ unsigned int dccp_sync_mss(struct sock *sk, u32 pmtu)
 
 EXPORT_SYMBOL_GPL(dccp_sync_mss);
 
-void dccp_write_space(struct sock *sk)
+void dccp_write_space(struct sock *sk, unsigned int bias)
 {
 	read_lock(&sk->sk_callback_lock);
 
 	if (sk_has_sleeper(sk))
 		wake_up_interruptible(sk->sk_sleep);
 	/* Should agree with poll, otherwise some programs break */
-	if (sock_writeable(sk))
+	if (sock_writeable_bias(sk, bias))
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 
 	read_unlock(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bdb0da..9c24d07 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4819,7 +4819,7 @@ static void tcp_new_space(struct sock *sk)
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 
-	sk->sk_write_space(sk);
+	sk->sk_write_space(sk, 0);
 }
 
 static void tcp_check_space(struct sock *sk)
diff --git a/net/phonet/pep-gprs.c b/net/phonet/pep-gprs.c
index 480839d..18ccc24 100644
--- a/net/phonet/pep-gprs.c
+++ b/net/phonet/pep-gprs.c
@@ -38,7 +38,7 @@ struct gprs_dev {
 	struct sock		*sk;
 	void			(*old_state_change)(struct sock *);
 	void			(*old_data_ready)(struct sock *, int);
-	void			(*old_write_space)(struct sock *);
+	void			(*old_write_space)(struct sock *, unsigned int);
 
 	struct net_device	*dev;
 };
@@ -157,7 +157,7 @@ static void gprs_data_ready(struct sock *sk, int len)
 	}
 }
 
-static void gprs_write_space(struct sock *sk)
+static void gprs_write_space(struct sock *sk, unsigned int bias)
 {
 	struct gprs_dev *gp = sk->sk_user_data;
 
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index eef833e..0d15822 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -268,7 +268,7 @@ static int pipe_rcv_status(struct sock *sk, struct sk_buff *skb)
 		return -EOPNOTSUPP;
 	}
 	if (wake)
-		sk->sk_write_space(sk);
+		sk->sk_write_space(sk, 0);
 	return 0;
 }
 
@@ -389,7 +389,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
 	case PNS_PIPE_ENABLED_IND:
 		if (!pn_flow_safe(pn->tx_fc)) {
 			atomic_set(&pn->tx_credits, 1);
-			sk->sk_write_space(sk);
+			sk->sk_write_space(sk, 0);
 		}
 		if (sk->sk_state == TCP_ESTABLISHED)
 			break; /* Nothing to do */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 23128ee..8c1642c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -380,7 +380,7 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
 	sock->sk->sk_sndbuf = snd * 2;
 	sock->sk->sk_rcvbuf = rcv * 2;
 	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
-	sock->sk->sk_write_space(sock->sk);
+	sock->sk->sk_write_space(sock->sk, 0);
 	release_sock(sock->sk);
 #endif
 }
@@ -405,7 +405,7 @@ static void svc_udp_data_ready(struct sock *sk, int count)
 /*
  * INET callback when space is newly available on the socket.
  */
-static void svc_write_space(struct sock *sk)
+static void svc_write_space(struct sock *sk, unsigned int bias)
 {
 	struct svc_sock	*svsk = (struct svc_sock *)(sk->sk_user_data);
 
@@ -422,13 +422,13 @@ static void svc_write_space(struct sock *sk)
 	}
 }
 
-static void svc_tcp_write_space(struct sock *sk)
+static void svc_tcp_write_space(struct sock *sk, unsigned int bias)
 {
 	struct socket *sock = sk->sk_socket;
 
 	if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock)
 		clear_bit(SOCK_NOSPACE, &sock->flags);
-	svc_write_space(sk);
+	svc_write_space(sk, bias);
 }
 
 /*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 83c73c4..11e4d35 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -262,7 +262,7 @@ struct sock_xprt {
 	 */
 	void			(*old_data_ready)(struct sock *, int);
 	void			(*old_state_change)(struct sock *);
-	void			(*old_write_space)(struct sock *);
+	void			(*old_write_space)(struct sock *, unsigned int);
 	void			(*old_error_report)(struct sock *);
 };
 
@@ -1491,12 +1491,12 @@ static void xs_write_space(struct sock *sk)
  * progress, otherwise we'll waste resources thrashing kernel_sendmsg
  * with a bunch of small requests.
  */
-static void xs_udp_write_space(struct sock *sk)
+static void xs_udp_write_space(struct sock *sk, unsigned int bias)
 {
 	read_lock(&sk->sk_callback_lock);
 
 	/* from net/core/sock.c:sock_def_write_space */
-	if (sock_writeable(sk))
+	if (sock_writeable_bias(sk, bias))
 		xs_write_space(sk);
 
 	read_unlock(&sk->sk_callback_lock);
@@ -1512,7 +1512,7 @@ static void xs_udp_write_space(struct sock *sk)
  * progress, otherwise we'll waste resources thrashing kernel_sendmsg
  * with a bunch of small requests.
  */
-static void xs_tcp_write_space(struct sock *sk)
+static void xs_tcp_write_space(struct sock *sk, unsigned int bias)
 {
 	read_lock(&sk->sk_callback_lock);
 
@@ -1535,7 +1535,7 @@ static void xs_udp_do_set_buffer_size(struct rpc_xprt *xprt)
 	if (transport->sndsize) {
 		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
 		sk->sk_sndbuf = transport->sndsize * xprt->max_reqs * 2;
-		sk->sk_write_space(sk);
+		sk->sk_write_space(sk, 0);
 	}
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..9f90ead 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -306,15 +306,15 @@ found:
 	return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline int unix_writable(struct sock *sk, unsigned int bias)
 {
-	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+	return ((atomic_read(&sk->sk_wmem_alloc) - bias) << 2) <= sk->sk_sndbuf;
 }
 
-static void unix_write_space(struct sock *sk)
+static void unix_write_space(struct sock *sk, unsigned int bias)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (unix_writable(sk)) {
+	if (unix_writable(sk, bias)) {
 		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync(sk->sk_sleep);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
@@ -2010,7 +2010,7 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 	 * we set writable also when the other side has shut down the
 	 * connection. This prevents stuck sockets.
 	 */
-	if (unix_writable(sk))
+	if (unix_writable(sk, 0))
 		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
 
 	return mask;
@@ -2048,7 +2048,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	}
 
 	/* writable? */
-	writable = unix_writable(sk);
+	writable = unix_writable(sk, 0);
 	if (writable) {
 		other = unix_peer_get(sk);
 		if (other) {

  reply	other threads:[~2009-09-09  7:14 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <alpine.DEB.2.00.0909072323350.3940@parag-desktop>
2009-09-08  4:51 ` BUG UNIX: Poison overwritten with 2.6.31-rc6-00223-g6c30c53 Jike Song
2009-09-08  7:38   ` Eric Dumazet
2009-09-08  8:09     ` Jike Song
2009-09-08 12:12       ` Eric Dumazet
2009-09-08 22:49         ` [PATCH] net: Fix sock_wfree() race Eric Dumazet
2009-09-09  7:14           ` Jike Song [this message]
2009-09-09  9:18             ` Eric Dumazet
2009-09-11 18:43           ` David Miller
2009-09-11 19:52             ` David Miller
2009-09-23 13:44               ` Eric Dumazet
2009-09-24 20:07                 ` Jarek Poplawski
2009-09-24 20:49                   ` Eric Dumazet
2009-09-30 23:23                     ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df9815e70909090014w27828827j860ecee4f1c3a9e2@mail.gmail.com \
    --to=albcamus@gmail.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).