From: Jike Song <albcamus@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
netdev@vger.kernel.org, David Miller <davem@davemloft.net>,
Parag Warudkar <parag.lkml
Subject: Re: [PATCH] net: Fix sock_wfree() race
Date: Wed, 9 Sep 2009 15:14:39 +0800 [thread overview]
Message-ID: <df9815e70909090014w27828827j860ecee4f1c3a9e2@mail.gmail.com> (raw)
In-Reply-To: <4AA6DF7B.7060105@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]
On Wed, Sep 9, 2009 at 6:49 AM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Eric Dumazet a écrit :
>> Jike Song a écrit :
>>> On Tue, Sep 8, 2009 at 3:38 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
>>>> We decrement a refcnt while object already freed.
>>>>
>>>> (SLUB DEBUG poisons the zone with 0x6B pattern)
>>>>
>>>> You might add this patch to trigger a WARN_ON when refcnt >= 0x60000000U
>>>> in sk_free() : We'll see the path trying to delete an already freed sock
>>>>
>>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>>> index 7633422..1cb85ff 100644
>>>> --- a/net/core/sock.c
>>>> +++ b/net/core/sock.c
>>>> @@ -1058,6 +1058,7 @@ static void __sk_free(struct sock *sk)
>>>>
>>>> void sk_free(struct sock *sk)
>>>> {
>>>> + WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>>>> /*
>>>> * We substract one from sk_wmem_alloc and can know if
>>>> * some packets are still in some tx queue.
>>>>
>>>>
>>> The output of dmesg with this patch appllied is attached.
>>>
>>>
>>
>> Unfortunatly this WARN_ON was not triggered,
>> maybe freeing comes from sock_wfree()
>>
>> Could you try this patch instead ?
>>
>> Thanks
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 7633422..30469dc 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1058,6 +1058,7 @@ static void __sk_free(struct sock *sk)
>>
>> void sk_free(struct sock *sk)
>> {
>> + WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>> /*
>> * We substract one from sk_wmem_alloc and can know if
>> * some packets are still in some tx queue.
>> @@ -1220,6 +1221,7 @@ void sock_wfree(struct sk_buff *skb)
>> struct sock *sk = skb->sk;
>> int res;
>>
>> + WARN_ON(atomic_read(&sk->sk_wmem_alloc) >= 0x60000000U);
>> /* In case it might be waiting for more memory. */
>> res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
>> if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
>>
>
>
> David, I believe problem could come from a race in sock_wfree()
>
> It used to have two atomic ops.
>
> One doing the atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
> then one sock_put() doing the atomic_dec_and_test(&sk->sk_refcnt)
>
> Now, if two cpus are both :
>
> CPU 1 calling sock_wfree()
> CPU 2 calling the 'final' sock_put(),
> CPU 1 doing sock_wfree() might call sk->sk_write_space(sk)
> while CPU 2 is already freeing the socket.
>
>
> Please note I did not test this patch, its very late here and I should get some sleep now...
>
> Thanks
>
> [PATCH] net: Fix sock_wfree() race
>
> Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> (net: No more expensive sock_hold()/sock_put() on each tx)
> opens a window in sock_wfree() where another cpu
> might free the socket we are working on.
>
> Fix is to call sk->sk_write_space(sk) only
> while still holding a reference on sk.
>
> Since doing this call is done before the
> atomic_sub(truesize, &sk->sk_wmem_alloc), we should pass truesize as
> a bias for possible sk_wmem_alloc evaluations.
>
> Reported-by: Jike Song <albcamus@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Eric, I'm unable to apply this patch neatly. I applied it by hand,
and did some change necessary. This patch for test is attached.
With this patch applied, when run vncviewer, the kerneloops service
still reports kernel failure. But I can't see any in dmesg output.
--
Thanks,
Jike
[-- Attachment #2: my.patch --]
[-- Type: application/octet-stream, Size: 12139 bytes --]
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 42b6c63..d1040fe 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -843,11 +843,11 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
.validate = tun_validate,
};
-static void tun_sock_write_space(struct sock *sk)
+static void tun_sock_write_space(struct sock *sk, unsigned int bias)
{
struct tun_struct *tun;
- if (!sock_writeable(sk))
+ if (!sock_writeable_bias(sk, bias))
return;
if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 04dba23..f80ebff 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -23,7 +23,7 @@ struct svc_sock {
/* We keep the old state_change and data_ready CB's here */
void (*sk_ostate)(struct sock *);
void (*sk_odata)(struct sock *, int bytes);
- void (*sk_owspace)(struct sock *);
+ void (*sk_owspace)(struct sock *, unsigned int bias);
/* private TCP part */
u32 sk_reclen; /* length of record */
diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..5fee407 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -296,7 +296,7 @@ struct sock {
/* XXX 4 bytes hole on 64 bit */
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk, int bytes);
- void (*sk_write_space)(struct sock *sk);
+ void (*sk_write_space)(struct sock *sk, unsigned int bias);
void (*sk_error_report)(struct sock *sk);
int (*sk_backlog_rcv)(struct sock *sk,
struct sk_buff *skb);
@@ -554,7 +554,7 @@ static inline int sk_stream_wspace(struct sock *sk)
return sk->sk_sndbuf - sk->sk_wmem_queued;
}
-extern void sk_stream_write_space(struct sock *sk);
+extern void sk_stream_write_space(struct sock *sk, unsigned int bias);
static inline int sk_stream_memory_free(struct sock *sk)
{
@@ -1433,6 +1433,11 @@ static inline int sock_writeable(const struct sock *sk)
return atomic_read(&sk->sk_wmem_alloc) < (sk->sk_sndbuf >> 1);
}
+static inline int sock_writeable_bias(const struct sock *sk, unsigned int bias)
+{
+ return (atomic_read(&sk->sk_wmem_alloc) - bias) < (sk->sk_sndbuf >> 1);
+}
+
static inline gfp_t gfp_any(void)
{
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
diff --git a/net/atm/raw.c b/net/atm/raw.c
index cbfcc71..ea14509 100644
--- a/net/atm/raw.c
+++ b/net/atm/raw.c
@@ -36,7 +36,7 @@ static void atm_pop_raw(struct atm_vcc *vcc,struct sk_buff *skb)
sk_wmem_alloc_get(sk), skb->truesize);
atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
dev_kfree_skb_any(skb);
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 7633422..b840c10 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -510,7 +510,7 @@ set_sndbuf:
* Wake up sending tasks if we
* upped the value.
*/
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
break;
case SO_SNDBUFFORCE:
@@ -1220,10 +1220,10 @@ void sock_wfree(struct sk_buff *skb)
struct sock *sk = skb->sk;
int res;
- /* In case it might be waiting for more memory. */
- res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, skb->truesize);
+
+ res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
/*
* if sk_wmem_alloc reached 0, we are last user and should
* free this sock, as sk_free() call could not do it.
@@ -1766,20 +1766,20 @@ static void sock_def_readable(struct sock *sk, int len)
read_unlock(&sk->sk_callback_lock);
}
-static void sock_def_write_space(struct sock *sk)
+static void sock_def_write_space(struct sock *sk, unsigned int bias)
{
read_lock(&sk->sk_callback_lock);
/* Do not wake up a writer until he can make "significant"
* progress. --DaveM
*/
- if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
+ if (((atomic_read(&sk->sk_wmem_alloc) - bias) << 1) <= sk->sk_sndbuf) {
if (sk_has_sleeper(sk))
wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
POLLWRNORM | POLLWRBAND);
/* Should agree with poll, otherwise some programs break */
- if (sock_writeable(sk))
+ if (sock_writeable_bias(sk, bias))
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
diff --git a/net/core/stream.c b/net/core/stream.c
index a37debf..df720e9 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -25,7 +25,7 @@
*
* FIXME: write proper description
*/
-void sk_stream_write_space(struct sock *sk)
+void sk_stream_write_space(struct sock *sk, unsigned int bias)
{
struct socket *sock = sk->sk_socket;
diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index a27b7f4..bb9cf19 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -480,7 +480,7 @@ done_computing_x:
* As we have calculated new ipi, delta, t_nom it is possible
* that we now can send a packet, so wake up dccp_wait_for_ccid
*/
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
/*
* Update timeout interval for the nofeedback timer.
diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index d6bc473..f32274f 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -235,7 +235,7 @@ extern void dccp_send_sync(struct sock *sk, const u64 seq,
const enum dccp_pkt_type pkt_type);
extern void dccp_write_xmit(struct sock *sk, int block);
-extern void dccp_write_space(struct sock *sk);
+extern void dccp_write_space(struct sock *sk, unsigned int);
extern void dccp_init_xmit_timers(struct sock *sk);
static inline void dccp_clear_xmit_timers(struct sock *sk)
diff --git a/net/dccp/output.c b/net/dccp/output.c
index c96119f..cf0635e 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -192,14 +192,14 @@ unsigned int dccp_sync_mss(struct sock *sk, u32 pmtu)
EXPORT_SYMBOL_GPL(dccp_sync_mss);
-void dccp_write_space(struct sock *sk)
+void dccp_write_space(struct sock *sk, unsigned int bias)
{
read_lock(&sk->sk_callback_lock);
if (sk_has_sleeper(sk))
wake_up_interruptible(sk->sk_sleep);
/* Should agree with poll, otherwise some programs break */
- if (sock_writeable(sk))
+ if (sock_writeable_bias(sk, bias))
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
read_unlock(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bdb0da..9c24d07 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4819,7 +4819,7 @@ static void tcp_new_space(struct sock *sk)
tp->snd_cwnd_stamp = tcp_time_stamp;
}
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
}
static void tcp_check_space(struct sock *sk)
diff --git a/net/phonet/pep-gprs.c b/net/phonet/pep-gprs.c
index 480839d..18ccc24 100644
--- a/net/phonet/pep-gprs.c
+++ b/net/phonet/pep-gprs.c
@@ -38,7 +38,7 @@ struct gprs_dev {
struct sock *sk;
void (*old_state_change)(struct sock *);
void (*old_data_ready)(struct sock *, int);
- void (*old_write_space)(struct sock *);
+ void (*old_write_space)(struct sock *, unsigned int);
struct net_device *dev;
};
@@ -157,7 +157,7 @@ static void gprs_data_ready(struct sock *sk, int len)
}
}
-static void gprs_write_space(struct sock *sk)
+static void gprs_write_space(struct sock *sk, unsigned int bias)
{
struct gprs_dev *gp = sk->sk_user_data;
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index eef833e..0d15822 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -268,7 +268,7 @@ static int pipe_rcv_status(struct sock *sk, struct sk_buff *skb)
return -EOPNOTSUPP;
}
if (wake)
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
return 0;
}
@@ -389,7 +389,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
case PNS_PIPE_ENABLED_IND:
if (!pn_flow_safe(pn->tx_fc)) {
atomic_set(&pn->tx_credits, 1);
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
}
if (sk->sk_state == TCP_ESTABLISHED)
break; /* Nothing to do */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 23128ee..8c1642c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -380,7 +380,7 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
sock->sk->sk_sndbuf = snd * 2;
sock->sk->sk_rcvbuf = rcv * 2;
sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
- sock->sk->sk_write_space(sock->sk);
+ sock->sk->sk_write_space(sock->sk, 0);
release_sock(sock->sk);
#endif
}
@@ -405,7 +405,7 @@ static void svc_udp_data_ready(struct sock *sk, int count)
/*
* INET callback when space is newly available on the socket.
*/
-static void svc_write_space(struct sock *sk)
+static void svc_write_space(struct sock *sk, unsigned int bias)
{
struct svc_sock *svsk = (struct svc_sock *)(sk->sk_user_data);
@@ -422,13 +422,13 @@ static void svc_write_space(struct sock *sk)
}
}
-static void svc_tcp_write_space(struct sock *sk)
+static void svc_tcp_write_space(struct sock *sk, unsigned int bias)
{
struct socket *sock = sk->sk_socket;
if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock)
clear_bit(SOCK_NOSPACE, &sock->flags);
- svc_write_space(sk);
+ svc_write_space(sk, bias);
}
/*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 83c73c4..11e4d35 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -262,7 +262,7 @@ struct sock_xprt {
*/
void (*old_data_ready)(struct sock *, int);
void (*old_state_change)(struct sock *);
- void (*old_write_space)(struct sock *);
+ void (*old_write_space)(struct sock *, unsigned int);
void (*old_error_report)(struct sock *);
};
@@ -1491,12 +1491,12 @@ static void xs_write_space(struct sock *sk)
* progress, otherwise we'll waste resources thrashing kernel_sendmsg
* with a bunch of small requests.
*/
-static void xs_udp_write_space(struct sock *sk)
+static void xs_udp_write_space(struct sock *sk, unsigned int bias)
{
read_lock(&sk->sk_callback_lock);
/* from net/core/sock.c:sock_def_write_space */
- if (sock_writeable(sk))
+ if (sock_writeable_bias(sk, bias))
xs_write_space(sk);
read_unlock(&sk->sk_callback_lock);
@@ -1512,7 +1512,7 @@ static void xs_udp_write_space(struct sock *sk)
* progress, otherwise we'll waste resources thrashing kernel_sendmsg
* with a bunch of small requests.
*/
-static void xs_tcp_write_space(struct sock *sk)
+static void xs_tcp_write_space(struct sock *sk, unsigned int bias)
{
read_lock(&sk->sk_callback_lock);
@@ -1535,7 +1535,7 @@ static void xs_udp_do_set_buffer_size(struct rpc_xprt *xprt)
if (transport->sndsize) {
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
sk->sk_sndbuf = transport->sndsize * xprt->max_reqs * 2;
- sk->sk_write_space(sk);
+ sk->sk_write_space(sk, 0);
}
}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..9f90ead 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -306,15 +306,15 @@ found:
return s;
}
-static inline int unix_writable(struct sock *sk)
+static inline int unix_writable(struct sock *sk, unsigned int bias)
{
- return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+ return ((atomic_read(&sk->sk_wmem_alloc) - bias) << 2) <= sk->sk_sndbuf;
}
-static void unix_write_space(struct sock *sk)
+static void unix_write_space(struct sock *sk, unsigned int bias)
{
read_lock(&sk->sk_callback_lock);
- if (unix_writable(sk)) {
+ if (unix_writable(sk, bias)) {
if (sk_has_sleeper(sk))
wake_up_interruptible_sync(sk->sk_sleep);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
@@ -2010,7 +2010,7 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
* we set writable also when the other side has shut down the
* connection. This prevents stuck sockets.
*/
- if (unix_writable(sk))
+ if (unix_writable(sk, 0))
mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
return mask;
@@ -2048,7 +2048,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
}
/* writable? */
- writable = unix_writable(sk);
+ writable = unix_writable(sk, 0);
if (writable) {
other = unix_peer_get(sk);
if (other) {
next prev parent reply other threads:[~2009-09-09 7:14 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <alpine.DEB.2.00.0909072323350.3940@parag-desktop>
2009-09-08 4:51 ` BUG UNIX: Poison overwritten with 2.6.31-rc6-00223-g6c30c53 Jike Song
2009-09-08 7:38 ` Eric Dumazet
2009-09-08 8:09 ` Jike Song
2009-09-08 12:12 ` Eric Dumazet
2009-09-08 22:49 ` [PATCH] net: Fix sock_wfree() race Eric Dumazet
2009-09-09 7:14 ` Jike Song [this message]
2009-09-09 9:18 ` Eric Dumazet
2009-09-11 18:43 ` David Miller
2009-09-11 19:52 ` David Miller
2009-09-23 13:44 ` Eric Dumazet
2009-09-24 20:07 ` Jarek Poplawski
2009-09-24 20:49 ` Eric Dumazet
2009-09-30 23:23 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=df9815e70909090014w27828827j860ecee4f1c3a9e2@mail.gmail.com \
--to=albcamus@gmail.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).