WARNING: at include/net/sock.h:417 udp_lib

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* WARNING: at include/net/sock.h:417 udp_lib_unhash
@ 2009-07-01 18:08 Tantilov, Emil S
  2009-07-02  6:10 ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Tantilov, Emil S @ 2009-07-01 18:08 UTC (permalink / raw)
  To: NetDev; +Cc: Brandeburg, Jesse, Kirsher, Jeffrey T

I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.

[45197.989163] ------------[ cut here ]------------
[45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
[45197.994311] Hardware name: X7DA8
[45197.994314] Modules linked in: e1000 [last unloaded: e1000]
[45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
[45197.994331] Call Trace:
[45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
[45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
[45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
[45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
[45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
[45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
[45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
[45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
[45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
[45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
[45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
[45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
[45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
[45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
[45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---

Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: WARNING: at include/net/sock.h:417 udp_lib_unhash
  2009-07-01 18:08 WARNING: at include/net/sock.h:417 udp_lib_unhash Tantilov, Emil S
@ 2009-07-02  6:10 ` Eric Dumazet
  2009-07-07  0:54   ` Emil S Tantilov
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-02  6:10 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: NetDev, Brandeburg, Jesse, Kirsher, Jeffrey T, Jiri Olsa,
	David S. Miller

Tantilov, Emil S a écrit :
> I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.
> 
> [45197.989163] ------------[ cut here ]------------
> [45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
> [45197.994311] Hardware name: X7DA8
> [45197.994314] Modules linked in: e1000 [last unloaded: e1000]
> [45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
> [45197.994331] Call Trace:
> [45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
> [45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
> [45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
> [45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
> [45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
> [45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
> [45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
> [45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
> [45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
> [45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
> [45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
> [45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
> [45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
> [45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
> [45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
> [45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---
> 
> Emil--

Thanks for this report Emil.

I could not find a recent change in this area in last kernels.
If struct sk is hashed (sk_hashed() true), then sk_refcnt was incremented
in sk_nulls_add_node_rcu(), thus its value should be >= 2.

Maybe we have a missing memory barrier somewhere or a list corruption.

1) Could you try CONFIG_DEBUG_LIST=y ?

2) Could you give model of cpu, since it reminds me the ongoing discussion raised by Jiri Olsa.

CPU1 does an atomic_inc(&sk->sk_refcnt)  : refcnt changes from 1 to 2
then CPU2 does an atomic_read(&sk->sk_refcnt) and reads 1 instead of 2

David, maybe this test is not safe and if we really want to do a check
we need to use a stronger atomic function.

If you can reproduce this problem easily could you try following patch ?

Thank you

diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..96ab278 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -393,8 +393,9 @@ static __inline__ int sk_del_node_init(struct sock *sk)
 
 	if (rc) {
 		/* paranoid for a while -acme */
-		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
-		__sock_put(sk);
+		int res = atomic_dec_return(&sk->sk_refcnt);
+
+		WARN_ON(res <= 0);
 	}
 	return rc;
 }
@@ -413,9 +414,9 @@ static __inline__ int sk_nulls_del_node_init_rcu(struct sock *sk)
 	int rc = __sk_nulls_del_node_init_rcu(sk);
 
 	if (rc) {
-		/* paranoid for a while -acme */
-		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
-		__sock_put(sk);
+		int res = atomic_dec_return(&sk->sk_refcnt);
+
+		WARN_ON(res <= 0);
 	}
 	return rc;
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: WARNING: at include/net/sock.h:417 udp_lib_unhash
  2009-07-02  6:10 ` Eric Dumazet
@ 2009-07-07  0:54   ` Emil S Tantilov
  2009-07-07  7:21     ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Emil S Tantilov @ 2009-07-07  0:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tantilov, Emil S, NetDev, Brandeburg, Jesse, Kirsher, Jeffrey T,
	Jiri Olsa, David S. Miller

On Wed, Jul 1, 2009 at 11:10 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Tantilov, Emil S a écrit :
>> I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.
>>
>> [45197.989163] ------------[ cut here ]------------
>> [45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
>> [45197.994311] Hardware name: X7DA8
>> [45197.994314] Modules linked in: e1000 [last unloaded: e1000]
>> [45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
>> [45197.994331] Call Trace:
>> [45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
>> [45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
>> [45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
>> [45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
>> [45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
>> [45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
>> [45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
>> [45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
>> [45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
>> [45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
>> [45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
>> [45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
>> [45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
>> [45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
>> [45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
>> [45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---
>>
>> Emil--
>
> Thanks for this report Emil.
>
> I could not find a recent change in this area in last kernels.
> If struct sk is hashed (sk_hashed() true), then sk_refcnt was incremented
> in sk_nulls_add_node_rcu(), thus its value should be >= 2.
>
> Maybe we have a missing memory barrier somewhere or a list corruption.
>
> 1) Could you try CONFIG_DEBUG_LIST=y ?
I am running a test with this option now. Sorry for the delayed
response, I was out last week.

> 2) Could you give model of cpu, since it reminds me the ongoing discussion raised by Jiri Olsa.

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping	: 6
cpu MHz		: 2999.790
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow
vnmi flexpriority
bogomips	: 5999.58
clflush size	: 64

2 quad core Xeons, I only included the output from the first to reduce clutter.

> CPU1 does an atomic_inc(&sk->sk_refcnt)  : refcnt changes from 1 to 2
> then CPU2 does an atomic_read(&sk->sk_refcnt) and reads 1 instead of 2
>
> David, maybe this test is not safe and if we really want to do a check
> we need to use a stronger atomic function.
>
> If you can reproduce this problem easily could you try following patch ?

It varies from few minutes to hours, but it does reproduce
consistently in my tests so far. I will try your patch once I manage
to get a trace with CONFIG_DEBUG_LIST

Thanks,
Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: WARNING: at include/net/sock.h:417 udp_lib_unhash
  2009-07-07  0:54   ` Emil S Tantilov
@ 2009-07-07  7:21     ` Eric Dumazet
  2009-07-07  7:40       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-07  7:21 UTC (permalink / raw)
  To: Emil S Tantilov
  Cc: Tantilov, Emil S, NetDev, Brandeburg, Jesse, Kirsher, Jeffrey T,
	Jiri Olsa, David S. Miller

Emil S Tantilov a écrit :
> On Wed, Jul 1, 2009 at 11:10 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
>> Tantilov, Emil S a écrit :
>>> I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.
>>>
>>> [45197.989163] ------------[ cut here ]------------
>>> [45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
>>> [45197.994311] Hardware name: X7DA8
>>> [45197.994314] Modules linked in: e1000 [last unloaded: e1000]
>>> [45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
>>> [45197.994331] Call Trace:
>>> [45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
>>> [45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
>>> [45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
>>> [45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
>>> [45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
>>> [45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
>>> [45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
>>> [45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
>>> [45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
>>> [45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
>>> [45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
>>> [45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
>>> [45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
>>> [45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
>>> [45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
>>> [45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---
>>>
>>> Emil--
>> Thanks for this report Emil.
>>
>> I could not find a recent change in this area in last kernels.
>> If struct sk is hashed (sk_hashed() true), then sk_refcnt was incremented
>> in sk_nulls_add_node_rcu(), thus its value should be >= 2.
>>
>> Maybe we have a missing memory barrier somewhere or a list corruption.
>>
>> 1) Could you try CONFIG_DEBUG_LIST=y ?
> I am running a test with this option now. Sorry for the delayed
> response, I was out last week.
> 
>> 2) Could you give model of cpu, since it reminds me the ongoing discussion raised by Jiri Olsa.
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 23
> model name	: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
> stepping	: 6
> cpu MHz		: 2999.790
> cache size	: 6144 KB
> physical id	: 0
> siblings	: 4
> core id		: 0
> cpu cores	: 4
> apicid		: 0
> initial apicid	: 0
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 10
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
> ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow
> vnmi flexpriority
> bogomips	: 5999.58
> clflush size	: 64
> 
> 2 quad core Xeons, I only included the output from the first to reduce clutter.
> 
>> CPU1 does an atomic_inc(&sk->sk_refcnt)  : refcnt changes from 1 to 2
>> then CPU2 does an atomic_read(&sk->sk_refcnt) and reads 1 instead of 2
>>
>> David, maybe this test is not safe and if we really want to do a check
>> we need to use a stronger atomic function.
>>
>> If you can reproduce this problem easily could you try following patch ?
> 
> It varies from few minutes to hours, but it does reproduce
> consistently in my tests so far. I will try your patch once I manage
> to get a trace with CONFIG_DEBUG_LIST
> 


Eventually, could you also use following debug/quick&dirty patch ?


diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..548f822 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -376,6 +376,7 @@ static __inline__ int __sk_del_node_init(struct sock *sk)
 
 static inline void sock_hold(struct sock *sk)
 {
+	WARN_ON(atomic_read(&sk->sk_refcnt) == 0);
 	atomic_inc(&sk->sk_refcnt);
 }
 
@@ -385,6 +386,7 @@ static inline void sock_hold(struct sock *sk)
 static inline void __sock_put(struct sock *sk)
 {
 	atomic_dec(&sk->sk_refcnt);
+	WARN_ON(atomic_read(&sk->sk_refcnt) == 0);
 }
 
 static __inline__ int sk_del_node_init(struct sock *sk)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: WARNING: at include/net/sock.h:417 udp_lib_unhash
  2009-07-07  7:21     ` Eric Dumazet
@ 2009-07-07  7:40       ` Eric Dumazet
  2009-07-07 16:14         ` [PATCH] net: sk_alloc() should not blindly overwrite memory Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-07  7:40 UTC (permalink / raw)
  Cc: Emil S Tantilov, Tantilov, Emil S, NetDev, Brandeburg, Jesse,
	Kirsher, Jeffrey T, Jiri Olsa, David S. Miller

Eric Dumazet a écrit :
> Emil S Tantilov a écrit :
>> On Wed, Jul 1, 2009 at 11:10 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
>>> Tantilov, Emil S a écrit :
>>>> I see the following trace during netperf stress mixed UDP/TCP IPv4/6 traffic. This is on recent pulls from net-2.6 and net-next.
>>>>
>>>> [45197.989163] ------------[ cut here ]------------
>>>> [45197.994309] WARNING: at include/net/sock.h:417 udp_lib_unhash+0x81/0xab()
>>>> [45197.994311] Hardware name: X7DA8
>>>> [45197.994314] Modules linked in: e1000 [last unloaded: e1000]
>>>> [45197.994326] Pid: 7110, comm: netserver Tainted: G        W  2.6.31-rc1-net-next-e1000-06250902 #8
>>>> [45197.994331] Call Trace:
>>>> [45197.994336]  [<ffffffff8135e0dc>] ? udp_lib_unhash+0x81/0xab
>>>> [45197.994344]  [<ffffffff8103cac9>] warn_slowpath_common+0x77/0x8f
>>>> [45197.994349]  [<ffffffff8103caf0>] warn_slowpath_null+0xf/0x11
>>>> [45197.994352]  [<ffffffff8135e0dc>] udp_lib_unhash+0x81/0xab
>>>> [45197.994357]  [<ffffffff81301acb>] sk_common_release+0x2f/0xb4
>>>> [45197.994364]  [<ffffffff813a0256>] udp_lib_close+0x9/0xb
>>>> [45197.994369]  [<ffffffff81364259>] inet_release+0x58/0x5f
>>>> [45197.994374]  [<ffffffff8138c8bd>] inet6_release+0x30/0x35
>>>> [45197.994383]  [<ffffffff812ff273>] sock_release+0x1a/0x6c
>>>> [45197.994386]  [<ffffffff812ff763>] sock_close+0x22/0x26
>>>> [45197.994392]  [<ffffffff810c69a0>] __fput+0xf0/0x18c
>>>> [45197.994395]  [<ffffffff810c6d00>] fput+0x15/0x19
>>>> [45197.994399]  [<ffffffff810c3c3e>] filp_close+0x5c/0x67
>>>> [45197.994404]  [<ffffffff810c3cc4>] sys_close+0x7b/0xb6
>>>> [45197.994412]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
>>>> [45197.994418] ---[ end trace 5acab6fc0afdaaa3 ]---
>>>>
>>>> Emil--
>>> Thanks for this report Emil.
>>>
>>> I could not find a recent change in this area in last kernels.
>>> If struct sk is hashed (sk_hashed() true), then sk_refcnt was incremented
>>> in sk_nulls_add_node_rcu(), thus its value should be >= 2.
>>>
>>> Maybe we have a missing memory barrier somewhere or a list corruption.
>>>
>>> 1) Could you try CONFIG_DEBUG_LIST=y ?
>> I am running a test with this option now. Sorry for the delayed
>> response, I was out last week.
>>
>>> 2) Could you give model of cpu, since it reminds me the ongoing discussion raised by Jiri Olsa.
>> processor	: 0
>> vendor_id	: GenuineIntel
>> cpu family	: 6
>> model		: 23
>> model name	: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
>> stepping	: 6
>> cpu MHz		: 2999.790
>> cache size	: 6144 KB
>> physical id	: 0
>> siblings	: 4
>> core id		: 0
>> cpu cores	: 4
>> apicid		: 0
>> initial apicid	: 0
>> fpu		: yes
>> fpu_exception	: yes
>> cpuid level	: 10
>> wp		: yes
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
>> lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
>> ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow
>> vnmi flexpriority
>> bogomips	: 5999.58
>> clflush size	: 64
>>
>> 2 quad core Xeons, I only included the output from the first to reduce clutter.
>>
>>> CPU1 does an atomic_inc(&sk->sk_refcnt)  : refcnt changes from 1 to 2
>>> then CPU2 does an atomic_read(&sk->sk_refcnt) and reads 1 instead of 2
>>>
>>> David, maybe this test is not safe and if we really want to do a check
>>> we need to use a stronger atomic function.
>>>
>>> If you can reproduce this problem easily could you try following patch ?
>> It varies from few minutes to hours, but it does reproduce
>> consistently in my tests so far. I will try your patch once I manage
>> to get a trace with CONFIG_DEBUG_LIST
>>
> 
> 
> Eventually, could you also use following debug/quick&dirty patch ?
> 
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 352f06b..548f822 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -376,6 +376,7 @@ static __inline__ int __sk_del_node_init(struct sock *sk)
>  
>  static inline void sock_hold(struct sock *sk)
>  {
> +	WARN_ON(atomic_read(&sk->sk_refcnt) == 0);
>  	atomic_inc(&sk->sk_refcnt);
>  }
>  
> @@ -385,6 +386,7 @@ static inline void sock_hold(struct sock *sk)
>  static inline void __sock_put(struct sock *sk)
>  {
>  	atomic_dec(&sk->sk_refcnt);
> +	WARN_ON(atomic_read(&sk->sk_refcnt) == 0);
>  }
>  
>  static __inline__ int sk_del_node_init(struct sock *sk)
> 

Hold on, I found the problem and will submit a patch (after testing it) in following hours

Thanks

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] net: sk_alloc() should not blindly overwrite memory
  2009-07-07  7:40       ` Eric Dumazet
@ 2009-07-07 16:14         ` Eric Dumazet
  2009-07-07 18:33           ` Tantilov, Emil S
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-07 16:14 UTC (permalink / raw)
  To: David S. Miller
  Cc: Emil S Tantilov, Tantilov, Emil S, NetDev, Brandeburg, Jesse,
	Kirsher, Jeffrey T, Jiri Olsa

Eric Dumazet a écrit :
>
> 
> Hold on, I found the problem and will submit a patch (after testing it) in following hours
> 

David, while looking at Emil case, I found following problem.

I am not sure this bug is responsible for the BUG hit by Emil,
but it seems a candidate for stable as well...

Thanks

[PATCH] net: sk_alloc() should not blindly overwrite memory

Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
fields should not be blindly overwritten, even with null.

These fields are sk->sk_refcnt and sk->sk_nulls_node.next

Current sk_alloc() implementation doesnt respect this hypothesis,
calling kmem_alloc with __GFP_ZERO and setting sk_refcnt to 1 instead
of atomically increment it.

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h            |   24 +++++++++++-------
 net/core/sock.c               |   41 ++++++++++++++++++++++++++++----
 net/ipv4/inet_timewait_sock.c |    2 -
 3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..3f6284e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -103,15 +103,15 @@ struct net;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
+ *	@skc_node: main hash linkage for various protocol lookup tables
+ *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
+ *	@skc_refcnt: reference count
+ *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
  *	@skc_bound_dev_if: bound device index if != 0
- *	@skc_node: main hash linkage for various protocol lookup tables
- *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
- *	@skc_refcnt: reference count
- *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_prot: protocol handlers inside a network family
  *	@skc_net: reference to the network namespace of this socket
  *
@@ -119,17 +119,23 @@ struct net;
  *	for struct sock and struct inet_timewait_sock.
  */
 struct sock_common {
-	unsigned short		skc_family;
-	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
-	int			skc_bound_dev_if;
+	/*
+	 * WARNING : skc_node, skc_refcnt must be first elements of this struct
+	 * (sk_prot_alloc() should not clear skc_node.next & skc_refcnt because
+	 *  of RCU lookups)
+	 */
 	union {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	struct hlist_node	skc_bind_node;
 	atomic_t		skc_refcnt;
+
 	unsigned int		skc_hash;
+	unsigned short		skc_family;
+	volatile unsigned char	skc_state;
+	unsigned char		skc_reuse;
+	int			skc_bound_dev_if;
+	struct hlist_node	skc_bind_node;
 	struct proto		*skc_prot;
 #ifdef CONFIG_NET_NS
 	struct net	 	*skc_net;
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..daadcc5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -994,8 +994,25 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 {
 	struct sock *sk;
 
-	sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
+	sk = sk_prot_alloc(prot, priority, family);
 	if (sk) {
+		/*
+		 * caches using SLAB_DESTROY_BY_RCU should let sk_node
+		 * and sk_refcnt untouched.
+		 *
+		 * Following makes sense only if sk first fields are
+		 * in following order : sk_node, refcnt, sk_hash
+		 */
+		BUILD_BUG_ON(offsetof(struct sock, sk_node) != 0);
+		BUILD_BUG_ON(offsetof(struct sock, sk_refcnt) !=
+			     sizeof(sk->sk_node));
+		BUILD_BUG_ON(offsetof(struct sock, sk_hash) !=
+			     sizeof(sk->sk_node) + sizeof(sk->sk_refcnt));
+		if (prot->slab_flags & SLAB_DESTROY_BY_RCU)
+			memset(&sk->sk_hash, 0,
+			       prot->obj_size - offsetof(struct sock, sk_hash));
+		else
+			memset(sk, 0, prot->obj_size);
 		sk->sk_family = family;
 		/*
 		 * See comment in struct sock definition to understand
@@ -1125,7 +1142,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_err	   = 0;
 		newsk->sk_priority = 0;
-		atomic_set(&newsk->sk_refcnt, 2);
+		atomic_add(2, &newsk->sk_refcnt);
 
 		/*
 		 * Increment the counter in the same struct proto as the master
@@ -1840,7 +1857,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_stamp = ktime_set(-1L, 0);
 
-	atomic_set(&sk->sk_refcnt, 1);
+	atomic_inc(&sk->sk_refcnt);
 	atomic_set(&sk->sk_wmem_alloc, 1);
 	atomic_set(&sk->sk_drops, 0);
 }
@@ -2140,12 +2157,25 @@ static inline void release_proto_idx(struct proto *prot)
 }
 #endif
 
+/*
+ * We need to initialize skc->skc_refcnt to 0, and skc->skc_node.pprev to NULL
+ * but only on newly allocated objects (check sk_prot_alloc())
+ */
+static void sk_init_once(void *arg)
+{
+	struct sock_common *skc = arg;
+
+	atomic_set(&skc->skc_refcnt, 0);
+	skc->skc_node.pprev = NULL;
+}
+
 int proto_register(struct proto *prot, int alloc_slab)
 {
 	if (alloc_slab) {
 		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
 					SLAB_HWCACHE_ALIGN | prot->slab_flags,
-					NULL);
+					prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+					    sk_init_once : NULL);
 
 		if (prot->slab == NULL) {
 			printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n",
@@ -2187,7 +2217,8 @@ int proto_register(struct proto *prot, int alloc_slab)
 						  0,
 						  SLAB_HWCACHE_ALIGN |
 							prot->slab_flags,
-						  NULL);
+						  prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+							sk_init_once : NULL);
 			if (prot->twsk_prot->twsk_slab == NULL)
 				goto out_free_timewait_sock_slab_name;
 		}
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 61283f9..a2997df 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -139,7 +139,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
 		twsk_net_set(tw, hold_net(sock_net(sk)));
-		atomic_set(&tw->tw_refcnt, 1);
+		atomic_inc(&tw->tw_refcnt);
 		inet_twsk_dead_node_init(tw);
 		__module_get(tw->tw_prot->owner);
 	}

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] net: sk_alloc() should not blindly overwrite memory
  2009-07-07 16:14         ` [PATCH] net: sk_alloc() should not blindly overwrite memory Eric Dumazet
@ 2009-07-07 18:33           ` Tantilov, Emil S
  2009-07-07 22:33             ` [PATCH] net: sk_prot_alloc() " Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Tantilov, Emil S @ 2009-07-07 18:33 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller
  Cc: Emil S Tantilov, NetDev, Brandeburg, Jesse, Kirsher, Jeffrey T,
	Jiri Olsa

Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> 
>> 
>> Hold on, I found the problem and will submit a patch (after testing
>> it) in following hours 
>> 
> 
> David, while looking at Emil case, I found following problem.
> 
> I am not sure this bug is responsible for the BUG hit by Emil,
> but it seems a candidate for stable as well...
> 
> Thanks
> 
> [PATCH] net: sk_alloc() should not blindly overwrite memory
> 
> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
> fields should not be blindly overwritten, even with null.
> 
> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
> 
> Current sk_alloc() implementation doesnt respect this hypothesis,
> calling kmem_alloc with __GFP_ZERO and setting sk_refcnt to 1 instead
> of atomically increment it.
> 
> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/sock.h            |   24 +++++++++++-------
>  net/core/sock.c               |   41 ++++++++++++++++++++++++++++----
>  net/ipv4/inet_timewait_sock.c |    2 -
>  3 files changed, 52 insertions(+), 15 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 352f06b..3f6284e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -103,15 +103,15 @@ struct net;
> 
>  /**
>   *	struct sock_common - minimal network layer representation of
> sockets + *	@skc_node: main hash linkage for various protocol lookup
> tables + *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite
> protocol + *	@skc_refcnt: reference count
> + *	@skc_hash: hash value used with various protocol lookup tables
>   *	@skc_family: network address family
>   *	@skc_state: Connection state
>   *	@skc_reuse: %SO_REUSEADDR setting
>   *	@skc_bound_dev_if: bound device index if != 0
> - *	@skc_node: main hash linkage for various protocol lookup tables
> - *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
>   *	@skc_bind_node: bind hash linkage for various protocol lookup
> tables 
> - *	@skc_refcnt: reference count
> - *	@skc_hash: hash value used with various protocol lookup tables
>   *	@skc_prot: protocol handlers inside a network family
>   *	@skc_net: reference to the network namespace of this socket
>   *
> @@ -119,17 +119,23 @@ struct net;
>   *	for struct sock and struct inet_timewait_sock.
>   */
>  struct sock_common {
> -	unsigned short		skc_family;
> -	volatile unsigned char	skc_state;
> -	unsigned char		skc_reuse;
> -	int			skc_bound_dev_if;
> +	/*
> +	 * WARNING : skc_node, skc_refcnt must be first elements of this
> struct +	 * (sk_prot_alloc() should not clear skc_node.next &
> skc_refcnt because +	 *  of RCU lookups)
> +	 */
>  	union {
>  		struct hlist_node	skc_node;
>  		struct hlist_nulls_node skc_nulls_node;
>  	};
> -	struct hlist_node	skc_bind_node;
>  	atomic_t		skc_refcnt;
> +
>  	unsigned int		skc_hash;
> +	unsigned short		skc_family;
> +	volatile unsigned char	skc_state;
> +	unsigned char		skc_reuse;
> +	int			skc_bound_dev_if;
> +	struct hlist_node	skc_bind_node;
>  	struct proto		*skc_prot;
>  #ifdef CONFIG_NET_NS
>  	struct net	 	*skc_net;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b0ba569..daadcc5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -994,8 +994,25 @@ struct sock *sk_alloc(struct net *net, int
>  family, gfp_t priority, {
>  	struct sock *sk;
> 
> -	sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
> +	sk = sk_prot_alloc(prot, priority, family);
>  	if (sk) {
> +		/*
> +		 * caches using SLAB_DESTROY_BY_RCU should let sk_node
> +		 * and sk_refcnt untouched.
> +		 *
> +		 * Following makes sense only if sk first fields are
> +		 * in following order : sk_node, refcnt, sk_hash
> +		 */
> +		BUILD_BUG_ON(offsetof(struct sock, sk_node) != 0);
> +		BUILD_BUG_ON(offsetof(struct sock, sk_refcnt) !=
> +			     sizeof(sk->sk_node));
> +		BUILD_BUG_ON(offsetof(struct sock, sk_hash) !=
> +			     sizeof(sk->sk_node) + sizeof(sk->sk_refcnt));
> +		if (prot->slab_flags & SLAB_DESTROY_BY_RCU)
> +			memset(&sk->sk_hash, 0,
> +			       prot->obj_size - offsetof(struct sock, sk_hash));
> +		else
> +			memset(sk, 0, prot->obj_size);
>  		sk->sk_family = family;
>  		/*
>  		 * See comment in struct sock definition to understand
> @@ -1125,7 +1142,7 @@ struct sock *sk_clone(const struct sock *sk,
> const gfp_t priority) 
> 
>  		newsk->sk_err	   = 0;
>  		newsk->sk_priority = 0;
> -		atomic_set(&newsk->sk_refcnt, 2);
> +		atomic_add(2, &newsk->sk_refcnt);
> 
>  		/*
>  		 * Increment the counter in the same struct proto as the master
> @@ -1840,7 +1857,7 @@ void sock_init_data(struct socket *sock, struct
> sock *sk) 
> 
>  	sk->sk_stamp = ktime_set(-1L, 0);
> 
> -	atomic_set(&sk->sk_refcnt, 1);
> +	atomic_inc(&sk->sk_refcnt);
>  	atomic_set(&sk->sk_wmem_alloc, 1);
>  	atomic_set(&sk->sk_drops, 0);
>  }
> @@ -2140,12 +2157,25 @@ static inline void release_proto_idx(struct
>  proto *prot) }
>  #endif
> 
> +/*
> + * We need to initialize skc->skc_refcnt to 0, and
> skc->skc_node.pprev to NULL + * but only on newly allocated objects
> (check sk_prot_alloc()) + */
> +static void sk_init_once(void *arg)
> +{
> +	struct sock_common *skc = arg;
> +
> +	atomic_set(&skc->skc_refcnt, 0);
> +	skc->skc_node.pprev = NULL;
> +}
> +
>  int proto_register(struct proto *prot, int alloc_slab)
>  {
>  	if (alloc_slab) {
>  		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
>  					SLAB_HWCACHE_ALIGN | prot->slab_flags,
> -					NULL);
> +					prot->slab_flags & SLAB_DESTROY_BY_RCU ?
> +					    sk_init_once : NULL);
> 
>  		if (prot->slab == NULL) {
>  			printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n",
> @@ -2187,7 +2217,8 @@ int proto_register(struct proto *prot, int
>  						  alloc_slab) 0,
>  						  SLAB_HWCACHE_ALIGN |
>  							prot->slab_flags,
> -						  NULL);
> +						  prot->slab_flags & SLAB_DESTROY_BY_RCU ?
> +							sk_init_once : NULL);
>  			if (prot->twsk_prot->twsk_slab == NULL)
>  				goto out_free_timewait_sock_slab_name;
>  		}
> diff --git a/net/ipv4/inet_timewait_sock.c
> b/net/ipv4/inet_timewait_sock.c 
> index 61283f9..a2997df 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -139,7 +139,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const
>  		struct sock *sk, const int stat tw->tw_transparent  =
>  		inet->transparent; tw->tw_prot	    = sk->sk_prot_creator;
>  		twsk_net_set(tw, hold_net(sock_net(sk)));
> -		atomic_set(&tw->tw_refcnt, 1);
> +		atomic_inc(&tw->tw_refcnt);
>  		inet_twsk_dead_node_init(tw);
>  		__module_get(tw->tw_prot->owner);
>  	}

Eric,

With this patch applied, I get panic on boot:

[    5.334653] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[    5.344225] IP: [<ffffffff811ed364>] selinux_socket_post_create+0x78/0x95
[    5.352263] PGD 0 
[    5.354952] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[    5.360599] last sysfs file: 
[    5.364292] CPU 0 
[    5.366985] Modules linked in:
[    5.370836] Pid: 1, comm: swapper Not tainted 2.6.31-rc1-net-2.6-igb-ed-patch-07071123 #2 S5520HC
[    5.381465] RIP: 0010:[<ffffffff811ed364>]  [<ffffffff811ed364>] selinux_socket_post_create+0x78/0x95
[    5.392551] RSP: 0018:ffff8801ef0a5d60  EFLAGS: 00010286
[    5.398871] RAX: 0000000000000001 RBX: ffff88036e5ace80 RCX: 0000000000000006
[    5.407227] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000002
[    5.415587] RBP: ffff8801ef0a5d80 R08: 0000000000000001 R09: 0000000000000000
[    5.423945] R10: ffff88036e000000 R11: ffff8801ef0a5c20 R12: ffff88036ec09c80
[    5.432300] R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000003
[    5.440662] FS:  0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
[    5.450386] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    5.457189] CR2: 0000000000000010 CR3: 0000000001001000 CR4: 00000000000006b0
[    5.465545] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.473908] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    5.482267] Process swapper (pid: 1, threadinfo ffff8801ef0a4000, task ffff88036f040000)
[    5.491994] Stack:
[    5.494606]  01ffffff816d0040 00000000ffffff9f ffffffff817244a0 ffff88036ec09c80
[    5.503262] <0> ffff8801ef0a5d90 ffffffff811e5ef0 ffff8801ef0a5df0 ffffffff8138c041
[    5.512853] <0> ffffffff8138bf93 00000001ef0a5de0 ffff8801ef0a5e18 0000000681478fb3
[    5.523046] Call Trace:
[    5.526158]  [<ffffffff811e5ef0>] security_socket_post_create+0x11/0x13
[    5.533932]  [<ffffffff8138c041>] __sock_create+0x1a4/0x1ff
[    5.540541]  [<ffffffff8138bf93>] ? __sock_create+0xf6/0x1ff
[    5.547243]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.553463]  [<ffffffff8138c0bb>] sock_create_kern+0x1f/0x21
[    5.560162]  [<ffffffff813edd5a>] inet_ctl_sock_create+0x29/0x5d
[    5.567255]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.573473]  [<ffffffff8183fb60>] tcp_sk_init+0x25/0x27
[    5.579687]  [<ffffffff81396716>] register_pernet_operations+0x18/0x1a
[    5.587365]  [<ffffffff81396824>] register_pernet_subsys+0x29/0x3d
[    5.594657]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.600882]  [<ffffffff8183fb27>] tcp_v4_init+0x1c/0x30
[    5.607106]  [<ffffffff818403cd>] inet_init+0x141/0x204
[    5.613333]  [<ffffffff8100905c>] do_one_initcall+0x56/0x130
[    5.620038]  [<ffffffff810a2a72>] ? register_irq_proc+0xae/0xca
[    5.627041]  [<ffffffff81130000>] ? d_add+0x16/0x1d
[    5.632875]  [<ffffffff81811915>] kernel_init+0x15e/0x1b4
[    5.639294]  [<ffffffff81028b9a>] child_rip+0xa/0x20
[    5.645214]  [<ffffffff8102857c>] ? restore_args+0x0/0x30
[    5.651622]  [<ffffffff818117b7>] ? kernel_init+0x0/0x1b4
[    5.658033]  [<ffffffff81028b90>] ? child_rip+0x0/0x20
[    5.664158] Code: ca 44 89 ef e8 d8 b6 ff ff c6 43 22 01 66 89 43 20 31 c0 49 8b 54 24 60 48 85 d2 74 22 8b 43 1c 48 8b 92 a8 03 00 00 41 0f b7 f5 <89> 42 10 8b 43 20 66 89 42 18 49 8b 7c 24 60 e8 de 4a 00 00 41 
[    5.690779] RIP  [<ffffffff811ed364>] selinux_socket_post_create+0x78/0x95
[    5.698921]  RSP <ffff8801ef0a5d60>
[    5.703202] CR2: 0000000000000010
[    5.707298] ---[ end trace a7919e7f17c0a725 ]---
[    5.712848] swapper used greatest stack depth: 4680 bytes left
[    5.727678] Kernel panic - not syncing: Attempted to kill init!
[    5.734672] Pid: 1, comm: swapper Tainted: G      D    2.6.31-rc1-net-2.6-igb-ed-patch-07071123 #2
[    5.745375] Call Trace:
[    5.748485]  [<ffffffff81057cfd>] panic+0xdb/0x190
[    5.754217]  [<ffffffff8105ae3b>] ? do_exit+0x35f/0x6d8
[    5.760439]  [<ffffffff8105adeb>] ? do_exit+0x30f/0x6d8
[    5.766659]  [<ffffffff8105adeb>] ? do_exit+0x30f/0x6d8
[    5.772875]  [<ffffffff8107bdbb>] ? trace_hardirqs_on+0xd/0xf
[    5.779688]  [<ffffffff8147a295>] ? _write_unlock_irq+0x2b/0x31
[    5.786690]  [<ffffffff8105ab55>] do_exit+0x79/0x6d8
[    5.792622]  [<ffffffff8147b07d>] oops_end+0xb2/0xba
[    5.798555]  [<ffffffff810407b4>] no_context+0x1ef/0x1fe
[    5.804875]  [<ffffffff8107ba3b>] ? mark_lock+0x22/0x1fb
[    5.811196]  [<ffffffff8107af82>] ? save_trace+0x3f/0x96
[    5.817516]  [<ffffffff81040a03>] __bad_area_nosemaphore+0x186/0x1a9
[    5.825009]  [<ffffffff813e5bbf>] ? raw_hash_sk+0x2a/0x73
[    5.831421]  [<ffffffff8147c354>] ? do_page_fault+0xbd/0x22e
[    5.838128]  [<ffffffff81040a9c>] bad_area_nosemaphore+0xe/0x10
[    5.845132]  [<ffffffff8147c3b6>] do_page_fault+0x11f/0x22e
[    5.851747]  [<ffffffff8147a4ef>] page_fault+0x1f/0x30
[    5.857871]  [<ffffffff811ed364>] ? selinux_socket_post_create+0x78/0x95
[    5.865754]  [<ffffffff811e5ef0>] security_socket_post_create+0x11/0x13
[    5.873536]  [<ffffffff8138c041>] __sock_create+0x1a4/0x1ff
[    5.880149]  [<ffffffff8138bf93>] ? __sock_create+0xf6/0x1ff
[    5.886859]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.893072]  [<ffffffff8138c0bb>] sock_create_kern+0x1f/0x21
[    5.899774]  [<ffffffff813edd5a>] inet_ctl_sock_create+0x29/0x5d
[    5.906863]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.913080]  [<ffffffff8183fb60>] tcp_sk_init+0x25/0x27
[    5.919299]  [<ffffffff81396716>] register_pernet_operations+0x18/0x1a
[    5.926986]  [<ffffffff81396824>] register_pernet_subsys+0x29/0x3d
[    5.934274]  [<ffffffff8184028c>] ? inet_init+0x0/0x204
[    5.940504]  [<ffffffff8183fb27>] tcp_v4_init+0x1c/0x30
[    5.946730]  [<ffffffff818403cd>] inet_init+0x141/0x204
[    5.952957]  [<ffffffff8100905c>] do_one_initcall+0x56/0x130
[    5.959668]  [<ffffffff810a2a72>] ? register_irq_proc+0xae/0xca
[    5.966666]  [<ffffffff81130000>] ? d_add+0x16/0x1d
[    5.972495]  [<ffffffff81811915>] kernel_init+0x15e/0x1b4
[    5.978909]  [<ffffffff81028b9a>] child_rip+0xa/0x20
[    5.984845]  [<ffffffff8102857c>] ? restore_args+0x0/0x30
[    5.991266]  [<ffffffff818117b7>] ? kernel_init+0x0/0x1b4
[    5.997686]  [<ffffffff81028b90>] ? child_rip+0x0/0x20

Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-07 18:33           ` Tantilov, Emil S
@ 2009-07-07 22:33             ` Eric Dumazet
  2009-07-08  2:14               ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-07 22:33 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: David S. Miller, Emil S Tantilov, NetDev, Brandeburg, Jesse,
	Kirsher, Jeffrey T, Jiri Olsa

Tantilov, Emil S a écrit :
> Eric Dumazet wrote:
>> Eric Dumazet a écrit :
>>>
>>> Hold on, I found the problem and will submit a patch (after testing
>>> it) in following hours 
>>>
>> David, while looking at Emil case, I found following problem.
>>
>> I am not sure this bug is responsible for the BUG hit by Emil,
>> but it seems a candidate for stable as well...
>>
>> Thanks
>>
>> [PATCH] net: sk_alloc() should not blindly overwrite memory
>>
>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
>> fields should not be blindly overwritten, even with null.
>>
>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
>>
>> Current sk_alloc() implementation doesnt respect this hypothesis,
>> calling kmem_alloc with __GFP_ZERO and setting sk_refcnt to 1 instead
>> of atomically increment it.
>>
>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Eric,
> 
> With this patch applied, I get panic on boot:

Oops, this stuff has to be done inside sk_prot_alloc() or breaks selinux

Thanks for testing Emil, here is a second try !


[PATCH] net: sk_prot_alloc() should not blindly overwrite memory

Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
fields should not be blindly overwritten, even with null.

These fields are sk->sk_refcnt and sk->sk_nulls_node.next

Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
instead of atomically increment it.

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h            |   24 +++++++++------
 net/core/sock.c               |   49 ++++++++++++++++++++++++++++----
 net/ipv4/inet_timewait_sock.c |    2 -
 3 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..3f6284e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -103,15 +103,15 @@ struct net;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
+ *	@skc_node: main hash linkage for various protocol lookup tables
+ *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
+ *	@skc_refcnt: reference count
+ *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
  *	@skc_bound_dev_if: bound device index if != 0
- *	@skc_node: main hash linkage for various protocol lookup tables
- *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
- *	@skc_refcnt: reference count
- *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_prot: protocol handlers inside a network family
  *	@skc_net: reference to the network namespace of this socket
  *
@@ -119,17 +119,23 @@ struct net;
  *	for struct sock and struct inet_timewait_sock.
  */
 struct sock_common {
-	unsigned short		skc_family;
-	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
-	int			skc_bound_dev_if;
+	/*
+	 * WARNING : skc_node, skc_refcnt must be first elements of this struct
+	 * (sk_prot_alloc() should not clear skc_node.next & skc_refcnt because
+	 *  of RCU lookups)
+	 */
 	union {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	struct hlist_node	skc_bind_node;
 	atomic_t		skc_refcnt;
+
 	unsigned int		skc_hash;
+	unsigned short		skc_family;
+	volatile unsigned char	skc_state;
+	unsigned char		skc_reuse;
+	int			skc_bound_dev_if;
+	struct hlist_node	skc_bind_node;
 	struct proto		*skc_prot;
 #ifdef CONFIG_NET_NS
 	struct net	 	*skc_net;
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..cf43ff1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -939,8 +939,31 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
 	struct kmem_cache *slab;
 
 	slab = prot->slab;
-	if (slab != NULL)
-		sk = kmem_cache_alloc(slab, priority);
+	if (slab != NULL) {
+		/*
+		 * caches using SLAB_DESTROY_BY_RCU should let sk_node
+		 * and sk_refcnt untouched.
+		 *
+		 * Following makes sense only if sk first fields are
+		 * in following order : sk_node, refcnt, sk_hash
+		 */
+		BUILD_BUG_ON(offsetof(struct sock, sk_node) != 0);
+		BUILD_BUG_ON(offsetof(struct sock, sk_refcnt) !=
+			     sizeof(sk->sk_node));
+		BUILD_BUG_ON(offsetof(struct sock, sk_hash) !=
+			     sizeof(sk->sk_node) + sizeof(sk->sk_refcnt));
+		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
+		if (!sk)
+			return sk;
+		if (priority & __GFP_ZERO) {
+			if (prot->slab_flags & SLAB_DESTROY_BY_RCU)
+				memset(&sk->sk_hash, 0,
+				       prot->obj_size -
+				       offsetof(struct sock, sk_hash));
+			else
+				memset(sk, 0, prot->obj_size);
+		}
+	}
 	else
 		sk = kmalloc(prot->obj_size, priority);
 
@@ -1125,7 +1148,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_err	   = 0;
 		newsk->sk_priority = 0;
-		atomic_set(&newsk->sk_refcnt, 2);
+		atomic_add(2, &newsk->sk_refcnt);
 
 		/*
 		 * Increment the counter in the same struct proto as the master
@@ -1840,7 +1863,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_stamp = ktime_set(-1L, 0);
 
-	atomic_set(&sk->sk_refcnt, 1);
+	atomic_inc(&sk->sk_refcnt);
 	atomic_set(&sk->sk_wmem_alloc, 1);
 	atomic_set(&sk->sk_drops, 0);
 }
@@ -2140,12 +2163,25 @@ static inline void release_proto_idx(struct proto *prot)
 }
 #endif
 
+/*
+ * We need to initialize skc->skc_refcnt to 0, and skc->skc_node.pprev to NULL
+ * but only on newly allocated objects (check sk_prot_alloc())
+ */
+static void sk_init_once(void *arg)
+{
+	struct sock_common *skc = arg;
+
+	atomic_set(&skc->skc_refcnt, 0);
+	skc->skc_node.pprev = NULL;
+}
+
 int proto_register(struct proto *prot, int alloc_slab)
 {
 	if (alloc_slab) {
 		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
 					SLAB_HWCACHE_ALIGN | prot->slab_flags,
-					NULL);
+					prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+					    sk_init_once : NULL);
 
 		if (prot->slab == NULL) {
 			printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n",
@@ -2187,7 +2223,8 @@ int proto_register(struct proto *prot, int alloc_slab)
 						  0,
 						  SLAB_HWCACHE_ALIGN |
 							prot->slab_flags,
-						  NULL);
+						  prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+							sk_init_once : NULL);
 			if (prot->twsk_prot->twsk_slab == NULL)
 				goto out_free_timewait_sock_slab_name;
 		}
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 61283f9..a2997df 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -139,7 +139,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
 		twsk_net_set(tw, hold_net(sock_net(sk)));
-		atomic_set(&tw->tw_refcnt, 1);
+		atomic_inc(&tw->tw_refcnt);
 		inet_twsk_dead_node_init(tw);
 		__module_get(tw->tw_prot->owner);
 	}

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-07 22:33             ` [PATCH] net: sk_prot_alloc() " Eric Dumazet
@ 2009-07-08  2:14               ` David Miller
  2009-07-08  6:50                 ` Eric Dumazet
  2009-07-08 17:02                 ` [PATCH] net: sk_prot_alloc() should not blindly overwrite memory Tantilov, Emil S
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2009-07-08  2:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 08 Jul 2009 00:33:29 +0200

> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
> 
> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
> fields should not be blindly overwritten, even with null.
> 
> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
> 
> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
> calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
> instead of atomically increment it.
> 
> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I've applied this but will wait for some more testing before
I push it out for real to kernel.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08  2:14               ` David Miller
@ 2009-07-08  6:50                 ` Eric Dumazet
  2009-07-09  5:36                   ` Eric Dumazet
  2009-07-08 17:02                 ` [PATCH] net: sk_prot_alloc() should not blindly overwrite memory Tantilov, Emil S
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-08  6:50 UTC (permalink / raw)
  To: David Miller
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa, Patrick McHardy, Paul E. McKenney

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 08 Jul 2009 00:33:29 +0200
> 
>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>>
>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
>> fields should not be blindly overwritten, even with null.
>>
>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
>>
>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
>> calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
>> instead of atomically increment it.
>>
>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> I've applied this but will wait for some more testing before
> I push it out for real to kernel.org

Thanks David

I forgot to CC Paul and Patrick, so I'll ask them to look at this patch.

Patrick, a similar fix is needed in conntrack as well, we currently
uses "ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);" and thus
 overwrite struct hlist_nulls_node hnnode; contained
in "struct nf_conntrack_tuple_hash", while lockless readers still
potentialy need them. Setting hnnode.next to NULL is dangerous
since last bit is not set (not a nulls value), a reader could
try to dereference this NULL pointer and trap.


Here is the patch again so that Paul & Patrick can comment on it.

I am not sure about the refcnt thing (blindly setting it to 0 again
should be OK in fact, since no reader should/can to the 
atomic_inc_if_not_zero on it), but the nulls.next thing is problematic.

[PATCH] net: sk_prot_alloc() should not blindly overwrite memory

Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
fields should not be blindly overwritten, even with null.

These fields are sk->sk_refcnt and sk->sk_nulls_node.next

Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
instead of atomically increment it.

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h            |   24 +++++++++------
 net/core/sock.c               |   49 ++++++++++++++++++++++++++++----
 net/ipv4/inet_timewait_sock.c |    2 -
 3 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..3f6284e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -103,15 +103,15 @@ struct net;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
+ *	@skc_node: main hash linkage for various protocol lookup tables
+ *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
+ *	@skc_refcnt: reference count
+ *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
  *	@skc_bound_dev_if: bound device index if != 0
- *	@skc_node: main hash linkage for various protocol lookup tables
- *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
- *	@skc_refcnt: reference count
- *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_prot: protocol handlers inside a network family
  *	@skc_net: reference to the network namespace of this socket
  *
@@ -119,17 +119,23 @@ struct net;
  *	for struct sock and struct inet_timewait_sock.
  */
 struct sock_common {
-	unsigned short		skc_family;
-	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
-	int			skc_bound_dev_if;
+	/*
+	 * WARNING : skc_node, skc_refcnt must be first elements of this struct
+	 * (sk_prot_alloc() should not clear skc_node.next & skc_refcnt because
+	 *  of RCU lookups)
+	 */
 	union {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	struct hlist_node	skc_bind_node;
 	atomic_t		skc_refcnt;
+
 	unsigned int		skc_hash;
+	unsigned short		skc_family;
+	volatile unsigned char	skc_state;
+	unsigned char		skc_reuse;
+	int			skc_bound_dev_if;
+	struct hlist_node	skc_bind_node;
 	struct proto		*skc_prot;
 #ifdef CONFIG_NET_NS
 	struct net	 	*skc_net;
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..cf43ff1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -939,8 +939,31 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
 	struct kmem_cache *slab;
 
 	slab = prot->slab;
-	if (slab != NULL)
-		sk = kmem_cache_alloc(slab, priority);
+	if (slab != NULL) {
+		/*
+		 * caches using SLAB_DESTROY_BY_RCU should let sk_node
+		 * and sk_refcnt untouched.
+		 *
+		 * Following makes sense only if sk first fields are
+		 * in following order : sk_node, refcnt, sk_hash
+		 */
+		BUILD_BUG_ON(offsetof(struct sock, sk_node) != 0);
+		BUILD_BUG_ON(offsetof(struct sock, sk_refcnt) !=
+			     sizeof(sk->sk_node));
+		BUILD_BUG_ON(offsetof(struct sock, sk_hash) !=
+			     sizeof(sk->sk_node) + sizeof(sk->sk_refcnt));
+		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
+		if (!sk)
+			return sk;
+		if (priority & __GFP_ZERO) {
+			if (prot->slab_flags & SLAB_DESTROY_BY_RCU)
+				memset(&sk->sk_hash, 0,
+				       prot->obj_size -
+				       offsetof(struct sock, sk_hash));
+			else
+				memset(sk, 0, prot->obj_size);
+		}
+	}
 	else
 		sk = kmalloc(prot->obj_size, priority);
 
@@ -1125,7 +1148,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_err	   = 0;
 		newsk->sk_priority = 0;
-		atomic_set(&newsk->sk_refcnt, 2);
+		atomic_add(2, &newsk->sk_refcnt);
 
 		/*
 		 * Increment the counter in the same struct proto as the master
@@ -1840,7 +1863,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_stamp = ktime_set(-1L, 0);
 
-	atomic_set(&sk->sk_refcnt, 1);
+	atomic_inc(&sk->sk_refcnt);
 	atomic_set(&sk->sk_wmem_alloc, 1);
 	atomic_set(&sk->sk_drops, 0);
 }
@@ -2140,12 +2163,25 @@ static inline void release_proto_idx(struct proto *prot)
 }
 #endif
 
+/*
+ * We need to initialize skc->skc_refcnt to 0, and skc->skc_node.pprev to NULL
+ * but only on newly allocated objects (check sk_prot_alloc())
+ */
+static void sk_init_once(void *arg)
+{
+	struct sock_common *skc = arg;
+
+	atomic_set(&skc->skc_refcnt, 0);
+	skc->skc_node.pprev = NULL;
+}
+
 int proto_register(struct proto *prot, int alloc_slab)
 {
 	if (alloc_slab) {
 		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
 					SLAB_HWCACHE_ALIGN | prot->slab_flags,
-					NULL);
+					prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+					    sk_init_once : NULL);
 
 		if (prot->slab == NULL) {
 			printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n",
@@ -2187,7 +2223,8 @@ int proto_register(struct proto *prot, int alloc_slab)
 						  0,
 						  SLAB_HWCACHE_ALIGN |
 							prot->slab_flags,
-						  NULL);
+						  prot->slab_flags & SLAB_DESTROY_BY_RCU ?
+							sk_init_once : NULL);
 			if (prot->twsk_prot->twsk_slab == NULL)
 				goto out_free_timewait_sock_slab_name;
 		}
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 61283f9..a2997df 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -139,7 +139,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
 		twsk_net_set(tw, hold_net(sock_net(sk)));
-		atomic_set(&tw->tw_refcnt, 1);
+		atomic_inc(&tw->tw_refcnt);
 		inet_twsk_dead_node_init(tw);
 		__module_get(tw->tw_prot->owner);
 	}


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08  2:14               ` David Miller
  2009-07-08  6:50                 ` Eric Dumazet
@ 2009-07-08 17:02                 ` Tantilov, Emil S
  2009-07-08 17:45                   ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Tantilov, Emil S @ 2009-07-08 17:02 UTC (permalink / raw)
  To: David Miller, eric.dumazet@gmail.com
  Cc: emils.tantilov@gmail.com, netdev@vger.kernel.org,
	Brandeburg, Jesse, Kirsher, Jeffrey T, jolsa@redhat.com

David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 08 Jul 2009 00:33:29 +0200
> 
>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>> 
>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
>> fields should not be blindly overwritten, even with null.
>> 
>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
>> 
>> Current sk_prot_alloc() implementation doesnt respect this
>> hypothesis, calling kmem_cache_alloc() with __GFP_ZERO and setting
>> sk_refcnt to 1 instead of atomically increment it.
>> 
>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> I've applied this but will wait for some more testing before
> I push it out for real to kernel.org

Still seeing traces during the test even with this patch applied:

[ 1089.430093] ------------[ cut here ]------------
[ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0()
[ 1089.435670] Hardware name: S5520HC
[ 1089.435671] Modules linked in: igb dca mdio [last unloaded: ixgbe]
[ 1089.435678] Pid: 15545, comm: netserver Not tainted 2.6.31-rc1-net-2.6-igb-ed-07071641 #4
[ 1089.435681] Call Trace:
[ 1089.435686]  [<ffffffff813e8a2f>] ? udp_lib_unhash+0x73/0xa0
[ 1089.435691]  [<ffffffff81057b49>] warn_slowpath_common+0x77/0x8f
[ 1089.435696]  [<ffffffff81057b70>] warn_slowpath_null+0xf/0x11
[ 1089.435700]  [<ffffffff813e8a2f>] udp_lib_unhash+0x73/0xa0
[ 1089.435705]  [<ffffffff8138e616>] sk_common_release+0x2f/0xb4
[ 1089.435710]  [<ffffffff81429028>] udp_lib_close+0x9/0xb
[ 1089.435715]  [<ffffffff813ee62a>] inet_release+0x58/0x5f
[ 1089.435720]  [<ffffffff814158e5>] inet6_release+0x30/0x35
[ 1089.435725]  [<ffffffff8138be4b>] sock_release+0x1a/0x6c
[ 1089.435729]  [<ffffffff8138c366>] sock_close+0x22/0x26
[ 1089.435735]  [<ffffffff810ec923>] __fput+0xf0/0x18c
[ 1089.435739]  [<ffffffff810eccd1>] fput+0x15/0x18
[ 1089.435742]  [<ffffffff810e9bfa>] filp_close+0x5c/0x67
[ 1089.435746]  [<ffffffff810e9c80>] sys_close+0x7b/0xb6
[ 1089.435751]  [<ffffffff81027aab>] system_call_fastpath+0x16/0x1b
[ 1089.435755] ---[ end trace a79410bd00b8b1ac ]---

Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08 17:02                 ` [PATCH] net: sk_prot_alloc() should not blindly overwrite memory Tantilov, Emil S
@ 2009-07-08 17:45                   ` David Miller
  2009-07-08 23:21                     ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-07-08 17:45 UTC (permalink / raw)
  To: emil.s.tantilov
  Cc: eric.dumazet, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa

From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
Date: Wed, 8 Jul 2009 11:02:22 -0600

> Still seeing traces during the test even with this patch applied:
> 
> [ 1089.430093] ------------[ cut here ]------------
> [ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0()
> [ 1089.435670] Hardware name: S5520HC

Ok I'll back this out for now, needs more investigation
obviously.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08 17:45                   ` David Miller
@ 2009-07-08 23:21                     ` Eric Dumazet
  2009-07-08 23:35                       ` Tantilov, Emil S
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-08 23:21 UTC (permalink / raw)
  To: David Miller
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa

David Miller a écrit :
> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
> Date: Wed, 8 Jul 2009 11:02:22 -0600
> 
>> Still seeing traces during the test even with this patch applied:
>>
>> [ 1089.430093] ------------[ cut here ]------------
>> [ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0()
>> [ 1089.435670] Hardware name: S5520HC
> 
> Ok I'll back this out for now, needs more investigation
> obviously.

Hmm... I never said it was supposed to fix Emil problem, just that 
I discovered one potential problem by code inspection.

I could not find yet sk_refcnt mismatch.
As we do less atomic ops per packet than before, some old bug could surface
now...

Emil, is it easy to reproduce this problem, considering I have a similar
platform than yours (dual quad core machine, E5450 cpus @ 3GHz) ?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08 23:21                     ` Eric Dumazet
@ 2009-07-08 23:35                       ` Tantilov, Emil S
  2009-07-09  0:20                         ` [PATCH] net: ip_push_pending_frames() fix Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Tantilov, Emil S @ 2009-07-08 23:35 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: emils.tantilov@gmail.com, netdev@vger.kernel.org,
	Brandeburg, Jesse, Kirsher, Jeffrey T, jolsa@redhat.com

Eric Dumazet wrote:
> David Miller a écrit :
>> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
>> Date: Wed, 8 Jul 2009 11:02:22 -0600
>> 
>>> Still seeing traces during the test even with this patch applied:
>>> 
>>> [ 1089.430093] ------------[ cut here ]------------
>>> [ 1089.435667] WARNING: at include/net/sock.h:423
>>> udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC
>> 
>> Ok I'll back this out for now, needs more investigation
>> obviously.
> 
> Hmm... I never said it was supposed to fix Emil problem, just that
> I discovered one potential problem by code inspection.
> 
> I could not find yet sk_refcnt mismatch.
> As we do less atomic ops per packet than before, some old bug could
> surface now...
> 
> Emil, is it easy to reproduce this problem, considering I have a
> similar platform than yours (dual quad core machine, E5450 cpus @
> 3GHz) ? 

Eric,

It should be easy to reproduce. At least I have been able to consistently 
reproduce it on several different systems with different drivers (e1000, e1000e, igb). 

The test I'm running is a mix of IPV4/6 TCP/UDP traffic with netperf (also mixing different types TCP/UDP_STREAM, TCP_MAERTS, TCP_UDP_RR etc). How much this matters I don't know - it's possible that just UDP traffic would do it. I also think it may have something to do with IPv6
because of the trace, but I am not sure.

If you need more information let me know.

Thanks,
Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] net: ip_push_pending_frames() fix
  2009-07-08 23:35                       ` Tantilov, Emil S
@ 2009-07-09  0:20                         ` Eric Dumazet
  2009-07-09 14:32                           ` Tantilov, Emil S
  2009-07-12  3:27                           ` David Miller
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2009-07-09  0:20 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: David Miller, emils.tantilov@gmail.com, netdev@vger.kernel.org,
	Brandeburg, Jesse, Kirsher, Jeffrey T, jolsa@redhat.com

Tantilov, Emil S a écrit :
> Eric Dumazet wrote:
>> David Miller a écrit :
>>> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
>>> Date: Wed, 8 Jul 2009 11:02:22 -0600
>>>
>>>> Still seeing traces during the test even with this patch applied:
>>>>
>>>> [ 1089.430093] ------------[ cut here ]------------
>>>> [ 1089.435667] WARNING: at include/net/sock.h:423
>>>> udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC
>>> Ok I'll back this out for now, needs more investigation
>>> obviously.
>> Hmm... I never said it was supposed to fix Emil problem, just that
>> I discovered one potential problem by code inspection.
>>
>> I could not find yet sk_refcnt mismatch.
>> As we do less atomic ops per packet than before, some old bug could
>> surface now...
>>
>> Emil, is it easy to reproduce this problem, considering I have a
>> similar platform than yours (dual quad core machine, E5450 cpus @
>> 3GHz) ? 
> 
> Eric,
> 
> It should be easy to reproduce. At least I have been able to consistently 
> reproduce it on several different systems with different drivers (e1000, e1000e, igb). 
> 
> The test I'm running is a mix of IPV4/6 TCP/UDP traffic with netperf (also mixing different types TCP/UDP_STREAM, TCP_MAERTS, TCP_UDP_RR etc). How much this matters I don't know - it's possible that just UDP traffic would do it. I also think it may have something to do with IPv6
> because of the trace, but I am not sure.
> 
> If you need more information let me know.
> 

OK thanks, this was helpful, corking or not corking, that is the question :)

I think ip6_push_pending_frames() & ip_push_pending_frames 
have a problem after recent commit 2b85a34e911bf483c27cfdd124aeb1605145dc80 
(net: No more expensive sock_hold()/sock_put() on each tx)

[PATCH] net: ip_push_pending_frames() fix

After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80 
(net: No more expensive sock_hold()/sock_put() on each tx)
we do not take any more references on sk->sk_refcnt on outgoing packets.

I forgot to delete two __sock_put() from ip_push_pending_frames()
and ip6_push_pending_frames().

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 2470262..7d08210 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1243,7 +1243,6 @@ int ip_push_pending_frames(struct sock *sk)
 		skb->len += tmp_skb->len;
 		skb->data_len += tmp_skb->len;
 		skb->truesize += tmp_skb->truesize;
-		__sock_put(tmp_skb->sk);
 		tmp_skb->destructor = NULL;
 		tmp_skb->sk = NULL;
 	}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7c76e3d..87f8419 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1484,7 +1484,6 @@ int ip6_push_pending_frames(struct sock *sk)
 		skb->len += tmp_skb->len;
 		skb->data_len += tmp_skb->len;
 		skb->truesize += tmp_skb->truesize;
-		__sock_put(tmp_skb->sk);
 		tmp_skb->destructor = NULL;
 		tmp_skb->sk = NULL;
 	}

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-08  6:50                 ` Eric Dumazet
@ 2009-07-09  5:36                   ` Eric Dumazet
  2009-07-09 17:13                     ` Paul E. McKenney
  2009-07-12  3:27                     ` David Miller
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2009-07-09  5:36 UTC (permalink / raw)
  To: David Miller
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa, Patrick McHardy, Paul E. McKenney

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 08 Jul 2009 00:33:29 +0200
>>
>>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>>>
>>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
>>> fields should not be blindly overwritten, even with null.
>>>
>>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
>>>
>>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
>>> calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
>>> instead of atomically increment it.
>>>
>>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> I've applied this but will wait for some more testing before
>> I push it out for real to kernel.org
> 
> Thanks David
> 
> I forgot to CC Paul and Patrick, so I'll ask them to look at this patch.
> 
> Patrick, a similar fix is needed in conntrack as well, we currently
> uses "ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);" and thus
>  overwrite struct hlist_nulls_node hnnode; contained
> in "struct nf_conntrack_tuple_hash", while lockless readers still
> potentialy need them. Setting hnnode.next to NULL is dangerous
> since last bit is not set (not a nulls value), a reader could
> try to dereference this NULL pointer and trap.
> 
> 
> Here is the patch again so that Paul & Patrick can comment on it.
> 
> I am not sure about the refcnt thing (blindly setting it to 0 again
> should be OK in fact, since no reader should/can to the 
> atomic_inc_if_not_zero on it), but the nulls.next thing is problematic.

Here is an updated and much simpler patch, taking care of sk_node.next being not set to 0

This patch applies to >= 2.6.29 kernels

[PATCH] net: sk_prot_alloc() should not blindly overwrite memory

Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
depends on sk->sk_nulls_node.next being always valid. A NULL
value is not allowed as it might fault a lockless reader.

Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
the forbidden field.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..7b87ec0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -939,8 +939,23 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
 	struct kmem_cache *slab;
 
 	slab = prot->slab;
-	if (slab != NULL)
-		sk = kmem_cache_alloc(slab, priority);
+	if (slab != NULL) {
+		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
+		if (!sk)
+			return sk;
+		if (priority & __GFP_ZERO) {
+			/*
+			 * caches using SLAB_DESTROY_BY_RCU should let
+			 * sk_node.next un-modified. Special care is taken
+			 * when initializing object to zero.
+			 */
+			if (offsetof(struct sock, sk_node.next) != 0)
+				memset(sk, 0, offsetof(struct sock, sk_node.next));
+			memset(&sk->sk_node.pprev, 0,
+			       prot->obj_size - offsetof(struct sock,
+							 sk_node.pprev));
+		}
+	}
 	else
 		sk = kmalloc(prot->obj_size, priority);
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] net: ip_push_pending_frames() fix
  2009-07-09  0:20                         ` [PATCH] net: ip_push_pending_frames() fix Eric Dumazet
@ 2009-07-09 14:32                           ` Tantilov, Emil S
  2009-07-09 14:38                             ` Eric Dumazet
  2009-07-12  3:27                           ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Tantilov, Emil S @ 2009-07-09 14:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, emils.tantilov@gmail.com, netdev@vger.kernel.org,
	Brandeburg, Jesse, Kirsher, Jeffrey T, jolsa@redhat.com

Eric Dumazet wrote:
> Tantilov, Emil S a écrit :
>> Eric Dumazet wrote:
>>> David Miller a écrit :
>>>> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
>>>> Date: Wed, 8 Jul 2009 11:02:22 -0600
>>>> 
>>>>> Still seeing traces during the test even with this patch applied:
>>>>> 
>>>>> [ 1089.430093] ------------[ cut here ]------------
>>>>> [ 1089.435667] WARNING: at include/net/sock.h:423
>>>>> udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC
>>>> Ok I'll back this out for now, needs more investigation
>>>> obviously.
>>> Hmm... I never said it was supposed to fix Emil problem, just that
>>> I discovered one potential problem by code inspection.
>>> 
>>> I could not find yet sk_refcnt mismatch.
>>> As we do less atomic ops per packet than before, some old bug could
>>> surface now... 
>>> 
>>> Emil, is it easy to reproduce this problem, considering I have a
>>> similar platform than yours (dual quad core machine, E5450 cpus @
>>> 3GHz) ?
>> 
>> Eric,
>> 
>> It should be easy to reproduce. At least I have been able to
>> consistently 
>> reproduce it on several different systems with different drivers
>> (e1000, e1000e, igb). 
>> 
>> The test I'm running is a mix of IPV4/6 TCP/UDP traffic with netperf
>> (also mixing different types TCP/UDP_STREAM, TCP_MAERTS, TCP_UDP_RR
>> etc). How much this matters I don't know - it's possible that just
>> UDP traffic would do it. I also think it may have something to do
>> with IPv6 because of the trace, but I am not sure.   
>> 
>> If you need more information let me know.
>> 
> 
> OK thanks, this was helpful, corking or not corking, that is the
> question :) 
> 
> I think ip6_push_pending_frames() & ip_push_pending_frames
> have a problem after recent commit
> 2b85a34e911bf483c27cfdd124aeb1605145dc80 (net: No more expensive
> sock_hold()/sock_put() on each tx) 
> 
> [PATCH] net: ip_push_pending_frames() fix
> 
> After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> (net: No more expensive sock_hold()/sock_put() on each tx)
> we do not take any more references on sk->sk_refcnt on outgoing
> packets. 
> 
> I forgot to delete two __sock_put() from ip_push_pending_frames()
> and ip6_push_pending_frames().
> 
> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 2470262..7d08210 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1243,7 +1243,6 @@ int ip_push_pending_frames(struct sock *sk)
>  		skb->len += tmp_skb->len;
>  		skb->data_len += tmp_skb->len;
>  		skb->truesize += tmp_skb->truesize;
> -		__sock_put(tmp_skb->sk);
>  		tmp_skb->destructor = NULL;
>  		tmp_skb->sk = NULL;
>  	}
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 7c76e3d..87f8419 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1484,7 +1484,6 @@ int ip6_push_pending_frames(struct sock *sk)
>  		skb->len += tmp_skb->len;
>  		skb->data_len += tmp_skb->len;
>  		skb->truesize += tmp_skb->truesize;
> -		__sock_put(tmp_skb->sk);
>  		tmp_skb->destructor = NULL;
>  		tmp_skb->sk = NULL;
>  	}

Thanks Eric,

With this patch the test ran all night without issues.

Emil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: ip_push_pending_frames() fix
  2009-07-09 14:32                           ` Tantilov, Emil S
@ 2009-07-09 14:38                             ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2009-07-09 14:38 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: David Miller, emils.tantilov@gmail.com, netdev@vger.kernel.org,
	Brandeburg, Jesse, Kirsher, Jeffrey T, jolsa@redhat.com

Tantilov, Emil S a écrit :
> Eric Dumazet wrote:
>> Tantilov, Emil S a écrit :
>>> Eric Dumazet wrote:
>>>> David Miller a écrit :
>>>>> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
>>>>> Date: Wed, 8 Jul 2009 11:02:22 -0600
>>>>>
>>>>>> Still seeing traces during the test even with this patch applied:
>>>>>>
>>>>>> [ 1089.430093] ------------[ cut here ]------------
>>>>>> [ 1089.435667] WARNING: at include/net/sock.h:423
>>>>>> udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC
>>>>> Ok I'll back this out for now, needs more investigation
>>>>> obviously.
>>>> Hmm... I never said it was supposed to fix Emil problem, just that
>>>> I discovered one potential problem by code inspection.
>>>>
>>>> I could not find yet sk_refcnt mismatch.
>>>> As we do less atomic ops per packet than before, some old bug could
>>>> surface now... 
>>>>
>>>> Emil, is it easy to reproduce this problem, considering I have a
>>>> similar platform than yours (dual quad core machine, E5450 cpus @
>>>> 3GHz) ?
>>> Eric,
>>>
>>> It should be easy to reproduce. At least I have been able to
>>> consistently 
>>> reproduce it on several different systems with different drivers
>>> (e1000, e1000e, igb). 
>>>
>>> The test I'm running is a mix of IPV4/6 TCP/UDP traffic with netperf
>>> (also mixing different types TCP/UDP_STREAM, TCP_MAERTS, TCP_UDP_RR
>>> etc). How much this matters I don't know - it's possible that just
>>> UDP traffic would do it. I also think it may have something to do
>>> with IPv6 because of the trace, but I am not sure.   
>>>
>>> If you need more information let me know.
>>>
>> OK thanks, this was helpful, corking or not corking, that is the
>> question :) 
>>
>> I think ip6_push_pending_frames() & ip_push_pending_frames
>> have a problem after recent commit
>> 2b85a34e911bf483c27cfdd124aeb1605145dc80 (net: No more expensive
>> sock_hold()/sock_put() on each tx) 
>>
>> [PATCH] net: ip_push_pending_frames() fix
>>
>> After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
>> (net: No more expensive sock_hold()/sock_put() on each tx)
>> we do not take any more references on sk->sk_refcnt on outgoing
>> packets. 
>>
>> I forgot to delete two __sock_put() from ip_push_pending_frames()
>> and ip6_push_pending_frames().
>>
>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index 2470262..7d08210 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -1243,7 +1243,6 @@ int ip_push_pending_frames(struct sock *sk)
>>  		skb->len += tmp_skb->len;
>>  		skb->data_len += tmp_skb->len;
>>  		skb->truesize += tmp_skb->truesize;
>> -		__sock_put(tmp_skb->sk);
>>  		tmp_skb->destructor = NULL;
>>  		tmp_skb->sk = NULL;
>>  	}
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 7c76e3d..87f8419 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1484,7 +1484,6 @@ int ip6_push_pending_frames(struct sock *sk)
>>  		skb->len += tmp_skb->len;
>>  		skb->data_len += tmp_skb->len;
>>  		skb->truesize += tmp_skb->truesize;
>> -		__sock_put(tmp_skb->sk);
>>  		tmp_skb->destructor = NULL;
>>  		tmp_skb->sk = NULL;
>>  	}
> 
> Thanks Eric,
> 
> With this patch the test ran all night without issues.
> 

Thanks a lot Emil for testing and your feedback.

David, could you please add another tag ?

Tested-by: Emil S Tantilov <emils.tantilov@gmail.com>

Thanks

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-09  5:36                   ` Eric Dumazet
@ 2009-07-09 17:13                     ` Paul E. McKenney
  2009-07-09 20:50                       ` Eric Dumazet
  2009-07-12  3:27                     ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2009-07-09 17:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, emil.s.tantilov, emils.tantilov, netdev,
	jesse.brandeburg, jeffrey.t.kirsher, jolsa, Patrick McHardy

On Thu, Jul 09, 2009 at 07:36:05AM +0200, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> > David Miller a écrit :
> >> From: Eric Dumazet <eric.dumazet@gmail.com>
> >> Date: Wed, 08 Jul 2009 00:33:29 +0200
> >>
> >>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
> >>>
> >>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
> >>> fields should not be blindly overwritten, even with null.
> >>>
> >>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
> >>>
> >>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
> >>> calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
> >>> instead of atomically increment it.
> >>>
> >>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
> >>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >> I've applied this but will wait for some more testing before
> >> I push it out for real to kernel.org
> > 
> > Thanks David
> > 
> > I forgot to CC Paul and Patrick, so I'll ask them to look at this patch.
> > 
> > Patrick, a similar fix is needed in conntrack as well, we currently
> > uses "ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);" and thus
> >  overwrite struct hlist_nulls_node hnnode; contained
> > in "struct nf_conntrack_tuple_hash", while lockless readers still
> > potentialy need them. Setting hnnode.next to NULL is dangerous
> > since last bit is not set (not a nulls value), a reader could
> > try to dereference this NULL pointer and trap.
> > 
> > 
> > Here is the patch again so that Paul & Patrick can comment on it.
> > 
> > I am not sure about the refcnt thing (blindly setting it to 0 again
> > should be OK in fact, since no reader should/can to the 
> > atomic_inc_if_not_zero on it), but the nulls.next thing is problematic.
> 
> Here is an updated and much simpler patch, taking care of sk_node.next being not set to 0
> 
> This patch applies to >= 2.6.29 kernels

Does this one also need the rearrangement of struct elements in the
earlier patch?  (And apologies about being slow to get to that one.)

						Thanx, Paul

> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
> 
> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
> depends on sk->sk_nulls_node.next being always valid. A NULL
> value is not allowed as it might fault a lockless reader.
> 
> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
> calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
> the forbidden field.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b0ba569..7b87ec0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -939,8 +939,23 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
>  	struct kmem_cache *slab;
> 
>  	slab = prot->slab;
> -	if (slab != NULL)
> -		sk = kmem_cache_alloc(slab, priority);
> +	if (slab != NULL) {
> +		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
> +		if (!sk)
> +			return sk;
> +		if (priority & __GFP_ZERO) {
> +			/*
> +			 * caches using SLAB_DESTROY_BY_RCU should let
> +			 * sk_node.next un-modified. Special care is taken
> +			 * when initializing object to zero.
> +			 */
> +			if (offsetof(struct sock, sk_node.next) != 0)
> +				memset(sk, 0, offsetof(struct sock, sk_node.next));
> +			memset(&sk->sk_node.pprev, 0,
> +			       prot->obj_size - offsetof(struct sock,
> +							 sk_node.pprev));
> +		}
> +	}
>  	else
>  		sk = kmalloc(prot->obj_size, priority);
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-09 17:13                     ` Paul E. McKenney
@ 2009-07-09 20:50                       ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2009-07-09 20:50 UTC (permalink / raw)
  To: paulmck
  Cc: David Miller, emil.s.tantilov, emils.tantilov, netdev,
	jesse.brandeburg, jeffrey.t.kirsher, jolsa, Patrick McHardy

Paul E. McKenney a écrit :
> On Thu, Jul 09, 2009 at 07:36:05AM +0200, Eric Dumazet wrote:
>> Eric Dumazet a écrit :
>>> David Miller a écrit :
>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Date: Wed, 08 Jul 2009 00:33:29 +0200
>>>>
>>>>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>>>>>
>>>>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some
>>>>> fields should not be blindly overwritten, even with null.
>>>>>
>>>>> These fields are sk->sk_refcnt and sk->sk_nulls_node.next
>>>>>
>>>>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
>>>>> calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1
>>>>> instead of atomically increment it.
>>>>>
>>>>> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
>>>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>> I've applied this but will wait for some more testing before
>>>> I push it out for real to kernel.org
>>> Thanks David
>>>
>>> I forgot to CC Paul and Patrick, so I'll ask them to look at this patch.
>>>
>>> Patrick, a similar fix is needed in conntrack as well, we currently
>>> uses "ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);" and thus
>>>  overwrite struct hlist_nulls_node hnnode; contained
>>> in "struct nf_conntrack_tuple_hash", while lockless readers still
>>> potentialy need them. Setting hnnode.next to NULL is dangerous
>>> since last bit is not set (not a nulls value), a reader could
>>> try to dereference this NULL pointer and trap.
>>>
>>>
>>> Here is the patch again so that Paul & Patrick can comment on it.
>>>
>>> I am not sure about the refcnt thing (blindly setting it to 0 again
>>> should be OK in fact, since no reader should/can to the 
>>> atomic_inc_if_not_zero on it), but the nulls.next thing is problematic.
>> Here is an updated and much simpler patch, taking care of sk_node.next being not set to 0
>>
>> This patch applies to >= 2.6.29 kernels
> 
> Does this one also need the rearrangement of struct elements in the
> earlier patch?  (And apologies about being slow to get to that one.)
> 

No, because only one field (sk_node.next) needs special attention, I felt
it was not really necessary to reorder fields. First memset
is inlined because of constant size, so small cost.

Thanks

> 						Thanx, Paul
> 
>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>>
>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
>> depends on sk->sk_nulls_node.next being always valid. A NULL
>> value is not allowed as it might fault a lockless reader.
>>
>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
>> calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
>> the forbidden field.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index b0ba569..7b87ec0 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -939,8 +939,23 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
>>  	struct kmem_cache *slab;
>>
>>  	slab = prot->slab;
>> -	if (slab != NULL)
>> -		sk = kmem_cache_alloc(slab, priority);
>> +	if (slab != NULL) {
>> +		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
>> +		if (!sk)
>> +			return sk;
>> +		if (priority & __GFP_ZERO) {
>> +			/*
>> +			 * caches using SLAB_DESTROY_BY_RCU should let
>> +			 * sk_node.next un-modified. Special care is taken
>> +			 * when initializing object to zero.
>> +			 */
>> +			if (offsetof(struct sock, sk_node.next) != 0)
>> +				memset(sk, 0, offsetof(struct sock, sk_node.next));
>> +			memset(&sk->sk_node.pprev, 0,
>> +			       prot->obj_size - offsetof(struct sock,
>> +							 sk_node.pprev));
>> +		}
>> +	}
>>  	else
>>  		sk = kmalloc(prot->obj_size, priority);
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: ip_push_pending_frames() fix
  2009-07-09  0:20                         ` [PATCH] net: ip_push_pending_frames() fix Eric Dumazet
  2009-07-09 14:32                           ` Tantilov, Emil S
@ 2009-07-12  3:27                           ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2009-07-12  3:27 UTC (permalink / raw)
  To: eric.dumazet
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 09 Jul 2009 02:20:42 +0200

> [PATCH] net: ip_push_pending_frames() fix
> 
> After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80 
> (net: No more expensive sock_hold()/sock_put() on each tx)
> we do not take any more references on sk->sk_refcnt on outgoing packets.
> 
> I forgot to delete two __sock_put() from ip_push_pending_frames()
> and ip6_push_pending_frames().
> 
> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, with the Tested-by marker added.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-09  5:36                   ` Eric Dumazet
  2009-07-09 17:13                     ` Paul E. McKenney
@ 2009-07-12  3:27                     ` David Miller
  2009-07-12  7:07                       ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: David Miller @ 2009-07-12  3:27 UTC (permalink / raw)
  To: eric.dumazet
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa, kaber, paulmck

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 09 Jul 2009 07:36:05 +0200

> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
> 
> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
> depends on sk->sk_nulls_node.next being always valid. A NULL
> value is not allowed as it might fault a lockless reader.
> 
> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
> calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
> the forbidden field.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

APplied and queued up for -stable.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
  2009-07-12  3:27                     ` David Miller
@ 2009-07-12  7:07                       ` Eric Dumazet
  2009-07-15 12:28                         ` [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc() Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-12  7:07 UTC (permalink / raw)
  To: David Miller
  Cc: emil.s.tantilov, emils.tantilov, netdev, jesse.brandeburg,
	jeffrey.t.kirsher, jolsa, kaber, paulmck

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 09 Jul 2009 07:36:05 +0200
> 
>> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory
>>
>> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
>> depends on sk->sk_nulls_node.next being always valid. A NULL
>> value is not allowed as it might fault a lockless reader.
>>
>> Current sk_prot_alloc() implementation doesnt respect this hypothesis,
>> calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
>> the forbidden field.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> APplied and queued up for -stable.

I'll try to find some time to fix netfilter conntrack as well.

Thanks

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc()
  2009-07-12  7:07                       ` Eric Dumazet
@ 2009-07-15 12:28                         ` Eric Dumazet
  2009-07-15 15:28                           ` Patrick McHardy
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-15 12:28 UTC (permalink / raw)
  To: David Miller, kaber; +Cc: netdev, paulmck

David, Patrick

Here is a fix for nf_conntrack, candidate for linux-2.6.31 and stable (linux-2.6.30)

Thank you

[PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc()

When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.

In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)

kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.

Fix is to call kmem_cache_alloc() and do the zeroing ourself.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7508f11..23feafa 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -561,17 +561,28 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 		}
 	}
 
-	ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);
+	/*
+	 * Do not use kmem_cache_zalloc(), as this cache uses
+	 * SLAB_DESTROY_BY_RCU.
+	 */
+	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
 	if (ct == NULL) {
 		pr_debug("nf_conntrack_alloc: Can't alloc conntrack.\n");
 		atomic_dec(&net->ct.count);
 		return ERR_PTR(-ENOMEM);
 	}
-
+	/*
+	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
+	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
+	 */
+	memset(&ct->tuplehash[IP_CT_DIR_MAX], 0,
+	       sizeof(*ct) - offsetof(struct nf_conn, tuplehash[IP_CT_DIR_MAX]));
 	spin_lock_init(&ct->lock);
 	atomic_set(&ct->ct_general.use, 1);
 	ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple = *orig;
+	ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = NULL;
 	ct->tuplehash[IP_CT_DIR_REPLY].tuple = *repl;
+	ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev = NULL;
 	/* Don't set timer yet: wait for confirmation */
 	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
 #ifdef CONFIG_NET_NS


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc()
  2009-07-15 12:28                         ` [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc() Eric Dumazet
@ 2009-07-15 15:28                           ` Patrick McHardy
  2009-07-15 19:54                             ` [PATCH] net: nf_conntrack_alloc() fixes Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick McHardy @ 2009-07-15 15:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, paulmck

Eric Dumazet wrote:
> [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc()
> 
> When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
> objects, since slab allocator could give a freed object still used by lockless
> readers.
> 
> In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
> being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
> object in hash chain.)
> 
> kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
> for ct->tuplehash[xxx].hnnode.next.
> 
> Fix is to call kmem_cache_alloc() and do the zeroing ourself.

I think this is still racy, please see below:

> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 7508f11..23feafa 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -561,17 +561,28 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
>  		}
>  	}
>  
> -	ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);
> +	/*
> +	 * Do not use kmem_cache_zalloc(), as this cache uses
> +	 * SLAB_DESTROY_BY_RCU.
> +	 */
> +	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
>  	if (ct == NULL) {
>  		pr_debug("nf_conntrack_alloc: Can't alloc conntrack.\n");
>  		atomic_dec(&net->ct.count);
>  		return ERR_PTR(-ENOMEM);
>  	}
> -

__nf_conntrack_find() on another CPU finds the entry at this point.

> +	/*
> +	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
> +	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
> +	 */
> +	memset(&ct->tuplehash[IP_CT_DIR_MAX], 0,
> +	       sizeof(*ct) - offsetof(struct nf_conn, tuplehash[IP_CT_DIR_MAX]));
>  	spin_lock_init(&ct->lock);
>  	atomic_set(&ct->ct_general.use, 1);

nf_conntrack_find_get() successfully tries to atomic_inc_not_zero()
at this point, following by another tuple comparison which is also
successful.

Am I missing something? I think we need to make sure the reference
count is not increased until the new tuples are visible.

>  	ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple = *orig;
> +	ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = NULL;
>  	ct->tuplehash[IP_CT_DIR_REPLY].tuple = *repl;
> +	ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev = NULL;
>  	/* Don't set timer yet: wait for confirmation */
>  	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
>  #ifdef CONFIG_NET_NS
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] net: nf_conntrack_alloc() fixes
  2009-07-15 15:28                           ` Patrick McHardy
@ 2009-07-15 19:54                             ` Eric Dumazet
  2009-07-16  9:13                               ` [PATCH] net: sock_copy() fixes Eric Dumazet
  2009-07-16 12:05                               ` [PATCH] net: nf_conntrack_alloc() fixes Patrick McHardy
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2009-07-15 19:54 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, paulmck

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc()
>>
>> When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
>> objects, since slab allocator could give a freed object still used by lockless
>> readers.
>>
>> In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
>> being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
>> object in hash chain.)
>>
>> kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
>> for ct->tuplehash[xxx].hnnode.next.
>>
>> Fix is to call kmem_cache_alloc() and do the zeroing ourself.
> 
> I think this is still racy, please see below:

Nice catch indeed !

> 
>> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
>> index 7508f11..23feafa 100644
>> --- a/net/netfilter/nf_conntrack_core.c
>> +++ b/net/netfilter/nf_conntrack_core.c
>> @@ -561,17 +561,28 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
>>  		}
>>  	}
>>  
>> -	ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);
>> +	/*
>> +	 * Do not use kmem_cache_zalloc(), as this cache uses
>> +	 * SLAB_DESTROY_BY_RCU.
>> +	 */
>> +	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
>>  	if (ct == NULL) {
>>  		pr_debug("nf_conntrack_alloc: Can't alloc conntrack.\n");
>>  		atomic_dec(&net->ct.count);
>>  		return ERR_PTR(-ENOMEM);
>>  	}
>> -
> 
> __nf_conntrack_find() on another CPU finds the entry at this point.
> 
>> +	/*
>> +	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
>> +	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
>> +	 */
>> +	memset(&ct->tuplehash[IP_CT_DIR_MAX], 0,
>> +	       sizeof(*ct) - offsetof(struct nf_conn, tuplehash[IP_CT_DIR_MAX]));
>>  	spin_lock_init(&ct->lock);
>>  	atomic_set(&ct->ct_general.use, 1);
> 
> nf_conntrack_find_get() successfully tries to atomic_inc_not_zero()
> at this point, following by another tuple comparison which is also
> successful.
> 
> Am I missing something? I think we need to make sure the reference
> count is not increased until the new tuples are visible.

Yes you are right, and Documentation/RCU/rculist_nulls.txt should be
updated to reflect this as well (in insert algo, must use smp_wmb()
between obj->key assignment and refcnt assigment)

We'll have to change socket allocation too, this will be addressed
by a followup patch

Thanks Patrick !

[PATCH] net: nf_conntrack_alloc() fixes

When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.

In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)

kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.

Fix is to call kmem_cache_alloc() and do the zeroing ourself.

As spotted by Patrick, we also need to make sure lookup keys are committed to
memory before setting refcount to 1, or a lockless reader could get a reference
on the old version of the object. Its key re-check could then pass the barrier. 

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 Documentation/RCU/rculist_nulls.txt |    7 ++++++-
 net/netfilter/nf_conntrack_core.c   |   21 ++++++++++++++++++---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 93cb28d..18f9651 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -83,11 +83,12 @@ not detect it missed following items in original chain.
 obj = kmem_cache_alloc(...);
 lock_chain(); // typically a spin_lock()
 obj->key = key;
-atomic_inc(&obj->refcnt);
 /*
  * we need to make sure obj->key is updated before obj->next
+ * or obj->refcnt
  */
 smp_wmb();
+atomic_set(&obj->refcnt, 1);
 hlist_add_head_rcu(&obj->obj_node, list);
 unlock_chain(); // typically a spin_unlock()
 
@@ -159,6 +160,10 @@ out:
 obj = kmem_cache_alloc(cachep);
 lock_chain(); // typically a spin_lock()
 obj->key = key;
+/*
+ * changes to obj->key must be visible before refcnt one
+ */
+smp_wmb();
 atomic_set(&obj->refcnt, 1);
 /*
  * insert obj in RCU way (readers might be traversing chain)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7508f11..b5869b9 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -561,23 +561,38 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 		}
 	}
 
-	ct = kmem_cache_zalloc(nf_conntrack_cachep, gfp);
+	/*
+	 * Do not use kmem_cache_zalloc(), as this cache uses
+	 * SLAB_DESTROY_BY_RCU.
+	 */
+	ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
 	if (ct == NULL) {
 		pr_debug("nf_conntrack_alloc: Can't alloc conntrack.\n");
 		atomic_dec(&net->ct.count);
 		return ERR_PTR(-ENOMEM);
 	}
-
+	/*
+	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
+	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
+	 */
+	memset(&ct->tuplehash[IP_CT_DIR_MAX], 0,
+	       sizeof(*ct) - offsetof(struct nf_conn, tuplehash[IP_CT_DIR_MAX]));
 	spin_lock_init(&ct->lock);
-	atomic_set(&ct->ct_general.use, 1);
 	ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple = *orig;
+	ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = NULL;
 	ct->tuplehash[IP_CT_DIR_REPLY].tuple = *repl;
+	ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev = NULL;
 	/* Don't set timer yet: wait for confirmation */
 	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
 
+	/*
+	 * changes to lookup keys must be done before setting refcnt to 1
+	 */
+	smp_wmb();
+	atomic_set(&ct->ct_general.use, 1);
 	return ct;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH] net: sock_copy() fixes
  2009-07-15 19:54                             ` [PATCH] net: nf_conntrack_alloc() fixes Eric Dumazet
@ 2009-07-16  9:13                               ` Eric Dumazet
  2009-07-17  1:09                                 ` David Miller
  2009-07-16 12:05                               ` [PATCH] net: nf_conntrack_alloc() fixes Patrick McHardy
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2009-07-16  9:13 UTC (permalink / raw)
  To: David Miller; +Cc: Patrick McHardy, netdev, paulmck

David, I believe following patch could be 2.6.31 candidate
(and stable (2.6.29/2.6.30) as well)

Note : Because we clear all sock fields in sk_prot_alloc(),
setting sk_refcnt to 1 in sock_init_data() is safe, even if
sock not yet added to hash table (with correct keys).
Lockless readers have to re-check their key after getting a
reference on the found object anyway, their re-check
should not catch old key values.

Next problem to fix (but it can wait 2.6.32 I believe) is
to delay sk_refcnt initialization *after* sock lookup keys
are setup for new socket, not after being set to 0.

A lockless reader could :
- find an old sock, with old keys
      (at this point, another cpu re-allocate this sock, clear fields,
       and sets sk_refcnt to 1)
- succeed to increment sk_refcnt
      (at this point, other cpu is filling lookup fields (remote address/port, local address...)
- re-check keys and find a match, while socket still being initialized by other cpu.

Only way to fix this is to set sk_refcnt to a non null value only at the
point it is safe for a reader to find the socket and use it.

Thanks

[PATCH] net: sock_copy() fixes

Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
(net: sk_prot_alloc() should not blindly overwrite memory)
took care of not zeroing whole new socket at allocation time.

sock_copy() is another spot where we should be very careful.
We should not set refcnt to a non null value, until
we are sure other fields are correctly setup, or
a lockless reader could catch this socket by mistake,
while not fully (re)initialized.

This patch puts sk_node & sk_refcnt to the very beginning
of struct sock to ease sock_copy() & sk_prot_alloc() job.

We add appropriate smp_wmb() before sk_refcnt initializations
to match our RCU requirements (changes to sock keys should
be committed to memory before sk_refcnt setting)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h |   32 +++++++++++++++++++-------------
 net/core/sock.c    |   20 ++++++++++++++++++--
 2 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2c0da92..ae0906e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -104,15 +104,15 @@ struct net;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
+ *	@skc_node: main hash linkage for various protocol lookup tables
+ *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
+ *	@skc_refcnt: reference count
+ *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
  *	@skc_bound_dev_if: bound device index if != 0
- *	@skc_node: main hash linkage for various protocol lookup tables
- *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
- *	@skc_refcnt: reference count
- *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_prot: protocol handlers inside a network family
  *	@skc_net: reference to the network namespace of this socket
  *
@@ -120,17 +120,21 @@ struct net;
  *	for struct sock and struct inet_timewait_sock.
  */
 struct sock_common {
-	unsigned short		skc_family;
-	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
-	int			skc_bound_dev_if;
+	/*
+	 * first fields are not copied in sock_copy()
+	 */
 	union {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	struct hlist_node	skc_bind_node;
 	atomic_t		skc_refcnt;
+
 	unsigned int		skc_hash;
+	unsigned short		skc_family;
+	volatile unsigned char	skc_state;
+	unsigned char		skc_reuse;
+	int			skc_bound_dev_if;
+	struct hlist_node	skc_bind_node;
 	struct proto		*skc_prot;
 #ifdef CONFIG_NET_NS
 	struct net	 	*skc_net;
@@ -208,15 +212,17 @@ struct sock {
 	 * don't add nothing before this first member (__sk_common) --acme
 	 */
 	struct sock_common	__sk_common;
+#define sk_node			__sk_common.skc_node
+#define sk_nulls_node		__sk_common.skc_nulls_node
+#define sk_refcnt		__sk_common.skc_refcnt
+
+#define sk_copy_start		__sk_common.skc_hash
+#define sk_hash			__sk_common.skc_hash
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
-#define sk_node			__sk_common.skc_node
-#define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_bind_node		__sk_common.skc_bind_node
-#define sk_refcnt		__sk_common.skc_refcnt
-#define sk_hash			__sk_common.skc_hash
 #define sk_prot			__sk_common.skc_prot
 #define sk_net			__sk_common.skc_net
 	kmemcheck_bitfield_begin(flags);
diff --git a/net/core/sock.c b/net/core/sock.c
index ba5d211..d9eec15 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -919,13 +919,19 @@ static inline void sock_lock_init(struct sock *sk)
 			af_family_keys + sk->sk_family);
 }
 
+/*
+ * Copy all fields from osk to nsk but nsk->sk_refcnt must not change yet,
+ * even temporarly, because of RCU lookups. sk_node should also be left as is.
+ */
 static void sock_copy(struct sock *nsk, const struct sock *osk)
 {
 #ifdef CONFIG_SECURITY_NETWORK
 	void *sptr = nsk->sk_security;
 #endif
-
-	memcpy(nsk, osk, osk->sk_prot->obj_size);
+	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
+	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
 	nsk->sk_security = sptr;
 	security_sk_clone(osk, nsk);
@@ -1140,6 +1146,11 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_err	   = 0;
 		newsk->sk_priority = 0;
+		/*
+		 * Before updating sk_refcnt, we must commit prior changes to memory
+		 * (Documentation/RCU/rculist_nulls.txt for details)
+		 */
+		smp_wmb();
 		atomic_set(&newsk->sk_refcnt, 2);
 
 		/*
@@ -1855,6 +1866,11 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_stamp = ktime_set(-1L, 0);
 
+	/*
+	 * Before updating sk_refcnt, we must commit prior changes to memory
+	 * (Documentation/RCU/rculist_nulls.txt for details)
+	 */
+	smp_wmb();
 	atomic_set(&sk->sk_refcnt, 1);
 	atomic_set(&sk->sk_wmem_alloc, 1);
 	atomic_set(&sk->sk_drops, 0);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: nf_conntrack_alloc() fixes
  2009-07-15 19:54                             ` [PATCH] net: nf_conntrack_alloc() fixes Eric Dumazet
  2009-07-16  9:13                               ` [PATCH] net: sock_copy() fixes Eric Dumazet
@ 2009-07-16 12:05                               ` Patrick McHardy
  1 sibling, 0 replies; 29+ messages in thread
From: Patrick McHardy @ 2009-07-16 12:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, paulmck

Eric Dumazet wrote:
> [PATCH] net: nf_conntrack_alloc() fixes
> 
> When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
> objects, since slab allocator could give a freed object still used by lockless
> readers.
> 
> In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
> being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
> object in hash chain.)
> 
> kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
> for ct->tuplehash[xxx].hnnode.next.
> 
> Fix is to call kmem_cache_alloc() and do the zeroing ourself.
> 
> As spotted by Patrick, we also need to make sure lookup keys are committed to
> memory before setting refcount to 1, or a lockless reader could get a reference
> on the old version of the object. Its key re-check could then pass the barrier. 

Looks good to me. Applied, thanks Eric. I'll push it to -stable
with the other fixes in a couple of days.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] net: sock_copy() fixes
  2009-07-16  9:13                               ` [PATCH] net: sock_copy() fixes Eric Dumazet
@ 2009-07-17  1:09                                 ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2009-07-17  1:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kaber, netdev, paulmck

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 16 Jul 2009 11:13:10 +0200

> [PATCH] net: sock_copy() fixes
> 
> Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
> (net: sk_prot_alloc() should not blindly overwrite memory)
> took care of not zeroing whole new socket at allocation time.
> 
> sock_copy() is another spot where we should be very careful.
> We should not set refcnt to a non null value, until
> we are sure other fields are correctly setup, or
> a lockless reader could catch this socket by mistake,
> while not fully (re)initialized.
> 
> This patch puts sk_node & sk_refcnt to the very beginning
> of struct sock to ease sock_copy() & sk_prot_alloc() job.
> 
> We add appropriate smp_wmb() before sk_refcnt initializations
> to match our RCU requirements (changes to sock keys should
> be committed to memory before sk_refcnt setting)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued for -stable, thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-07-17  1:09 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-01 18:08 WARNING: at include/net/sock.h:417 udp_lib_unhash Tantilov, Emil S
2009-07-02  6:10 ` Eric Dumazet
2009-07-07  0:54   ` Emil S Tantilov
2009-07-07  7:21     ` Eric Dumazet
2009-07-07  7:40       ` Eric Dumazet
2009-07-07 16:14         ` [PATCH] net: sk_alloc() should not blindly overwrite memory Eric Dumazet
2009-07-07 18:33           ` Tantilov, Emil S
2009-07-07 22:33             ` [PATCH] net: sk_prot_alloc() " Eric Dumazet
2009-07-08  2:14               ` David Miller
2009-07-08  6:50                 ` Eric Dumazet
2009-07-09  5:36                   ` Eric Dumazet
2009-07-09 17:13                     ` Paul E. McKenney
2009-07-09 20:50                       ` Eric Dumazet
2009-07-12  3:27                     ` David Miller
2009-07-12  7:07                       ` Eric Dumazet
2009-07-15 12:28                         ` [PATCH] net: nf_conntrack_alloc() should not use kmem_cache_zalloc() Eric Dumazet
2009-07-15 15:28                           ` Patrick McHardy
2009-07-15 19:54                             ` [PATCH] net: nf_conntrack_alloc() fixes Eric Dumazet
2009-07-16  9:13                               ` [PATCH] net: sock_copy() fixes Eric Dumazet
2009-07-17  1:09                                 ` David Miller
2009-07-16 12:05                               ` [PATCH] net: nf_conntrack_alloc() fixes Patrick McHardy
2009-07-08 17:02                 ` [PATCH] net: sk_prot_alloc() should not blindly overwrite memory Tantilov, Emil S
2009-07-08 17:45                   ` David Miller
2009-07-08 23:21                     ` Eric Dumazet
2009-07-08 23:35                       ` Tantilov, Emil S
2009-07-09  0:20                         ` [PATCH] net: ip_push_pending_frames() fix Eric Dumazet
2009-07-09 14:32                           ` Tantilov, Emil S
2009-07-09 14:38                             ` Eric Dumazet
2009-07-12  3:27                           ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).