* Double free of dst_entry in ipv4_dst_destroy()
@ 2015-12-06 18:26 dwilder
2015-12-06 21:03 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: dwilder @ 2015-12-06 18:26 UTC (permalink / raw)
To: netdev; +Cc: predeep, mjtarsel
Hi-
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
[172135.963496] Unable to handle kernel paging request for data at
address 0x00100108 << poison value
[172135.963913] Faulting instruction address: 0xc00000000097f5ac
[172135.964184] Oops: Kernel access of bad area, sig: 11 [#1]
[172135.964327] SMP NR_CPUS=2048 NUMA PowerNV
[172135.964649] Modules linked in: iptable_filter ip_tables x_tables
bnx2x cxlflash cxl mdio ses libcrc32c enclosure uio_pdrv_genirq uio
powernv_rng sunrpc autofs4 ipr
[172135.965403] CPU: 51 PID: 65452 Comm: hxecpu Tainted: G W
4.2.3 #2
[172135.965482] task: c000001d8a370ff0 ti: c000001e2d57c000 task.ti:
c000001e2d57c000
[172135.965564] NIP: c00000000097f5ac LR: c00000000097f59c CTR:
0000000000000000
[172135.965664] REGS: c000001e2d57f8e0 TRAP: 0300 Tainted: G
W (4.2.3)
[172135.965782] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR:
22322442 XER: 00000000
[172135.966033] CFAR: c000000000008468 DAR: 0000000000100108 DSISR:
42000000 SOFTE: 1
GPR00: c00000000097f59c c000001e2d57fb60
c00000000151ad00 c000001e504de300
GPR04: 0000000000000001 0000000000000000
c000001fff8af370 c00000000141ad00
GPR08: c0000000016aad00 0000000000200200
0000000000100100 3d7473696c5f6465
GPR12: 6531303030303063 c000000007b5e480
0000000000000012 0000000000000001
GPR16: 0000000000000000 c000000001431280
c000000000ad38f0 7fffffffffffffff
GPR20: 0000000000000000 c000001e28caf000
c000001e2d57c000 c000000001429b80
GPR24: 000000000000000a 0000000000000000
c000001e504ddb30 0000000000000000
GPR28: 0000000000000000 c000001e2d57c000
c000001e504de300 c000001e28caf000 <<< pointer to dst
[172135.967772] NIP [c00000000097f5ac] ipv4_dst_destroy+0x8c/0xc0
[172135.967921] LR [c00000000097f59c] ipv4_dst_destroy+0x7c/0xc0
[172135.968076] Call Trace:
[172135.968133] [c000001e2d57fb60] [c00000000097f59c]
ipv4_dst_destroy+0x7c/0xc0 (unreliable)
[172135.968306] [c000001e2d57fbd0] [c00000000093b8e0] dst_destroy+0xf0/0x1a0
[172135.968452] [c000001e2d57fc10] [c00000000093bc68]
dst_destroy_rcu+0x28/0x50
[172135.968600] [c000001e2d57fc40] [c0000000001397a0]
rcu_process_callbacks+0x340/0x6f0
[172135.968768] [c000001e2d57fcf0] [c0000000000ba7d8] __do_softirq+0x188/0x3a0
[172135.968913] [c000001e2d57fde0] [c0000000000bac68] irq_exit+0xc8/0x100
[172135.969056] [c000001e2d57fe00] [c00000000001f734]
timer_interrupt+0xa4/0xe0
[172135.969223] [c000001e2d57fe30] [c000000000002714]
decrementer_common+0x114/0x180
[172135.969395] Instruction dump:
[172135.969492] 7fe5fb78 38842968 7fc6f378 3863e580 4811c989 60000000
7fc3f378 481154c1
[172135.969748] 60000000 e93f00b8 e95f00b0 7fc3f378 <f92a0008>
f9490000 3d200010 61290100
[172135.970009] ---[ end trace 34f3693ddc2d5aea ]---
----------------------------------------------------------------------------------------------------------
I added a call to dump_stack() into dst_release() in an attempt to
catch the two calls to dst_release().
-- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -307,6 +307,12 @@ void dst_release(struct dst_entry *dst)
net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
__func__, dst, newrefcnt);
if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE))
+
+ if (!list_empty(&rt->rt_uncached)) {
+ printk("%s: dst=%p\n",__func__,dst);
+ dump_stack();
+ }
+
call_rcu(&dst->rcu_head, dst_destroy_rcu);
}
}
I got lucky and caught the following stack traces on my next run.
[26211.699357] tcp_v4_do_rcv: sk=c000001d10a00000 skb=c000001d22c61d00
dst=c000001d22c62500 sk->sk_rx_dst=c000001d22c62500
[26211.699527] dst_release: dst=c000001d22c62500
[26211.699626] CPU: 51 PID: 23317 Comm: hxecom Tainted: G W
4.2.3 #4
[26211.699632] Call Trace:
[26211.699678] [c000001cf0387440] [c000000000a9dcd4]
dump_stack+0x90/0xbc (unreliable)
[26211.699829] [c000001cf0387470] [c00000000093bf80] dst_release+0x110/0x120
[26211.699875] [c000001cf03874e0] [c0000000009b4024] tcp_v4_do_rcv+0x4d4/0x4f0
[26211.699979] [c000001cf0387580] [c0000000009b7834] tcp_v4_rcv+0xb74/0xb90
[26211.700027] [c000001cf0387660] [c000000000984bb8]
ip_local_deliver_finish+0x178/0x350
[26211.700123] [c000001cf03876b0] [c0000000009853bc]
ip_local_deliver+0x4c/0x120
[26211.700181] [c000001cf0387720] [c000000000984eb4] ip_rcv_finish+0x124/0x420
[26211.700255] [c000001cf03877a0] [c0000000009857a4] ip_rcv+0x314/0x440
[26211.700312] [c000001cf0387830] [c00000000092b094]
__netif_receive_skb_core+0xa14/0xd60
[26211.700334] [c000001cf0387910] [c00000000092e924]
netif_receive_skb_internal+0x34/0xd0
[26211.700370] [c000001cf0387950] [c00000000092fa5c]
napi_gro_receive+0xec/0x1b0
[26211.700449] [c000001cf0387990] [d0000000165a59f0]
bnx2x_rx_int+0x1450/0x1700 [bnx2x]
[26211.700524] [c000001cf0387b30] [d0000000165a6580]
bnx2x_poll+0x310/0x440 [bnx2x]
[26211.700551] [c000001cf0387be0] [c00000000092f0dc] net_rx_action+0x2dc/0x470
[26211.700565] [c000001cf0387cf0] [c0000000000ba7d8] __do_softirq+0x188/0x3a0
[26211.700629] [c000001cf0387de0] [c0000000000bac68] irq_exit+0xc8/0x100
[26211.700657] [c000001cf0387e00] [c00000000001f734] timer_interrupt+0xa4/0xe0
[26211.700664] [c000001cf0387e30] [c000000000002714]
decrementer_common+0x114/0x180
-----------------------------------------------------------------------------------------------------------------------------------
[26211.700709] tcp_v4_do_rcv: sk=c000001d10a00000 skb=c000001d22c61d00
dst=c000001d22c62500 sk->sk_rx_dst=c000001d22c62500
[26211.700713] dst_release: dst=c000001d22c62500
[26211.700719] CPU: 34 PID: 23342 Comm: hxecom Tainted: G W
4.2.3 #4
[26211.700726] Call Trace:
[26211.700735] [c000001cf4a97920] [c000000000a9dcd4]
dump_stack+0x90/0xbc (unreliable)
[26211.700742] [c000001cf4a97950] [c00000000093bf80] dst_release+0x110/0x120
[26211.700749] [c000001cf4a979c0] [c0000000009b4024] tcp_v4_do_rcv+0x4d4/0x4f0
[26211.700759] [c000001cf4a97a60] [c00000000090c72c] release_sock+0xec/0x1e0
[26211.700772] [c000001cf4a97ab0] [c000000000998b98] tcp_recvmsg+0x3f8/0xce0
[26211.700778] [c000001cf4a97bd0] [c0000000009d31dc] inet_recvmsg+0x9c/0x110
[26211.700785] [c000001cf4a97c30] [c000000000906d24] sock_recvmsg+0x84/0xb0
[26211.700794] [c000001cf4a97c70] [c0000000009088ac] SyS_recvfrom+0xdc/0x1a0
[26211.700802] [c000001cf4a97dc0] [c000000000909a38]
SyS_socketcall+0x2d8/0x430
[26211.700809] [c000001cf4a97e30] [c000000000009204] system_call+0x38/0xb4
(Corresponding Oops message cut out for brevity, the double free
occurred on dst_entry c000001d22c62500, same entry as in both the
above traces.)
This is what I think is happening:
In the first thread tcp_v4_do_rcv() calls dst_release(), the reference
count becomes zero scheduling a rcu callback of dst_destroy_rcu() and
setting sk->sk_rx_dst = NULL.
tcp_rcv_established() is called from tcp_v4_do_rcv(), finding
sk->sk_rx_dst = NULL it re-referencing the same dst_entry from
skb->_skb_refdst taking a reference, the reference count is now 1 .
At this point dst_destroy_rcu() is scheduled to run against a
dst_entry that has a reference count of 1!
The second thread (a user read) follows the same path calling
dst_release() and scheduling a second callback for the same dst_entry
eventually causing the crash when the callback runs.
To test my theory I added the following check in tcp_v4_do_rcv().
---------------
If the dst released is the same dst as in skb->refdst, skip the call
to dst_release().
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db00343..7f058f9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1377,8 +1377,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
!dst->ops->check(dst, 0)) {
- dst_release(dst);
- sk->sk_rx_dst = NULL;
+ if (dst != skb_dst(skb)) {
+ dst_release(dst);
+ sk->sk_rx_dst = NULL;
+ }
}
}
tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len);
With this change the test has run for 48 hours with out a crash. I
don't fully understand the conditions that must occur to set this up,
therefor this may be a workaround not a fix. What are your thoughts?
David Wilder
dwilder@us.ibm.com
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-06 18:26 Double free of dst_entry in ipv4_dst_destroy() dwilder
@ 2015-12-06 21:03 ` Eric Dumazet
2015-12-07 1:58 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-06 21:03 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Sun, 2015-12-06 at 13:26 -0500, dwilder@us.ibm.com wrote:
> Hi-
>
> I am seeing a crash on a distro V4.2.3 kernel caused by a double
> release of a dst_entry. In ipv4_dst_destroy() the call to
> list_empty() finds a poisoned next pointer, indicating the dst_entry
> has already been removed from the list and freed. The crash occurs
> 18 to 24 hours into a run of a network stress exerciser.
>
> [172135.963496] Unable to handle kernel paging request for data at
> address 0x00100108 << poison value
> [172135.963913] Faulting instruction address: 0xc00000000097f5ac
> [172135.964184] Oops: Kernel access of bad area, sig: 11 [#1]
> [172135.964327] SMP NR_CPUS=2048 NUMA PowerNV
> [172135.964649] Modules linked in: iptable_filter ip_tables x_tables
> bnx2x cxlflash cxl mdio ses libcrc32c enclosure uio_pdrv_genirq uio
> powernv_rng sunrpc autofs4 ipr
> [172135.965403] CPU: 51 PID: 65452 Comm: hxecpu Tainted: G W
> 4.2.3 #2
> [172135.965482] task: c000001d8a370ff0 ti: c000001e2d57c000 task.ti:
> c000001e2d57c000
> [172135.965564] NIP: c00000000097f5ac LR: c00000000097f59c CTR:
> 0000000000000000
> [172135.965664] REGS: c000001e2d57f8e0 TRAP: 0300 Tainted: G
> W (4.2.3)
> [172135.965782] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR:
> 22322442 XER: 00000000
> [172135.966033] CFAR: c000000000008468 DAR: 0000000000100108 DSISR:
> 42000000 SOFTE: 1
> GPR00: c00000000097f59c c000001e2d57fb60
> c00000000151ad00 c000001e504de300
> GPR04: 0000000000000001 0000000000000000
> c000001fff8af370 c00000000141ad00
> GPR08: c0000000016aad00 0000000000200200
> 0000000000100100 3d7473696c5f6465
> GPR12: 6531303030303063 c000000007b5e480
> 0000000000000012 0000000000000001
> GPR16: 0000000000000000 c000000001431280
> c000000000ad38f0 7fffffffffffffff
> GPR20: 0000000000000000 c000001e28caf000
> c000001e2d57c000 c000000001429b80
> GPR24: 000000000000000a 0000000000000000
> c000001e504ddb30 0000000000000000
> GPR28: 0000000000000000 c000001e2d57c000
> c000001e504de300 c000001e28caf000 <<< pointer to dst
> [172135.967772] NIP [c00000000097f5ac] ipv4_dst_destroy+0x8c/0xc0
> [172135.967921] LR [c00000000097f59c] ipv4_dst_destroy+0x7c/0xc0
> [172135.968076] Call Trace:
> [172135.968133] [c000001e2d57fb60] [c00000000097f59c]
> ipv4_dst_destroy+0x7c/0xc0 (unreliable)
> [172135.968306] [c000001e2d57fbd0] [c00000000093b8e0] dst_destroy+0xf0/0x1a0
> [172135.968452] [c000001e2d57fc10] [c00000000093bc68]
> dst_destroy_rcu+0x28/0x50
> [172135.968600] [c000001e2d57fc40] [c0000000001397a0]
> rcu_process_callbacks+0x340/0x6f0
> [172135.968768] [c000001e2d57fcf0] [c0000000000ba7d8] __do_softirq+0x188/0x3a0
> [172135.968913] [c000001e2d57fde0] [c0000000000bac68] irq_exit+0xc8/0x100
> [172135.969056] [c000001e2d57fe00] [c00000000001f734]
> timer_interrupt+0xa4/0xe0
> [172135.969223] [c000001e2d57fe30] [c000000000002714]
> decrementer_common+0x114/0x180
> [172135.969395] Instruction dump:
> [172135.969492] 7fe5fb78 38842968 7fc6f378 3863e580 4811c989 60000000
> 7fc3f378 481154c1
> [172135.969748] 60000000 e93f00b8 e95f00b0 7fc3f378 <f92a0008>
> f9490000 3d200010 61290100
> [172135.970009] ---[ end trace 34f3693ddc2d5aea ]---
> ----------------------------------------------------------------------------------------------------------
>
> I added a call to dump_stack() into dst_release() in an attempt to
> catch the two calls to dst_release().
>
> -- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -307,6 +307,12 @@ void dst_release(struct dst_entry *dst)
> net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
> __func__, dst, newrefcnt);
> if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE))
> +
> + if (!list_empty(&rt->rt_uncached)) {
> + printk("%s: dst=%p\n",__func__,dst);
> + dump_stack();
> + }
> +
> call_rcu(&dst->rcu_head, dst_destroy_rcu);
> }
> }
>
> I got lucky and caught the following stack traces on my next run.
>
> [26211.699357] tcp_v4_do_rcv: sk=c000001d10a00000 skb=c000001d22c61d00
> dst=c000001d22c62500 sk->sk_rx_dst=c000001d22c62500
> [26211.699527] dst_release: dst=c000001d22c62500
> [26211.699626] CPU: 51 PID: 23317 Comm: hxecom Tainted: G W
> 4.2.3 #4
> [26211.699632] Call Trace:
> [26211.699678] [c000001cf0387440] [c000000000a9dcd4]
> dump_stack+0x90/0xbc (unreliable)
> [26211.699829] [c000001cf0387470] [c00000000093bf80] dst_release+0x110/0x120
> [26211.699875] [c000001cf03874e0] [c0000000009b4024] tcp_v4_do_rcv+0x4d4/0x4f0
> [26211.699979] [c000001cf0387580] [c0000000009b7834] tcp_v4_rcv+0xb74/0xb90
> [26211.700027] [c000001cf0387660] [c000000000984bb8]
> ip_local_deliver_finish+0x178/0x350
> [26211.700123] [c000001cf03876b0] [c0000000009853bc]
> ip_local_deliver+0x4c/0x120
> [26211.700181] [c000001cf0387720] [c000000000984eb4] ip_rcv_finish+0x124/0x420
> [26211.700255] [c000001cf03877a0] [c0000000009857a4] ip_rcv+0x314/0x440
> [26211.700312] [c000001cf0387830] [c00000000092b094]
> __netif_receive_skb_core+0xa14/0xd60
> [26211.700334] [c000001cf0387910] [c00000000092e924]
> netif_receive_skb_internal+0x34/0xd0
> [26211.700370] [c000001cf0387950] [c00000000092fa5c]
> napi_gro_receive+0xec/0x1b0
> [26211.700449] [c000001cf0387990] [d0000000165a59f0]
> bnx2x_rx_int+0x1450/0x1700 [bnx2x]
> [26211.700524] [c000001cf0387b30] [d0000000165a6580]
> bnx2x_poll+0x310/0x440 [bnx2x]
> [26211.700551] [c000001cf0387be0] [c00000000092f0dc] net_rx_action+0x2dc/0x470
> [26211.700565] [c000001cf0387cf0] [c0000000000ba7d8] __do_softirq+0x188/0x3a0
> [26211.700629] [c000001cf0387de0] [c0000000000bac68] irq_exit+0xc8/0x100
> [26211.700657] [c000001cf0387e00] [c00000000001f734] timer_interrupt+0xa4/0xe0
> [26211.700664] [c000001cf0387e30] [c000000000002714]
> decrementer_common+0x114/0x180
> -----------------------------------------------------------------------------------------------------------------------------------
> [26211.700709] tcp_v4_do_rcv: sk=c000001d10a00000 skb=c000001d22c61d00
> dst=c000001d22c62500 sk->sk_rx_dst=c000001d22c62500
> [26211.700713] dst_release: dst=c000001d22c62500
> [26211.700719] CPU: 34 PID: 23342 Comm: hxecom Tainted: G W
> 4.2.3 #4
> [26211.700726] Call Trace:
> [26211.700735] [c000001cf4a97920] [c000000000a9dcd4]
> dump_stack+0x90/0xbc (unreliable)
> [26211.700742] [c000001cf4a97950] [c00000000093bf80] dst_release+0x110/0x120
> [26211.700749] [c000001cf4a979c0] [c0000000009b4024] tcp_v4_do_rcv+0x4d4/0x4f0
> [26211.700759] [c000001cf4a97a60] [c00000000090c72c] release_sock+0xec/0x1e0
> [26211.700772] [c000001cf4a97ab0] [c000000000998b98] tcp_recvmsg+0x3f8/0xce0
> [26211.700778] [c000001cf4a97bd0] [c0000000009d31dc] inet_recvmsg+0x9c/0x110
> [26211.700785] [c000001cf4a97c30] [c000000000906d24] sock_recvmsg+0x84/0xb0
> [26211.700794] [c000001cf4a97c70] [c0000000009088ac] SyS_recvfrom+0xdc/0x1a0
> [26211.700802] [c000001cf4a97dc0] [c000000000909a38]
> SyS_socketcall+0x2d8/0x430
> [26211.700809] [c000001cf4a97e30] [c000000000009204] system_call+0x38/0xb4
>
> (Corresponding Oops message cut out for brevity, the double free
> occurred on dst_entry c000001d22c62500, same entry as in both the
> above traces.)
>
> This is what I think is happening:
>
> In the first thread tcp_v4_do_rcv() calls dst_release(), the reference
> count becomes zero scheduling a rcu callback of dst_destroy_rcu() and
> setting sk->sk_rx_dst = NULL.
>
> tcp_rcv_established() is called from tcp_v4_do_rcv(), finding
> sk->sk_rx_dst = NULL it re-referencing the same dst_entry from
> skb->_skb_refdst taking a reference, the reference count is now 1 .
> At this point dst_destroy_rcu() is scheduled to run against a
> dst_entry that has a reference count of 1!
>
demux code (tcp_v4_early_demux()) only works if we can always
set skb->dst to a (previously cached) dst, even with refcnt==0
Otherwise, we would need to always increment dst refcnt (and only if
refcnt is not 0), that would be quite expensive (2 additional atomic op
per input packet)
This means that dst_check() fails to notice the dst should not be reused
(once it is scheduled for destruction)
But then when later we promote a skb->dst to a refctounted one
(skb_dst_force(), we might make sure we abort the operation if __refcnt
== 0 ( and DST_NOCACHE is in dst->flags)
> The second thread (a user read) follows the same path calling
> dst_release() and scheduling a second callback for the same dst_entry
> eventually causing the crash when the callback runs.
>
> To test my theory I added the following check in tcp_v4_do_rcv().
>
> ---------------
> If the dst released is the same dst as in skb->refdst, skip the call
> to dst_release().
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index db00343..7f058f9 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1377,8 +1377,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> if (dst) {
> if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
> !dst->ops->check(dst, 0)) {
> - dst_release(dst);
> - sk->sk_rx_dst = NULL;
> + if (dst != skb_dst(skb)) {
> + dst_release(dst);
> + sk->sk_rx_dst = NULL;
> + }
> }
> }
> tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len);
>
>
> With this change the test has run for 48 hours with out a crash. I
> don't fully understand the conditions that must occur to set this up,
> therefor this may be a workaround not a fix. What are your thoughts?
Well, sure, not doing this avoid to trigger the fault, but does not find
root cause.
Keeping the obsolete dst in socket cache is wrong.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-06 21:03 ` Eric Dumazet
@ 2015-12-07 1:58 ` Eric Dumazet
2015-12-11 14:23 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-07 1:58 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
> But then when later we promote a skb->dst to a refctounted one
> (skb_dst_force(), we might make sure we abort the operation if __refcnt
> == 0 ( and DST_NOCACHE is in dst->flags)
>
Minimum patch would be :
diff --git a/include/net/dst.h b/include/net/dst.h
index 1279f9b09791..b9a3239f4296 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -322,6 +322,24 @@ static inline void skb_dst_force(struct sk_buff *skb)
}
}
+/**
+ * skb_dst_force_safe - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted and not destroyed, grab a ref on it.
+ */
+static inline void skb_dst_force_safe(struct sk_buff *skb)
+{
+ if (skb_dst_is_noref(skb)) {
+ struct dst_entry *dst = skb_dst(skb);
+
+ if (!atomic_inc_not_zero(&dst->__refcnt))
+ dst = NULL;
+
+ skb->_skb_refdst = (unsigned long)dst;
+ }
+}
+
/**
* __skb_tunnel_rx - prepare skb for rx reinsert
diff --git a/include/net/sock.h b/include/net/sock.h
index b1d475b5db68..6367d1112c3d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -816,7 +816,7 @@ void sk_stream_write_space(struct sock *sk);
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
/* dont let skb dst not refcounted, we are going to leave rcu lock */
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
if (!sk->sk_backlog.tail)
sk->sk_backlog.head = skb;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db003438aaf5..a72688a288f5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
if (likely(sk->sk_rx_dst))
skb_dst_drop(skb);
else
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-07 1:58 ` Eric Dumazet
@ 2015-12-11 14:23 ` Eric Dumazet
2015-12-11 15:48 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-11 14:23 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Sun, 2015-12-06 at 17:58 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
>
> > But then when later we promote a skb->dst to a refctounted one
> > (skb_dst_force(), we might make sure we abort the operation if __refcnt
> > == 0 ( and DST_NOCACHE is in dst->flags)
> >
>
> Minimum patch would be :
>
Here is a more complete patch, it should fix the issue I think :
diff --git a/include/net/dst.h b/include/net/dst.h
index 1279f9b09791..b9a3239f4296 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -322,6 +322,24 @@ static inline void skb_dst_force(struct sk_buff *skb)
}
}
+/**
+ * skb_dst_force_safe - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted and not destroyed, grab a ref on it.
+ */
+static inline void skb_dst_force_safe(struct sk_buff *skb)
+{
+ if (skb_dst_is_noref(skb)) {
+ struct dst_entry *dst = skb_dst(skb);
+
+ if (!atomic_inc_not_zero(&dst->__refcnt))
+ dst = NULL;
+
+ skb->_skb_refdst = (unsigned long)dst;
+ }
+}
+
/**
* __skb_tunnel_rx - prepare skb for rx reinsert
diff --git a/include/net/sock.h b/include/net/sock.h
index b1d475b5db68..6367d1112c3d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -816,7 +816,7 @@ void sk_stream_write_space(struct sock *sk);
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
/* dont let skb dst not refcounted, we are going to leave rcu lock */
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
if (!sk->sk_backlog.tail)
sk->sk_backlog.head = skb;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db003438aaf5..00f21ba7435d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
if (likely(sk->sk_rx_dst))
skb_dst_drop(skb);
else
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
@@ -1721,8 +1721,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
- dst_hold(dst);
+ if (dst && atomic_inc_not_zero(&dst->__refcnt)) {
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e7aab561b7b4..4be69d177440 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,10 +93,9 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
+ if (dst && atomic_inc_not_zero(&dst->__refcnt)) {
const struct rt6_info *rt = (const struct rt6_info *)dst;
- dst_hold(dst);
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-11 14:23 ` Eric Dumazet
@ 2015-12-11 15:48 ` Eric Dumazet
2015-12-11 16:47 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-11 15:48 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Fri, 2015-12-11 at 06:23 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-06 at 17:58 -0800, Eric Dumazet wrote:
> > On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
> >
> > > But then when later we promote a skb->dst to a refctounted one
> > > (skb_dst_force(), we might make sure we abort the operation if __refcnt
> > > == 0 ( and DST_NOCACHE is in dst->flags)
> > >
> >
> > Minimum patch would be :
> >
>
> Here is a more complete patch, it should fix the issue I think :
Hmm, I'll send a v3, I forgot to test DST_NOCACHE properly.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-11 15:48 ` Eric Dumazet
@ 2015-12-11 16:47 ` Eric Dumazet
2015-12-14 16:28 ` dwilder
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-11 16:47 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Fri, 2015-12-11 at 07:48 -0800, Eric Dumazet wrote:
> On Fri, 2015-12-11 at 06:23 -0800, Eric Dumazet wrote:
> > On Sun, 2015-12-06 at 17:58 -0800, Eric Dumazet wrote:
> > > On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
> > >
> > > > But then when later we promote a skb->dst to a refctounted one
> > > > (skb_dst_force(), we might make sure we abort the operation if __refcnt
> > > > == 0 ( and DST_NOCACHE is in dst->flags)
> > > >
> > >
> > > Minimum patch would be :
> > >
> >
> > Here is a more complete patch, it should fix the issue I think :
>
> Hmm, I'll send a v3, I forgot to test DST_NOCACHE properly.
>
David, please test the following patch, thanks !
include/net/dst.h | 33 +++++++++++++++++++++++++++++++++
include/net/sock.h | 2 +-
net/ipv4/tcp_ipv4.c | 5 ++---
net/ipv6/tcp_ipv6.c | 3 +--
4 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/include/net/dst.h b/include/net/dst.h
index 1279f9b09791..c7329dcd90cc 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -322,6 +322,39 @@ static inline void skb_dst_force(struct sk_buff *skb)
}
}
+/**
+ * dst_hold_safe - Take a reference on a dst if possible
+ * @dst: pointer to dst entry
+ *
+ * This helper returns false if it could not safely
+ * take a reference on a dst.
+ */
+static inline bool dst_hold_safe(struct dst_entry *dst)
+{
+ if (dst->flags & DST_NOCACHE)
+ return atomic_inc_not_zero(&dst->__refcnt);
+ dst_hold(dst);
+ return true;
+}
+
+/**
+ * skb_dst_force_safe - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted and not destroyed, grab a ref on it.
+ */
+static inline void skb_dst_force_safe(struct sk_buff *skb)
+{
+ if (skb_dst_is_noref(skb)) {
+ struct dst_entry *dst = skb_dst(skb);
+
+ if (!dst_hold_safe(dst))
+ dst = NULL;
+
+ skb->_skb_refdst = (unsigned long)dst;
+ }
+}
+
/**
* __skb_tunnel_rx - prepare skb for rx reinsert
diff --git a/include/net/sock.h b/include/net/sock.h
index eaef41433d7a..18322bded064 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -816,7 +816,7 @@ void sk_stream_write_space(struct sock *sk);
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
/* dont let skb dst not refcounted, we are going to leave rcu lock */
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
if (!sk->sk_backlog.tail)
sk->sk_backlog.head = skb;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db003438aaf5..d8841a2f1569 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
if (likely(sk->sk_rx_dst))
skb_dst_drop(skb);
else
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
@@ -1721,8 +1721,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
- dst_hold(dst);
+ if (dst && dst_hold_safe(dst)) {
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e7aab561b7b4..6b8a8a9091fa 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,10 +93,9 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
+ if (dst && dst_hold_safe(dst)) {
const struct rt6_info *rt = (const struct rt6_info *)dst;
- dst_hold(dst);
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-11 16:47 ` Eric Dumazet
@ 2015-12-14 16:28 ` dwilder
2015-12-14 16:36 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: dwilder @ 2015-12-14 16:28 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, predeep, mjtarsel
Eric -
With this patch applied the test ran clean for 2 days.
Thanks for your help.
Quoting Eric Dumazet <eric.dumazet@gmail.com>:
> On Fri, 2015-12-11 at 07:48 -0800, Eric Dumazet wrote:
>> On Fri, 2015-12-11 at 06:23 -0800, Eric Dumazet wrote:
>> > On Sun, 2015-12-06 at 17:58 -0800, Eric Dumazet wrote:
>> > > On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
>> > >
>> > > > But then when later we promote a skb->dst to a refctounted one
>> > > > (skb_dst_force(), we might make sure we abort the operation
>> if __refcnt
>> > > > == 0 ( and DST_NOCACHE is in dst->flags)
>> > > >
>> > >
>> > > Minimum patch would be :
>> > >
>> >
>> > Here is a more complete patch, it should fix the issue I think :
>>
>> Hmm, I'll send a v3, I forgot to test DST_NOCACHE properly.
>>
>
> David, please test the following patch, thanks !
>
> include/net/dst.h | 33 +++++++++++++++++++++++++++++++++
> include/net/sock.h | 2 +-
> net/ipv4/tcp_ipv4.c | 5 ++---
> net/ipv6/tcp_ipv6.c | 3 +--
> 4 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 1279f9b09791..c7329dcd90cc 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -322,6 +322,39 @@ static inline void skb_dst_force(struct sk_buff *skb)
> }
> }
>
> +/**
> + * dst_hold_safe - Take a reference on a dst if possible
> + * @dst: pointer to dst entry
> + *
> + * This helper returns false if it could not safely
> + * take a reference on a dst.
> + */
> +static inline bool dst_hold_safe(struct dst_entry *dst)
> +{
> + if (dst->flags & DST_NOCACHE)
> + return atomic_inc_not_zero(&dst->__refcnt);
> + dst_hold(dst);
> + return true;
> +}
> +
> +/**
> + * skb_dst_force_safe - makes sure skb dst is refcounted
> + * @skb: buffer
> + *
> + * If dst is not yet refcounted and not destroyed, grab a ref on it.
> + */
> +static inline void skb_dst_force_safe(struct sk_buff *skb)
> +{
> + if (skb_dst_is_noref(skb)) {
> + struct dst_entry *dst = skb_dst(skb);
> +
> + if (!dst_hold_safe(dst))
> + dst = NULL;
> +
> + skb->_skb_refdst = (unsigned long)dst;
> + }
> +}
> +
>
> /**
> * __skb_tunnel_rx - prepare skb for rx reinsert
> diff --git a/include/net/sock.h b/include/net/sock.h
> index eaef41433d7a..18322bded064 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -816,7 +816,7 @@ void sk_stream_write_space(struct sock *sk);
> static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
> {
> /* dont let skb dst not refcounted, we are going to leave rcu lock */
> - skb_dst_force(skb);
> + skb_dst_force_safe(skb);
>
> if (!sk->sk_backlog.tail)
> sk->sk_backlog.head = skb;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index db003438aaf5..d8841a2f1569 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
> if (likely(sk->sk_rx_dst))
> skb_dst_drop(skb);
> else
> - skb_dst_force(skb);
> + skb_dst_force_safe(skb);
>
> __skb_queue_tail(&tp->ucopy.prequeue, skb);
> tp->ucopy.memory += skb->truesize;
> @@ -1721,8 +1721,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const
> struct sk_buff *skb)
> {
> struct dst_entry *dst = skb_dst(skb);
>
> - if (dst) {
> - dst_hold(dst);
> + if (dst && dst_hold_safe(dst)) {
> sk->sk_rx_dst = dst;
> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> }
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index e7aab561b7b4..6b8a8a9091fa 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -93,10 +93,9 @@ static void inet6_sk_rx_dst_set(struct sock *sk,
> const struct sk_buff *skb)
> {
> struct dst_entry *dst = skb_dst(skb);
>
> - if (dst) {
> + if (dst && dst_hold_safe(dst)) {
> const struct rt6_info *rt = (const struct rt6_info *)dst;
>
> - dst_hold(dst);
> sk->sk_rx_dst = dst;
> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Double free of dst_entry in ipv4_dst_destroy()
2015-12-14 16:28 ` dwilder
@ 2015-12-14 16:36 ` Eric Dumazet
2015-12-14 22:08 ` [PATCH net] net: fix IP early demux races Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-14 16:36 UTC (permalink / raw)
To: dwilder; +Cc: netdev, predeep, mjtarsel
On Mon, 2015-12-14 at 11:28 -0500, dwilder@us.ibm.com wrote:
> Eric -
> With this patch applied the test ran clean for 2 days.
>
> Thanks for your help.
Excellent !
Thanks a lot David, I will submit it formally with your 'Reported-by'
and 'Tested-by'
I have no idea why this took so long to discover this race.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net] net: fix IP early demux races
2015-12-14 16:36 ` Eric Dumazet
@ 2015-12-14 22:08 ` Eric Dumazet
2015-12-15 4:52 ` David Miller
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-12-14 22:08 UTC (permalink / raw)
To: dwilder, David Miller; +Cc: netdev, predeep, mjtarsel
From: Eric Dumazet <edumazet@google.com>
David Wilder reported crashes caused by dst reuse.
<quote David>
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
</quote>
Thanks to his detailed report and analysis, we were able to understand
the core issue.
IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.
When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.
Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.
We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.
This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.
It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.
Can probably be backported back to linux-3.6 kernels
Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/dst.h | 33 +++++++++++++++++++++++++++++++++
include/net/sock.h | 2 +-
net/ipv4/tcp_ipv4.c | 5 ++---
net/ipv6/tcp_ipv6.c | 3 +--
4 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/include/net/dst.h b/include/net/dst.h
index 1279f9b09791..c7329dcd90cc 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -322,6 +322,39 @@ static inline void skb_dst_force(struct sk_buff *skb)
}
}
+/**
+ * dst_hold_safe - Take a reference on a dst if possible
+ * @dst: pointer to dst entry
+ *
+ * This helper returns false if it could not safely
+ * take a reference on a dst.
+ */
+static inline bool dst_hold_safe(struct dst_entry *dst)
+{
+ if (dst->flags & DST_NOCACHE)
+ return atomic_inc_not_zero(&dst->__refcnt);
+ dst_hold(dst);
+ return true;
+}
+
+/**
+ * skb_dst_force_safe - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted and not destroyed, grab a ref on it.
+ */
+static inline void skb_dst_force_safe(struct sk_buff *skb)
+{
+ if (skb_dst_is_noref(skb)) {
+ struct dst_entry *dst = skb_dst(skb);
+
+ if (!dst_hold_safe(dst))
+ dst = NULL;
+
+ skb->_skb_refdst = (unsigned long)dst;
+ }
+}
+
/**
* __skb_tunnel_rx - prepare skb for rx reinsert
diff --git a/include/net/sock.h b/include/net/sock.h
index c4205e0a3a2d..28790fe18206 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -817,7 +817,7 @@ void sk_stream_write_space(struct sock *sk);
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
/* dont let skb dst not refcounted, we are going to leave rcu lock */
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
if (!sk->sk_backlog.tail)
sk->sk_backlog.head = skb;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db003438aaf5..d8841a2f1569 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
if (likely(sk->sk_rx_dst))
skb_dst_drop(skb);
else
- skb_dst_force(skb);
+ skb_dst_force_safe(skb);
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
@@ -1721,8 +1721,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
- dst_hold(dst);
+ if (dst && dst_hold_safe(dst)) {
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e7aab561b7b4..6b8a8a9091fa 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,10 +93,9 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
- if (dst) {
+ if (dst && dst_hold_safe(dst)) {
const struct rt6_info *rt = (const struct rt6_info *)dst;
- dst_hold(dst);
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH net] net: fix IP early demux races
2015-12-14 22:08 ` [PATCH net] net: fix IP early demux races Eric Dumazet
@ 2015-12-15 4:52 ` David Miller
0 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2015-12-15 4:52 UTC (permalink / raw)
To: eric.dumazet; +Cc: dwilder, netdev, predeep, mjtarsel
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Dec 2015 14:08:53 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> David Wilder reported crashes caused by dst reuse.
>
> <quote David>
> I am seeing a crash on a distro V4.2.3 kernel caused by a double
> release of a dst_entry. In ipv4_dst_destroy() the call to
> list_empty() finds a poisoned next pointer, indicating the dst_entry
> has already been removed from the list and freed. The crash occurs
> 18 to 24 hours into a run of a network stress exerciser.
> </quote>
>
> Thanks to his detailed report and analysis, we were able to understand
> the core issue.
>
> IP early demux can associate a dst to skb, after a lookup in TCP/UDP
> sockets.
>
> When socket cache is not properly set, we want to store into
> sk->sk_dst_cache the dst for future IP early demux lookups,
> by acquiring a stable refcount on the dst.
>
> Problem is this acquisition is simply using an atomic_inc(),
> which works well, unless the dst was queued for destruction from
> dst_release() noticing dst refcount went to zero, if DST_NOCACHE
> was set on dst.
>
> We need to make sure current refcount is not zero before incrementing
> it, or risk double free as David reported.
>
> This patch, being a stable candidate, adds two new helpers, and use
> them only from IP early demux problematic paths.
>
> It might be possible to merge in net-next skb_dst_force() and
> skb_dst_force_safe(), but I prefer having the smallest patch for stable
> kernels : Maybe some skb_dst_force() callers do not expect skb->dst
> can suddenly be cleared.
>
> Can probably be backported back to linux-3.6 kernels
>
> Reported-by: David J. Wilder <dwilder@us.ibm.com>
> Tested-by: David J. Wilder <dwilder@us.ibm.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied and queued up for -stable, thanks Eric.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-12-15 4:52 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-06 18:26 Double free of dst_entry in ipv4_dst_destroy() dwilder
2015-12-06 21:03 ` Eric Dumazet
2015-12-07 1:58 ` Eric Dumazet
2015-12-11 14:23 ` Eric Dumazet
2015-12-11 15:48 ` Eric Dumazet
2015-12-11 16:47 ` Eric Dumazet
2015-12-14 16:28 ` dwilder
2015-12-14 16:36 ` Eric Dumazet
2015-12-14 22:08 ` [PATCH net] net: fix IP early demux races Eric Dumazet
2015-12-15 4:52 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).