* ppp/pppoe, still panic 4.15.3 in ppp_push
@ 2018-02-14 13:17 Denys Fedoryshchenko
2018-02-14 16:07 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-14 13:17 UTC (permalink / raw)
To: Linux Kernel Network Developers
Hi,
Upgraded kernel to 4.15.3, still it crashes after while (several hours,
cannot do bisect, as it is production server).
dev ppp # gdb ppp_generic.o
GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
<<skipped>>
Reading symbols from ppp_generic.o...done.
(gdb) list *ppp_push+0x73
0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
1658 list = list->next;
1659 pch = list_entry(list, struct channel, clist);
1660
1661 spin_lock(&pch->downl);
1662 if (pch->chan) {
1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
1664 ppp->xmit_pending = NULL;
1665 } else {
1666 /* channel got unregistered */
1667 kfree_skb(skb);
Feb 14 08:32:00 <srv> [17937.863304] general protection fault: 0000 [#1]
SMP
Feb 14 08:32:00 <srv> [17937.863638] Modules linked in: pppoe pppox
ppp_generic slhc netconsole configfs coretemp nf_nat_pptp
nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE
nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
Feb 14 08:32:00 <srv> [17937.865619] CPU: 6 PID: 12543 Comm: accel-pppd
Not tainted 4.15.3-build-0134 #4
Feb 14 08:32:00 <srv> [17937.866211] Hardware name: HP ProLiant DL320e
Gen8 v2, BIOS P80 04/02/2015
Feb 14 08:32:00 <srv> [17937.866542] RIP: 0010:ppp_push+0x73/0x4ec
[ppp_generic]
Feb 14 08:32:00 <srv> [17937.866865] RSP: 0018:ffffc90001fa7d50 EFLAGS:
00010282
Feb 14 08:32:00 <srv> [17937.867191] RAX: 0fd54d16ec030000 RBX:
ffff8803eeb207b8 RCX: 0000000000000101
Feb 14 08:32:00 <srv> [17937.867517] RDX: 0000000000000000 RSI:
ffff8803f9fb5000 RDI: ffff8803eed1e443
Feb 14 08:32:00 <srv> [17937.867844] RBP: ffff8803f9fb5000 R08:
0000000000000001 R09: 00000000ffffffff
Feb 14 08:32:00 <srv> [17937.868171] R10: 00007f0a75fba758 R11:
0000000000000293 R12: 0000000000008021
Feb 14 08:32:00 <srv> [17937.868499] R13: ffff8804144c7880 R14:
0000000000008021 R15: ffff8804144c7800
Feb 14 08:32:00 <srv> [17937.868824] FS: 00007f0a7ecd8700(0000)
GS:ffff880434180000(0000) knlGS:0000000000000000
Feb 14 08:32:00 <srv> [17937.869408] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Feb 14 08:32:00 <srv> [17937.869729] CR2: 00007fa87a187978 CR3:
000000042a6cd005 CR4: 00000000001606e0
Feb 14 08:32:00 <srv> [17937.870053] Call Trace:
Feb 14 08:32:00 <srv> [17937.870375] ?
__kmalloc_node_track_caller+0xb5/0xd6
Feb 14 08:32:00 <srv> [17937.870700] __ppp_xmit_process+0x35/0x4c6
[ppp_generic]
Feb 14 08:32:00 <srv> [17937.871025] ppp_xmit_process+0x35/0x88
[ppp_generic]
Feb 14 08:32:00 <srv> [17937.871350] ppp_write+0xb1/0xbb [ppp_generic]
Feb 14 08:32:00 <srv> [17937.871678] __vfs_write+0x1c/0x118
Feb 14 08:32:00 <srv> [17937.872003] ? SyS_epoll_ctl+0x399/0x871
Feb 14 08:32:00 <srv> [17937.872328] vfs_write+0xc6/0x169
Feb 14 08:32:00 <srv> [17937.872654] SyS_write+0x48/0x81
Feb 14 08:32:00 <srv> [17937.872980] do_syscall_64+0x5f/0xea
Feb 14 08:32:00 <srv> [17937.873310]
entry_SYSCALL_64_after_hwframe+0x21/0x86
Feb 14 08:32:00 <srv> [17937.873638] RIP: 0033:0x7f0a7e4bfb2d
Feb 14 08:32:00 <srv> [17937.873963] RSP: 002b:00007f0a7ecd7b00 EFLAGS:
00000293 ORIG_RAX: 0000000000000001
Feb 14 08:32:00 <srv> [17937.874554] RAX: ffffffffffffffda RBX:
00007f0a7d00b1e3 RCX: 00007f0a7e4bfb2d
Feb 14 08:32:00 <srv> [17937.874881] RDX: 000000000000000c RSI:
00007f0a74175c80 RDI: 0000000000003ef8
Feb 14 08:32:00 <srv> [17937.875207] RBP: 00007f0a7ecd7b30 R08:
0000000000000000 R09: 000055776e7a5e40
Feb 14 08:32:00 <srv> [17937.875536] R10: 00007f0a75fba758 R11:
0000000000000293 R12: 00007f0a7550dd18
Feb 14 08:32:00 <srv> [17937.875863] R13: 00007ffd4c941eaf R14:
0000000000000000 R15: 00007f0a7ecd8700
Feb 14 08:32:00 <srv> [17937.876190] Code: 94 00 00 00 49 89 ff 0f ba e0
0a 72 43 48 8b 5f 68 48 8d 7b e8 e8 88 4f 84 e1 48 8b 7b b8 48 85 ff 74
10 48 8b 47 08 48 8b 34 24 <ff> 10 85 c0 75 0b eb 14 48 8b 3c 2
4 e8 d8 6c 76 e1 49 c7 87 c8
Feb 14 08:32:00 <srv> [17937.877071] RIP: ppp_push+0x73/0x4ec
[ppp_generic] RSP: ffffc90001fa7d50
Feb 14 08:32:00 <srv> [17937.877435] ---[ end trace 30a3cc6a49109783
]---
Feb 14 08:32:00 <srv> [17937.878370] Kernel panic - not syncing: Fatal
exception in interrupt
Feb 14 08:32:00 <srv> [17937.878715] Kernel Offset: disabled
Feb 14 08:32:00 <srv> [17937.879771] Rebooting in 5 seconds..
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 13:17 ppp/pppoe, still panic 4.15.3 in ppp_push Denys Fedoryshchenko
@ 2018-02-14 16:07 ` Guillaume Nault
2018-02-14 16:29 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-14 16:07 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> Hi,
>
> Upgraded kernel to 4.15.3, still it crashes after while (several hours,
> cannot do bisect, as it is production server).
>
> dev ppp # gdb ppp_generic.o
> GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> <<skipped>>
> Reading symbols from ppp_generic.o...done.
> (gdb) list *ppp_push+0x73
> 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> 1658 list = list->next;
> 1659 pch = list_entry(list, struct channel, clist);
> 1660
> 1661 spin_lock(&pch->downl);
> 1662 if (pch->chan) {
> 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
> 1664 ppp->xmit_pending = NULL;
> 1665 } else {
> 1666 /* channel got unregistered */
> 1667 kfree_skb(skb);
>
>
I expect a memory corruption. Do you have the possibility to run with
KASAN by any chance?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 16:07 ` Guillaume Nault
@ 2018-02-14 16:29 ` Denys Fedoryshchenko
2018-02-14 16:47 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-14 16:29 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-14 18:07, Guillaume Nault wrote:
> On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
>> Hi,
>>
>> Upgraded kernel to 4.15.3, still it crashes after while (several
>> hours,
>> cannot do bisect, as it is production server).
>>
>> dev ppp # gdb ppp_generic.o
>> GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
>> <<skipped>>
>> Reading symbols from ppp_generic.o...done.
>> (gdb) list *ppp_push+0x73
>> 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
>> 1658 list = list->next;
>> 1659 pch = list_entry(list, struct channel, clist);
>> 1660
>> 1661 spin_lock(&pch->downl);
>> 1662 if (pch->chan) {
>> 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
>> 1664 ppp->xmit_pending = NULL;
>> 1665 } else {
>> 1666 /* channel got unregistered */
>> 1667 kfree_skb(skb);
>>
>>
> I expect a memory corruption. Do you have the possibility to run with
> KASAN by any chance?
I will try to enable it tonight. For now i reverted "drivers, net, ppp:
convert ppp_file.refcnt from atomic_t to refcount_t" for test.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 16:29 ` Denys Fedoryshchenko
@ 2018-02-14 16:47 ` Guillaume Nault
2018-02-14 16:49 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-14 16:47 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:07, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > Hi,
> > >
> > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > hours,
> > > cannot do bisect, as it is production server).
> > >
> > > dev ppp # gdb ppp_generic.o
> > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > <<skipped>>
> > > Reading symbols from ppp_generic.o...done.
> > > (gdb) list *ppp_push+0x73
> > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > 1658 list = list->next;
> > > 1659 pch = list_entry(list, struct channel, clist);
> > > 1660
> > > 1661 spin_lock(&pch->downl);
> > > 1662 if (pch->chan) {
> > > 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
> > > 1664 ppp->xmit_pending = NULL;
> > > 1665 } else {
> > > 1666 /* channel got unregistered */
> > > 1667 kfree_skb(skb);
> > >
> > >
> > I expect a memory corruption. Do you have the possibility to run with
> > KASAN by any chance?
> I will try to enable it tonight. For now i reverted "drivers, net, ppp:
> convert ppp_file.refcnt from atomic_t to refcount_t" for test.
>
This commit looks good to me. Do you have doubts about it because it's
new in 4.15? Does it mean that your last known-good kernel is 4.14?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 16:47 ` Guillaume Nault
@ 2018-02-14 16:49 ` Denys Fedoryshchenko
2018-02-14 17:25 ` Guillaume Nault
2018-02-15 19:20 ` Guillaume Nault
0 siblings, 2 replies; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-14 16:49 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-14 18:47, Guillaume Nault wrote:
> On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-14 18:07, Guillaume Nault wrote:
>> > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
>> > > Hi,
>> > >
>> > > Upgraded kernel to 4.15.3, still it crashes after while (several
>> > > hours,
>> > > cannot do bisect, as it is production server).
>> > >
>> > > dev ppp # gdb ppp_generic.o
>> > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
>> > > <<skipped>>
>> > > Reading symbols from ppp_generic.o...done.
>> > > (gdb) list *ppp_push+0x73
>> > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
>> > > 1658 list = list->next;
>> > > 1659 pch = list_entry(list, struct channel, clist);
>> > > 1660
>> > > 1661 spin_lock(&pch->downl);
>> > > 1662 if (pch->chan) {
>> > > 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
>> > > 1664 ppp->xmit_pending = NULL;
>> > > 1665 } else {
>> > > 1666 /* channel got unregistered */
>> > > 1667 kfree_skb(skb);
>> > >
>> > >
>> > I expect a memory corruption. Do you have the possibility to run with
>> > KASAN by any chance?
>> I will try to enable it tonight. For now i reverted "drivers, net,
>> ppp:
>> convert ppp_file.refcnt from atomic_t to refcount_t" for test.
>>
> This commit looks good to me. Do you have doubts about it because it's
> new in 4.15? Does it mean that your last known-good kernel is 4.14?
I am just doing "manual" bisect, checking all possibilities, and picking
patch to revert randomly.
Yes, correct, my known-good is 4.14.2.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 16:49 ` Denys Fedoryshchenko
@ 2018-02-14 17:25 ` Guillaume Nault
2018-02-15 10:19 ` Denys Fedoryshchenko
2018-02-15 19:20 ` Guillaume Nault
1 sibling, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-14 17:25 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:47, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-14 18:07, Guillaume Nault wrote:
> > > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > > > Hi,
> > > > >
> > > > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > > > hours,
> > > > > cannot do bisect, as it is production server).
> > > > >
> > > > > dev ppp # gdb ppp_generic.o
> > > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > > > <<skipped>>
> > > > > Reading symbols from ppp_generic.o...done.
> > > > > (gdb) list *ppp_push+0x73
> > > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > > > 1658 list = list->next;
> > > > > 1659 pch = list_entry(list, struct channel, clist);
> > > > > 1660
> > > > > 1661 spin_lock(&pch->downl);
> > > > > 1662 if (pch->chan) {
> > > > > 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
> > > > > 1664 ppp->xmit_pending = NULL;
> > > > > 1665 } else {
> > > > > 1666 /* channel got unregistered */
> > > > > 1667 kfree_skb(skb);
> > > > >
> > > > >
> > > > I expect a memory corruption. Do you have the possibility to run with
> > > > KASAN by any chance?
> > > I will try to enable it tonight. For now i reverted "drivers, net,
> > > ppp:
> > > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> > >
> > This commit looks good to me. Do you have doubts about it because it's
> > new in 4.15? Does it mean that your last known-good kernel is 4.14?
>
> I am just doing "manual" bisect, checking all possibilities, and picking
> patch to revert randomly.
>
Must be a painful process. Are all of your networking modules required?
With luck, you might be able to isolate a faulty module in fewer steps.
> Yes, correct, my known-good is 4.14.2.
>
Good to know.
Let me know if you can get a KASAN trace.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 17:25 ` Guillaume Nault
@ 2018-02-15 10:19 ` Denys Fedoryshchenko
2018-02-15 15:55 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-15 10:19 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-14 19:25, Guillaume Nault wrote:
> On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-14 18:47, Guillaume Nault wrote:
>> > On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
>> > > On 2018-02-14 18:07, Guillaume Nault wrote:
>> > > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
>> > > > > Hi,
>> > > > >
>> > > > > Upgraded kernel to 4.15.3, still it crashes after while (several
>> > > > > hours,
>> > > > > cannot do bisect, as it is production server).
>> > > > >
>> > > > > dev ppp # gdb ppp_generic.o
>> > > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
>> > > > > <<skipped>>
>> > > > > Reading symbols from ppp_generic.o...done.
>> > > > > (gdb) list *ppp_push+0x73
>> > > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
>> > > > > 1658 list = list->next;
>> > > > > 1659 pch = list_entry(list, struct channel, clist);
>> > > > > 1660
>> > > > > 1661 spin_lock(&pch->downl);
>> > > > > 1662 if (pch->chan) {
>> > > > > 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
>> > > > > 1664 ppp->xmit_pending = NULL;
>> > > > > 1665 } else {
>> > > > > 1666 /* channel got unregistered */
>> > > > > 1667 kfree_skb(skb);
>> > > > >
>> > > > >
>> > > > I expect a memory corruption. Do you have the possibility to run with
>> > > > KASAN by any chance?
>> > > I will try to enable it tonight. For now i reverted "drivers, net,
>> > > ppp:
>> > > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
>> > >
>> > This commit looks good to me. Do you have doubts about it because it's
>> > new in 4.15? Does it mean that your last known-good kernel is 4.14?
>>
>> I am just doing "manual" bisect, checking all possibilities, and
>> picking
>> patch to revert randomly.
>>
> Must be a painful process. Are all of your networking modules required?
> With luck, you might be able to isolate a faulty module in fewer steps.
>
>> Yes, correct, my known-good is 4.14.2.
>>
> Good to know.
>
> Let me know if you can get a KASAN trace.
Here we go:
<srv> [24558.921549]
==================================================================
<srv> [24558.922167] BUG: KASAN: use-after-free in
ppp_ioctl+0xa6a/0x1522 [ppp_generic]
<srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
accel-pppd/12622
<srv> [24558.923113]
<srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
W 4.15.3-build-0134 #1
<srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS
P80 04/02/2015
<srv> [24558.924406] Call Trace:
<srv> [24558.924753] dump_stack+0x46/0x59
<srv> [24558.925103] print_address_description+0x6b/0x23b
<srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
<srv> [24558.925797] kasan_report+0x21b/0x241
<srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
<srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
<srv> [24558.926829] ? sock_sendmsg+0x89/0x99
<srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
<srv> [24558.927523] ? kernel_read+0xed/0xed
<srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
<srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
<srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
<srv> [24558.928898] vfs_ioctl+0x6e/0x81
<srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
<srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
<srv> [24558.929907] ? sigsuspend+0x13e/0x13e
<srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
<srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
<srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
<srv> [24558.931252] SyS_ioctl+0x39/0x55
<srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
<srv> [24558.931942] do_syscall_64+0x1b1/0x31f
<srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
<srv> [24558.932627] RIP: 0033:0x7f302849d8a7
<srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
ORIG_RAX: 0000000000000010
<srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
00007f302849d8a7
<srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
0000000000003a67
<srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
000055c8308d8e40
<srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
00007f3023f49358
<srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
00007f3029a53700
<srv> [24558.935288]
<srv> [24558.935626] Allocated by task 12622:
<srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
<srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
<srv> [24558.936640] SyS_connect+0x14b/0x1b7
<srv> [24558.936975] do_syscall_64+0x1b1/0x31f
<srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
<srv> [24558.937655]
<srv> [24558.937993] Freed by task 12622:
<srv> [24558.938321] kfree+0xb0/0x11d
<srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
<srv> [24558.938994] __fput+0x2ba/0x51a
<srv> [24558.939332] task_work_run+0x11c/0x13d
<srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
<srv> [24558.940022] do_syscall_64+0x2ea/0x31f
<srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
<srv> [24558.947099]
<srv> [24558.947443] The buggy address belongs to the object at
ffff8803d35bf340
[24558.947443] which belongs to the cache kmalloc-256 of size 256
<srv> [24558.948064] The buggy address is located 184 bytes inside of
[24558.948064] 256-byte region [ffff8803d35bf340, ffff8803d35bf440)
<srv> [24558.948676] The buggy address belongs to the page:
<srv> [24558.949019] page:ffffea000f4d6f00 count:1 mapcount:0 mapping:
(null) index:0xffff8803d35bfc00 compound_mapcount: 0
<srv> [24558.949633] flags: 0x17ffe00000008100(slab|head)
<srv> [24558.949980] raw: 17ffe00000008100 0000000000000000
ffff8803d35bfc00 000000010033002e
<srv> [24558.950597] raw: ffffea000d98c020 ffffea000df57d20
ffff8803f1c0f480 0000000000000000
<srv> [24558.951209] page dumped because: kasan: bad access detected
<srv> [24558.951546]
<srv> [24558.951880] Memory state around the buggy address:
<srv> [24558.952217] ffff8803d35bf280: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
<srv> [24558.952818] ffff8803d35bf300: fc fc fc fc fc fc fc fc fb fb
fb fb fb fb fb fb
<srv> [24558.953415] >ffff8803d35bf380: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
<srv> [24558.954021]
^
<srv> [24558.954365] ffff8803d35bf400: fb fb fb fb fb fb fb fb fc fc
fc fc fc fc fc fc
<srv> [24558.954969] ffff8803d35bf480: 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00
<srv> [24558.955572]
==================================================================
<srv> [24558.956169] Disabling lock debugging due to kernel taint
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 10:19 ` Denys Fedoryshchenko
@ 2018-02-15 15:55 ` Guillaume Nault
2018-02-15 16:01 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-15 15:55 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> Here we go:
>
> <srv> [24558.921549]
> ==================================================================
> <srv> [24558.922167] BUG: KASAN: use-after-free in ppp_ioctl+0xa6a/0x1522
> [ppp_generic]
> <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
> accel-pppd/12622
> <srv> [24558.923113]
> <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G W
> 4.15.3-build-0134 #1
> <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> <srv> [24558.924406] Call Trace:
> <srv> [24558.924753] dump_stack+0x46/0x59
> <srv> [24558.925103] print_address_description+0x6b/0x23b
> <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> <srv> [24558.925797] kasan_report+0x21b/0x241
> <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
> <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
> <srv> [24558.927523] ? kernel_read+0xed/0xed
> <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
> <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
> <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
> <srv> [24558.928898] vfs_ioctl+0x6e/0x81
> <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
> <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
> <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
> <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
> <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
> <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
> <srv> [24558.931252] SyS_ioctl+0x39/0x55
> <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
> <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
> <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
> <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
> <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
> 00007f302849d8a7
> <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
> 0000000000003a67
> <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
> 000055c8308d8e40
> <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
> 00007f3023f49358
> <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
> 00007f3029a53700
> <srv> [24558.935288]
> <srv> [24558.935626] Allocated by task 12622:
> <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
> <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
> <srv> [24558.936640] SyS_connect+0x14b/0x1b7
> <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
> <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
> <srv> [24558.937655]
> <srv> [24558.937993] Freed by task 12622:
> <srv> [24558.938321] kfree+0xb0/0x11d
> <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
> <srv> [24558.938994] __fput+0x2ba/0x51a
> <srv> [24558.939332] task_work_run+0x11c/0x13d
> <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
> <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
> <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
> <srv> [24558.947099]
Your first guess was right. It looks like we have an issue with
reference counting on the channels. Can you send me your ppp_generic.o?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 15:55 ` Guillaume Nault
@ 2018-02-15 16:01 ` Denys Fedoryshchenko
2018-02-15 19:31 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-15 16:01 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers, netdev-owner
On 2018-02-15 17:55, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> Here we go:
>>
>> <srv> [24558.921549]
>> ==================================================================
>> <srv> [24558.922167] BUG: KASAN: use-after-free in
>> ppp_ioctl+0xa6a/0x1522
>> [ppp_generic]
>> <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> accel-pppd/12622
>> <srv> [24558.923113]
>> <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> W
>> 4.15.3-build-0134 #1
>> <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS
>> P80
>> 04/02/2015
>> <srv> [24558.924406] Call Trace:
>> <srv> [24558.924753] dump_stack+0x46/0x59
>> <srv> [24558.925103] print_address_description+0x6b/0x23b
>> <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> <srv> [24558.925797] kasan_report+0x21b/0x241
>> <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> <srv> [24558.927523] ? kernel_read+0xed/0xed
>> <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> ORIG_RAX:
>> 0000000000000010
>> <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> 00007f302849d8a7
>> <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> 0000000000003a67
>> <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> 000055c8308d8e40
>> <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> 00007f3023f49358
>> <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> 00007f3029a53700
>> <srv> [24558.935288]
>> <srv> [24558.935626] Allocated by task 12622:
>> <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> [ppp_generic]
>> <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> <srv> [24558.937655]
>> <srv> [24558.937993] Freed by task 12622:
>> <srv> [24558.938321] kfree+0xb0/0x11d
>> <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> <srv> [24558.938994] __fput+0x2ba/0x51a
>> <srv> [24558.939332] task_work_run+0x11c/0x13d
>> <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> <srv> [24558.947099]
>
> Your first guess was right. It looks like we have an issue with
> reference counting on the channels. Can you send me your ppp_generic.o?
http://nuclearcat.com/ppp_generic.o
Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-14 16:49 ` Denys Fedoryshchenko
2018-02-14 17:25 ` Guillaume Nault
@ 2018-02-15 19:20 ` Guillaume Nault
1 sibling, 0 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-15 19:20 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:47, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-14 18:07, Guillaume Nault wrote:
> > > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > > > Hi,
> > > > >
> > > > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > > > hours,
> > > > > cannot do bisect, as it is production server).
> > > > >
> > > > > dev ppp # gdb ppp_generic.o
> > > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > > > <<skipped>>
> > > > > Reading symbols from ppp_generic.o...done.
> > > > > (gdb) list *ppp_push+0x73
> > > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > > > 1658 list = list->next;
> > > > > 1659 pch = list_entry(list, struct channel, clist);
> > > > > 1660
> > > > > 1661 spin_lock(&pch->downl);
> > > > > 1662 if (pch->chan) {
> > > > > 1663 if (pch->chan->ops->start_xmit(pch->chan, skb))
> > > > > 1664 ppp->xmit_pending = NULL;
> > > > > 1665 } else {
> > > > > 1666 /* channel got unregistered */
> > > > > 1667 kfree_skb(skb);
> > > > >
> > > > >
> > > > I expect a memory corruption. Do you have the possibility to run with
> > > > KASAN by any chance?
> > > I will try to enable it tonight. For now i reverted "drivers, net,
> > > ppp:
> > > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> > >
> > This commit looks good to me. Do you have doubts about it because it's
> > new in 4.15? Does it mean that your last known-good kernel is 4.14?
>
> I am just doing "manual" bisect, checking all possibilities, and picking
> patch to revert randomly.
> Yes, correct, my known-good is 4.14.2.
>
Then maybe try reverting commit 0171c4183559 ("ppp: unlock all_ppp_mutex before registering device").
I can't see how it could lead to the bug you observed, but the other
ppp_generic patches introduced since 4.14 were rather trivial.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 16:01 ` Denys Fedoryshchenko
@ 2018-02-15 19:31 ` Guillaume Nault
2018-02-15 19:34 ` Denys Fedoryshchenko
2018-02-21 20:04 ` Cong Wang
0 siblings, 2 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-15 19:31 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 17:55, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > Here we go:
> > >
> > > <srv> [24558.921549]
> > > ==================================================================
> > > <srv> [24558.922167] BUG: KASAN: use-after-free in
> > > ppp_ioctl+0xa6a/0x1522
> > > [ppp_generic]
> > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
> > > accel-pppd/12622
> > > <srv> [24558.923113]
> > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > W
> > > 4.15.3-build-0134 #1
> > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > BIOS P80
> > > 04/02/2015
> > > <srv> [24558.924406] Call Trace:
> > > <srv> [24558.924753] dump_stack+0x46/0x59
> > > <srv> [24558.925103] print_address_description+0x6b/0x23b
> > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > <srv> [24558.925797] kasan_report+0x21b/0x241
> > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
> > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
> > > <srv> [24558.927523] ? kernel_read+0xed/0xed
> > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
> > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
> > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
> > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
> > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
> > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
> > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
> > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
> > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
> > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
> > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
> > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
> > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
> > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
> > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
> > > ORIG_RAX:
> > > 0000000000000010
> > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
> > > 00007f302849d8a7
> > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
> > > 0000000000003a67
> > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
> > > 000055c8308d8e40
> > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
> > > 00007f3023f49358
> > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
> > > 00007f3029a53700
> > > <srv> [24558.935288]
> > > <srv> [24558.935626] Allocated by task 12622:
> > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
> > > [ppp_generic]
> > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
> > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
> > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
> > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > <srv> [24558.937655]
> > > <srv> [24558.937993] Freed by task 12622:
> > > <srv> [24558.938321] kfree+0xb0/0x11d
> > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
> > > <srv> [24558.938994] __fput+0x2ba/0x51a
> > > <srv> [24558.939332] task_work_run+0x11c/0x13d
> > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
> > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
> > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > <srv> [24558.947099]
> >
> > Your first guess was right. It looks like we have an issue with
> > reference counting on the channels. Can you send me your ppp_generic.o?
> http://nuclearcat.com/ppp_generic.o
> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>
>From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
concurrently on the same ppp_file. Even if this ppp_file was pointed at
by two different file descriptors, I can't see how this could defeat
the reference counting mechanism. I'm going to think more about it.
Can you test with CONFIG_REFCOUNT_FULL? (and keep
d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t")).
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 19:31 ` Guillaume Nault
@ 2018-02-15 19:34 ` Denys Fedoryshchenko
2018-02-15 19:42 ` Guillaume Nault
2018-02-21 20:04 ` Cong Wang
1 sibling, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-15 19:34 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-15 21:31, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 17:55, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > Here we go:
>> > >
>> > > <srv> [24558.921549]
>> > > ==================================================================
>> > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > ppp_ioctl+0xa6a/0x1522
>> > > [ppp_generic]
>> > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > accel-pppd/12622
>> > > <srv> [24558.923113]
>> > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > W
>> > > 4.15.3-build-0134 #1
>> > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > BIOS P80
>> > > 04/02/2015
>> > > <srv> [24558.924406] Call Trace:
>> > > <srv> [24558.924753] dump_stack+0x46/0x59
>> > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > ORIG_RAX:
>> > > 0000000000000010
>> > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > 00007f302849d8a7
>> > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > 0000000000003a67
>> > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > 000055c8308d8e40
>> > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > 00007f3023f49358
>> > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > 00007f3029a53700
>> > > <srv> [24558.935288]
>> > > <srv> [24558.935626] Allocated by task 12622:
>> > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> > > [ppp_generic]
>> > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.937655]
>> > > <srv> [24558.937993] Freed by task 12622:
>> > > <srv> [24558.938321] kfree+0xb0/0x11d
>> > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.947099]
>> >
>> > Your first guess was right. It looks like we have an issue with
>> > reference counting on the channels. Can you send me your ppp_generic.o?
>> http://nuclearcat.com/ppp_generic.o
>> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>>
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.
>
> Can you test with CONFIG_REFCOUNT_FULL? (and keep
> d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> atomic_t to refcount_t")).
Ok, i will try that tonight. On vanilla kernel or reversing mentioned in
previous email patch?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 19:34 ` Denys Fedoryshchenko
@ 2018-02-15 19:42 ` Guillaume Nault
2018-02-16 11:13 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-15 19:42 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 21:31, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > > Here we go:
> > > > >
> > > > > <srv> [24558.921549]
> > > > > ==================================================================
> > > > > <srv> [24558.922167] BUG: KASAN: use-after-free in
> > > > > ppp_ioctl+0xa6a/0x1522
> > > > > [ppp_generic]
> > > > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
> > > > > accel-pppd/12622
> > > > > <srv> [24558.923113]
> > > > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > > W
> > > > > 4.15.3-build-0134 #1
> > > > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > BIOS P80
> > > > > 04/02/2015
> > > > > <srv> [24558.924406] Call Trace:
> > > > > <srv> [24558.924753] dump_stack+0x46/0x59
> > > > > <srv> [24558.925103] print_address_description+0x6b/0x23b
> > > > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > <srv> [24558.925797] kasan_report+0x21b/0x241
> > > > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
> > > > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
> > > > > <srv> [24558.927523] ? kernel_read+0xed/0xed
> > > > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
> > > > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
> > > > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
> > > > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
> > > > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
> > > > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
> > > > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
> > > > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
> > > > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
> > > > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
> > > > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
> > > > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
> > > > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
> > > > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
> > > > > ORIG_RAX:
> > > > > 0000000000000010
> > > > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
> > > > > 00007f302849d8a7
> > > > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
> > > > > 0000000000003a67
> > > > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
> > > > > 000055c8308d8e40
> > > > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
> > > > > 00007f3023f49358
> > > > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
> > > > > 00007f3029a53700
> > > > > <srv> [24558.935288]
> > > > > <srv> [24558.935626] Allocated by task 12622:
> > > > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
> > > > > [ppp_generic]
> > > > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
> > > > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
> > > > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
> > > > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > <srv> [24558.937655]
> > > > > <srv> [24558.937993] Freed by task 12622:
> > > > > <srv> [24558.938321] kfree+0xb0/0x11d
> > > > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
> > > > > <srv> [24558.938994] __fput+0x2ba/0x51a
> > > > > <srv> [24558.939332] task_work_run+0x11c/0x13d
> > > > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
> > > > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
> > > > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > <srv> [24558.947099]
> > > >
> > > > Your first guess was right. It looks like we have an issue with
> > > > reference counting on the channels. Can you send me your ppp_generic.o?
> > > http://nuclearcat.com/ppp_generic.o
> > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > >
> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > by two different file descriptors, I can't see how this could defeat
> > the reference counting mechanism. I'm going to think more about it.
> >
> > Can you test with CONFIG_REFCOUNT_FULL? (and keep
> > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> > atomic_t to refcount_t")).
> Ok, i will try that tonight. On vanilla kernel or reversing mentioned in
> previous email patch?
On vanilla kernel. The other is really a shot in the dark.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 19:42 ` Guillaume Nault
@ 2018-02-16 11:13 ` Denys Fedoryshchenko
2018-02-16 18:48 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-16 11:13 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-15 21:42, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 21:31, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> > > On 2018-02-15 17:55, Guillaume Nault wrote:
>> > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > > > Here we go:
>> > > > >
>> > > > > <srv> [24558.921549]
>> > > > > ==================================================================
>> > > > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > > > ppp_ioctl+0xa6a/0x1522
>> > > > > [ppp_generic]
>> > > > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > > > accel-pppd/12622
>> > > > > <srv> [24558.923113]
>> > > > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > > > W
>> > > > > 4.15.3-build-0134 #1
>> > > > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > > > BIOS P80
>> > > > > 04/02/2015
>> > > > > <srv> [24558.924406] Call Trace:
>> > > > > <srv> [24558.924753] dump_stack+0x46/0x59
>> > > > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> > > > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> > > > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> > > > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> > > > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> > > > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> > > > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> > > > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> > > > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> > > > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> > > > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> > > > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> > > > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> > > > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> > > > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> > > > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> > > > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> > > > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> > > > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > > > ORIG_RAX:
>> > > > > 0000000000000010
>> > > > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > > > 00007f302849d8a7
>> > > > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > > > 0000000000003a67
>> > > > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > > > 000055c8308d8e40
>> > > > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > > > 00007f3023f49358
>> > > > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > > > 00007f3029a53700
>> > > > > <srv> [24558.935288]
>> > > > > <srv> [24558.935626] Allocated by task 12622:
>> > > > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> > > > > [ppp_generic]
>> > > > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> > > > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> > > > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> > > > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > <srv> [24558.937655]
>> > > > > <srv> [24558.937993] Freed by task 12622:
>> > > > > <srv> [24558.938321] kfree+0xb0/0x11d
>> > > > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> > > > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> > > > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> > > > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> > > > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> > > > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > <srv> [24558.947099]
>> > > >
>> > > > Your first guess was right. It looks like we have an issue with
>> > > > reference counting on the channels. Can you send me your ppp_generic.o?
>> > > http://nuclearcat.com/ppp_generic.o
>> > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>> > >
>> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
>> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
>> > by two different file descriptors, I can't see how this could defeat
>> > the reference counting mechanism. I'm going to think more about it.
>> >
>> > Can you test with CONFIG_REFCOUNT_FULL? (and keep
>> > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
>> > atomic_t to refcount_t")).
>> Ok, i will try that tonight. On vanilla kernel or reversing mentioned
>> in
>> previous email patch?
> On vanilla kernel. The other is really a shot in the dark.
As far as i can see there is only KASAN triggered again(and server
rebooted shortly after that), but nothing else:
[ 1848.527234]
==================================================================
[ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7
[ppp_generic]
[ 1848.528468] Write of size 8 at addr ffff880354d3fa38 by task
accel-pppd/12626
[ 1848.528807]
[ 1848.529143] CPU: 5 PID: 12626 Comm: accel-pppd Tainted: G W
4.15.3-build-0134 #2
[ 1848.529755] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[ 1848.530100] Call Trace:
[ 1848.530448] dump_stack+0x46/0x59
[ 1848.530793] print_address_description+0x6b/0x23b
[ 1848.531143] ? ppp_ioctl+0xa68/0x14e7 [ppp_generic]
[ 1848.531488] kasan_report+0x21b/0x241
[ 1848.538248] ppp_ioctl+0xa68/0x14e7 [ppp_generic]
[ 1848.538604] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
[ 1848.538949] ? fsnotify+0x770/0x7fe
[ 1848.539295] ? __fsnotify_inode_delete+0xc/0xc
[ 1848.539648] ? kernel_read+0xed/0xed
[ 1848.539993] ? memcpy+0x34/0x46
[ 1848.540332] vfs_ioctl+0x6e/0x81
[ 1848.540679] do_vfs_ioctl+0xa00/0xb10
[ 1848.541029] ? __fsnotify_inode_delete+0xc/0xc
[ 1848.541375] ? fcntl_setlk+0x6b8/0x6b8
[ 1848.541721] ? ioctl_preallocate+0x14e/0x14e
[ 1848.542064] ? rcu_is_watching+0x5/0x9
[ 1848.542404] ? dput+0x4a/0x432
[ 1848.542739] ? __fput+0x4e4/0x51a
[ 1848.543080] SyS_ioctl+0x39/0x55
[ 1848.543418] ? do_vfs_ioctl+0xb10/0xb10
[ 1848.543759] do_syscall_64+0x1b1/0x31f
[ 1848.544100] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 1848.544439] RIP: 0033:0x7f752170a8a7
[ 1848.544780] RSP: 002b:00007f7522cbfaf8 EFLAGS: 00000206 ORIG_RAX:
0000000000000010
[ 1848.545403] RAX: ffffffffffffffda RBX: 00007f7520ff31e3 RCX:
00007f752170a8a7
[ 1848.545752] RDX: 00007f7518cba868 RSI: 000000004004743a RDI:
0000000000007c31
[ 1848.546097] RBP: 00007f7522cbfb20 R08: 0000000000000000 R09:
000055ba01b29e40
[ 1848.546448] R10: 0000000073646172 R11: 0000000000000206 R12:
00007f7518cba758
[ 1848.546802] R13: 00007ffe84df1a0f R14: 0000000000000000 R15:
00007f7522cc0700
[ 1848.547148]
[ 1848.547490] Allocated by task 12626:
[ 1848.547834] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
[ 1848.548181] pppoe_connect+0xab7/0xc71 [pppoe]
[ 1848.548535] SyS_connect+0x14b/0x1b7
[ 1848.548878] do_syscall_64+0x1b1/0x31f
[ 1848.549219] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 1848.549562]
[ 1848.549901] Freed by task 12626:
[ 1848.550249] kfree+0xb0/0x11d
[ 1848.550599] ppp_release+0x112/0x121 [ppp_generic]
[ 1848.550941] __fput+0x2ba/0x51a
[ 1848.551284] task_work_run+0x11c/0x13d
[ 1848.551626] exit_to_usermode_loop+0x7c/0xaf
[ 1848.551972] do_syscall_64+0x2ea/0x31f
[ 1848.552322] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 1848.552666]
[ 1848.553012] The buggy address belongs to the object at
ffff880354d3f980
[ 1848.553012] which belongs to the cache kmalloc-256 of size 256
[ 1848.553632] The buggy address is located 184 bytes inside of
[ 1848.553632] 256-byte region [ffff880354d3f980, ffff880354d3fa80)
[ 1848.554248] The buggy address belongs to the page:
[ 1848.554591] page:ffffea000d534f00 count:1 mapcount:0 mapping:
(null) index:0xffff880354d3f980 compound_mapcount: 0
[ 1848.555211] flags: 0x17ffe00000008100(slab|head)
[ 1848.555562] raw: 17ffe00000008100 0000000000000000 ffff880354d3f980
000000010033002c
[ 1848.556174] raw: ffffea000d0f7420 ffffea000cf84d20 ffff8803f1c0f480
0000000000000000
[ 1848.556783] page dumped because: kasan: bad access detected
[ 1848.557123]
[ 1848.557457] Memory state around the buggy address:
[ 1848.557800] ffff880354d3f900: fb fb fb fb fb fb fb fb fc fc fc fc fc
fc fc fc
[ 1848.558416] ffff880354d3f980: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 1848.559028] >ffff880354d3fa00: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 1848.559643] ^
[ 1848.559989] ffff880354d3fa80: fc fc fc fc fc fc fc fc 00 00 00 00 00
00 00 00
[ 1848.560603] ffff880354d3fb00: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[ 1848.561217]
==================================================================
[ 1848.561830] Disabling lock debugging due to kernel taint
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-16 11:13 ` Denys Fedoryshchenko
@ 2018-02-16 18:48 ` Guillaume Nault
2018-02-18 10:01 ` Denys Fedoryshchenko
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-16 18:48 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 21:42, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-15 21:31, Guillaume Nault wrote:
> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > > > > Here we go:
> > > > > > >
> > > > > > > <srv> [24558.921549]
> > > > > > > ==================================================================
> > > > > > > <srv> [24558.922167] BUG: KASAN: use-after-free in
> > > > > > > ppp_ioctl+0xa6a/0x1522
> > > > > > > [ppp_generic]
> > > > > > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
> > > > > > > accel-pppd/12622
> > > > > > > <srv> [24558.923113]
> > > > > > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > > > > W
> > > > > > > 4.15.3-build-0134 #1
> > > > > > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > > > BIOS P80
> > > > > > > 04/02/2015
> > > > > > > <srv> [24558.924406] Call Trace:
> > > > > > > <srv> [24558.924753] dump_stack+0x46/0x59
> > > > > > > <srv> [24558.925103] print_address_description+0x6b/0x23b
> > > > > > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > > > <srv> [24558.925797] kasan_report+0x21b/0x241
> > > > > > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > > > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
> > > > > > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
> > > > > > > <srv> [24558.927523] ? kernel_read+0xed/0xed
> > > > > > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
> > > > > > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
> > > > > > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
> > > > > > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
> > > > > > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
> > > > > > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
> > > > > > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
> > > > > > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
> > > > > > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
> > > > > > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
> > > > > > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
> > > > > > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
> > > > > > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
> > > > > > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > > > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
> > > > > > > ORIG_RAX:
> > > > > > > 0000000000000010
> > > > > > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
> > > > > > > 00007f302849d8a7
> > > > > > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
> > > > > > > 0000000000003a67
> > > > > > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
> > > > > > > 000055c8308d8e40
> > > > > > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
> > > > > > > 00007f3023f49358
> > > > > > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
> > > > > > > 00007f3029a53700
> > > > > > > <srv> [24558.935288]
> > > > > > > <srv> [24558.935626] Allocated by task 12622:
> > > > > > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
> > > > > > > [ppp_generic]
> > > > > > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
> > > > > > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
> > > > > > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
> > > > > > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > > <srv> [24558.937655]
> > > > > > > <srv> [24558.937993] Freed by task 12622:
> > > > > > > <srv> [24558.938321] kfree+0xb0/0x11d
> > > > > > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
> > > > > > > <srv> [24558.938994] __fput+0x2ba/0x51a
> > > > > > > <srv> [24558.939332] task_work_run+0x11c/0x13d
> > > > > > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
> > > > > > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
> > > > > > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > > <srv> [24558.947099]
> > > > > >
> > > > > > Your first guess was right. It looks like we have an issue with
> > > > > > reference counting on the channels. Can you send me your ppp_generic.o?
> > > > > http://nuclearcat.com/ppp_generic.o
> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > > > >
> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > > > by two different file descriptors, I can't see how this could defeat
> > > > the reference counting mechanism. I'm going to think more about it.
> > > >
> > > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
> > > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> > > > atomic_t to refcount_t")).
> > > Ok, i will try that tonight. On vanilla kernel or reversing
> > > mentioned in
> > > previous email patch?
> > On vanilla kernel. The other is really a shot in the dark.
>
> As far as i can see there is only KASAN triggered again(and server rebooted
> shortly after that), but nothing else:
>
Ok, so no refcount failure detected. Not what I expected... but that's
still an information. It's getting even harder to find a ppp scenario
that could lead to such symptoms.
If that's acceptable for you, you can try reverting the few commits
that entered after 4.14.
02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom of lower device into account on xmit
0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex before registering device
e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks added
f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when cleanup
90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t
313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
Sorry, but I have nothing better to propose for now. At least that
should help narrowing the problem space.
I'm going to stress test ppp_generic and pppoe on my side.
> [ 1848.527234]
> ==================================================================
> [ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7
> [ppp_generic]
> [ 1848.528468] Write of size 8 at addr ffff880354d3fa38 by task
> accel-pppd/12626
> [ 1848.528807]
> [ 1848.529143] CPU: 5 PID: 12626 Comm: accel-pppd Tainted: G W
> 4.15.3-build-0134 #2
> [ 1848.529755] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [ 1848.530100] Call Trace:
> [ 1848.530448] dump_stack+0x46/0x59
> [ 1848.530793] print_address_description+0x6b/0x23b
> [ 1848.531143] ? ppp_ioctl+0xa68/0x14e7 [ppp_generic]
> [ 1848.531488] kasan_report+0x21b/0x241
> [ 1848.538248] ppp_ioctl+0xa68/0x14e7 [ppp_generic]
> [ 1848.538604] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> [ 1848.538949] ? fsnotify+0x770/0x7fe
> [ 1848.539295] ? __fsnotify_inode_delete+0xc/0xc
> [ 1848.539648] ? kernel_read+0xed/0xed
> [ 1848.539993] ? memcpy+0x34/0x46
> [ 1848.540332] vfs_ioctl+0x6e/0x81
> [ 1848.540679] do_vfs_ioctl+0xa00/0xb10
> [ 1848.541029] ? __fsnotify_inode_delete+0xc/0xc
> [ 1848.541375] ? fcntl_setlk+0x6b8/0x6b8
> [ 1848.541721] ? ioctl_preallocate+0x14e/0x14e
> [ 1848.542064] ? rcu_is_watching+0x5/0x9
> [ 1848.542404] ? dput+0x4a/0x432
> [ 1848.542739] ? __fput+0x4e4/0x51a
> [ 1848.543080] SyS_ioctl+0x39/0x55
> [ 1848.543418] ? do_vfs_ioctl+0xb10/0xb10
> [ 1848.543759] do_syscall_64+0x1b1/0x31f
> [ 1848.544100] entry_SYSCALL_64_after_hwframe+0x21/0x86
> [ 1848.544439] RIP: 0033:0x7f752170a8a7
> [ 1848.544780] RSP: 002b:00007f7522cbfaf8 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [ 1848.545403] RAX: ffffffffffffffda RBX: 00007f7520ff31e3 RCX:
> 00007f752170a8a7
> [ 1848.545752] RDX: 00007f7518cba868 RSI: 000000004004743a RDI:
> 0000000000007c31
> [ 1848.546097] RBP: 00007f7522cbfb20 R08: 0000000000000000 R09:
> 000055ba01b29e40
> [ 1848.546448] R10: 0000000073646172 R11: 0000000000000206 R12:
> 00007f7518cba758
> [ 1848.546802] R13: 00007ffe84df1a0f R14: 0000000000000000 R15:
> 00007f7522cc0700
> [ 1848.547148]
> [ 1848.547490] Allocated by task 12626:
> [ 1848.547834] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
> [ 1848.548181] pppoe_connect+0xab7/0xc71 [pppoe]
> [ 1848.548535] SyS_connect+0x14b/0x1b7
> [ 1848.548878] do_syscall_64+0x1b1/0x31f
> [ 1848.549219] entry_SYSCALL_64_after_hwframe+0x21/0x86
> [ 1848.549562]
> [ 1848.549901] Freed by task 12626:
> [ 1848.550249] kfree+0xb0/0x11d
> [ 1848.550599] ppp_release+0x112/0x121 [ppp_generic]
> [ 1848.550941] __fput+0x2ba/0x51a
> [ 1848.551284] task_work_run+0x11c/0x13d
> [ 1848.551626] exit_to_usermode_loop+0x7c/0xaf
> [ 1848.551972] do_syscall_64+0x2ea/0x31f
> [ 1848.552322] entry_SYSCALL_64_after_hwframe+0x21/0x86
> [ 1848.552666]
> [ 1848.553012] The buggy address belongs to the object at ffff880354d3f980
> [ 1848.553012] which belongs to the cache kmalloc-256 of size 256
> [ 1848.553632] The buggy address is located 184 bytes inside of
> [ 1848.553632] 256-byte region [ffff880354d3f980, ffff880354d3fa80)
> [ 1848.554248] The buggy address belongs to the page:
> [ 1848.554591] page:ffffea000d534f00 count:1 mapcount:0 mapping:
> (null) index:0xffff880354d3f980 compound_mapcount: 0
> [ 1848.555211] flags: 0x17ffe00000008100(slab|head)
> [ 1848.555562] raw: 17ffe00000008100 0000000000000000 ffff880354d3f980
> 000000010033002c
> [ 1848.556174] raw: ffffea000d0f7420 ffffea000cf84d20 ffff8803f1c0f480
> 0000000000000000
> [ 1848.556783] page dumped because: kasan: bad access detected
> [ 1848.557123]
> [ 1848.557457] Memory state around the buggy address:
> [ 1848.557800] ffff880354d3f900: fb fb fb fb fb fb fb fb fc fc fc fc fc fc
> fc fc
> [ 1848.558416] ffff880354d3f980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [ 1848.559028] >ffff880354d3fa00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [ 1848.559643] ^
> [ 1848.559989] ffff880354d3fa80: fc fc fc fc fc fc fc fc 00 00 00 00 00 00
> 00 00
> [ 1848.560603] ffff880354d3fb00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00
> [ 1848.561217]
> ==================================================================
> [ 1848.561830] Disabling lock debugging due to kernel taint
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-16 18:48 ` Guillaume Nault
@ 2018-02-18 10:01 ` Denys Fedoryshchenko
2018-02-21 18:38 ` Guillaume Nault
2018-02-20 9:05 ` Denys Fedoryshchenko
2018-02-21 10:26 ` Denys Fedoryshchenko
2 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-18 10:01 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-16 20:48, Guillaume Nault wrote:
> On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 21:42, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
>> > > On 2018-02-15 21:31, Guillaume Nault wrote:
>> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> > > > > On 2018-02-15 17:55, Guillaume Nault wrote:
>> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > > > > > Here we go:
>> > > > > > >
>> > > > > > > <srv> [24558.921549]
>> > > > > > > ==================================================================
>> > > > > > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > > > > > ppp_ioctl+0xa6a/0x1522
>> > > > > > > [ppp_generic]
>> > > > > > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > > > > > accel-pppd/12622
>> > > > > > > <srv> [24558.923113]
>> > > > > > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > > > > > W
>> > > > > > > 4.15.3-build-0134 #1
>> > > > > > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > > > > > BIOS P80
>> > > > > > > 04/02/2015
>> > > > > > > <srv> [24558.924406] Call Trace:
>> > > > > > > <srv> [24558.924753] dump_stack+0x46/0x59
>> > > > > > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> > > > > > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> > > > > > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > > > > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> > > > > > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> > > > > > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> > > > > > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> > > > > > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> > > > > > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> > > > > > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> > > > > > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> > > > > > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> > > > > > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> > > > > > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> > > > > > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> > > > > > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> > > > > > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> > > > > > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> > > > > > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> > > > > > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > > > > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > > > > > ORIG_RAX:
>> > > > > > > 0000000000000010
>> > > > > > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > > > > > 00007f302849d8a7
>> > > > > > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > > > > > 0000000000003a67
>> > > > > > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > > > > > 000055c8308d8e40
>> > > > > > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > > > > > 00007f3023f49358
>> > > > > > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > > > > > 00007f3029a53700
>> > > > > > > <srv> [24558.935288]
>> > > > > > > <srv> [24558.935626] Allocated by task 12622:
>> > > > > > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> > > > > > > [ppp_generic]
>> > > > > > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> > > > > > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> > > > > > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> > > > > > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.937655]
>> > > > > > > <srv> [24558.937993] Freed by task 12622:
>> > > > > > > <srv> [24558.938321] kfree+0xb0/0x11d
>> > > > > > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> > > > > > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> > > > > > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> > > > > > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> > > > > > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> > > > > > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.947099]
>> > > > > >
>> > > > > > Your first guess was right. It looks like we have an issue with
>> > > > > > reference counting on the channels. Can you send me your ppp_generic.o?
>> > > > > http://nuclearcat.com/ppp_generic.o
>> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>> > > > >
>> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
>> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
>> > > > by two different file descriptors, I can't see how this could defeat
>> > > > the reference counting mechanism. I'm going to think more about it.
>> > > >
>> > > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
>> > > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
>> > > > atomic_t to refcount_t")).
>> > > Ok, i will try that tonight. On vanilla kernel or reversing
>> > > mentioned in
>> > > previous email patch?
>> > On vanilla kernel. The other is really a shot in the dark.
>>
>> As far as i can see there is only KASAN triggered again(and server
>> rebooted
>> shortly after that), but nothing else:
>>
> Ok, so no refcount failure detected. Not what I expected... but that's
> still an information. It's getting even harder to find a ppp scenario
> that could lead to such symptoms.
> If that's acceptable for you, you can try reverting the few commits
> that entered after 4.14.
>
> 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> of lower device into account on xmit
> 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> before registering device
> e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks
> added
> f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when
> cleanup
> 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> syncppp.refcnt from atomic_t to refcount_t
> d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> ppp_file.refcnt from atomic_t to refcount_t
> 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> asyncppp.refcnt from atomic_t to refcount_t
>
> Sorry, but I have nothing better to propose for now. At least that
> should help narrowing the problem space.
> I'm going to stress test ppp_generic and pppoe on my side.
>
Quick update.
Testing 5 first patches didn't changed anything.
But revering more, with last 4 patches also (i did all together) is
changing things, probably i need to repeat one night more reverting just
all refcount_t patches.
[25222.173840] ------------[ cut here ]------------
[25222.174259] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 3 timed
out
[25222.174618] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:323
dev_watchdog+0x44a/0x555
[25222.175212] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[25222.177133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G B W
4.15.3-build-0134 #6
[25222.184121] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[25222.184457] RIP: 0010:dev_watchdog+0x44a/0x555
[25222.184791] RSP: 0018:ffff8803f22c7d98 EFLAGS: 00010292
[25222.185127] RAX: 0000000000000000 RBX: ffff8803ded00438 RCX:
0000000000000000
[25222.185463] RDX: 0000000000000001 RSI: 0000000000000002 RDI:
ffffed007e458fa8
[25222.185797] RBP: ffff8803ded00000 R08: 0000000000000001 R09:
0000000000000000
[25222.186133] R10: ffff8803f22c7e30 R11: 0000000000000001 R12:
ffff8803ded28450
[25222.186471] R13: 0000000000000003 R14: dffffc0000000000 R15:
ffff8803ded283c0
[25222.186804] FS: 0000000000000000(0000) GS:ffff8803f22c0000(0000)
knlGS:0000000000000000
[25222.187401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25222.187739] CR2: 0000561f5bffc128 CR3: 0000000445a0d003 CR4:
00000000001606e0
[25222.188077] Call Trace:
[25222.188410] <IRQ>
[25222.188740] ? dev_graft_qdisc+0xfa/0xfa
[25222.189072] call_timer_fn+0x15/0x72
[25222.189407] ? dev_graft_qdisc+0xfa/0xfa
[25222.189741] expire_timers+0x1b9/0x1d5
[25222.190072] run_timer_softirq+0x184/0x361
[25222.190400] ? expire_timers+0x1d5/0x1d5
[25222.190723] ? enqueue_hrtimer+0xce/0xd8
[25222.191048] ? __hrtimer_run_queues+0x1ec/0x24d
[25222.191373] __do_softirq+0x17f/0x34a
[25222.191702] irq_exit+0x8f/0xf9
[25222.192034] smp_apic_timer_interrupt+0xcb/0xd6
[25222.192365] apic_timer_interrupt+0x92/0xa0
[25222.192695] </IRQ>
[25222.193023] RIP: 0010:mwait_idle+0x99/0xac
[25222.193355] RSP: 0018:ffff8803f030fef8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff11
[25222.193956] RAX: 0000000000000000 RBX: ffff8803f02e3500 RCX:
0000000000000000
[25222.194290] RDX: 1ffff1007e05c6a0 RSI: 0000000000000000 RDI:
0000000000000000
[25222.194626] RBP: ffff8803f02e3500 R08: ffffed007ccc8eef R09:
ffff8803e6647728
[25222.194958] R10: ffff8803f030fdd0 R11: 0000000000000001 R12:
0000000000000000
[25222.195292] R13: dffffc0000000000 R14: ffffed007e05c6a0 R15:
ffff8803f02e3500
[25222.195627] do_idle+0xe6/0x19a
[25222.195963] cpu_startup_entry+0x18/0x1a
[25222.196295] secondary_startup_64+0xa5/0xb0
[25222.196625] Code: 68 87 40 01 00 75 3f 48 89 ef c6 05 5c 87 40 01 01
e8 64 93 fa ff 44 89 e9 48 89 c2 48 89 ee 48 c7 c7 80 28 68 83 e8 25 69
6d fe <0f> ff eb 17 41 ff c5 49 81 c4 40 0
1 00 00 44 3b 6c 24 04 0f 85
[25222.197511] ---[ end trace 4b04e9c6754a1cd5 ]---
and then
[25222.197853] ixgbe 0000:04:00.1 eth1: initiating reset due to tx
timeout
[25222.198194] ixgbe 0000:04:00.1 eth1: Reset adapter
[25227.805896] ixgbe 0000:04:00.1 eth1: initiating reset due to tx
timeout
[25232.925944] ixgbe 0000:04:00.1 eth1: initiating reset due to tx
timeout
[25236.084968] watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
[accel-pppd:12627]
[25236.085562] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[25236.087496] CPU: 0 PID: 12627 Comm: accel-pppd Tainted: G B W
4.15.3-build-0134 #6
[25236.088095] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[25236.088430] RIP: 0010:queued_spin_lock_slowpath+0xb1/0x418
[25236.088759] RSP: 0018:ffff8803e6457a98 EFLAGS: 00000213 ORIG_RAX:
ffffffffffffff11
[25236.089353] RAX: 00000000000001fb RBX: ffff880345e75fe0 RCX:
ffffffff811aeca3
[25236.089685] RDX: 0000000000000000 RSI: 0000000000000004 RDI:
ffff880345e75fe0
[25236.090026] RBP: ffffed0068bcebfc R08: 06030a0001012180 R09:
ffffed006cc9beb2
[25236.090369] R10: ffffed006cc9beb3 R11: 0000000000000001 R12:
0000000000000003
[25236.090705] R13: 0000000000008021 R14: 0000000000008021 R15:
00000000034e4b06
[25236.091043] FS: 00007f94bd26c700(0000) GS:ffff8803f2200000(0000)
knlGS:0000000000000000
[25236.091636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25236.091966] CR2: 00007ffc0935eff8 CR3: 00000003d709b003 CR4:
00000000001606f0
[25236.092304] Call Trace:
[25236.092638] ppp_push+0x112/0xdda [ppp_generic]
[25236.092975] ? enqueue_hrtimer+0xce/0xd8
[25236.093304] ? hrtimer_start_range_ns+0x827/0x854
[25236.093635] __ppp_xmit_process+0xc6a/0xdd5 [ppp_generic]
[25236.093969] ? __kmalloc_reserve.isra.5+0x29/0x96
[25236.094302] ? memset+0x1f/0x31
[25236.094631] ? ppp_receive_nonmp_frame+0x138c/0x138c [ppp_generic]
[25236.094962] ? __alloc_skb+0x2ec/0x431
[25236.095292] ? __kmalloc_reserve.isra.5+0x96/0x96
[25236.095620] ? timerfd_release+0x1d3/0x1d3
[25236.095950] ppp_xmit_process+0xc3/0x194 [ppp_generic]
[25236.096284] ppp_write+0x1b7/0x1c3 [ppp_generic]
[25236.096617] __vfs_write+0xd9/0x4ad
[25236.096953] ? kernel_read+0xed/0xed
[25236.097283] ? vfs_copy_file_range+0x6a8/0x6a8
[25236.097614] ? bit_waitqueue+0x2a/0x2a
[25236.097946] ? __fsnotify_inode_delete+0xc/0xc
[25236.098276] ? __fsnotify_inode_delete+0xc/0xc
[25236.098610] ? SyS_sendmmsg+0x13/0x13
[25236.098936] vfs_write+0x18c/0x378
[25236.099258] SyS_write+0xc4/0x13b
[25236.099579] ? SyS_read+0x13b/0x13b
[25236.099902] ? exit_to_usermode_loop+0x7c/0xaf
[25236.100225] ? SyS_read+0x13b/0x13b
[25236.100550] do_syscall_64+0x1b1/0x31f
[25236.100879] entry_SYSCALL_64_after_hwframe+0x21/0x86
[25236.101210] RIP: 0033:0x7f94bca53b2d
[25236.101536] RSP: 002b:00007f94bd26bb80 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[25236.102127] RAX: ffffffffffffffda RBX: 00007f94bb59f1e3 RCX:
00007f94bca53b2d
[25236.102461] RDX: 000000000000000c RSI: 00007f94b78895d0 RDI:
0000000000002f92
[25236.102793] RBP: 00007f94bd26bbb0 R08: 0000000000000030 R09:
0000000000000027
[25236.103127] R10: 0000000000000000 R11: 0000000000000293 R12:
00007f94b6450eb8
[25236.103460] R13: 00007ffc8c047a6f R14: 0000000000000000 R15:
00007f94bd26c700
[25236.103790] Code: 83 03 00 00 48 89 dd 49 89 dc 48 b8 00 00 00 00 00
fc ff df 48 c1 ed 03 41 83 e4 07 48 01 c5 41 83 c4 03 8a 45 00 41 38 c4
7c 0c <84> c0 74 08 48 89 df e8 31 54 17 0
0 8b 03 84 c0 74 04 f3 90 eb
Then system autorebooted.
Maybe i am hitting some qdisc bug now...
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-16 18:48 ` Guillaume Nault
2018-02-18 10:01 ` Denys Fedoryshchenko
@ 2018-02-20 9:05 ` Denys Fedoryshchenko
2018-02-21 10:26 ` Denys Fedoryshchenko
2 siblings, 0 replies; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-20 9:05 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-16 20:48, Guillaume Nault wrote:
> On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 21:42, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
>> > > On 2018-02-15 21:31, Guillaume Nault wrote:
>> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> > > > > On 2018-02-15 17:55, Guillaume Nault wrote:
>> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > > > > > Here we go:
>> > > > > > >
>> > > > > > > <srv> [24558.921549]
>> > > > > > > ==================================================================
>> > > > > > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > > > > > ppp_ioctl+0xa6a/0x1522
>> > > > > > > [ppp_generic]
>> > > > > > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > > > > > accel-pppd/12622
>> > > > > > > <srv> [24558.923113]
>> > > > > > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > > > > > W
>> > > > > > > 4.15.3-build-0134 #1
>> > > > > > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > > > > > BIOS P80
>> > > > > > > 04/02/2015
>> > > > > > > <srv> [24558.924406] Call Trace:
>> > > > > > > <srv> [24558.924753] dump_stack+0x46/0x59
>> > > > > > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> > > > > > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> > > > > > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > > > > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> > > > > > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> > > > > > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> > > > > > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> > > > > > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> > > > > > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> > > > > > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> > > > > > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> > > > > > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> > > > > > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> > > > > > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> > > > > > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> > > > > > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> > > > > > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> > > > > > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> > > > > > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> > > > > > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > > > > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > > > > > ORIG_RAX:
>> > > > > > > 0000000000000010
>> > > > > > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > > > > > 00007f302849d8a7
>> > > > > > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > > > > > 0000000000003a67
>> > > > > > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > > > > > 000055c8308d8e40
>> > > > > > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > > > > > 00007f3023f49358
>> > > > > > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > > > > > 00007f3029a53700
>> > > > > > > <srv> [24558.935288]
>> > > > > > > <srv> [24558.935626] Allocated by task 12622:
>> > > > > > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> > > > > > > [ppp_generic]
>> > > > > > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> > > > > > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> > > > > > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> > > > > > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.937655]
>> > > > > > > <srv> [24558.937993] Freed by task 12622:
>> > > > > > > <srv> [24558.938321] kfree+0xb0/0x11d
>> > > > > > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> > > > > > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> > > > > > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> > > > > > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> > > > > > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> > > > > > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > > <srv> [24558.947099]
>> > > > > >
>> > > > > > Your first guess was right. It looks like we have an issue with
>> > > > > > reference counting on the channels. Can you send me your ppp_generic.o?
>> > > > > http://nuclearcat.com/ppp_generic.o
>> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>> > > > >
>> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
>> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
>> > > > by two different file descriptors, I can't see how this could defeat
>> > > > the reference counting mechanism. I'm going to think more about it.
>> > > >
>> > > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
>> > > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
>> > > > atomic_t to refcount_t")).
>> > > Ok, i will try that tonight. On vanilla kernel or reversing
>> > > mentioned in
>> > > previous email patch?
>> > On vanilla kernel. The other is really a shot in the dark.
>>
>> As far as i can see there is only KASAN triggered again(and server
>> rebooted
>> shortly after that), but nothing else:
>>
> Ok, so no refcount failure detected. Not what I expected... but that's
> still an information. It's getting even harder to find a ppp scenario
> that could lead to such symptoms.
> If that's acceptable for you, you can try reverting the few commits
> that entered after 4.14.
>
> 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> of lower device into account on xmit
> 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> before registering device
> e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks
> added
> f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when
> cleanup
> 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> syncppp.refcnt from atomic_t to refcount_t
> d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> ppp_file.refcnt from atomic_t to refcount_t
> 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> asyncppp.refcnt from atomic_t to refcount_t
>
> Sorry, but I have nothing better to propose for now. At least that
> should help narrowing the problem space.
> I'm going to stress test ppp_generic and pppoe on my side.
>
>> [ 1848.527234]
>> ==================================================================
>> [ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7
>> [ppp_generic]
>> [ 1848.528468] Write of size 8 at addr ffff880354d3fa38 by task
>> accel-pppd/12626
>> [ 1848.528807]
>> [ 1848.529143] CPU: 5 PID: 12626 Comm: accel-pppd Tainted: G W
>> 4.15.3-build-0134 #2
>> [ 1848.529755] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
>> 04/02/2015
>> [ 1848.530100] Call Trace:
>> [ 1848.530448] dump_stack+0x46/0x59
>> [ 1848.530793] print_address_description+0x6b/0x23b
>> [ 1848.531143] ? ppp_ioctl+0xa68/0x14e7 [ppp_generic]
>> [ 1848.531488] kasan_report+0x21b/0x241
>> [ 1848.538248] ppp_ioctl+0xa68/0x14e7 [ppp_generic]
>> [ 1848.538604] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> [ 1848.538949] ? fsnotify+0x770/0x7fe
>> [ 1848.539295] ? __fsnotify_inode_delete+0xc/0xc
>> [ 1848.539648] ? kernel_read+0xed/0xed
>> [ 1848.539993] ? memcpy+0x34/0x46
>> [ 1848.540332] vfs_ioctl+0x6e/0x81
>> [ 1848.540679] do_vfs_ioctl+0xa00/0xb10
>> [ 1848.541029] ? __fsnotify_inode_delete+0xc/0xc
>> [ 1848.541375] ? fcntl_setlk+0x6b8/0x6b8
>> [ 1848.541721] ? ioctl_preallocate+0x14e/0x14e
>> [ 1848.542064] ? rcu_is_watching+0x5/0x9
>> [ 1848.542404] ? dput+0x4a/0x432
>> [ 1848.542739] ? __fput+0x4e4/0x51a
>> [ 1848.543080] SyS_ioctl+0x39/0x55
>> [ 1848.543418] ? do_vfs_ioctl+0xb10/0xb10
>> [ 1848.543759] do_syscall_64+0x1b1/0x31f
>> [ 1848.544100] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.544439] RIP: 0033:0x7f752170a8a7
>> [ 1848.544780] RSP: 002b:00007f7522cbfaf8 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000010
>> [ 1848.545403] RAX: ffffffffffffffda RBX: 00007f7520ff31e3 RCX:
>> 00007f752170a8a7
>> [ 1848.545752] RDX: 00007f7518cba868 RSI: 000000004004743a RDI:
>> 0000000000007c31
>> [ 1848.546097] RBP: 00007f7522cbfb20 R08: 0000000000000000 R09:
>> 000055ba01b29e40
>> [ 1848.546448] R10: 0000000073646172 R11: 0000000000000206 R12:
>> 00007f7518cba758
>> [ 1848.546802] R13: 00007ffe84df1a0f R14: 0000000000000000 R15:
>> 00007f7522cc0700
>> [ 1848.547148]
>> [ 1848.547490] Allocated by task 12626:
>> [ 1848.547834] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
>> [ 1848.548181] pppoe_connect+0xab7/0xc71 [pppoe]
>> [ 1848.548535] SyS_connect+0x14b/0x1b7
>> [ 1848.548878] do_syscall_64+0x1b1/0x31f
>> [ 1848.549219] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.549562]
>> [ 1848.549901] Freed by task 12626:
>> [ 1848.550249] kfree+0xb0/0x11d
>> [ 1848.550599] ppp_release+0x112/0x121 [ppp_generic]
>> [ 1848.550941] __fput+0x2ba/0x51a
>> [ 1848.551284] task_work_run+0x11c/0x13d
>> [ 1848.551626] exit_to_usermode_loop+0x7c/0xaf
>> [ 1848.551972] do_syscall_64+0x2ea/0x31f
>> [ 1848.552322] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.552666]
>> [ 1848.553012] The buggy address belongs to the object at
>> ffff880354d3f980
>> [ 1848.553012] which belongs to the cache kmalloc-256 of size 256
>> [ 1848.553632] The buggy address is located 184 bytes inside of
>> [ 1848.553632] 256-byte region [ffff880354d3f980, ffff880354d3fa80)
>> [ 1848.554248] The buggy address belongs to the page:
>> [ 1848.554591] page:ffffea000d534f00 count:1 mapcount:0 mapping:
>> (null) index:0xffff880354d3f980 compound_mapcount: 0
>> [ 1848.555211] flags: 0x17ffe00000008100(slab|head)
>> [ 1848.555562] raw: 17ffe00000008100 0000000000000000 ffff880354d3f980
>> 000000010033002c
>> [ 1848.556174] raw: ffffea000d0f7420 ffffea000cf84d20 ffff8803f1c0f480
>> 0000000000000000
>> [ 1848.556783] page dumped because: kasan: bad access detected
>> [ 1848.557123]
>> [ 1848.557457] Memory state around the buggy address:
>> [ 1848.557800] ffff880354d3f900: fb fb fb fb fb fb fb fb fc fc fc fc
>> fc fc
>> fc fc
>> [ 1848.558416] ffff880354d3f980: fb fb fb fb fb fb fb fb fb fb fb fb
>> fb fb
>> fb fb
>> [ 1848.559028] >ffff880354d3fa00: fb fb fb fb fb fb fb fb fb fb fb fb
>> fb fb
>> fb fb
>> [ 1848.559643] ^
>> [ 1848.559989] ffff880354d3fa80: fc fc fc fc fc fc fc fc 00 00 00 00
>> 00 00
>> 00 00
>> [ 1848.560603] ffff880354d3fb00: 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00
>> 00 00
>> [ 1848.561217]
>> ==================================================================
>> [ 1848.561830] Disabling lock debugging due to kernel taint
>>
Ok, more updates:
I updated 2 other production servers(different location), similar
workload, higher load,
but slightly different configration, different hardware. 4.14.20 working
fine there almost 24h.
So even they agreed to participate in this bug hunting, i can't
reproduce bug easily on that server, because
on first server i am able to trigger bug even on 4.14.20 as well, but it
doesnt happen on 4.14.2.
But there is several other changes from old kernel build, mostly because
of Meltdown/Spectre, i had to update
to newer gcc and to do some userspace updates, so i am trying to roll
back slowly to same conditions as old stable build,
to find out what is triggering bug.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-16 18:48 ` Guillaume Nault
2018-02-18 10:01 ` Denys Fedoryshchenko
2018-02-20 9:05 ` Denys Fedoryshchenko
@ 2018-02-21 10:26 ` Denys Fedoryshchenko
2018-02-21 18:55 ` Guillaume Nault
2 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-21 10:26 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
It seems even rebuilding seemingly stable version triggering crashes too
(but different ones)
Maybe it is coincidence, and bug reproducer appeared in network same
time i decided to upgrade kernel,
as it happened with xt_MSS(and that bug existed for years).
Deleted quoting, i added more debug options (as much as performance
degradation allows me).
This is vanilla again:
[14834.090421]
==================================================================
[14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
[14834.091521] Read of size 8 at addr ffff8803dbeb8660 by task
accel-pppd/12636
[14834.091905]
[14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
4.15.4-build-0134 #1
[14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[14834.093320] Call Trace:
[14834.093680] dump_stack+0xb3/0x13e
[14834.094050] ? _atomic_dec_and_lock+0x10f/0x10f
[14834.094434] print_address_description+0x69/0x236
[14834.094814] ? __list_add_valid+0x69/0xad
[14834.095197] kasan_report+0x219/0x23f
[14834.095570] __list_add_valid+0x69/0xad
[14834.095957] ppp_ioctl+0x1216/0x2201 [ppp_generic]
[14834.096348] ? ppp_write+0x1cc/0x1cc [ppp_generic]
[14834.096723] ? get_usage_char.isra.2+0x36/0x36
[14834.097094] ? packet_poll+0x362/0x362
[14834.097455] ? lock_downgrade+0x4d0/0x4d0
[14834.097811] ? rcu_irq_enter_disabled+0x8/0x8
[14834.098187] ? get_usage_char.isra.2+0x36/0x36
[14834.098561] ? __fget+0x3b8/0x3eb
[14834.098936] ? get_usage_char.isra.2+0x36/0x36
[14834.099309] ? __fget+0x3a0/0x3eb
[14834.099682] ? get_usage_char.isra.2+0x36/0x36
[14834.100069] ? __fget+0x3a0/0x3eb
[14834.100443] ? lock_downgrade+0x4d0/0x4d0
[14834.100814] ? rcu_irq_enter_disabled+0x8/0x8
[14834.101203] ? __fget+0x3b8/0x3eb
[14834.101581] ? expand_files+0x62f/0x62f
[14834.101945] ? kernel_read+0xed/0xed
[14834.102322] ? SyS_getpeername+0x28b/0x28b
[14834.102690] vfs_ioctl+0x6e/0x81
[14834.103049] do_vfs_ioctl+0xe2f/0xe62
[14834.103413] ? ioctl_preallocate+0x211/0x211
[14834.103778] ? __fget_light+0x28c/0x2ca
[14834.104150] ? iterate_fd+0x2a8/0x2a8
[14834.104526] ? SyS_rt_sigprocmask+0x12e/0x181
[14834.104876] ? sigprocmask+0x23f/0x23f
[14834.105231] ? SyS_write+0x148/0x173
[14834.105580] ? SyS_read+0x173/0x173
[14834.105943] SyS_ioctl+0x39/0x55
[14834.106316] ? do_vfs_ioctl+0xe62/0xe62
[14834.106694] do_syscall_64+0x262/0x594
[14834.107076] ? syscall_return_slowpath+0x351/0x351
[14834.107447] ? up_read+0x17/0x2c
[14834.107806] ? __do_page_fault+0x68a/0x763
[14834.108171] ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
[14834.108550] ? trace_hardirqs_off_thunk+0x1a/0x1c
[14834.108937] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.109293] RIP: 0033:0x7fc9be3758a7
[14834.109652] RSP: 002b:00007fc9bf92aaf8 EFLAGS: 00000206 ORIG_RAX:
0000000000000010
[14834.110313] RAX: ffffffffffffffda RBX: 00007fc9bdc5e1e3 RCX:
00007fc9be3758a7
[14834.110707] RDX: 00007fc9b7ad13e8 RSI: 000000004004743a RDI:
0000000000004b9f
[14834.111082] RBP: 00007fc9bf92ab20 R08: 0000000000000000 R09:
000055f07a27fe40
[14834.111471] R10: 0000000000000008 R11: 0000000000000206 R12:
00007fc9b7ad12d8
[14834.111845] R13: 00007ffd06346a6f R14: 0000000000000000 R15:
00007fc9bf92b700
[14834.112231]
[14834.112589] Allocated by task 12636:
[14834.112962] ppp_register_net_channel+0xc4/0x610 [ppp_generic]
[14834.113331] pppoe_connect+0xe6d/0x1097 [pppoe]
[14834.113691] SyS_connect+0x19c/0x274
[14834.114054] do_syscall_64+0x262/0x594
[14834.114421] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.114792]
[14834.115139] Freed by task 12636:
[14834.115504] kfree+0xe2/0x154
[14834.115866] ppp_release+0x11b/0x12a [ppp_generic]
[14834.116240] __fput+0x342/0x5ba
[14834.116611] task_work_run+0x15d/0x198
[14834.116973] exit_to_usermode_loop+0xc7/0x153
[14834.117320] do_syscall_64+0x53d/0x594
[14834.117694] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.118067]
[14834.118426] The buggy address belongs to the object at
ffff8803dbeb8480
[14834.119087] The buggy address is located 480 bytes inside of
[14834.119755] The buggy address belongs to the page:
[14834.120138] page:ffffea000f6fae00 count:1 mapcount:0 mapping:
(null) index:0xffff8803dbebd580 compound_mapcount: 0
[14834.120817] flags: 0x17ffe00000008100(slab|head)
[14834.121171] raw: 17ffe00000008100 0000000000000000 ffff8803dbebd580
00000001001c001b
[14834.121800] raw: ffffea000d718020 ffffea000d32d620 ffff8803f080ee80
0000000000000000
[14834.122415] page dumped because: kasan: bad access detected
[14834.122787]
[14834.123140] Memory state around the buggy address:
[14834.123503] ffff8803dbeb8500: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.124150] ffff8803dbeb8580: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.124806] >ffff8803dbeb8600: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.125467] ^
[14834.125848] ffff8803dbeb8680: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.126493] ffff8803dbeb8700: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.127143]
==================================================================
[14834.127768] Disabling lock debugging due to kernel taint
ferrari-dns 10.0.252.9 # cat udp-2018_02_21.log |grep "Feb 21 06:02:14
10.0.252.9"
[14834.090421]
==================================================================
[14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
[14834.091521] Read of size 8 at addr ffff8803dbeb8660 by task
accel-pppd/12636
[14834.091905]
[14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
4.15.4-build-0134 #1
[14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[14834.093320] Call Trace:
[14834.093680] dump_stack+0xb3/0x13e
[14834.094050] ? _atomic_dec_and_lock+0x10f/0x10f
[14834.094434] print_address_description+0x69/0x236
[14834.094814] ? __list_add_valid+0x69/0xad
[14834.095197] kasan_report+0x219/0x23f
[14834.095570] __list_add_valid+0x69/0xad
[14834.095957] ppp_ioctl+0x1216/0x2201 [ppp_generic]
[14834.096348] ? ppp_write+0x1cc/0x1cc [ppp_generic]
[14834.096723] ? get_usage_char.isra.2+0x36/0x36
[14834.097094] ? packet_poll+0x362/0x362
[14834.097455] ? lock_downgrade+0x4d0/0x4d0
[14834.097811] ? rcu_irq_enter_disabled+0x8/0x8
[14834.098187] ? get_usage_char.isra.2+0x36/0x36
[14834.098561] ? __fget+0x3b8/0x3eb
[14834.098936] ? get_usage_char.isra.2+0x36/0x36
[14834.099309] ? __fget+0x3a0/0x3eb
[14834.099682] ? get_usage_char.isra.2+0x36/0x36
[14834.100069] ? __fget+0x3a0/0x3eb
[14834.100443] ? lock_downgrade+0x4d0/0x4d0
[14834.100814] ? rcu_irq_enter_disabled+0x8/0x8
[14834.101203] ? __fget+0x3b8/0x3eb
[14834.101581] ? expand_files+0x62f/0x62f
[14834.101945] ? kernel_read+0xed/0xed
[14834.102322] ? SyS_getpeername+0x28b/0x28b
[14834.102690] vfs_ioctl+0x6e/0x81
[14834.103049] do_vfs_ioctl+0xe2f/0xe62
[14834.103413] ? ioctl_preallocate+0x211/0x211
[14834.103778] ? __fget_light+0x28c/0x2ca
[14834.104150] ? iterate_fd+0x2a8/0x2a8
[14834.104526] ? SyS_rt_sigprocmask+0x12e/0x181
[14834.104876] ? sigprocmask+0x23f/0x23f
[14834.105231] ? SyS_write+0x148/0x173
[14834.105580] ? SyS_read+0x173/0x173
[14834.105943] SyS_ioctl+0x39/0x55
[14834.106316] ? do_vfs_ioctl+0xe62/0xe62
[14834.106694] do_syscall_64+0x262/0x594
[14834.107076] ? syscall_return_slowpath+0x351/0x351
[14834.107447] ? up_read+0x17/0x2c
[14834.107806] ? __do_page_fault+0x68a/0x763
[14834.108171] ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
[14834.108550] ? trace_hardirqs_off_thunk+0x1a/0x1c
[14834.108937] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.109293] RIP: 0033:0x7fc9be3758a7
[14834.109652] RSP: 002b:00007fc9bf92aaf8 EFLAGS: 00000206 ORIG_RAX:
0000000000000010
[14834.110313] RAX: ffffffffffffffda RBX: 00007fc9bdc5e1e3 RCX:
00007fc9be3758a7
[14834.110707] RDX: 00007fc9b7ad13e8 RSI: 000000004004743a RDI:
0000000000004b9f
[14834.111082] RBP: 00007fc9bf92ab20 R08: 0000000000000000 R09:
000055f07a27fe40
[14834.111471] R10: 0000000000000008 R11: 0000000000000206 R12:
00007fc9b7ad12d8
[14834.111845] R13: 00007ffd06346a6f R14: 0000000000000000 R15:
00007fc9bf92b700
[14834.112231]
[14834.112589] Allocated by task 12636:
[14834.112962] ppp_register_net_channel+0xc4/0x610 [ppp_generic]
[14834.113331] pppoe_connect+0xe6d/0x1097 [pppoe]
[14834.113691] SyS_connect+0x19c/0x274
[14834.114054] do_syscall_64+0x262/0x594
[14834.114421] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.114792]
[14834.115139] Freed by task 12636:
[14834.115504] kfree+0xe2/0x154
[14834.115866] ppp_release+0x11b/0x12a [ppp_generic]
[14834.116240] __fput+0x342/0x5ba
[14834.116611] task_work_run+0x15d/0x198
[14834.116973] exit_to_usermode_loop+0xc7/0x153
[14834.117320] do_syscall_64+0x53d/0x594
[14834.117694] entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.118067]
[14834.118426] The buggy address belongs to the object at
ffff8803dbeb8480
[14834.119087] The buggy address is located 480 bytes inside of
[14834.119755] The buggy address belongs to the page:
[14834.120138] page:ffffea000f6fae00 count:1 mapcount:0 mapping:
(null) index:0xffff8803dbebd580 compound_mapcount: 0
[14834.120817] flags: 0x17ffe00000008100(slab|head)
[14834.121171] raw: 17ffe00000008100 0000000000000000 ffff8803dbebd580
00000001001c001b
[14834.121800] raw: ffffea000d718020 ffffea000d32d620 ffff8803f080ee80
0000000000000000
[14834.122415] page dumped because: kasan: bad access detected
[14834.122787]
[14834.123140] Memory state around the buggy address:
[14834.123503] ffff8803dbeb8500: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.124150] ffff8803dbeb8580: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.124806] >ffff8803dbeb8600: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.125467] ^
[14834.125848] ffff8803dbeb8680: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.126493] ffff8803dbeb8700: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[14834.127143]
==================================================================
[14834.127768] Disabling lock debugging due to kernel taint
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-18 10:01 ` Denys Fedoryshchenko
@ 2018-02-21 18:38 ` Guillaume Nault
0 siblings, 0 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-21 18:38 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Sun, Feb 18, 2018 at 12:01:02PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-16 20:48, Guillaume Nault wrote:
> > On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
> > > As far as i can see there is only KASAN triggered again(and server
> > > rebooted
> > > shortly after that), but nothing else:
> > >
> > Ok, so no refcount failure detected. Not what I expected... but that's
> > still an information. It's getting even harder to find a ppp scenario
> > that could lead to such symptoms.
> > If that's acceptable for you, you can try reverting the few commits
> > that entered after 4.14.
> >
> > 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> > of lower device into account on xmit
> > 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> > before registering device
> > e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks
> > added
> > f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when
> > cleanup
> > 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> > 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> > syncppp.refcnt from atomic_t to refcount_t
> > d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> > ppp_file.refcnt from atomic_t to refcount_t
> > 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> > asyncppp.refcnt from atomic_t to refcount_t
> >
> > Sorry, but I have nothing better to propose for now. At least that
> > should help narrowing the problem space.
> > I'm going to stress test ppp_generic and pppoe on my side.
> >
> Quick update.
> Testing 5 first patches didn't changed anything.
> But revering more, with last 4 patches also (i did all together) is changing
> things, probably i need to repeat one night more reverting just all
> refcount_t patches.
>
So you got the following trace with all 8 patches reverted, right?
I prefer to concentrate on the other traces for now. If this one tends
to be reproducible, you can try to activate lockdep (for lack of better
suggestion).
> [25222.173840] ------------[ cut here ]------------
> [25222.174259] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 3 timed out
> [25222.174618] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:323
> dev_watchdog+0x44a/0x555
> [25222.175212] Modules linked in: pppoe pppox ppp_generic slhc netconsole
> configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp
> nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
> t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set
> xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
> t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables
> x_tables 8021q garp mrp stp llc ixgbe dca
> [25222.177133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G B W
> 4.15.3-build-0134 #6
> [25222.184121] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [25222.184457] RIP: 0010:dev_watchdog+0x44a/0x555
> [25222.184791] RSP: 0018:ffff8803f22c7d98 EFLAGS: 00010292
> [25222.185127] RAX: 0000000000000000 RBX: ffff8803ded00438 RCX:
> 0000000000000000
> [25222.185463] RDX: 0000000000000001 RSI: 0000000000000002 RDI:
> ffffed007e458fa8
> [25222.185797] RBP: ffff8803ded00000 R08: 0000000000000001 R09:
> 0000000000000000
> [25222.186133] R10: ffff8803f22c7e30 R11: 0000000000000001 R12:
> ffff8803ded28450
> [25222.186471] R13: 0000000000000003 R14: dffffc0000000000 R15:
> ffff8803ded283c0
> [25222.186804] FS: 0000000000000000(0000) GS:ffff8803f22c0000(0000)
> knlGS:0000000000000000
> [25222.187401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [25222.187739] CR2: 0000561f5bffc128 CR3: 0000000445a0d003 CR4:
> 00000000001606e0
> [25222.188077] Call Trace:
> [25222.188410] <IRQ>
> [25222.188740] ? dev_graft_qdisc+0xfa/0xfa
> [25222.189072] call_timer_fn+0x15/0x72
> [25222.189407] ? dev_graft_qdisc+0xfa/0xfa
> [25222.189741] expire_timers+0x1b9/0x1d5
> [25222.190072] run_timer_softirq+0x184/0x361
> [25222.190400] ? expire_timers+0x1d5/0x1d5
> [25222.190723] ? enqueue_hrtimer+0xce/0xd8
> [25222.191048] ? __hrtimer_run_queues+0x1ec/0x24d
> [25222.191373] __do_softirq+0x17f/0x34a
> [25222.191702] irq_exit+0x8f/0xf9
> [25222.192034] smp_apic_timer_interrupt+0xcb/0xd6
> [25222.192365] apic_timer_interrupt+0x92/0xa0
> [25222.192695] </IRQ>
> [25222.193023] RIP: 0010:mwait_idle+0x99/0xac
> [25222.193355] RSP: 0018:ffff8803f030fef8 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff11
> [25222.193956] RAX: 0000000000000000 RBX: ffff8803f02e3500 RCX:
> 0000000000000000
> [25222.194290] RDX: 1ffff1007e05c6a0 RSI: 0000000000000000 RDI:
> 0000000000000000
> [25222.194626] RBP: ffff8803f02e3500 R08: ffffed007ccc8eef R09:
> ffff8803e6647728
> [25222.194958] R10: ffff8803f030fdd0 R11: 0000000000000001 R12:
> 0000000000000000
> [25222.195292] R13: dffffc0000000000 R14: ffffed007e05c6a0 R15:
> ffff8803f02e3500
> [25222.195627] do_idle+0xe6/0x19a
> [25222.195963] cpu_startup_entry+0x18/0x1a
> [25222.196295] secondary_startup_64+0xa5/0xb0
> [25222.196625] Code: 68 87 40 01 00 75 3f 48 89 ef c6 05 5c 87 40 01 01 e8
> 64 93 fa ff 44 89 e9 48 89 c2 48 89 ee 48 c7 c7 80 28 68 83 e8 25 69 6d fe
> <0f> ff eb 17 41 ff c5 49 81 c4 40 0
> 1 00 00 44 3b 6c 24 04 0f 85
> [25222.197511] ---[ end trace 4b04e9c6754a1cd5 ]---
>
> and then
>
> [25222.197853] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout
> [25222.198194] ixgbe 0000:04:00.1 eth1: Reset adapter
> [25227.805896] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout
> [25232.925944] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout
> [25236.084968] watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
> [accel-pppd:12627]
> [25236.085562] Modules linked in: pppoe pppox ppp_generic slhc netconsole
> configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp
> nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
> t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set
> xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
> t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables
> x_tables 8021q garp mrp stp llc ixgbe dca
> [25236.087496] CPU: 0 PID: 12627 Comm: accel-pppd Tainted: G B W
> 4.15.3-build-0134 #6
> [25236.088095] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [25236.088430] RIP: 0010:queued_spin_lock_slowpath+0xb1/0x418
> [25236.088759] RSP: 0018:ffff8803e6457a98 EFLAGS: 00000213 ORIG_RAX:
> ffffffffffffff11
> [25236.089353] RAX: 00000000000001fb RBX: ffff880345e75fe0 RCX:
> ffffffff811aeca3
> [25236.089685] RDX: 0000000000000000 RSI: 0000000000000004 RDI:
> ffff880345e75fe0
> [25236.090026] RBP: ffffed0068bcebfc R08: 06030a0001012180 R09:
> ffffed006cc9beb2
> [25236.090369] R10: ffffed006cc9beb3 R11: 0000000000000001 R12:
> 0000000000000003
> [25236.090705] R13: 0000000000008021 R14: 0000000000008021 R15:
> 00000000034e4b06
> [25236.091043] FS: 00007f94bd26c700(0000) GS:ffff8803f2200000(0000)
> knlGS:0000000000000000
> [25236.091636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [25236.091966] CR2: 00007ffc0935eff8 CR3: 00000003d709b003 CR4:
> 00000000001606f0
> [25236.092304] Call Trace:
> [25236.092638] ppp_push+0x112/0xdda [ppp_generic]
> [25236.092975] ? enqueue_hrtimer+0xce/0xd8
> [25236.093304] ? hrtimer_start_range_ns+0x827/0x854
> [25236.093635] __ppp_xmit_process+0xc6a/0xdd5 [ppp_generic]
> [25236.093969] ? __kmalloc_reserve.isra.5+0x29/0x96
> [25236.094302] ? memset+0x1f/0x31
> [25236.094631] ? ppp_receive_nonmp_frame+0x138c/0x138c [ppp_generic]
> [25236.094962] ? __alloc_skb+0x2ec/0x431
> [25236.095292] ? __kmalloc_reserve.isra.5+0x96/0x96
> [25236.095620] ? timerfd_release+0x1d3/0x1d3
> [25236.095950] ppp_xmit_process+0xc3/0x194 [ppp_generic]
> [25236.096284] ppp_write+0x1b7/0x1c3 [ppp_generic]
> [25236.096617] __vfs_write+0xd9/0x4ad
> [25236.096953] ? kernel_read+0xed/0xed
> [25236.097283] ? vfs_copy_file_range+0x6a8/0x6a8
> [25236.097614] ? bit_waitqueue+0x2a/0x2a
> [25236.097946] ? __fsnotify_inode_delete+0xc/0xc
> [25236.098276] ? __fsnotify_inode_delete+0xc/0xc
> [25236.098610] ? SyS_sendmmsg+0x13/0x13
> [25236.098936] vfs_write+0x18c/0x378
> [25236.099258] SyS_write+0xc4/0x13b
> [25236.099579] ? SyS_read+0x13b/0x13b
> [25236.099902] ? exit_to_usermode_loop+0x7c/0xaf
> [25236.100225] ? SyS_read+0x13b/0x13b
> [25236.100550] do_syscall_64+0x1b1/0x31f
> [25236.100879] entry_SYSCALL_64_after_hwframe+0x21/0x86
> [25236.101210] RIP: 0033:0x7f94bca53b2d
> [25236.101536] RSP: 002b:00007f94bd26bb80 EFLAGS: 00000293 ORIG_RAX:
> 0000000000000001
> [25236.102127] RAX: ffffffffffffffda RBX: 00007f94bb59f1e3 RCX:
> 00007f94bca53b2d
> [25236.102461] RDX: 000000000000000c RSI: 00007f94b78895d0 RDI:
> 0000000000002f92
> [25236.102793] RBP: 00007f94bd26bbb0 R08: 0000000000000030 R09:
> 0000000000000027
> [25236.103127] R10: 0000000000000000 R11: 0000000000000293 R12:
> 00007f94b6450eb8
> [25236.103460] R13: 00007ffc8c047a6f R14: 0000000000000000 R15:
> 00007f94bd26c700
> [25236.103790] Code: 83 03 00 00 48 89 dd 49 89 dc 48 b8 00 00 00 00 00 fc
> ff df 48 c1 ed 03 41 83 e4 07 48 01 c5 41 83 c4 03 8a 45 00 41 38 c4 7c 0c
> <84> c0 74 08 48 89 df e8 31 54 17 0
> 0 8b 03 84 c0 74 04 f3 90 eb
>
> Then system autorebooted.
> Maybe i am hitting some qdisc bug now...
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-21 10:26 ` Denys Fedoryshchenko
@ 2018-02-21 18:55 ` Guillaume Nault
2018-02-21 19:30 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-21 18:55 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers
On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
> It seems even rebuilding seemingly stable version triggering crashes too
> (but different ones)
Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?
Also, which stable versions are you refering to?
I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.
> Maybe it is coincidence, and bug reproducer appeared in network same time i
> decided to upgrade kernel,
> as it happened with xt_MSS(and that bug existed for years).
>
> Deleted quoting, i added more debug options (as much as performance
> degradation allows me).
> This is vanilla again:
>
> [14834.090421]
> ==================================================================
> [14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
> [14834.091521] Read of size 8 at addr ffff8803dbeb8660 by task
> accel-pppd/12636
> [14834.091905]
> [14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
> 4.15.4-build-0134 #1
> [14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [14834.093320] Call Trace:
> [14834.093680] dump_stack+0xb3/0x13e
> [14834.094050] ? _atomic_dec_and_lock+0x10f/0x10f
> [14834.094434] print_address_description+0x69/0x236
> [14834.094814] ? __list_add_valid+0x69/0xad
> [14834.095197] kasan_report+0x219/0x23f
> [14834.095570] __list_add_valid+0x69/0xad
> [14834.095957] ppp_ioctl+0x1216/0x2201 [ppp_generic]
> [14834.096348] ? ppp_write+0x1cc/0x1cc [ppp_generic]
> [14834.096723] ? get_usage_char.isra.2+0x36/0x36
> [14834.097094] ? packet_poll+0x362/0x362
> [14834.097455] ? lock_downgrade+0x4d0/0x4d0
> [14834.097811] ? rcu_irq_enter_disabled+0x8/0x8
> [14834.098187] ? get_usage_char.isra.2+0x36/0x36
> [14834.098561] ? __fget+0x3b8/0x3eb
> [14834.098936] ? get_usage_char.isra.2+0x36/0x36
> [14834.099309] ? __fget+0x3a0/0x3eb
> [14834.099682] ? get_usage_char.isra.2+0x36/0x36
> [14834.100069] ? __fget+0x3a0/0x3eb
> [14834.100443] ? lock_downgrade+0x4d0/0x4d0
> [14834.100814] ? rcu_irq_enter_disabled+0x8/0x8
> [14834.101203] ? __fget+0x3b8/0x3eb
> [14834.101581] ? expand_files+0x62f/0x62f
> [14834.101945] ? kernel_read+0xed/0xed
> [14834.102322] ? SyS_getpeername+0x28b/0x28b
> [14834.102690] vfs_ioctl+0x6e/0x81
> [14834.103049] do_vfs_ioctl+0xe2f/0xe62
> [14834.103413] ? ioctl_preallocate+0x211/0x211
> [14834.103778] ? __fget_light+0x28c/0x2ca
> [14834.104150] ? iterate_fd+0x2a8/0x2a8
> [14834.104526] ? SyS_rt_sigprocmask+0x12e/0x181
> [14834.104876] ? sigprocmask+0x23f/0x23f
> [14834.105231] ? SyS_write+0x148/0x173
> [14834.105580] ? SyS_read+0x173/0x173
> [14834.105943] SyS_ioctl+0x39/0x55
> [14834.106316] ? do_vfs_ioctl+0xe62/0xe62
> [14834.106694] do_syscall_64+0x262/0x594
> [14834.107076] ? syscall_return_slowpath+0x351/0x351
> [14834.107447] ? up_read+0x17/0x2c
> [14834.107806] ? __do_page_fault+0x68a/0x763
> [14834.108171] ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
> [14834.108550] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [14834.108937] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.109293] RIP: 0033:0x7fc9be3758a7
> [14834.109652] RSP: 002b:00007fc9bf92aaf8 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [14834.110313] RAX: ffffffffffffffda RBX: 00007fc9bdc5e1e3 RCX:
> 00007fc9be3758a7
> [14834.110707] RDX: 00007fc9b7ad13e8 RSI: 000000004004743a RDI:
> 0000000000004b9f
> [14834.111082] RBP: 00007fc9bf92ab20 R08: 0000000000000000 R09:
> 000055f07a27fe40
> [14834.111471] R10: 0000000000000008 R11: 0000000000000206 R12:
> 00007fc9b7ad12d8
> [14834.111845] R13: 00007ffd06346a6f R14: 0000000000000000 R15:
> 00007fc9bf92b700
> [14834.112231]
> [14834.112589] Allocated by task 12636:
> [14834.112962] ppp_register_net_channel+0xc4/0x610 [ppp_generic]
> [14834.113331] pppoe_connect+0xe6d/0x1097 [pppoe]
> [14834.113691] SyS_connect+0x19c/0x274
> [14834.114054] do_syscall_64+0x262/0x594
> [14834.114421] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.114792]
> [14834.115139] Freed by task 12636:
> [14834.115504] kfree+0xe2/0x154
> [14834.115866] ppp_release+0x11b/0x12a [ppp_generic]
> [14834.116240] __fput+0x342/0x5ba
> [14834.116611] task_work_run+0x15d/0x198
> [14834.116973] exit_to_usermode_loop+0xc7/0x153
> [14834.117320] do_syscall_64+0x53d/0x594
> [14834.117694] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.118067]
> [14834.118426] The buggy address belongs to the object at ffff8803dbeb8480
> [14834.119087] The buggy address is located 480 bytes inside of
> [14834.119755] The buggy address belongs to the page:
> [14834.120138] page:ffffea000f6fae00 count:1 mapcount:0 mapping:
> (null) index:0xffff8803dbebd580 compound_mapcount: 0
> [14834.120817] flags: 0x17ffe00000008100(slab|head)
> [14834.121171] raw: 17ffe00000008100 0000000000000000 ffff8803dbebd580
> 00000001001c001b
> [14834.121800] raw: ffffea000d718020 ffffea000d32d620 ffff8803f080ee80
> 0000000000000000
> [14834.122415] page dumped because: kasan: bad access detected
> [14834.122787]
> [14834.123140] Memory state around the buggy address:
> [14834.123503] ffff8803dbeb8500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.124150] ffff8803dbeb8580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.124806] >ffff8803dbeb8600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.125467] ^
> [14834.125848] ffff8803dbeb8680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.126493] ffff8803dbeb8700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.127143]
> ==================================================================
> [14834.127768] Disabling lock debugging due to kernel taint
>
> ferrari-dns 10.0.252.9 # cat udp-2018_02_21.log |grep "Feb 21 06:02:14
> 10.0.252.9"
> [14834.090421]
> ==================================================================
> [14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
> [14834.091521] Read of size 8 at addr ffff8803dbeb8660 by task
> accel-pppd/12636
> [14834.091905]
> [14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
> 4.15.4-build-0134 #1
> [14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [14834.093320] Call Trace:
> [14834.093680] dump_stack+0xb3/0x13e
> [14834.094050] ? _atomic_dec_and_lock+0x10f/0x10f
> [14834.094434] print_address_description+0x69/0x236
> [14834.094814] ? __list_add_valid+0x69/0xad
> [14834.095197] kasan_report+0x219/0x23f
> [14834.095570] __list_add_valid+0x69/0xad
> [14834.095957] ppp_ioctl+0x1216/0x2201 [ppp_generic]
> [14834.096348] ? ppp_write+0x1cc/0x1cc [ppp_generic]
> [14834.096723] ? get_usage_char.isra.2+0x36/0x36
> [14834.097094] ? packet_poll+0x362/0x362
> [14834.097455] ? lock_downgrade+0x4d0/0x4d0
> [14834.097811] ? rcu_irq_enter_disabled+0x8/0x8
> [14834.098187] ? get_usage_char.isra.2+0x36/0x36
> [14834.098561] ? __fget+0x3b8/0x3eb
> [14834.098936] ? get_usage_char.isra.2+0x36/0x36
> [14834.099309] ? __fget+0x3a0/0x3eb
> [14834.099682] ? get_usage_char.isra.2+0x36/0x36
> [14834.100069] ? __fget+0x3a0/0x3eb
> [14834.100443] ? lock_downgrade+0x4d0/0x4d0
> [14834.100814] ? rcu_irq_enter_disabled+0x8/0x8
> [14834.101203] ? __fget+0x3b8/0x3eb
> [14834.101581] ? expand_files+0x62f/0x62f
> [14834.101945] ? kernel_read+0xed/0xed
> [14834.102322] ? SyS_getpeername+0x28b/0x28b
> [14834.102690] vfs_ioctl+0x6e/0x81
> [14834.103049] do_vfs_ioctl+0xe2f/0xe62
> [14834.103413] ? ioctl_preallocate+0x211/0x211
> [14834.103778] ? __fget_light+0x28c/0x2ca
> [14834.104150] ? iterate_fd+0x2a8/0x2a8
> [14834.104526] ? SyS_rt_sigprocmask+0x12e/0x181
> [14834.104876] ? sigprocmask+0x23f/0x23f
> [14834.105231] ? SyS_write+0x148/0x173
> [14834.105580] ? SyS_read+0x173/0x173
> [14834.105943] SyS_ioctl+0x39/0x55
> [14834.106316] ? do_vfs_ioctl+0xe62/0xe62
> [14834.106694] do_syscall_64+0x262/0x594
> [14834.107076] ? syscall_return_slowpath+0x351/0x351
> [14834.107447] ? up_read+0x17/0x2c
> [14834.107806] ? __do_page_fault+0x68a/0x763
> [14834.108171] ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
> [14834.108550] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [14834.108937] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.109293] RIP: 0033:0x7fc9be3758a7
> [14834.109652] RSP: 002b:00007fc9bf92aaf8 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [14834.110313] RAX: ffffffffffffffda RBX: 00007fc9bdc5e1e3 RCX:
> 00007fc9be3758a7
> [14834.110707] RDX: 00007fc9b7ad13e8 RSI: 000000004004743a RDI:
> 0000000000004b9f
> [14834.111082] RBP: 00007fc9bf92ab20 R08: 0000000000000000 R09:
> 000055f07a27fe40
> [14834.111471] R10: 0000000000000008 R11: 0000000000000206 R12:
> 00007fc9b7ad12d8
> [14834.111845] R13: 00007ffd06346a6f R14: 0000000000000000 R15:
> 00007fc9bf92b700
> [14834.112231]
> [14834.112589] Allocated by task 12636:
> [14834.112962] ppp_register_net_channel+0xc4/0x610 [ppp_generic]
> [14834.113331] pppoe_connect+0xe6d/0x1097 [pppoe]
> [14834.113691] SyS_connect+0x19c/0x274
> [14834.114054] do_syscall_64+0x262/0x594
> [14834.114421] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.114792]
> [14834.115139] Freed by task 12636:
> [14834.115504] kfree+0xe2/0x154
> [14834.115866] ppp_release+0x11b/0x12a [ppp_generic]
> [14834.116240] __fput+0x342/0x5ba
> [14834.116611] task_work_run+0x15d/0x198
> [14834.116973] exit_to_usermode_loop+0xc7/0x153
> [14834.117320] do_syscall_64+0x53d/0x594
> [14834.117694] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.118067]
> [14834.118426] The buggy address belongs to the object at ffff8803dbeb8480
> [14834.119087] The buggy address is located 480 bytes inside of
> [14834.119755] The buggy address belongs to the page:
> [14834.120138] page:ffffea000f6fae00 count:1 mapcount:0 mapping:
> (null) index:0xffff8803dbebd580 compound_mapcount: 0
> [14834.120817] flags: 0x17ffe00000008100(slab|head)
> [14834.121171] raw: 17ffe00000008100 0000000000000000 ffff8803dbebd580
> 00000001001c001b
> [14834.121800] raw: ffffea000d718020 ffffea000d32d620 ffff8803f080ee80
> 0000000000000000
> [14834.122415] page dumped because: kasan: bad access detected
> [14834.122787]
> [14834.123140] Memory state around the buggy address:
> [14834.123503] ffff8803dbeb8500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.124150] ffff8803dbeb8580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.124806] >ffff8803dbeb8600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.125467] ^
> [14834.125848] ffff8803dbeb8680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.126493] ffff8803dbeb8700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb
> [14834.127143]
> ==================================================================
> [14834.127768] Disabling lock debugging due to kernel taint
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-21 18:55 ` Guillaume Nault
@ 2018-02-21 19:30 ` Denys Fedoryshchenko
0 siblings, 0 replies; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-21 19:30 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Linux Kernel Network Developers
On 2018-02-21 20:55, Guillaume Nault wrote:
> On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
>> It seems even rebuilding seemingly stable version triggering crashes
>> too
>> (but different ones)
> Different ones? The trace following your message looks very similar to
> your first KASAN report. Or are you refering to the lockup you posted
> on Sun, 18 Feb 2018?
>
> Also, which stable versions are you refering to?
Trace i sent in previous email - is latest kernel, vanilla, just more
debug options and few options disabled.
One of disabled was spitting some errors (it is obviously bug),
CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it).
And i disabled namespaces, as they are often source of trouble.
Today i will try to revert just:
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t
Because i suspect previously, after reverting this patches i got
different kernel
panic (and i didn't noticed that, now too late to identify between other
crashes),
seems it was not KASAN.
I will report results after testing, unfortunately i can't test it more
than once per day.
"Stable" for me was 4.14.2 - but it looks like on that kernel i am
getting different issue now.
I will paste it below.
Another observation, just hour ago, i noticed on another server, where i
am testing 4.15, and 4.14.20
(at moment of testing 4.14.20, but no debug at that moment), when i
killed accel-pppd (pppoe server software),
with 8k sessions online, i got some weird behaviour, accel-pppd process
got stuck, same as
ifconfig and "ip link", and even kexec -e didn't worked(got stuck too),
unless i did kexec -e -x
(so it wont try to make interfaces down on kexec).
I will try to reproduce this bug as well, with debug enabled (lockdep
and so) i hope it is not related.
>
> I'm interested in the ppp_generic.o file that produced the following
> trace. Just to be sure that the differences come from the new debugging
> options.
Also kernel config:
https://nuclearcat.com/bughunting/config.txt
https://nuclearcat.com/bughunting/ppp_generic.o
This is in 4.14.2, was seemingly stable before:
[50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out
[50401.389014] ------------[ cut here ]------------
[50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320
dev_watchdog+0x15c/0x1b9
[50401.389925] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
4.14.2-build-0134 #4
[50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50401.392513] task: ffff880434d72640 task.stack: ffffc90001914000
[50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9
[50401.393155] RSP: 0018:ffff8804364c3e90 EFLAGS: 00010286
[50401.393470] RAX: 0000000000000039 RBX: ffff88042f6e0000 RCX:
0000000000000000
[50401.393787] RDX: 0000000000000001 RSI: 0000000000000002 RDI:
ffffffff828dbc64
[50401.394103] RBP: ffff8804364c3eb0 R08: 0000000000000001 R09:
0000000000000000
[50401.394420] R10: 0000000000000002 R11: ffff8803fa075c00 R12:
0000000000000001
[50401.394739] R13: 0000000000000040 R14: 0000000000000003 R15:
ffffffff81e05108
[50401.395064] FS: 0000000000000000(0000) GS:ffff8804364c0000(0000)
knlGS:0000000000000000
[50401.395645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50401.395970] CR2: 00007fff25fc20a8 CR3: 0000000001e09005 CR4:
00000000001606e0
[50401.396294] Call Trace:
[50401.396613] <IRQ>
[50401.396934] ? qdisc_rcu_free+0x3f/0x3f
[50401.397255] call_timer_fn.isra.4+0x17/0x7b
[50401.397576] expire_timers+0x6f/0x7e
[50401.397899] run_timer_softirq+0x6d/0x8f
[50401.398219] ? ktime_get+0x3b/0x8c
[50401.398540] ? lapic_next_event+0x18/0x1c
[50401.398862] ? clockevents_program_event+0xa3/0xbb
[50401.399186] __do_softirq+0xbc/0x1ab
[50401.399510] irq_exit+0x4d/0x8e
[50401.399832] smp_apic_timer_interrupt+0x73/0x80
[50401.400157] apic_timer_interrupt+0x8d/0xa0
[50401.400480] </IRQ>
[50401.400801] RIP: 0010:mwait_idle+0x4e/0x61
[50401.401123] RSP: 0018:ffffc90001917ec0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50401.401714] RAX: 0000000000000000 RBX: ffff880434d72640 RCX:
0000000000000000
[50401.402037] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[50401.402362] RBP: ffffc90001917ec0 R08: 0000000000000000 R09:
0000000000000001
[50401.402685] R10: ffffc90001917e58 R11: 000000000000037a R12:
0000000000000000
[50401.403008] R13: 0000000000000000 R14: ffff880434d72640 R15:
ffff880434d72640
[50401.403336] arch_cpu_idle+0xa/0xc
[50401.403660] default_idle_call+0x20/0x22
[50401.403983] do_idle+0xb3/0x13b
[50401.404303] cpu_startup_entry+0x1a/0x1c
[50401.404627] start_secondary+0x102/0x105
[50401.404949] secondary_startup_64+0xa5/0xa5
[50401.405271] Code: d9 54 58 00 00 75 38 48 89 df c6 05 cd 54 58 00 01
e8 dc 24 fe ff 44 89 e1 48 89 c2 48 89 de 48 c7 c7 51 df d6 81 e8 db f6
7c ff <0f> ff eb 10 41 ff c4 48 05 40 01 0
0 00 41 39 cc 75 9a eb 0d 48
[50401.406154] ---[ end trace 433f9a49b32a3b4f ]---
[50401.406489] ixgbe 0000:04:00.1 eth1: initiating reset due to tx
timeout
[50401.406828] ixgbe 0000:04:00.1 eth1: Reset adapter
[50416.686709] watchdog: BUG: soft lockup - CPU#5 stuck for 23s!
[tc:24389]
[50416.687042] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50416.687707] watchdog: BUG: soft lockup - CPU#7 stuck for 23s!
[accel-pppd:24377]
[50416.687708] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50416.687720] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W
4.14.2-build-0134 #4
[50416.687721] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.687722] task: ffff8803f278d940 task.stack: ffffc90011a38000
[50416.687727] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687728] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50416.687729] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX:
ffff8804365d9a40
[50416.687730] RDX: ffff880436419a60 RSI: 0000000000200000 RDI:
ffff8804035abfa0
[50416.687730] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09:
00000000ffffffff
[50416.687731] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12:
ffff88041a5fde00
[50416.687731] R13: 0000000000008021 R14: ffff8803fc5778c0 R15:
ffff8803fc577840
[50416.687732] FS: 00007f7b5726e700(0000) GS:ffff8804365c0000(0000)
knlGS:0000000000000000
[50416.687733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.687733] CR2: 0000000001d5f408 CR3: 000000042bf7c006 CR4:
00000000001606e0
[50416.687734] Call Trace:
[50416.687738] _raw_spin_lock+0x1a/0x1c
[50416.687741] ppp_push+0x56/0x49b [ppp_generic]
[50416.687743] ? __kmalloc_node_track_caller+0xba/0xdf
[50416.687746] __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
[50416.687747] ppp_xmit_process+0x39/0x8d [ppp_generic]
[50416.687749] ppp_write+0xbb/0xc7 [ppp_generic]
[50416.687750] __vfs_write+0x21/0x111
[50416.687752] ? handle_mm_fault+0xb1/0xc0
[50416.687753] vfs_write+0xcd/0x176
[50416.687754] SyS_write+0x49/0x83
[50416.687755] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.687756] RIP: 0033:0x7f7b5ba4e14d
[50416.687756] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[50416.687757] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX:
00007f7b5ba4e14d
[50416.687758] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI:
0000000000004e9e
[50416.687758] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09:
0000000000633e00
[50416.687758] R10: 0000000000000000 R11: 0000000000000293 R12:
0000000000000000
[50416.687759] R13: 00007f7b5716c4ff R14: 0000000000000003 R15:
00007f7b5716c590
[50416.687759] Code: 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48
81 c2 40 9a 01 00 48 03 14 c5 a0 73 d7 81 48 89 0a 8b 41 08 85 c0 75 04
f3 90 <eb> f5 48 8b 11 48 85 d2 74 03 0f 1
8 0a 44 8b 07 66 45 85 c0 0f
[50416.687771] Kernel panic - not syncing: softlockup: hung tasks
[50416.687772] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W
L 4.14.2-build-0134 #4
[50416.687773] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.687773] Call Trace:
[50416.687774] <IRQ>
[50416.687776] dump_stack+0x4d/0x63
[50416.687778] panic+0xcd/0x215
[50416.687780] watchdog_timer_fn+0x1a4/0x1c6
[50416.687782] __hrtimer_run_queues+0x76/0xc0
[50416.687783] hrtimer_interrupt+0xa0/0x162
[50416.687785] smp_apic_timer_interrupt+0x6e/0x80
[50416.687786] apic_timer_interrupt+0x8d/0xa0
[50416.687786] </IRQ>
[50416.687787] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687788] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50416.687789] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX:
ffff8804365d9a40
[50416.687789] RDX: ffff880436419a60 RSI: 0000000000200000 RDI:
ffff8804035abfa0
[50416.687789] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09:
00000000ffffffff
[50416.687790] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12:
ffff88041a5fde00
[50416.687790] R13: 0000000000008021 R14: ffff8803fc5778c0 R15:
ffff8803fc577840
[50416.687792] _raw_spin_lock+0x1a/0x1c
[50416.687793] ppp_push+0x56/0x49b [ppp_generic]
[50416.687794] ? __kmalloc_node_track_caller+0xba/0xdf
[50416.687795] __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
[50416.687797] ppp_xmit_process+0x39/0x8d [ppp_generic]
[50416.687798] ppp_write+0xbb/0xc7 [ppp_generic]
[50416.687798] __vfs_write+0x21/0x111
[50416.687799] ? handle_mm_fault+0xb1/0xc0
[50416.687800] vfs_write+0xcd/0x176
[50416.687801] SyS_write+0x49/0x83
[50416.687802] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.687802] RIP: 0033:0x7f7b5ba4e14d
[50416.687803] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[50416.687803] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX:
00007f7b5ba4e14d
[50416.687804] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI:
0000000000004e9e
[50416.687804] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09:
0000000000633e00
[50416.687805] R10: 0000000000000000 R11: 0000000000000293 R12:
0000000000000000
[50416.687805] R13: 00007f7b5716c4ff R14: 0000000000000003 R15:
00007f7b5716c590
[50416.723251] CPU: 5 PID: 24389 Comm: tc Tainted: G W L
4.14.2-build-0134 #4
[50416.723831] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.724151] task: ffff8803f0f19980 task.stack: ffffc90011a70000
[50416.724477] RIP: 0010:queued_spin_lock_slowpath+0x54/0x16e
[50416.724799] RSP: 0018:ffffc90011a73940 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff10
[50416.725384] RAX: 0000000000000101 RBX: ffff8803fc577000 RCX:
0000000000000101
[50416.725709] RDX: 0000000000000100 RSI: 0000000000000001 RDI:
ffff8803fc5778c0
[50416.726034] RBP: ffffc90011a73940 R08: 0000000000000001 R09:
0000000000002e88
[50416.726355] R10: 0000000000000024 R11: 0000000000000000 R12:
ffff8803eea1d060
[50416.726678] R13: ffff8803fc5778c0 R14: ffffc90011a739f0 R15:
0000000000000000
[50416.727001] FS: 00007f72228f2700(0000) GS:ffff880436540000(0000)
knlGS:0000000000000000
[50416.727582] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.727904] CR2: 0000000001aea038 CR3: 00000003fb09a001 CR4:
00000000001606e0
[50416.728230] Call Trace:
[50416.728550] _raw_spin_lock_bh+0x25/0x27
[50416.728872] ppp_get_stats64+0x4c/0xaf [ppp_generic]
[50416.729195] dev_get_stats+0x33/0x84
[50416.729513] rtnl_fill_stats+0x41/0x121
[50416.729831] rtnl_fill_ifinfo+0x567/0xdf0
[50416.730148] rtnl_dump_ifinfo+0x232/0x2e0
[50416.730470] ? __alloc_pages_nodemask+0xd2/0xa99
[50416.730790] ? handle_mm_fault+0xb1/0xc0
[50416.731109] ? __do_page_fault+0x31c/0x3b6
[50416.731431] netlink_dump+0xfc/0x271
[50416.731749] netlink_recvmsg+0x1af/0x2d9
[50416.732068] sock_recvmsg+0x14/0x16
[50416.732388] ___sys_recvmsg+0xea/0x1a1
[50416.732708] ? lru_deactivate_file_fn+0x29f/0x29f
[50416.733028] ? pagevec_lru_move_fn+0xad/0xc1
[50416.733348] ? alloc_pages_vma+0x152/0x19c
[50416.733669] ? page_add_new_anon_rmap+0xa0/0xa9
[50416.733988] ? __handle_mm_fault+0xda4/0xf40
[50416.734310] ? sock_sendmsg+0x12/0x1d
[50416.734628] __sys_recvmsg+0x3d/0x5e
[50416.734945] ? __sys_recvmsg+0x3d/0x5e
[50416.735264] SyS_recvmsg+0xd/0x17
[50416.735576] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.735890] RIP: 0033:0x7f7221afdac0
[50416.736202] RSP: 002b:00007ffcbbc1f328 EFLAGS: 00000246 ORIG_RAX:
000000000000002f
[50416.736772] RAX: ffffffffffffffda RBX: 00007f7221da8b98 RCX:
00007f7221afdac0
[50416.737092] RDX: 0000000000000000 RSI: 00007ffcbbc1f390 RDI:
0000000000000003
[50416.737411] RBP: 00007f7221da8b98 R08: 0000000000007d7c R09:
0000000000000000
[50416.737734] R10: 00007ffcbbc1f3d0 R11: 0000000000000246 R12:
000000000066a520
[50416.738055] R13: 0000000000000040 R14: 00007f7221da8b40 R15:
0000000000002710
[50416.738377] Code: ff ff 75 33 83 fe 01 89 ca 89 f0 41 0f 45 d0 f0 0f
b1 17 39 c6 74 04 89 c6 eb e1 ff ca 0f 84 20 01 00 00 8b 07 84 c0 74 04
f3 90 <eb> f6 66 c7 07 01 00 e9 0c 01 00 0
0 48 c7 c0 40 9a 01 00 65 48
Feb 21 00:04:31 10.0.252.9 [50417.724420] Shutting down cpus with NMI
Feb 21 00:04:31 10.0.252.9 [50417.724748] Kernel Offset: disabled
Feb 21 00:04:31 10.0.252.9 [50417.728891] Rebooting in 5 seconds..
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-15 19:31 ` Guillaume Nault
2018-02-15 19:34 ` Denys Fedoryshchenko
@ 2018-02-21 20:04 ` Cong Wang
2018-02-22 18:30 ` Guillaume Nault
1 sibling, 1 reply; 37+ messages in thread
From: Cong Wang @ 2018-02-21 20:04 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Denys Fedoryshchenko, Linux Kernel Network Developers
On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault <g.nault@alphalink.fr> wrote:
> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 17:55, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > Here we go:
>> > >
>> > > <srv> [24558.921549]
>> > > ==================================================================
>> > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > ppp_ioctl+0xa6a/0x1522
>> > > [ppp_generic]
>> > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > accel-pppd/12622
>> > > <srv> [24558.923113]
>> > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > W
>> > > 4.15.3-build-0134 #1
>> > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > BIOS P80
>> > > 04/02/2015
>> > > <srv> [24558.924406] Call Trace:
>> > > <srv> [24558.924753] dump_stack+0x46/0x59
>> > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > ORIG_RAX:
>> > > 0000000000000010
>> > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > 00007f302849d8a7
>> > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > 0000000000003a67
>> > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > 000055c8308d8e40
>> > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > 00007f3023f49358
>> > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > 00007f3029a53700
>> > > <srv> [24558.935288]
>> > > <srv> [24558.935626] Allocated by task 12622:
>> > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> > > [ppp_generic]
>> > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.937655]
>> > > <srv> [24558.937993] Freed by task 12622:
>> > > <srv> [24558.938321] kfree+0xb0/0x11d
>> > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > <srv> [24558.947099]
>> >
>> > Your first guess was right. It looks like we have an issue with
>> > reference counting on the channels. Can you send me your ppp_generic.o?
>> http://nuclearcat.com/ppp_generic.o
>> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>>
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.
For me it looks like pch->clist is not removed from the list ppp->channels
when destroyed via ppp_release(). But I don't want to pretend I understand
ppp logic.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-21 20:04 ` Cong Wang
@ 2018-02-22 18:30 ` Guillaume Nault
2018-02-22 18:51 ` Denys Fedoryshchenko
2018-02-27 18:54 ` Guillaume Nault
0 siblings, 2 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-22 18:30 UTC (permalink / raw)
To: Cong Wang; +Cc: Denys Fedoryshchenko, Linux Kernel Network Developers
On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
> On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault <g.nault@alphalink.fr> wrote:
> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> >> On 2018-02-15 17:55, Guillaume Nault wrote:
> >> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> >> > > Here we go:
> >> > >
> >> > > <srv> [24558.921549]
> >> > > ==================================================================
> >> > > <srv> [24558.922167] BUG: KASAN: use-after-free in
> >> > > ppp_ioctl+0xa6a/0x1522
> >> > > [ppp_generic]
> >> > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
> >> > > accel-pppd/12622
> >> > > <srv> [24558.923113]
> >> > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> >> > > W
> >> > > 4.15.3-build-0134 #1
> >> > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> >> > > BIOS P80
> >> > > 04/02/2015
> >> > > <srv> [24558.924406] Call Trace:
> >> > > <srv> [24558.924753] dump_stack+0x46/0x59
> >> > > <srv> [24558.925103] print_address_description+0x6b/0x23b
> >> > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >> > > <srv> [24558.925797] kasan_report+0x21b/0x241
> >> > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >> > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> >> > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
> >> > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
> >> > > <srv> [24558.927523] ? kernel_read+0xed/0xed
> >> > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
> >> > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
> >> > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
> >> > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
> >> > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
> >> > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
> >> > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
> >> > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
> >> > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
> >> > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
> >> > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
> >> > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
> >> > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
> >> > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
> >> > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
> >> > > ORIG_RAX:
> >> > > 0000000000000010
> >> > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
> >> > > 00007f302849d8a7
> >> > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
> >> > > 0000000000003a67
> >> > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
> >> > > 000055c8308d8e40
> >> > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
> >> > > 00007f3023f49358
> >> > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
> >> > > 00007f3029a53700
> >> > > <srv> [24558.935288]
> >> > > <srv> [24558.935626] Allocated by task 12622:
> >> > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
> >> > > [ppp_generic]
> >> > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
> >> > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
> >> > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
> >> > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > > <srv> [24558.937655]
> >> > > <srv> [24558.937993] Freed by task 12622:
> >> > > <srv> [24558.938321] kfree+0xb0/0x11d
> >> > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
> >> > > <srv> [24558.938994] __fput+0x2ba/0x51a
> >> > > <srv> [24558.939332] task_work_run+0x11c/0x13d
> >> > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
> >> > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
> >> > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > > <srv> [24558.947099]
> >> >
> >> > Your first guess was right. It looks like we have an issue with
> >> > reference counting on the channels. Can you send me your ppp_generic.o?
> >> http://nuclearcat.com/ppp_generic.o
> >> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> >>
> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > by two different file descriptors, I can't see how this could defeat
> > the reference counting mechanism. I'm going to think more about it.
>
> For me it looks like pch->clist is not removed from the list ppp->channels
> when destroyed via ppp_release(). But I don't want to pretend I understand
> ppp logic.
>
I've thought about that too, but couldn't find a scenario that could
trigger the bug.
To get ->private_data pointing to a struct channel pointer, a file needs
to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
must have been registered with ppp_register_net_channel(). Both
operations take a reference on the channel, which means that, before
adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
the channel is already held by a /dev/ppp file and by the code that
registered the channel in the first place.
Therefore, closing the /dev/ppp file can't be enough to make
ppp_release() free the channel. We need to unregister the channel for
the refcount to drop to 0. And ppp_unregister_channel(), removes
pch->clist from ppp->channels before decrementing the reference
counter.
I'm simplifying a bit, and there could always be a race somewhere. But
I couldn't find anything so far.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-22 18:30 ` Guillaume Nault
@ 2018-02-22 18:51 ` Denys Fedoryshchenko
2018-02-23 9:38 ` Guillaume Nault
2018-02-27 18:54 ` Guillaume Nault
1 sibling, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-22 18:51 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-02-22 20:30, Guillaume Nault wrote:
> On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
>> On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault
>> <g.nault@alphalink.fr> wrote:
>> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> >> On 2018-02-15 17:55, Guillaume Nault wrote:
>> >> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> >> > > Here we go:
>> >> > >
>> >> > > <srv> [24558.921549]
>> >> > > ==================================================================
>> >> > > <srv> [24558.922167] BUG: KASAN: use-after-free in
>> >> > > ppp_ioctl+0xa6a/0x1522
>> >> > > [ppp_generic]
>> >> > > <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> >> > > accel-pppd/12622
>> >> > > <srv> [24558.923113]
>> >> > > <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> >> > > W
>> >> > > 4.15.3-build-0134 #1
>> >> > > <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> >> > > BIOS P80
>> >> > > 04/02/2015
>> >> > > <srv> [24558.924406] Call Trace:
>> >> > > <srv> [24558.924753] dump_stack+0x46/0x59
>> >> > > <srv> [24558.925103] print_address_description+0x6b/0x23b
>> >> > > <srv> [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> >> > > <srv> [24558.925797] kasan_report+0x21b/0x241
>> >> > > <srv> [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> >> > > <srv> [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> >> > > <srv> [24558.926829] ? sock_sendmsg+0x89/0x99
>> >> > > <srv> [24558.927176] ? __vfs_write+0xd9/0x4ad
>> >> > > <srv> [24558.927523] ? kernel_read+0xed/0xed
>> >> > > <srv> [24558.927872] ? SyS_getpeername+0x18c/0x18c
>> >> > > <srv> [24558.928213] ? bit_waitqueue+0x2a/0x2a
>> >> > > <srv> [24558.928561] ? wake_atomic_t_function+0x115/0x115
>> >> > > <srv> [24558.928898] vfs_ioctl+0x6e/0x81
>> >> > > <srv> [24558.929228] do_vfs_ioctl+0xa00/0xb10
>> >> > > <srv> [24558.929571] ? sigprocmask+0x1a6/0x1d0
>> >> > > <srv> [24558.929907] ? sigsuspend+0x13e/0x13e
>> >> > > <srv> [24558.930239] ? ioctl_preallocate+0x14e/0x14e
>> >> > > <srv> [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142
>> >> > > <srv> [24558.930904] ? sigprocmask+0x1d0/0x1d0
>> >> > > <srv> [24558.931252] SyS_ioctl+0x39/0x55
>> >> > > <srv> [24558.931595] ? do_vfs_ioctl+0xb10/0xb10
>> >> > > <srv> [24558.931942] do_syscall_64+0x1b1/0x31f
>> >> > > <srv> [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> >> > > <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> >> > > <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> >> > > ORIG_RAX:
>> >> > > 0000000000000010
>> >> > > <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> >> > > 00007f302849d8a7
>> >> > > <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> >> > > 0000000000003a67
>> >> > > <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> >> > > 000055c8308d8e40
>> >> > > <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> >> > > 00007f3023f49358
>> >> > > <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> >> > > 00007f3029a53700
>> >> > > <srv> [24558.935288]
>> >> > > <srv> [24558.935626] Allocated by task 12622:
>> >> > > <srv> [24558.935972] ppp_register_net_channel+0x5f/0x5c6
>> >> > > [ppp_generic]
>> >> > > <srv> [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe]
>> >> > > <srv> [24558.936640] SyS_connect+0x14b/0x1b7
>> >> > > <srv> [24558.936975] do_syscall_64+0x1b1/0x31f
>> >> > > <srv> [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> >> > > <srv> [24558.937655]
>> >> > > <srv> [24558.937993] Freed by task 12622:
>> >> > > <srv> [24558.938321] kfree+0xb0/0x11d
>> >> > > <srv> [24558.938658] ppp_release+0x111/0x120 [ppp_generic]
>> >> > > <srv> [24558.938994] __fput+0x2ba/0x51a
>> >> > > <srv> [24558.939332] task_work_run+0x11c/0x13d
>> >> > > <srv> [24558.939676] exit_to_usermode_loop+0x7c/0xaf
>> >> > > <srv> [24558.940022] do_syscall_64+0x2ea/0x31f
>> >> > > <srv> [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86
>> >> > > <srv> [24558.947099]
>> >> >
>> >> > Your first guess was right. It looks like we have an issue with
>> >> > reference counting on the channels. Can you send me your ppp_generic.o?
>> >> http://nuclearcat.com/ppp_generic.o
>> >> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>> >>
>> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
>> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
>> > by two different file descriptors, I can't see how this could defeat
>> > the reference counting mechanism. I'm going to think more about it.
>>
>> For me it looks like pch->clist is not removed from the list
>> ppp->channels
>> when destroyed via ppp_release(). But I don't want to pretend I
>> understand
>> ppp logic.
>>
> I've thought about that too, but couldn't find a scenario that could
> trigger the bug.
>
> To get ->private_data pointing to a struct channel pointer, a file
> needs
> to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
> must have been registered with ppp_register_net_channel(). Both
> operations take a reference on the channel, which means that, before
> adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
> the channel is already held by a /dev/ppp file and by the code that
> registered the channel in the first place.
>
> Therefore, closing the /dev/ppp file can't be enough to make
> ppp_release() free the channel. We need to unregister the channel for
> the refcount to drop to 0. And ppp_unregister_channel(), removes
> pch->clist from ppp->channels before decrementing the reference
> counter.
>
> I'm simplifying a bit, and there could always be a race somewhere. But
> I couldn't find anything so far.
I'm using accel-ppp that has unit-cache option, i guess for "reusing"
ppp interfaces (because creating a lot of interfaces on BRAS with 8k
users quite expensive).
Maybe it is somehow related and can be that scenario causing this bug?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-22 18:51 ` Denys Fedoryshchenko
@ 2018-02-23 9:38 ` Guillaume Nault
2018-02-23 9:41 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-23 9:38 UTC (permalink / raw)
To: Denys Fedoryshchenko
Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> I'm using accel-ppp that has unit-cache option, i guess for "reusing" ppp
> interfaces (because creating a lot of interfaces on BRAS with 8k users quite
> expensive).
> Maybe it is somehow related and can be that scenario causing this bug?
>
Indeed, it'd be interesting to know if unit-cache is part of the
equation (if it's workable for you to disable it).
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-23 9:38 ` Guillaume Nault
@ 2018-02-23 9:41 ` Denys Fedoryshchenko
2018-02-23 10:07 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-23 9:41 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-02-23 11:38, Guillaume Nault wrote:
> On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
>> I'm using accel-ppp that has unit-cache option, i guess for "reusing"
>> ppp
>> interfaces (because creating a lot of interfaces on BRAS with 8k users
>> quite
>> expensive).
>> Maybe it is somehow related and can be that scenario causing this bug?
>>
> Indeed, it'd be interesting to know if unit-cache is part of the
> equation (if it's workable for you to disable it).
Already did that and testing, unfortunately i had to disable KASAN and
full refcount, as performance hit is too heavy for me. I will try to
enable KASAN alone tomorrow.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-23 9:41 ` Denys Fedoryshchenko
@ 2018-02-23 10:07 ` Guillaume Nault
2018-02-23 10:54 ` Denys Fedoryshchenko
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-23 10:07 UTC (permalink / raw)
To: Denys Fedoryshchenko
Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-23 11:38, Guillaume Nault wrote:
> > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > I'm using accel-ppp that has unit-cache option, i guess for
> > > "reusing" ppp
> > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > users quite
> > > expensive).
> > > Maybe it is somehow related and can be that scenario causing this bug?
> > >
> > Indeed, it'd be interesting to know if unit-cache is part of the
> > equation (if it's workable for you to disable it).
> Already did that and testing, unfortunately i had to disable KASAN and full
> refcount, as performance hit is too heavy for me. I will try to enable KASAN
> alone tomorrow.
>
Don't hesitate to post the result even if you can't afford enabling KASAN.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-23 10:07 ` Guillaume Nault
@ 2018-02-23 10:54 ` Denys Fedoryshchenko
2018-02-24 21:22 ` Denys Fedoryshchenko
2018-02-27 10:58 ` Denys Fedoryshchenko
2 siblings, 0 replies; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-23 10:54 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-02-23 12:07, Guillaume Nault wrote:
> On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-23 11:38, Guillaume Nault wrote:
>> > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
>> > > I'm using accel-ppp that has unit-cache option, i guess for
>> > > "reusing" ppp
>> > > interfaces (because creating a lot of interfaces on BRAS with 8k
>> > > users quite
>> > > expensive).
>> > > Maybe it is somehow related and can be that scenario causing this bug?
>> > >
>> > Indeed, it'd be interesting to know if unit-cache is part of the
>> > equation (if it's workable for you to disable it).
>> Already did that and testing, unfortunately i had to disable KASAN and
>> full
>> refcount, as performance hit is too heavy for me. I will try to enable
>> KASAN
>> alone tomorrow.
>>
> Don't hesitate to post the result even if you can't afford enabling
> KASAN.
For sure, i am expecting it to crash even if KASAN not enabled (just i
wont have clean message what is reason).
Usually it happened for me within 6-10 hours after upgrade at night,
when load started to increase, i prefer to wait
48h at least, even if no crash.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-23 10:07 ` Guillaume Nault
2018-02-23 10:54 ` Denys Fedoryshchenko
@ 2018-02-24 21:22 ` Denys Fedoryshchenko
2018-02-27 10:58 ` Denys Fedoryshchenko
2 siblings, 0 replies; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-24 21:22 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-02-23 12:07, Guillaume Nault wrote:
> On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-23 11:38, Guillaume Nault wrote:
>> > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
>> > > I'm using accel-ppp that has unit-cache option, i guess for
>> > > "reusing" ppp
>> > > interfaces (because creating a lot of interfaces on BRAS with 8k
>> > > users quite
>> > > expensive).
>> > > Maybe it is somehow related and can be that scenario causing this bug?
>> > >
>> > Indeed, it'd be interesting to know if unit-cache is part of the
>> > equation (if it's workable for you to disable it).
>> Already did that and testing, unfortunately i had to disable KASAN and
>> full
>> refcount, as performance hit is too heavy for me. I will try to enable
>> KASAN
>> alone tomorrow.
>>
> Don't hesitate to post the result even if you can't afford enabling
> KASAN.
Very likely unit-cache is major contributor to this reboots.
After disabling it, it is almost 48h and no reboots yet.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-23 10:07 ` Guillaume Nault
2018-02-23 10:54 ` Denys Fedoryshchenko
2018-02-24 21:22 ` Denys Fedoryshchenko
@ 2018-02-27 10:58 ` Denys Fedoryshchenko
2018-02-27 18:56 ` Guillaume Nault
2 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-02-27 10:58 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-02-23 12:07, Guillaume Nault wrote:
> On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-23 11:38, Guillaume Nault wrote:
>> > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
>> > > I'm using accel-ppp that has unit-cache option, i guess for
>> > > "reusing" ppp
>> > > interfaces (because creating a lot of interfaces on BRAS with 8k
>> > > users quite
>> > > expensive).
>> > > Maybe it is somehow related and can be that scenario causing this bug?
>> > >
>> > Indeed, it'd be interesting to know if unit-cache is part of the
>> > equation (if it's workable for you to disable it).
>> Already did that and testing, unfortunately i had to disable KASAN and
>> full
>> refcount, as performance hit is too heavy for me. I will try to enable
>> KASAN
>> alone tomorrow.
>>
> Don't hesitate to post the result even if you can't afford enabling
> KASAN.
Till now 4 days and no reboots.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-22 18:30 ` Guillaume Nault
2018-02-22 18:51 ` Denys Fedoryshchenko
@ 2018-02-27 18:54 ` Guillaume Nault
1 sibling, 0 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-02-27 18:54 UTC (permalink / raw)
To: Cong Wang; +Cc: Denys Fedoryshchenko, Linux Kernel Network Developers
On Thu, Feb 22, 2018 at 07:30:38PM +0100, Guillaume Nault wrote:
> On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
> > For me it looks like pch->clist is not removed from the list ppp->channels
> > when destroyed via ppp_release(). But I don't want to pretend I understand
> > ppp logic.
> >
> I've thought about that too, but couldn't find a scenario that could
> trigger the bug.
>
> To get ->private_data pointing to a struct channel pointer, a file needs
> to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
> must have been registered with ppp_register_net_channel(). Both
> operations take a reference on the channel, which means that, before
> adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
> the channel is already held by a /dev/ppp file and by the code that
> registered the channel in the first place.
>
> Therefore, closing the /dev/ppp file can't be enough to make
> ppp_release() free the channel. We need to unregister the channel for
> the refcount to drop to 0. And ppp_unregister_channel(), removes
> pch->clist from ppp->channels before decrementing the reference
> counter.
>
And this is where my reasoning failed... If pch->clist hasn't been
added to a ppp->channels list (that is, there was no
ppp_connect_channel() call for this channel), then
ppp_unregister_channel() only decrements the reference counter.
Therefore, we now have an unregistered channel which is only held by a
/dev/ppp file. But ioctl(PPPIOCCONNECT) still works on such a file, so
one can add pch->clist to a ppp->channels list. When the file
descriptor closes, we fall in Cong's scenario and the channel is freed,
leaving dangling pointers in ppp->channels.
Then, it's just a matter of calling ioctl(PPPIOCCONNECT) on this ppp
unit again to make list_add_tail() follow those invalid pointers and
crash.
Thank you Cong for putting me on the right track.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-27 10:58 ` Denys Fedoryshchenko
@ 2018-02-27 18:56 ` Guillaume Nault
2018-03-01 20:01 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-02-27 18:56 UTC (permalink / raw)
To: Denys Fedoryshchenko
Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-23 12:07, Guillaume Nault wrote:
> > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-23 11:38, Guillaume Nault wrote:
> > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > > > I'm using accel-ppp that has unit-cache option, i guess for
> > > > > "reusing" ppp
> > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > > > users quite
> > > > > expensive).
> > > > > Maybe it is somehow related and can be that scenario causing this bug?
> > > > >
> > > > Indeed, it'd be interesting to know if unit-cache is part of the
> > > > equation (if it's workable for you to disable it).
> > > Already did that and testing, unfortunately i had to disable KASAN
> > > and full
> > > refcount, as performance hit is too heavy for me. I will try to
> > > enable KASAN
> > > alone tomorrow.
> > >
> > Don't hesitate to post the result even if you can't afford enabling
> > KASAN.
> Till now 4 days and no reboots.
>
That unit-cache information was very useful. I can now reproduce the
issue and work on a fix.
Thanks Denys!
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-02-27 18:56 ` Guillaume Nault
@ 2018-03-01 20:01 ` Guillaume Nault
2018-03-01 20:07 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-03-01 20:01 UTC (permalink / raw)
To: Denys Fedoryshchenko
Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On Tue, Feb 27, 2018 at 07:56:27PM +0100, Guillaume Nault wrote:
> On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-23 12:07, Guillaume Nault wrote:
> > > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> > > > On 2018-02-23 11:38, Guillaume Nault wrote:
> > > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > > > > I'm using accel-ppp that has unit-cache option, i guess for
> > > > > > "reusing" ppp
> > > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > > > > users quite
> > > > > > expensive).
> > > > > > Maybe it is somehow related and can be that scenario causing this bug?
> > > > > >
> > > > > Indeed, it'd be interesting to know if unit-cache is part of the
> > > > > equation (if it's workable for you to disable it).
> > > > Already did that and testing, unfortunately i had to disable KASAN
> > > > and full
> > > > refcount, as performance hit is too heavy for me. I will try to
> > > > enable KASAN
> > > > alone tomorrow.
> > > >
> > > Don't hesitate to post the result even if you can't afford enabling
> > > KASAN.
> > Till now 4 days and no reboots.
> >
> That unit-cache information was very useful. I can now reproduce the
> issue and work on a fix.
>
You can try the following patch.
Sorry for the delay, I'm a bit out of time these days.
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 255a5def56e9..2acf4b0eabd1 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int unit)
goto outl;
ppp_lock(ppp);
+ spin_lock_bh(&pch->downl);
+ if (!pch->chan) {
+ /* Don't connect unregistered channels */
+ ppp_unlock(ppp);
+ spin_unlock_bh(&pch->downl);
+ ret = -ENOTCONN;
+ goto outl;
+ }
+ spin_unlock_bh(&pch->downl);
if (pch->file.hdrlen > ppp->file.hdrlen)
ppp->file.hdrlen = pch->file.hdrlen;
hdrlen = pch->file.hdrlen + 2; /* for protocol bytes */
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-03-01 20:01 ` Guillaume Nault
@ 2018-03-01 20:07 ` Denys Fedoryshchenko
2018-03-02 17:43 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-03-01 20:07 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-03-01 22:01, Guillaume Nault wrote:
> On Tue, Feb 27, 2018 at 07:56:27PM +0100, Guillaume Nault wrote:
>> On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
>> > On 2018-02-23 12:07, Guillaume Nault wrote:
>> > > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
>> > > > On 2018-02-23 11:38, Guillaume Nault wrote:
>> > > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
>> > > > > > I'm using accel-ppp that has unit-cache option, i guess for
>> > > > > > "reusing" ppp
>> > > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
>> > > > > > users quite
>> > > > > > expensive).
>> > > > > > Maybe it is somehow related and can be that scenario causing this bug?
>> > > > > >
>> > > > > Indeed, it'd be interesting to know if unit-cache is part of the
>> > > > > equation (if it's workable for you to disable it).
>> > > > Already did that and testing, unfortunately i had to disable KASAN
>> > > > and full
>> > > > refcount, as performance hit is too heavy for me. I will try to
>> > > > enable KASAN
>> > > > alone tomorrow.
>> > > >
>> > > Don't hesitate to post the result even if you can't afford enabling
>> > > KASAN.
>> > Till now 4 days and no reboots.
>> >
>> That unit-cache information was very useful. I can now reproduce the
>> issue and work on a fix.
>>
> You can try the following patch.
>
> Sorry for the delay, I'm a bit out of time these days.
>
> diff --git a/drivers/net/ppp/ppp_generic.c
> b/drivers/net/ppp/ppp_generic.c
> index 255a5def56e9..2acf4b0eabd1 100644
> --- a/drivers/net/ppp/ppp_generic.c
> +++ b/drivers/net/ppp/ppp_generic.c
> @@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int
> unit)
> goto outl;
>
> ppp_lock(ppp);
> + spin_lock_bh(&pch->downl);
> + if (!pch->chan) {
> + /* Don't connect unregistered channels */
> + ppp_unlock(ppp);
> + spin_unlock_bh(&pch->downl);
> + ret = -ENOTCONN;
> + goto outl;
> + }
> + spin_unlock_bh(&pch->downl);
> if (pch->file.hdrlen > ppp->file.hdrlen)
> ppp->file.hdrlen = pch->file.hdrlen;
> hdrlen = pch->file.hdrlen + 2; /* for protocol bytes */
Ok, i will try to test that at night.
Thanks a lot! For me also problem solved anyway by removing unit-cache,
just i think it's nice to have bug fixed :)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-03-01 20:07 ` Denys Fedoryshchenko
@ 2018-03-02 17:43 ` Guillaume Nault
2018-03-03 9:33 ` Denys Fedoryshchenko
0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Nault @ 2018-03-02 17:43 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Cong Wang, Linux Kernel Network Developers
On Thu, Mar 01, 2018 at 10:07:05PM +0200, Denys Fedoryshchenko wrote:
> On 2018-03-01 22:01, Guillaume Nault wrote:
> > diff --git a/drivers/net/ppp/ppp_generic.c
> > b/drivers/net/ppp/ppp_generic.c
> > index 255a5def56e9..2acf4b0eabd1 100644
> > --- a/drivers/net/ppp/ppp_generic.c
> > +++ b/drivers/net/ppp/ppp_generic.c
> > @@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int
> > unit)
> > goto outl;
> >
> > ppp_lock(ppp);
> > + spin_lock_bh(&pch->downl);
> > + if (!pch->chan) {
> > + /* Don't connect unregistered channels */
> > + ppp_unlock(ppp);
> > + spin_unlock_bh(&pch->downl);
This is obviously wrong. It should have been
+ spin_unlock_bh(&pch->downl);
+ ppp_unlock(ppp);
Sorry, I shouldn't have hurried.
This is fixed in the official version.
> > + ret = -ENOTCONN;
> > + goto outl;
> > + }
> > + spin_unlock_bh(&pch->downl);
> > if (pch->file.hdrlen > ppp->file.hdrlen)
> > ppp->file.hdrlen = pch->file.hdrlen;
> > hdrlen = pch->file.hdrlen + 2; /* for protocol bytes */
> Ok, i will try to test that at night.
> Thanks a lot! For me also problem solved anyway by removing unit-cache, just
> i think it's nice to have bug fixed :)
>
I think this bug has been there forever, indeed it's good to have it fixed.
Thanks a lot for your help (and patience!).
FYI, if you see accel-ppp logs like
"ioctl(PPPIOCCONNECT): Transport endpoint is not connected", then that
means the patch prevented the scenario that was leading to the original
crash.
Out of curiosity, did unit-cache really bring performance improvements
on your workload?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-03-02 17:43 ` Guillaume Nault
@ 2018-03-03 9:33 ` Denys Fedoryshchenko
2018-03-05 17:22 ` Guillaume Nault
0 siblings, 1 reply; 37+ messages in thread
From: Denys Fedoryshchenko @ 2018-03-03 9:33 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner
On 2018-03-02 19:43, Guillaume Nault wrote:
> On Thu, Mar 01, 2018 at 10:07:05PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-03-01 22:01, Guillaume Nault wrote:
>> > diff --git a/drivers/net/ppp/ppp_generic.c
>> > b/drivers/net/ppp/ppp_generic.c
>> > index 255a5def56e9..2acf4b0eabd1 100644
>> > --- a/drivers/net/ppp/ppp_generic.c
>> > +++ b/drivers/net/ppp/ppp_generic.c
>> > @@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int
>> > unit)
>> > goto outl;
>> >
>> > ppp_lock(ppp);
>> > + spin_lock_bh(&pch->downl);
>> > + if (!pch->chan) {
>> > + /* Don't connect unregistered channels */
>> > + ppp_unlock(ppp);
>> > + spin_unlock_bh(&pch->downl);
>
> This is obviously wrong. It should have been
> + spin_unlock_bh(&pch->downl);
> + ppp_unlock(ppp);
>
> Sorry, I shouldn't have hurried.
> This is fixed in the official version.
>
>> > + ret = -ENOTCONN;
>> > + goto outl;
>> > + }
>> > + spin_unlock_bh(&pch->downl);
>> > if (pch->file.hdrlen > ppp->file.hdrlen)
>> > ppp->file.hdrlen = pch->file.hdrlen;
>> > hdrlen = pch->file.hdrlen + 2; /* for protocol bytes */
>> Ok, i will try to test that at night.
>> Thanks a lot! For me also problem solved anyway by removing
>> unit-cache, just
>> i think it's nice to have bug fixed :)
>>
> I think this bug has been there forever, indeed it's good to have it
> fixed.
> Thanks a lot for your help (and patience!).
>
> FYI, if you see accel-ppp logs like
> "ioctl(PPPIOCCONNECT): Transport endpoint is not connected", then that
> means the patch prevented the scenario that was leading to the original
> crash.
>
> Out of curiosity, did unit-cache really bring performance improvements
> on your workload?
On old kernels it definitely did, due local specifics (electricity
outages) i might have few thousands of interfaces deleted and created
again in short period of time.
And before interfaces creation/deletion (especially when there is
thousands of them) was very expensive.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ppp/pppoe, still panic 4.15.3 in ppp_push
2018-03-03 9:33 ` Denys Fedoryshchenko
@ 2018-03-05 17:22 ` Guillaume Nault
0 siblings, 0 replies; 37+ messages in thread
From: Guillaume Nault @ 2018-03-05 17:22 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: Cong Wang, Linux Kernel Network Developers
On Sat, Mar 03, 2018 at 11:33:53AM +0200, Denys Fedoryshchenko wrote:
> On 2018-03-02 19:43, Guillaume Nault wrote:
> > Out of curiosity, did unit-cache really bring performance improvements
> > on your workload?
> On old kernels it definitely did, due local specifics (electricity outages)
> i might have few thousands of interfaces deleted and created again in short
> period of time.
> And before interfaces creation/deletion (especially when there is thousands
> of them) was very expensive.
I see. Our workload is a bit different, that's probably why we've never
felt the need for the unit-cache.
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2018-03-05 17:22 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-14 13:17 ppp/pppoe, still panic 4.15.3 in ppp_push Denys Fedoryshchenko
2018-02-14 16:07 ` Guillaume Nault
2018-02-14 16:29 ` Denys Fedoryshchenko
2018-02-14 16:47 ` Guillaume Nault
2018-02-14 16:49 ` Denys Fedoryshchenko
2018-02-14 17:25 ` Guillaume Nault
2018-02-15 10:19 ` Denys Fedoryshchenko
2018-02-15 15:55 ` Guillaume Nault
2018-02-15 16:01 ` Denys Fedoryshchenko
2018-02-15 19:31 ` Guillaume Nault
2018-02-15 19:34 ` Denys Fedoryshchenko
2018-02-15 19:42 ` Guillaume Nault
2018-02-16 11:13 ` Denys Fedoryshchenko
2018-02-16 18:48 ` Guillaume Nault
2018-02-18 10:01 ` Denys Fedoryshchenko
2018-02-21 18:38 ` Guillaume Nault
2018-02-20 9:05 ` Denys Fedoryshchenko
2018-02-21 10:26 ` Denys Fedoryshchenko
2018-02-21 18:55 ` Guillaume Nault
2018-02-21 19:30 ` Denys Fedoryshchenko
2018-02-21 20:04 ` Cong Wang
2018-02-22 18:30 ` Guillaume Nault
2018-02-22 18:51 ` Denys Fedoryshchenko
2018-02-23 9:38 ` Guillaume Nault
2018-02-23 9:41 ` Denys Fedoryshchenko
2018-02-23 10:07 ` Guillaume Nault
2018-02-23 10:54 ` Denys Fedoryshchenko
2018-02-24 21:22 ` Denys Fedoryshchenko
2018-02-27 10:58 ` Denys Fedoryshchenko
2018-02-27 18:56 ` Guillaume Nault
2018-03-01 20:01 ` Guillaume Nault
2018-03-01 20:07 ` Denys Fedoryshchenko
2018-03-02 17:43 ` Guillaume Nault
2018-03-03 9:33 ` Denys Fedoryshchenko
2018-03-05 17:22 ` Guillaume Nault
2018-02-27 18:54 ` Guillaume Nault
2018-02-15 19:20 ` Guillaume Nault
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).