From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nuclearcat.com ([144.76.183.226]:40038 "EHLO nuclearcat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750884AbeBTJFO (ORCPT ); Tue, 20 Feb 2018 04:05:14 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 20 Feb 2018 11:05:10 +0200 From: Denys Fedoryshchenko To: Guillaume Nault Cc: Linux Kernel Network Developers Subject: Re: ppp/pppoe, still panic 4.15.3 in ppp_push In-Reply-To: <20180216184812.GM1422@alphalink.fr> References: <20180214164728.GG1422@alphalink.fr> <802ba894fee75e34576d39aed5022dc8@nuclearcat.com> <20180214172549.GH1422@alphalink.fr> <20180215155517.GI1422@alphalink.fr> <79ba31ea47137ca539beea112bec8e57@nuclearcat.com> <20180215193114.GK1422@alphalink.fr> <61f3b6b7b437cd7b37e3aae0d2ee1df8@nuclearcat.com> <20180215194253.GL1422@alphalink.fr> <159534b0fadb78c727bce6d55d358689@nuclearcat.com> <20180216184812.GM1422@alphalink.fr> Message-ID: Sender: netdev-owner@vger.kernel.org List-ID: On 2018-02-16 20:48, Guillaume Nault wrote: > On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote: >> On 2018-02-15 21:42, Guillaume Nault wrote: >> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote: >> > > On 2018-02-15 21:31, Guillaume Nault wrote: >> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote: >> > > > > On 2018-02-15 17:55, Guillaume Nault wrote: >> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote: >> > > > > > > Here we go: >> > > > > > > >> > > > > > > [24558.921549] >> > > > > > > ================================================================== >> > > > > > > [24558.922167] BUG: KASAN: use-after-free in >> > > > > > > ppp_ioctl+0xa6a/0x1522 >> > > > > > > [ppp_generic] >> > > > > > > [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task >> > > > > > > accel-pppd/12622 >> > > > > > > [24558.923113] >> > > > > > > [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G >> > > > > > > W >> > > > > > > 4.15.3-build-0134 #1 >> > > > > > > [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, >> > > > > > > BIOS P80 >> > > > > > > 04/02/2015 >> > > > > > > [24558.924406] Call Trace: >> > > > > > > [24558.924753] dump_stack+0x46/0x59 >> > > > > > > [24558.925103] print_address_description+0x6b/0x23b >> > > > > > > [24558.925451] ? ppp_ioctl+0xa6a/0x1522 [ppp_generic] >> > > > > > > [24558.925797] kasan_report+0x21b/0x241 >> > > > > > > [24558.926136] ppp_ioctl+0xa6a/0x1522 [ppp_generic] >> > > > > > > [24558.926479] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic] >> > > > > > > [24558.926829] ? sock_sendmsg+0x89/0x99 >> > > > > > > [24558.927176] ? __vfs_write+0xd9/0x4ad >> > > > > > > [24558.927523] ? kernel_read+0xed/0xed >> > > > > > > [24558.927872] ? SyS_getpeername+0x18c/0x18c >> > > > > > > [24558.928213] ? bit_waitqueue+0x2a/0x2a >> > > > > > > [24558.928561] ? wake_atomic_t_function+0x115/0x115 >> > > > > > > [24558.928898] vfs_ioctl+0x6e/0x81 >> > > > > > > [24558.929228] do_vfs_ioctl+0xa00/0xb10 >> > > > > > > [24558.929571] ? sigprocmask+0x1a6/0x1d0 >> > > > > > > [24558.929907] ? sigsuspend+0x13e/0x13e >> > > > > > > [24558.930239] ? ioctl_preallocate+0x14e/0x14e >> > > > > > > [24558.930568] ? SyS_rt_sigprocmask+0xf1/0x142 >> > > > > > > [24558.930904] ? sigprocmask+0x1d0/0x1d0 >> > > > > > > [24558.931252] SyS_ioctl+0x39/0x55 >> > > > > > > [24558.931595] ? do_vfs_ioctl+0xb10/0xb10 >> > > > > > > [24558.931942] do_syscall_64+0x1b1/0x31f >> > > > > > > [24558.932288] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> > > > > > > [24558.932627] RIP: 0033:0x7f302849d8a7 >> > > > > > > [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206 >> > > > > > > ORIG_RAX: >> > > > > > > 0000000000000010 >> > > > > > > [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX: >> > > > > > > 00007f302849d8a7 >> > > > > > > [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI: >> > > > > > > 0000000000003a67 >> > > > > > > [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09: >> > > > > > > 000055c8308d8e40 >> > > > > > > [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12: >> > > > > > > 00007f3023f49358 >> > > > > > > [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15: >> > > > > > > 00007f3029a53700 >> > > > > > > [24558.935288] >> > > > > > > [24558.935626] Allocated by task 12622: >> > > > > > > [24558.935972] ppp_register_net_channel+0x5f/0x5c6 >> > > > > > > [ppp_generic] >> > > > > > > [24558.936306] pppoe_connect+0xab7/0xc71 [pppoe] >> > > > > > > [24558.936640] SyS_connect+0x14b/0x1b7 >> > > > > > > [24558.936975] do_syscall_64+0x1b1/0x31f >> > > > > > > [24558.937319] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> > > > > > > [24558.937655] >> > > > > > > [24558.937993] Freed by task 12622: >> > > > > > > [24558.938321] kfree+0xb0/0x11d >> > > > > > > [24558.938658] ppp_release+0x111/0x120 [ppp_generic] >> > > > > > > [24558.938994] __fput+0x2ba/0x51a >> > > > > > > [24558.939332] task_work_run+0x11c/0x13d >> > > > > > > [24558.939676] exit_to_usermode_loop+0x7c/0xaf >> > > > > > > [24558.940022] do_syscall_64+0x2ea/0x31f >> > > > > > > [24558.940368] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> > > > > > > [24558.947099] >> > > > > > >> > > > > > Your first guess was right. It looks like we have an issue with >> > > > > > reference counting on the channels. Can you send me your ppp_generic.o? >> > > > > http://nuclearcat.com/ppp_generic.o >> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3) >> > > > > >> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called >> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at >> > > > by two different file descriptors, I can't see how this could defeat >> > > > the reference counting mechanism. I'm going to think more about it. >> > > > >> > > > Can you test with CONFIG_REFCOUNT_FULL? (and keep >> > > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from >> > > > atomic_t to refcount_t")). >> > > Ok, i will try that tonight. On vanilla kernel or reversing >> > > mentioned in >> > > previous email patch? >> > On vanilla kernel. The other is really a shot in the dark. >> >> As far as i can see there is only KASAN triggered again(and server >> rebooted >> shortly after that), but nothing else: >> > Ok, so no refcount failure detected. Not what I expected... but that's > still an information. It's getting even harder to find a ppp scenario > that could lead to such symptoms. > If that's acceptable for you, you can try reverting the few commits > that entered after 4.14. > > 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom > of lower device into account on xmit > 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex > before registering device > e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks > added > f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when > cleanup > 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces > 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert > syncppp.refcnt from atomic_t to refcount_t > d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert > ppp_file.refcnt from atomic_t to refcount_t > 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert > asyncppp.refcnt from atomic_t to refcount_t > > Sorry, but I have nothing better to propose for now. At least that > should help narrowing the problem space. > I'm going to stress test ppp_generic and pppoe on my side. > >> [ 1848.527234] >> ================================================================== >> [ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7 >> [ppp_generic] >> [ 1848.528468] Write of size 8 at addr ffff880354d3fa38 by task >> accel-pppd/12626 >> [ 1848.528807] >> [ 1848.529143] CPU: 5 PID: 12626 Comm: accel-pppd Tainted: G W >> 4.15.3-build-0134 #2 >> [ 1848.529755] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 >> 04/02/2015 >> [ 1848.530100] Call Trace: >> [ 1848.530448] dump_stack+0x46/0x59 >> [ 1848.530793] print_address_description+0x6b/0x23b >> [ 1848.531143] ? ppp_ioctl+0xa68/0x14e7 [ppp_generic] >> [ 1848.531488] kasan_report+0x21b/0x241 >> [ 1848.538248] ppp_ioctl+0xa68/0x14e7 [ppp_generic] >> [ 1848.538604] ? ppp_nl_newlink+0x1da/0x1da [ppp_generic] >> [ 1848.538949] ? fsnotify+0x770/0x7fe >> [ 1848.539295] ? __fsnotify_inode_delete+0xc/0xc >> [ 1848.539648] ? kernel_read+0xed/0xed >> [ 1848.539993] ? memcpy+0x34/0x46 >> [ 1848.540332] vfs_ioctl+0x6e/0x81 >> [ 1848.540679] do_vfs_ioctl+0xa00/0xb10 >> [ 1848.541029] ? __fsnotify_inode_delete+0xc/0xc >> [ 1848.541375] ? fcntl_setlk+0x6b8/0x6b8 >> [ 1848.541721] ? ioctl_preallocate+0x14e/0x14e >> [ 1848.542064] ? rcu_is_watching+0x5/0x9 >> [ 1848.542404] ? dput+0x4a/0x432 >> [ 1848.542739] ? __fput+0x4e4/0x51a >> [ 1848.543080] SyS_ioctl+0x39/0x55 >> [ 1848.543418] ? do_vfs_ioctl+0xb10/0xb10 >> [ 1848.543759] do_syscall_64+0x1b1/0x31f >> [ 1848.544100] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> [ 1848.544439] RIP: 0033:0x7f752170a8a7 >> [ 1848.544780] RSP: 002b:00007f7522cbfaf8 EFLAGS: 00000206 ORIG_RAX: >> 0000000000000010 >> [ 1848.545403] RAX: ffffffffffffffda RBX: 00007f7520ff31e3 RCX: >> 00007f752170a8a7 >> [ 1848.545752] RDX: 00007f7518cba868 RSI: 000000004004743a RDI: >> 0000000000007c31 >> [ 1848.546097] RBP: 00007f7522cbfb20 R08: 0000000000000000 R09: >> 000055ba01b29e40 >> [ 1848.546448] R10: 0000000073646172 R11: 0000000000000206 R12: >> 00007f7518cba758 >> [ 1848.546802] R13: 00007ffe84df1a0f R14: 0000000000000000 R15: >> 00007f7522cc0700 >> [ 1848.547148] >> [ 1848.547490] Allocated by task 12626: >> [ 1848.547834] ppp_register_net_channel+0x5f/0x5c6 [ppp_generic] >> [ 1848.548181] pppoe_connect+0xab7/0xc71 [pppoe] >> [ 1848.548535] SyS_connect+0x14b/0x1b7 >> [ 1848.548878] do_syscall_64+0x1b1/0x31f >> [ 1848.549219] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> [ 1848.549562] >> [ 1848.549901] Freed by task 12626: >> [ 1848.550249] kfree+0xb0/0x11d >> [ 1848.550599] ppp_release+0x112/0x121 [ppp_generic] >> [ 1848.550941] __fput+0x2ba/0x51a >> [ 1848.551284] task_work_run+0x11c/0x13d >> [ 1848.551626] exit_to_usermode_loop+0x7c/0xaf >> [ 1848.551972] do_syscall_64+0x2ea/0x31f >> [ 1848.552322] entry_SYSCALL_64_after_hwframe+0x21/0x86 >> [ 1848.552666] >> [ 1848.553012] The buggy address belongs to the object at >> ffff880354d3f980 >> [ 1848.553012] which belongs to the cache kmalloc-256 of size 256 >> [ 1848.553632] The buggy address is located 184 bytes inside of >> [ 1848.553632] 256-byte region [ffff880354d3f980, ffff880354d3fa80) >> [ 1848.554248] The buggy address belongs to the page: >> [ 1848.554591] page:ffffea000d534f00 count:1 mapcount:0 mapping: >> (null) index:0xffff880354d3f980 compound_mapcount: 0 >> [ 1848.555211] flags: 0x17ffe00000008100(slab|head) >> [ 1848.555562] raw: 17ffe00000008100 0000000000000000 ffff880354d3f980 >> 000000010033002c >> [ 1848.556174] raw: ffffea000d0f7420 ffffea000cf84d20 ffff8803f1c0f480 >> 0000000000000000 >> [ 1848.556783] page dumped because: kasan: bad access detected >> [ 1848.557123] >> [ 1848.557457] Memory state around the buggy address: >> [ 1848.557800] ffff880354d3f900: fb fb fb fb fb fb fb fb fc fc fc fc >> fc fc >> fc fc >> [ 1848.558416] ffff880354d3f980: fb fb fb fb fb fb fb fb fb fb fb fb >> fb fb >> fb fb >> [ 1848.559028] >ffff880354d3fa00: fb fb fb fb fb fb fb fb fb fb fb fb >> fb fb >> fb fb >> [ 1848.559643] ^ >> [ 1848.559989] ffff880354d3fa80: fc fc fc fc fc fc fc fc 00 00 00 00 >> 00 00 >> 00 00 >> [ 1848.560603] ffff880354d3fb00: 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 >> 00 00 >> [ 1848.561217] >> ================================================================== >> [ 1848.561830] Disabling lock debugging due to kernel taint >> Ok, more updates: I updated 2 other production servers(different location), similar workload, higher load, but slightly different configration, different hardware. 4.14.20 working fine there almost 24h. So even they agreed to participate in this bug hunting, i can't reproduce bug easily on that server, because on first server i am able to trigger bug even on 4.14.20 as well, but it doesnt happen on 4.14.2. But there is several other changes from old kernel build, mostly because of Meltdown/Spectre, i had to update to newer gcc and to do some userspace updates, so i am trying to roll back slowly to same conditions as old stable build, to find out what is triggering bug.