From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nuclearcat.com ([144.76.183.226]:55584 "EHLO nuclearcat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751376AbeBUTax (ORCPT ); Wed, 21 Feb 2018 14:30:53 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Wed, 21 Feb 2018 21:30:49 +0200 From: Denys Fedoryshchenko To: Guillaume Nault Cc: Linux Kernel Network Developers Subject: Re: ppp/pppoe, still panic 4.15.3 in ppp_push In-Reply-To: <20180221185545.GB1322@alphalink.fr> References: <20180214172549.GH1422@alphalink.fr> <20180215155517.GI1422@alphalink.fr> <79ba31ea47137ca539beea112bec8e57@nuclearcat.com> <20180215193114.GK1422@alphalink.fr> <61f3b6b7b437cd7b37e3aae0d2ee1df8@nuclearcat.com> <20180215194253.GL1422@alphalink.fr> <159534b0fadb78c727bce6d55d358689@nuclearcat.com> <20180216184812.GM1422@alphalink.fr> <20180221185545.GB1322@alphalink.fr> Message-ID: Sender: netdev-owner@vger.kernel.org List-ID: On 2018-02-21 20:55, Guillaume Nault wrote: > On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote: >> It seems even rebuilding seemingly stable version triggering crashes >> too >> (but different ones) > Different ones? The trace following your message looks very similar to > your first KASAN report. Or are you refering to the lockup you posted > on Sun, 18 Feb 2018? > > Also, which stable versions are you refering to? Trace i sent in previous email - is latest kernel, vanilla, just more debug options and few options disabled. One of disabled was spitting some errors (it is obviously bug), CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it). And i disabled namespaces, as they are often source of trouble. Today i will try to revert just: drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t Because i suspect previously, after reverting this patches i got different kernel panic (and i didn't noticed that, now too late to identify between other crashes), seems it was not KASAN. I will report results after testing, unfortunately i can't test it more than once per day. "Stable" for me was 4.14.2 - but it looks like on that kernel i am getting different issue now. I will paste it below. Another observation, just hour ago, i noticed on another server, where i am testing 4.15, and 4.14.20 (at moment of testing 4.14.20, but no debug at that moment), when i killed accel-pppd (pppoe server software), with 8k sessions online, i got some weird behaviour, accel-pppd process got stuck, same as ifconfig and "ip link", and even kexec -e didn't worked(got stuck too), unless i did kexec -e -x (so it wont try to make interfaces down on kexec). I will try to reproduce this bug as well, with debug enabled (lockdep and so) i hope it is not related. > > I'm interested in the ppp_generic.o file that produced the following > trace. Just to be sure that the differences come from the new debugging > options. Also kernel config: https://nuclearcat.com/bughunting/config.txt https://nuclearcat.com/bughunting/ppp_generic.o This is in 4.14.2, was seemingly stable before: [50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out [50401.389014] ------------[ cut here ]------------ [50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x15c/0x1b9 [50401.389925] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.2-build-0134 #4 [50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 [50401.392513] task: ffff880434d72640 task.stack: ffffc90001914000 [50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9 [50401.393155] RSP: 0018:ffff8804364c3e90 EFLAGS: 00010286 [50401.393470] RAX: 0000000000000039 RBX: ffff88042f6e0000 RCX: 0000000000000000 [50401.393787] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffffffff828dbc64 [50401.394103] RBP: ffff8804364c3eb0 R08: 0000000000000001 R09: 0000000000000000 [50401.394420] R10: 0000000000000002 R11: ffff8803fa075c00 R12: 0000000000000001 [50401.394739] R13: 0000000000000040 R14: 0000000000000003 R15: ffffffff81e05108 [50401.395064] FS: 0000000000000000(0000) GS:ffff8804364c0000(0000) knlGS:0000000000000000 [50401.395645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [50401.395970] CR2: 00007fff25fc20a8 CR3: 0000000001e09005 CR4: 00000000001606e0 [50401.396294] Call Trace: [50401.396613] [50401.396934] ? qdisc_rcu_free+0x3f/0x3f [50401.397255] call_timer_fn.isra.4+0x17/0x7b [50401.397576] expire_timers+0x6f/0x7e [50401.397899] run_timer_softirq+0x6d/0x8f [50401.398219] ? ktime_get+0x3b/0x8c [50401.398540] ? lapic_next_event+0x18/0x1c [50401.398862] ? clockevents_program_event+0xa3/0xbb [50401.399186] __do_softirq+0xbc/0x1ab [50401.399510] irq_exit+0x4d/0x8e [50401.399832] smp_apic_timer_interrupt+0x73/0x80 [50401.400157] apic_timer_interrupt+0x8d/0xa0 [50401.400480] [50401.400801] RIP: 0010:mwait_idle+0x4e/0x61 [50401.401123] RSP: 0018:ffffc90001917ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50401.401714] RAX: 0000000000000000 RBX: ffff880434d72640 RCX: 0000000000000000 [50401.402037] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [50401.402362] RBP: ffffc90001917ec0 R08: 0000000000000000 R09: 0000000000000001 [50401.402685] R10: ffffc90001917e58 R11: 000000000000037a R12: 0000000000000000 [50401.403008] R13: 0000000000000000 R14: ffff880434d72640 R15: ffff880434d72640 [50401.403336] arch_cpu_idle+0xa/0xc [50401.403660] default_idle_call+0x20/0x22 [50401.403983] do_idle+0xb3/0x13b [50401.404303] cpu_startup_entry+0x1a/0x1c [50401.404627] start_secondary+0x102/0x105 [50401.404949] secondary_startup_64+0xa5/0xa5 [50401.405271] Code: d9 54 58 00 00 75 38 48 89 df c6 05 cd 54 58 00 01 e8 dc 24 fe ff 44 89 e1 48 89 c2 48 89 de 48 c7 c7 51 df d6 81 e8 db f6 7c ff <0f> ff eb 10 41 ff c4 48 05 40 01 0 0 00 41 39 cc 75 9a eb 0d 48 [50401.406154] ---[ end trace 433f9a49b32a3b4f ]--- [50401.406489] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout [50401.406828] ixgbe 0000:04:00.1 eth1: Reset adapter [50416.686709] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [tc:24389] [50416.687042] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50416.687707] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [accel-pppd:24377] [50416.687708] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50416.687720] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W 4.14.2-build-0134 #4 [50416.687721] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 [50416.687722] task: ffff8803f278d940 task.stack: ffffc90011a38000 [50416.687727] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e [50416.687728] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50416.687729] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX: ffff8804365d9a40 [50416.687730] RDX: ffff880436419a60 RSI: 0000000000200000 RDI: ffff8804035abfa0 [50416.687730] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09: 00000000ffffffff [50416.687731] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12: ffff88041a5fde00 [50416.687731] R13: 0000000000008021 R14: ffff8803fc5778c0 R15: ffff8803fc577840 [50416.687732] FS: 00007f7b5726e700(0000) GS:ffff8804365c0000(0000) knlGS:0000000000000000 [50416.687733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [50416.687733] CR2: 0000000001d5f408 CR3: 000000042bf7c006 CR4: 00000000001606e0 [50416.687734] Call Trace: [50416.687738] _raw_spin_lock+0x1a/0x1c [50416.687741] ppp_push+0x56/0x49b [ppp_generic] [50416.687743] ? __kmalloc_node_track_caller+0xba/0xdf [50416.687746] __ppp_xmit_process+0x38/0x4c9 [ppp_generic] [50416.687747] ppp_xmit_process+0x39/0x8d [ppp_generic] [50416.687749] ppp_write+0xbb/0xc7 [ppp_generic] [50416.687750] __vfs_write+0x21/0x111 [50416.687752] ? handle_mm_fault+0xb1/0xc0 [50416.687753] vfs_write+0xcd/0x176 [50416.687754] SyS_write+0x49/0x83 [50416.687755] entry_SYSCALL_64_fastpath+0x17/0x98 [50416.687756] RIP: 0033:0x7f7b5ba4e14d [50416.687756] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [50416.687757] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX: 00007f7b5ba4e14d [50416.687758] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI: 0000000000004e9e [50416.687758] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09: 0000000000633e00 [50416.687758] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [50416.687759] R13: 00007f7b5716c4ff R14: 0000000000000003 R15: 00007f7b5716c590 [50416.687759] Code: 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48 81 c2 40 9a 01 00 48 03 14 c5 a0 73 d7 81 48 89 0a 8b 41 08 85 c0 75 04 f3 90 f5 48 8b 11 48 85 d2 74 03 0f 1 8 0a 44 8b 07 66 45 85 c0 0f [50416.687771] Kernel panic - not syncing: softlockup: hung tasks [50416.687772] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W L 4.14.2-build-0134 #4 [50416.687773] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 [50416.687773] Call Trace: [50416.687774] [50416.687776] dump_stack+0x4d/0x63 [50416.687778] panic+0xcd/0x215 [50416.687780] watchdog_timer_fn+0x1a4/0x1c6 [50416.687782] __hrtimer_run_queues+0x76/0xc0 [50416.687783] hrtimer_interrupt+0xa0/0x162 [50416.687785] smp_apic_timer_interrupt+0x6e/0x80 [50416.687786] apic_timer_interrupt+0x8d/0xa0 [50416.687786] [50416.687787] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e [50416.687788] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50416.687789] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX: ffff8804365d9a40 [50416.687789] RDX: ffff880436419a60 RSI: 0000000000200000 RDI: ffff8804035abfa0 [50416.687789] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09: 00000000ffffffff [50416.687790] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12: ffff88041a5fde00 [50416.687790] R13: 0000000000008021 R14: ffff8803fc5778c0 R15: ffff8803fc577840 [50416.687792] _raw_spin_lock+0x1a/0x1c [50416.687793] ppp_push+0x56/0x49b [ppp_generic] [50416.687794] ? __kmalloc_node_track_caller+0xba/0xdf [50416.687795] __ppp_xmit_process+0x38/0x4c9 [ppp_generic] [50416.687797] ppp_xmit_process+0x39/0x8d [ppp_generic] [50416.687798] ppp_write+0xbb/0xc7 [ppp_generic] [50416.687798] __vfs_write+0x21/0x111 [50416.687799] ? handle_mm_fault+0xb1/0xc0 [50416.687800] vfs_write+0xcd/0x176 [50416.687801] SyS_write+0x49/0x83 [50416.687802] entry_SYSCALL_64_fastpath+0x17/0x98 [50416.687802] RIP: 0033:0x7f7b5ba4e14d [50416.687803] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [50416.687803] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX: 00007f7b5ba4e14d [50416.687804] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI: 0000000000004e9e [50416.687804] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09: 0000000000633e00 [50416.687805] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [50416.687805] R13: 00007f7b5716c4ff R14: 0000000000000003 R15: 00007f7b5716c590 [50416.723251] CPU: 5 PID: 24389 Comm: tc Tainted: G W L 4.14.2-build-0134 #4 [50416.723831] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 [50416.724151] task: ffff8803f0f19980 task.stack: ffffc90011a70000 [50416.724477] RIP: 0010:queued_spin_lock_slowpath+0x54/0x16e [50416.724799] RSP: 0018:ffffc90011a73940 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10 [50416.725384] RAX: 0000000000000101 RBX: ffff8803fc577000 RCX: 0000000000000101 [50416.725709] RDX: 0000000000000100 RSI: 0000000000000001 RDI: ffff8803fc5778c0 [50416.726034] RBP: ffffc90011a73940 R08: 0000000000000001 R09: 0000000000002e88 [50416.726355] R10: 0000000000000024 R11: 0000000000000000 R12: ffff8803eea1d060 [50416.726678] R13: ffff8803fc5778c0 R14: ffffc90011a739f0 R15: 0000000000000000 [50416.727001] FS: 00007f72228f2700(0000) GS:ffff880436540000(0000) knlGS:0000000000000000 [50416.727582] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [50416.727904] CR2: 0000000001aea038 CR3: 00000003fb09a001 CR4: 00000000001606e0 [50416.728230] Call Trace: [50416.728550] _raw_spin_lock_bh+0x25/0x27 [50416.728872] ppp_get_stats64+0x4c/0xaf [ppp_generic] [50416.729195] dev_get_stats+0x33/0x84 [50416.729513] rtnl_fill_stats+0x41/0x121 [50416.729831] rtnl_fill_ifinfo+0x567/0xdf0 [50416.730148] rtnl_dump_ifinfo+0x232/0x2e0 [50416.730470] ? __alloc_pages_nodemask+0xd2/0xa99 [50416.730790] ? handle_mm_fault+0xb1/0xc0 [50416.731109] ? __do_page_fault+0x31c/0x3b6 [50416.731431] netlink_dump+0xfc/0x271 [50416.731749] netlink_recvmsg+0x1af/0x2d9 [50416.732068] sock_recvmsg+0x14/0x16 [50416.732388] ___sys_recvmsg+0xea/0x1a1 [50416.732708] ? lru_deactivate_file_fn+0x29f/0x29f [50416.733028] ? pagevec_lru_move_fn+0xad/0xc1 [50416.733348] ? alloc_pages_vma+0x152/0x19c [50416.733669] ? page_add_new_anon_rmap+0xa0/0xa9 [50416.733988] ? __handle_mm_fault+0xda4/0xf40 [50416.734310] ? sock_sendmsg+0x12/0x1d [50416.734628] __sys_recvmsg+0x3d/0x5e [50416.734945] ? __sys_recvmsg+0x3d/0x5e [50416.735264] SyS_recvmsg+0xd/0x17 [50416.735576] entry_SYSCALL_64_fastpath+0x17/0x98 [50416.735890] RIP: 0033:0x7f7221afdac0 [50416.736202] RSP: 002b:00007ffcbbc1f328 EFLAGS: 00000246 ORIG_RAX: 000000000000002f [50416.736772] RAX: ffffffffffffffda RBX: 00007f7221da8b98 RCX: 00007f7221afdac0 [50416.737092] RDX: 0000000000000000 RSI: 00007ffcbbc1f390 RDI: 0000000000000003 [50416.737411] RBP: 00007f7221da8b98 R08: 0000000000007d7c R09: 0000000000000000 [50416.737734] R10: 00007ffcbbc1f3d0 R11: 0000000000000246 R12: 000000000066a520 [50416.738055] R13: 0000000000000040 R14: 00007f7221da8b40 R15: 0000000000002710 [50416.738377] Code: ff ff 75 33 83 fe 01 89 ca 89 f0 41 0f 45 d0 f0 0f b1 17 39 c6 74 04 89 c6 eb e1 ff ca 0f 84 20 01 00 00 8b 07 84 c0 74 04 f3 90 f6 66 c7 07 01 00 e9 0c 01 00 0 0 48 c7 c0 40 9a 01 00 65 48 Feb 21 00:04:31 10.0.252.9 [50417.724420] Shutting down cpus with NMI Feb 21 00:04:31 10.0.252.9 [50417.724748] Kernel Offset: disabled Feb 21 00:04:31 10.0.252.9 [50417.728891] Rebooting in 5 seconds..