netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
To: Guillaume Nault <g.nault@alphalink.fr>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: ppp/pppoe, still panic 4.15.3 in ppp_push
Date: Tue, 20 Feb 2018 11:05:10 +0200	[thread overview]
Message-ID: <c549ef6cb3183bc3c89b1e4f60eb45fe@nuclearcat.com> (raw)
In-Reply-To: <20180216184812.GM1422@alphalink.fr>

On 2018-02-16 20:48, Guillaume Nault wrote:
> On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 21:42, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
>> > > On 2018-02-15 21:31, Guillaume Nault wrote:
>> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> > > > > On 2018-02-15 17:55, Guillaume Nault wrote:
>> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > > > > > Here we go:
>> > > > > > >
>> > > > > > >  <srv> [24558.921549]
>> > > > > > > ==================================================================
>> > > > > > >  <srv> [24558.922167] BUG: KASAN: use-after-free in
>> > > > > > > ppp_ioctl+0xa6a/0x1522
>> > > > > > > [ppp_generic]
>> > > > > > >  <srv> [24558.922776] Write of size 8 at addr ffff8803d35bf3f8 by task
>> > > > > > > accel-pppd/12622
>> > > > > > >  <srv> [24558.923113]
>> > > > > > >  <srv> [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > > > > > W
>> > > > > > > 4.15.3-build-0134 #1
>> > > > > > >  <srv> [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > > > > > BIOS P80
>> > > > > > > 04/02/2015
>> > > > > > >  <srv> [24558.924406] Call Trace:
>> > > > > > >  <srv> [24558.924753]  dump_stack+0x46/0x59
>> > > > > > >  <srv> [24558.925103]  print_address_description+0x6b/0x23b
>> > > > > > >  <srv> [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > >  <srv> [24558.925797]  kasan_report+0x21b/0x241
>> > > > > > >  <srv> [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > > > > > >  <srv> [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > > > > > >  <srv> [24558.926829]  ? sock_sendmsg+0x89/0x99
>> > > > > > >  <srv> [24558.927176]  ? __vfs_write+0xd9/0x4ad
>> > > > > > >  <srv> [24558.927523]  ? kernel_read+0xed/0xed
>> > > > > > >  <srv> [24558.927872]  ? SyS_getpeername+0x18c/0x18c
>> > > > > > >  <srv> [24558.928213]  ? bit_waitqueue+0x2a/0x2a
>> > > > > > >  <srv> [24558.928561]  ? wake_atomic_t_function+0x115/0x115
>> > > > > > >  <srv> [24558.928898]  vfs_ioctl+0x6e/0x81
>> > > > > > >  <srv> [24558.929228]  do_vfs_ioctl+0xa00/0xb10
>> > > > > > >  <srv> [24558.929571]  ? sigprocmask+0x1a6/0x1d0
>> > > > > > >  <srv> [24558.929907]  ? sigsuspend+0x13e/0x13e
>> > > > > > >  <srv> [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
>> > > > > > >  <srv> [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
>> > > > > > >  <srv> [24558.930904]  ? sigprocmask+0x1d0/0x1d0
>> > > > > > >  <srv> [24558.931252]  SyS_ioctl+0x39/0x55
>> > > > > > >  <srv> [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
>> > > > > > >  <srv> [24558.931942]  do_syscall_64+0x1b1/0x31f
>> > > > > > >  <srv> [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > >  <srv> [24558.932627] RIP: 0033:0x7f302849d8a7
>> > > > > > >  <srv> [24558.932965] RSP: 002b:00007f3029a52af8 EFLAGS: 00000206
>> > > > > > > ORIG_RAX:
>> > > > > > > 0000000000000010
>> > > > > > >  <srv> [24558.933578] RAX: ffffffffffffffda RBX: 00007f3027d861e3 RCX:
>> > > > > > > 00007f302849d8a7
>> > > > > > >  <srv> [24558.933927] RDX: 00007f3023f49468 RSI: 000000004004743a RDI:
>> > > > > > > 0000000000003a67
>> > > > > > >  <srv> [24558.934266] RBP: 00007f3029a52b20 R08: 0000000000000000 R09:
>> > > > > > > 000055c8308d8e40
>> > > > > > >  <srv> [24558.934607] R10: 0000000000000008 R11: 0000000000000206 R12:
>> > > > > > > 00007f3023f49358
>> > > > > > >  <srv> [24558.934947] R13: 00007ffe86e5723f R14: 0000000000000000 R15:
>> > > > > > > 00007f3029a53700
>> > > > > > >  <srv> [24558.935288]
>> > > > > > >  <srv> [24558.935626] Allocated by task 12622:
>> > > > > > >  <srv> [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
>> > > > > > > [ppp_generic]
>> > > > > > >  <srv> [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
>> > > > > > >  <srv> [24558.936640]  SyS_connect+0x14b/0x1b7
>> > > > > > >  <srv> [24558.936975]  do_syscall_64+0x1b1/0x31f
>> > > > > > >  <srv> [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > >  <srv> [24558.937655]
>> > > > > > >  <srv> [24558.937993] Freed by task 12622:
>> > > > > > >  <srv> [24558.938321]  kfree+0xb0/0x11d
>> > > > > > >  <srv> [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
>> > > > > > >  <srv> [24558.938994]  __fput+0x2ba/0x51a
>> > > > > > >  <srv> [24558.939332]  task_work_run+0x11c/0x13d
>> > > > > > >  <srv> [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
>> > > > > > >  <srv> [24558.940022]  do_syscall_64+0x2ea/0x31f
>> > > > > > >  <srv> [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > > > > > >  <srv> [24558.947099]
>> > > > > >
>> > > > > > Your first guess was right. It looks like we have an issue with
>> > > > > > reference counting on the channels. Can you send me your ppp_generic.o?
>> > > > > http://nuclearcat.com/ppp_generic.o
>> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>> > > > >
>> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
>> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
>> > > > by two different file descriptors, I can't see how this could defeat
>> > > > the reference counting mechanism. I'm going to think more about it.
>> > > >
>> > > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
>> > > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
>> > > > atomic_t to refcount_t")).
>> > > Ok, i will try that tonight. On vanilla kernel or reversing
>> > > mentioned in
>> > > previous email patch?
>> > On vanilla kernel. The other is really a shot in the dark.
>> 
>> As far as i can see there is only KASAN triggered again(and server 
>> rebooted
>> shortly after that), but nothing else:
>> 
> Ok, so no refcount failure detected. Not what I expected... but that's
> still an information. It's getting even harder to find a ppp scenario
> that could lead to such symptoms.
> If that's acceptable for you, you can try reverting the few commits
> that entered after 4.14.
> 
> 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> of lower device into account on xmit
> 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> before registering device
> e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks 
> added
> f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when 
> cleanup
> 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> syncppp.refcnt from atomic_t to refcount_t
> d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> ppp_file.refcnt from atomic_t to refcount_t
> 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> asyncppp.refcnt from atomic_t to refcount_t
> 
> Sorry, but I have nothing better to propose for now. At least that
> should help narrowing the problem space.
> I'm going to stress test ppp_generic and pppoe on my side.
> 
>> [ 1848.527234]
>> ==================================================================
>> [ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7
>> [ppp_generic]
>> [ 1848.528468] Write of size 8 at addr ffff880354d3fa38 by task
>> accel-pppd/12626
>> [ 1848.528807]
>> [ 1848.529143] CPU: 5 PID: 12626 Comm: accel-pppd Tainted: G        W
>> 4.15.3-build-0134 #2
>> [ 1848.529755] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
>> 04/02/2015
>> [ 1848.530100] Call Trace:
>> [ 1848.530448]  dump_stack+0x46/0x59
>> [ 1848.530793]  print_address_description+0x6b/0x23b
>> [ 1848.531143]  ? ppp_ioctl+0xa68/0x14e7 [ppp_generic]
>> [ 1848.531488]  kasan_report+0x21b/0x241
>> [ 1848.538248]  ppp_ioctl+0xa68/0x14e7 [ppp_generic]
>> [ 1848.538604]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> [ 1848.538949]  ? fsnotify+0x770/0x7fe
>> [ 1848.539295]  ? __fsnotify_inode_delete+0xc/0xc
>> [ 1848.539648]  ? kernel_read+0xed/0xed
>> [ 1848.539993]  ? memcpy+0x34/0x46
>> [ 1848.540332]  vfs_ioctl+0x6e/0x81
>> [ 1848.540679]  do_vfs_ioctl+0xa00/0xb10
>> [ 1848.541029]  ? __fsnotify_inode_delete+0xc/0xc
>> [ 1848.541375]  ? fcntl_setlk+0x6b8/0x6b8
>> [ 1848.541721]  ? ioctl_preallocate+0x14e/0x14e
>> [ 1848.542064]  ? rcu_is_watching+0x5/0x9
>> [ 1848.542404]  ? dput+0x4a/0x432
>> [ 1848.542739]  ? __fput+0x4e4/0x51a
>> [ 1848.543080]  SyS_ioctl+0x39/0x55
>> [ 1848.543418]  ? do_vfs_ioctl+0xb10/0xb10
>> [ 1848.543759]  do_syscall_64+0x1b1/0x31f
>> [ 1848.544100]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.544439] RIP: 0033:0x7f752170a8a7
>> [ 1848.544780] RSP: 002b:00007f7522cbfaf8 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000010
>> [ 1848.545403] RAX: ffffffffffffffda RBX: 00007f7520ff31e3 RCX:
>> 00007f752170a8a7
>> [ 1848.545752] RDX: 00007f7518cba868 RSI: 000000004004743a RDI:
>> 0000000000007c31
>> [ 1848.546097] RBP: 00007f7522cbfb20 R08: 0000000000000000 R09:
>> 000055ba01b29e40
>> [ 1848.546448] R10: 0000000073646172 R11: 0000000000000206 R12:
>> 00007f7518cba758
>> [ 1848.546802] R13: 00007ffe84df1a0f R14: 0000000000000000 R15:
>> 00007f7522cc0700
>> [ 1848.547148]
>> [ 1848.547490] Allocated by task 12626:
>> [ 1848.547834]  ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
>> [ 1848.548181]  pppoe_connect+0xab7/0xc71 [pppoe]
>> [ 1848.548535]  SyS_connect+0x14b/0x1b7
>> [ 1848.548878]  do_syscall_64+0x1b1/0x31f
>> [ 1848.549219]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.549562]
>> [ 1848.549901] Freed by task 12626:
>> [ 1848.550249]  kfree+0xb0/0x11d
>> [ 1848.550599]  ppp_release+0x112/0x121 [ppp_generic]
>> [ 1848.550941]  __fput+0x2ba/0x51a
>> [ 1848.551284]  task_work_run+0x11c/0x13d
>> [ 1848.551626]  exit_to_usermode_loop+0x7c/0xaf
>> [ 1848.551972]  do_syscall_64+0x2ea/0x31f
>> [ 1848.552322]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> [ 1848.552666]
>> [ 1848.553012] The buggy address belongs to the object at 
>> ffff880354d3f980
>> [ 1848.553012]  which belongs to the cache kmalloc-256 of size 256
>> [ 1848.553632] The buggy address is located 184 bytes inside of
>> [ 1848.553632]  256-byte region [ffff880354d3f980, ffff880354d3fa80)
>> [ 1848.554248] The buggy address belongs to the page:
>> [ 1848.554591] page:ffffea000d534f00 count:1 mapcount:0 mapping:
>> (null) index:0xffff880354d3f980 compound_mapcount: 0
>> [ 1848.555211] flags: 0x17ffe00000008100(slab|head)
>> [ 1848.555562] raw: 17ffe00000008100 0000000000000000 ffff880354d3f980
>> 000000010033002c
>> [ 1848.556174] raw: ffffea000d0f7420 ffffea000cf84d20 ffff8803f1c0f480
>> 0000000000000000
>> [ 1848.556783] page dumped because: kasan: bad access detected
>> [ 1848.557123]
>> [ 1848.557457] Memory state around the buggy address:
>> [ 1848.557800]  ffff880354d3f900: fb fb fb fb fb fb fb fb fc fc fc fc 
>> fc fc
>> fc fc
>> [ 1848.558416]  ffff880354d3f980: fb fb fb fb fb fb fb fb fb fb fb fb 
>> fb fb
>> fb fb
>> [ 1848.559028] >ffff880354d3fa00: fb fb fb fb fb fb fb fb fb fb fb fb 
>> fb fb
>> fb fb
>> [ 1848.559643]                                         ^
>> [ 1848.559989]  ffff880354d3fa80: fc fc fc fc fc fc fc fc 00 00 00 00 
>> 00 00
>> 00 00
>> [ 1848.560603]  ffff880354d3fb00: 00 00 00 00 00 00 00 00 00 00 00 00 
>> 00 00
>> 00 00
>> [ 1848.561217]
>> ==================================================================
>> [ 1848.561830] Disabling lock debugging due to kernel taint
>> 
Ok, more updates:

I updated 2 other production servers(different location), similar 
workload, higher load,
but slightly different configration, different hardware. 4.14.20 working 
fine there almost 24h.
So even they agreed to participate in this bug hunting, i can't 
reproduce bug easily on that server, because
on first server i am able to trigger bug even on 4.14.20 as well, but it 
doesnt happen on 4.14.2.

But there is several other changes from old kernel build, mostly because 
of Meltdown/Spectre, i had to update
to newer gcc and to do some userspace updates, so i am trying to roll 
back slowly to same conditions as old stable build,
to find out what is triggering bug.

  parent reply	other threads:[~2018-02-20  9:05 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-14 13:17 ppp/pppoe, still panic 4.15.3 in ppp_push Denys Fedoryshchenko
2018-02-14 16:07 ` Guillaume Nault
2018-02-14 16:29   ` Denys Fedoryshchenko
2018-02-14 16:47     ` Guillaume Nault
2018-02-14 16:49       ` Denys Fedoryshchenko
2018-02-14 17:25         ` Guillaume Nault
2018-02-15 10:19           ` Denys Fedoryshchenko
2018-02-15 15:55             ` Guillaume Nault
2018-02-15 16:01               ` Denys Fedoryshchenko
2018-02-15 19:31                 ` Guillaume Nault
2018-02-15 19:34                   ` Denys Fedoryshchenko
2018-02-15 19:42                     ` Guillaume Nault
2018-02-16 11:13                       ` Denys Fedoryshchenko
2018-02-16 18:48                         ` Guillaume Nault
2018-02-18 10:01                           ` Denys Fedoryshchenko
2018-02-21 18:38                             ` Guillaume Nault
2018-02-20  9:05                           ` Denys Fedoryshchenko [this message]
2018-02-21 10:26                           ` Denys Fedoryshchenko
2018-02-21 18:55                             ` Guillaume Nault
2018-02-21 19:30                               ` Denys Fedoryshchenko
2018-02-21 20:04                   ` Cong Wang
2018-02-22 18:30                     ` Guillaume Nault
2018-02-22 18:51                       ` Denys Fedoryshchenko
2018-02-23  9:38                         ` Guillaume Nault
2018-02-23  9:41                           ` Denys Fedoryshchenko
2018-02-23 10:07                             ` Guillaume Nault
2018-02-23 10:54                               ` Denys Fedoryshchenko
2018-02-24 21:22                               ` Denys Fedoryshchenko
2018-02-27 10:58                               ` Denys Fedoryshchenko
2018-02-27 18:56                                 ` Guillaume Nault
2018-03-01 20:01                                   ` Guillaume Nault
2018-03-01 20:07                                     ` Denys Fedoryshchenko
2018-03-02 17:43                                       ` Guillaume Nault
2018-03-03  9:33                                         ` Denys Fedoryshchenko
2018-03-05 17:22                                           ` Guillaume Nault
2018-02-27 18:54                       ` Guillaume Nault
2018-02-15 19:20         ` Guillaume Nault

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c549ef6cb3183bc3c89b1e4f60eb45fe@nuclearcat.com \
    --to=nuclearcat@nuclearcat.com \
    --cc=g.nault@alphalink.fr \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).