All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Lance Roy <ldr709@gmail.com>,
	Andrey Konovalov <andreyknvl@google.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	LKML <linux-kernel@vger.kernel.org>,
	syzkaller <syzkaller@googlegroups.com>,
	Kostya Serebryany <kcc@google.com>
Subject: Re: srcu: BUG in __synchronize_srcu
Date: Mon, 27 Mar 2017 07:16:27 -0700	[thread overview]
Message-ID: <20170327141627.GE3637@linux.vnet.ibm.com> (raw)
In-Reply-To: <CACT4Y+YcQNEvR0pFD97JXPJ2hyZiOR6-QHdbw=9-XerTRapsfQ@mail.gmail.com>

On Mon, Mar 27, 2017 at 02:36:35PM +0200, Dmitry Vyukov wrote:
> On Tue, Mar 14, 2017 at 5:21 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Mar 14, 2017 at 12:47:02AM -0700, Lance Roy wrote:
> >> I am not sure how the rcu_scheduler_active changes in __synchronize_srcu work,
> >> but there seem to be a few problems in them. First,
> >> "if (done && likely(!driving))" on line 453 doesn't appear to ever happen,
> >> as driving doesn't get set to false when srcu_reschedule is called. This seems
> >> like it could cause a race condition if another thread notices that ->running is
> >> false, adds itself to the queue, set ->running to true, and starts on its own
> >> grace period before the first thread acquires the lock again on line 469. Then
> >> the first thread will then acquire the lock, set running to false, and release
> >> the lock, resulting in an invalid state where ->running is false but the second
> >> thread is still trying to finish its grace period.
> >>
> >> Second, the while loop on line 455 seems to violate to rule that ->running
> >> shouldn't be false when there are entries in the queue. If a second thread adds
> >> itself to the queue while the first thread is driving SRCU inside that loop, and
> >> then the first thread finishes its own grace period and quits the loop, it will
> >> set ->running to false even though there is still a thread on the queue.
> >>
> >> The second issue requires rcu_scheduler_active to be RCU_SCHEDULER_INIT to
> >> occur, and as I don't what the assumptions during RCU_SCHEDULER_INIT are I don't
> >> know if it is actually a problem, but the first issue looks like it could occur
> >> at any time.
> >
> > Thank you for looking into this!
> >
> > I determined that my patch-order strategy was flawed, as it required
> > me to rewrite the mid-boot functionality several times.  I therefore
> > removed the mid-boot commits.  I will add them in later, but they will
> > use a rather different approach based on a grace-period sequence number
> > similar to that used by the expedited grace periods.
> >
> > Which should also teach me to be less aggressive about pushing new code
> > to -next.  For a few weeks, anyway.  ;-)
> >
> >                                                         Thanx, Paul
> >
> >> Thanks,
> >> Lance
> >>
> >> On Fri, 10 Mar 2017 14:26:09 -0800
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote:
> >> > > On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov <andreyknvl@google.com>
> >> > > wrote:
> >> > > > Kernel panic - not syncing: Fatal exception
> >> >
> >> > So the theory is that if !sp->running, all of SRCU's queues must be empty.
> >> > So if you are holding ->queue_lock (with irqs disabled) and you see
> >> > !sp->running, and then you enqueue a callback on ->batch_check0, then
> >> > that callback must be the first in the list.  And the code preceding
> >> > the WARN_ON() you triggered does in fact check and enqueue shile holding
> >> > ->queue_lock with irqs disabled.
> >> >
> >> > And rcu_batch_queue() does operate FIFO as required.  (Otherwise,
> >> > srcu_barrier() would not work.)
> >> >
> >> > There are only three calls to rcu_batch_queue(), and the one involved with
> >> > the WARN_ON() enqueues to ->batch_check0.  The other two enqueue to
> >> > ->batch_queue.  Callbacks move from ->batch_queue to ->batch_check0 to
> >> > ->batch_check1 to ->batch_done, so nothing should slip in front.
> >> >
> >> > Of course, if ->running were ever set to false with any of ->batch_check0,
> >> > ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger.
> >> > But srcu_reschedule() sets it to false only if all four batches are
> >> > empty (and checks and sets under ->queue_lock()), and all other cases
> >> > where it is set to false happen at initialization time, and also clear
> >> > out the queues.  Of course, if someone raced an init_srcu_struct() with
> >> > either a call_srcu() or synchronize_srcu(), all bets are off.  Now,
> >> > mmu_notifier.c does invoke init_srcu_struct() manually, but it does
> >> > so at subsys_initcall() time.  Which -might- be after other things are
> >> > happening, so one "hail Mary" attempted fix is to remove mmu_notifier_init()
> >> > and replace the "static struct srcu_struct srcu" with:
> >>
> >> >
> >> >     DEFINE_STATIC_SRCU(srcu);
> >> >
> >> > But this might require changing the name -- I vaguely recall some
> >> > strangeness where the names of statically defined per-CPU variables need
> >> > to be globally unique even when static.  Easy enough to do, though.
> >> > Might need a similar change to the "srcu" instances defined in vmd.c
> >> > and kvm_host.h -- assuming that this change helps.
> >> >
> >> > Another possibility is that something in SRCU is messing with either the
> >> > queues or the ->running field without holding ->queue_lock.  And that does
> >> > seem to be happening -- srcu_advance_batches() invokes rcu_batch_move()
> >> > without holding anything.  Which seems like it could cause trouble
> >> > if someone else was invoking synchronize_srcu() concurrently.  Those
> >> > particular invocations might be safe due to access only by a single
> >> > kthread/workqueue, given that all updates to ->batch_queue are protected
> >> > by ->queue_lock (aside from initialization).
> >> >
> >> > But ->batch_check0 is updated by __synchronize_srcu(), though protected
> >> > by ->queue_lock, and only if ->running is false, and with both the
> >> > check and the update protected by the same ->queue_lock critical section.
> >> > If ->running is false, then the workqueue cannot be running, so it remains
> >> > to see if all other updates to ->batch_check0 are either with ->queue_lock
> >> > held and ->running false on the one hand or from the workqueue handler
> >> > on the other:
> >> >
> >> > srcu_collect_new() updates with ->queue_lock held, but does not check
> >> >     ->running.  It is invoked only from process_srcu(), which in
> >> >     turn is invoked only as a workqueue handler.  The work is queued
> >> >     from:
> >> >
> >> >     call_srcu(), which does so with ->queue_lock held having just
> >> >             set ->running to true.
> >> >
> >> >     srcu_reschedule(), which invokes it if there are non-empty
> >> >             queues.  This is invoked from __synchronize_srcu()
> >> >             in the case where it has set ->running to true
> >> >             after finding the queues empty, which should imply
> >> >             no other instances.
> >> >
> >> >             It is also invoked from process_srcu(), which is
> >> >             invoked only as a workqueue handler.  (Yay
> >> >             recursive inquiry!)
> >> >
> >> > srcu_advance_batches() updates without locks held.  It is invoked as
> >> >     follows:
> >> >
> >> >     __synchronize_srcu() in the case where ->running was set, which
> >> >             as noted before excludes workqueue handlers.
> >> >
> >> >     process_srcu() which as noted before is only invoked from
> >> >             a workqueue handler.
> >> >
> >> > So an SRCU workqueue is invoked only from a workqueue handler, or from
> >> > some other task that transitioned ->running from false to true while
> >> > holding ->queuelock.  There should therefore only be one SRCU workqueue
> >> > per srcu_struct, so this should be safe.  Though I hope that it can
> >> > be simplified a bit.  :-/
> >> >
> >> > So the only suggestion I have at the moment is static definition of
> >> > the "srcu" variable.  Lai, Josh, Steve, Mathieu, anything I missed?
> >> >
> >> >                                             Thanx, Paul
> 
> 
> 
> This happened on linux-next/65b2dc38291f9f27e5ec3b804d6eb3b5f79a3ce4
> and may be related.
> The report says that srcu subsystem still uses the srcu object after
> it has been freed. It can be a kvm fault as well.

Hmmm...  I am not seeing a call to cleanup_srcu_struct() for the
->track_srcu field of the kvm_page_track_notifier_head structure.
Or is this structure immortal, so that it is never cleaned up?
Or am I just blind this morning?

In any case, freeing the kvm_page_track_notifier_head structure
without first invoking cleanup_srcu_struct() on its ->track_srcu
srcu_struct field could easily result in a use-after-free bug.

							Thanx, Paul

> ==================================================================
> BUG: KASAN: use-after-free in debug_spin_unlock
> kernel/locking/spinlock_debug.c:97 [inline]
> BUG: KASAN: use-after-free in do_raw_spin_unlock+0x2ea/0x320
> kernel/locking/spinlock_debug.c:134
> Read of size 4 at addr ffff88014158a564 by task kworker/1:1/5712
> 
> CPU: 1 PID: 5712 Comm: kworker/1:1 Not tainted 4.11.0-rc3-next-20170324+ #1
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011
> Workqueue: events_power_efficient process_srcu
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x2fb/0x40f lib/dump_stack.c:52
>  print_address_description+0x7f/0x260 mm/kasan/report.c:250
>  kasan_report_error mm/kasan/report.c:349 [inline]
>  kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:372
>  kasan_report mm/kasan/report.c:392 [inline]
>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:392
>  debug_spin_unlock kernel/locking/spinlock_debug.c:97 [inline]
>  do_raw_spin_unlock+0x2ea/0x320 kernel/locking/spinlock_debug.c:134
>  __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:167 [inline]
>  _raw_spin_unlock_irq+0x22/0x70 kernel/locking/spinlock.c:199
>  spin_unlock_irq include/linux/spinlock.h:349 [inline]
>  srcu_reschedule+0x1a1/0x260 kernel/rcu/srcu.c:582
>  process_srcu+0x63c/0x11c0 kernel/rcu/srcu.c:600
>  process_one_work+0xac0/0x1b00 kernel/workqueue.c:2097
>  worker_thread+0x1b4/0x1300 kernel/workqueue.c:2231
>  kthread+0x36c/0x440 kernel/kthread.c:231
>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> 
> Allocated by task 20961:
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:515
>  set_track mm/kasan/kasan.c:527 [inline]
>  kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:619
>  kmem_cache_alloc_trace+0x10b/0x670 mm/slab.c:3635
>  kmalloc include/linux/slab.h:492 [inline]
>  kzalloc include/linux/slab.h:665 [inline]
>  kvm_arch_alloc_vm include/linux/kvm_host.h:773 [inline]
>  kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:610 [inline]
>  kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 [inline]
>  kvm_dev_ioctl+0x1bf/0x1460 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3205
>  vfs_ioctl fs/ioctl.c:45 [inline]
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
>  SYSC_ioctl fs/ioctl.c:700 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Freed by task 20960:
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:515
>  set_track mm/kasan/kasan.c:527 [inline]
>  kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:592
>  __cache_free mm/slab.c:3511 [inline]
>  kfree+0xd3/0x250 mm/slab.c:3828
>  kvm_arch_free_vm include/linux/kvm_host.h:778 [inline]
>  kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:732 [inline]
>  kvm_put_kvm+0x709/0x9a0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:747
>  kvm_vm_release+0x42/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:758
>  __fput+0x332/0x800 fs/file_table.c:209
>  ____fput+0x15/0x20 fs/file_table.c:245
>  task_work_run+0x197/0x260 kernel/task_work.c:116
>  exit_task_work include/linux/task_work.h:21 [inline]
>  do_exit+0x1a53/0x27c0 kernel/exit.c:878
>  do_group_exit+0x149/0x420 kernel/exit.c:982
>  get_signal+0x7d8/0x1820 kernel/signal.c:2318
>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>  exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157
>  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
>  syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:263
>  entry_SYSCALL_64_fastpath+0xbc/0xbe
> 
> The buggy address belongs to the object at ffff880141581640
>  which belongs to the cache kmalloc-65536 of size 65536
> The buggy address is located 36644 bytes inside of
>  65536-byte region [ffff880141581640, ffff880141591640)
> The buggy address belongs to the page:
> page:ffffea000464b400 count:1 mapcount:0 mapping:ffff880141581640
> index:0x0 compound_mapcount: 0
> flags: 0x200000000008100(slab|head)
> raw: 0200000000008100 ffff880141581640 0000000000000000 0000000100000001
> raw: ffffea00064b1f20 ffffea000640fa20 ffff8801db800d00
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff88014158a400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff88014158a480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88014158a500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                                        ^
>  ffff88014158a580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff88014158a600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
> 

  reply	other threads:[~2017-03-27 14:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-10 19:28 srcu: BUG in __synchronize_srcu Andrey Konovalov
2017-03-10 19:29 ` Andrey Konovalov
2017-03-10 19:42   ` Dmitry Vyukov
2017-03-10 22:26   ` Paul E. McKenney
2017-03-11 14:25     ` Mathieu Desnoyers
2017-03-11 20:06       ` Paul E. McKenney
2017-03-14  7:47     ` Lance Roy
2017-03-14 16:21       ` Paul E. McKenney
2017-03-27 12:36         ` Dmitry Vyukov
2017-03-27 14:16           ` Paul E. McKenney [this message]
2017-03-27 14:51             ` Dmitry Vyukov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170327141627.GE3637@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=andreyknvl@google.com \
    --cc=dvyukov@google.com \
    --cc=jiangshanlai@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=kcc@google.com \
    --cc=ldr709@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=rostedt@goodmis.org \
    --cc=syzkaller@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.