* sched: memory corruption on completing completions @ 2015-02-04 23:24 Sasha Levin 2015-02-04 23:46 ` Andrew Morton 2015-02-05 0:16 ` Linus Torvalds 0 siblings, 2 replies; 14+ messages in thread From: Sasha Levin @ 2015-02-04 23:24 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Andrew Morton, Linus Torvalds, Andrey Ryabinin, Dave Jones, LKML Hi all, I was fuzzing with trinity on a -next kernel with the KASan patchset, and got what initially appeared to be a rather odd trace: [ 856.817966] BUG: AddressSanitizer: out of bounds on stack in do_raw_spin_unlock+0x417/0x4f0 at addr ffff8803875c7c42 [ 856.817966] Read of size 2 by task migration/15/140 [ 856.821565] page:ffffea000e1d71c0 count:0 mapcount:0 mapping: (null) index:0x0 [ 856.821565] flags: 0x2efffff80000000() [ 856.821565] page dumped because: kasan: bad access detected [ 856.821565] CPU: 15 PID: 140 Comm: migration/15 Not tainted 3.19.0-rc5-next-20150123-sasha-00061-g527ff0d-dirty #1814 [ 856.841712] 0000000000000000 0000000000000000 ffff8804d0447aa0 ffff8804d04479e8 [ 856.843478] ffffffff92eb033b 1ffffd4001c3ae3f ffffea000e1d71c0 ffff8804d0447a88 [ 856.843478] ffffffff81b4c292 ffff8804d0447a58 ffffffff815ef1e1 0000000000000000 [ 856.843478] Call Trace: [ 856.843478] dump_stack (lib/dump_stack.c:52) [ 856.843478] kasan_report_error (mm/kasan/report.c:136 mm/kasan/report.c:194) [ 856.843478] ? __lock_is_held (kernel/locking/lockdep.c:3518) [ 856.843478] ? sched_ttwu_pending (kernel/sched/core.c:4921) [ 856.843478] __asan_report_load2_noabort (mm/kasan/report.c:234) [ 856.843478] ? do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159) [ 856.843478] do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159) [ 856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157) [ 856.843478] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191) [ 856.843478] complete (kernel/sched/completion.c:37) [ 856.843478] cpu_stop_signal_done (kernel/stop_machine.c:69) [ 856.843478] cpu_stopper_thread (include/linux/spinlock.h:342 kernel/stop_machine.c:456) [ 856.843478] ? irq_cpu_stop_queue_work (kernel/stop_machine.c:449) [ 856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157) [ 856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191) [ 856.843478] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601) [ 856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:163 kernel/locking/spinlock.c:191) [ 856.843478] smpboot_thread_fn (kernel/smpboot.c:161) [ 856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105) [ 856.843478] ? schedule (kernel/sched/core.c:2853) [ 856.843478] ? __kthread_parkme (./arch/x86/include/asm/current.h:14 kernel/kthread.c:164) [ 856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105) [ 856.843478] kthread (kernel/kthread.c:207) [ 856.843478] ? flush_kthread_work (kernel/kthread.c:176) [ 856.843478] ? schedule_tail (./arch/x86/include/asm/preempt.h:95 kernel/sched/core.c:2318) [ 856.843478] ? flush_kthread_work (kernel/kthread.c:176) [ 856.843478] ret_from_fork (arch/x86/kernel/entry_64.S:349) [ 856.843478] ? flush_kthread_work (kernel/kthread.c:176) [ 856.843478] Memory state around the buggy address: [ 856.843478] ffff8803875c7b00: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 856.843478] ffff8803875c7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 856.843478] >ffff8803875c7c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 [ 856.843478] ^ [ 856.843478] ffff8803875c7c80: f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 00 [ 856.843478] ffff8803875c7d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 We see this trace only now because Andrey has recently made KASan able to detect issues with memory on stack, rather than just kmalloc()ed memory, so the underlying bug has probably been there for a while. Initially we couldn't explain it. It was reproducing often, and always on the spin_unlock_irqrestore() call at the end of the complete() code. I now have a theory for why it happens: Thread A Thread B ---------------------------------------------------------- [Enter function] DECLARE_COMPLETION_ONSTACK(x) wait_for_completion(x) complete(x) [In complete(x):] spin_lock_irqsave(&x->wait.lock, flags); x->done++; __wake_up_locked(&x->wait, TASK_NORMAL, 1); [Done waiting, wakes up] [Exit function] spin_unlock_irqrestore(&x->wait.lock, flags); So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption the stack of thread A. Thanks, Sasha ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin @ 2015-02-04 23:46 ` Andrew Morton 2015-02-05 0:16 ` Linus Torvalds 1 sibling, 0 replies; 14+ messages in thread From: Andrew Morton @ 2015-02-04 23:46 UTC (permalink / raw) To: Sasha Levin Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrey Ryabinin, Dave Jones, LKML On Wed, 04 Feb 2015 18:24:06 -0500 Sasha Levin <sasha.levin@oracle.com> wrote: > Hi all, > > I was fuzzing with trinity on a -next kernel with the KASan patchset, and > got what initially appeared to be a rather odd trace: > > ... > > > I now have a theory for why it happens: > > Thread A Thread B > ---------------------------------------------------------- > > [Enter function] > DECLARE_COMPLETION_ONSTACK(x) > wait_for_completion(x) > complete(x) > [In complete(x):] > spin_lock_irqsave(&x->wait.lock, flags); > x->done++; > __wake_up_locked(&x->wait, TASK_NORMAL, 1); > [Done waiting, wakes up] > [Exit function] > spin_unlock_irqrestore(&x->wait.lock, flags); > > > > So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption > the stack of thread A. But wait_for_completion() takes ->wait.lock as well, which should provide the needed synchronization (__wait_for_common, do_wait_for_common). I'm not seeing a hole in the logic, but it looks like there might be one. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin 2015-02-04 23:46 ` Andrew Morton @ 2015-02-05 0:16 ` Linus Torvalds 2015-02-05 7:10 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 14+ messages in thread From: Linus Torvalds @ 2015-02-05 0:16 UTC (permalink / raw) To: Sasha Levin, Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On Wed, Feb 4, 2015 at 3:24 PM, Sasha Levin <sasha.levin@oracle.com> wrote: > > I now have a theory for why it happens: > > Thread A Thread B > ---------------------------------------------------------- > > [Enter function] > DECLARE_COMPLETION_ONSTACK(x) > wait_for_completion(x) > complete(x) > [In complete(x):] > spin_lock_irqsave(&x->wait.lock, flags); > x->done++; > __wake_up_locked(&x->wait, TASK_NORMAL, 1); > [Done waiting, wakes up] > [Exit function] > spin_unlock_irqrestore(&x->wait.lock, flags); > > So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption > the stack of thread A. We have indeed had bugs like that, but it *shouldn't be the case any more. The hard rule for spinlocks is: - the last thing *unlock* does is to release the lock with a single store. After that has happened, it will not touch it, and before that has happened, nobody else can get the spinlock which means that you cannot actually get the case you are talking about (because the "done waiting, wakes up" in thread A happens with the spinlock held). That said, as mentioned, we have had bugs in spinlock debugging code in particular, where the last unlock does other things after it releases the low-level lock. And you are clearly not using normal spinlocks, since your error trace has this: do_raw_spin_unlock+0x417/0x4f0 which means that "do_raw_spin_unlock() is 1250+ bytes in size. A "normal" do_raw_spin_unlock()" is inlined because it's a single store operation. Looks like the debug version of do_raw_spin_unlock() is slightly larger than that ;) Anyway, apparently that version of do_raw_spin_unlock isn't just huge, it's also buggy.. Although I thought we fixed that bug some time ago, and looking at the code it does void do_raw_spin_unlock(raw_spinlock_t *lock) { debug_spin_unlock(lock); arch_spin_unlock(&lock->raw_lock); } so it *looks* ok in the generic code. So off to look at the arch code... And looking at the arch version, I think the paravirtualized code is crap. It does: prev = *lock; add_smp(&lock->tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the "add_smp()" and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock migth come in, and release the whole data structure. As usual, the paravirt code is a horribly buggy heap of crud. Film at 11. Why did I think we had this bug but already fixed it ? Maybe it's one of those things that Waiman fixed in his long delayed qspinlock series? Waiman? Or maybe I just remember the fixes where we changed from a mutex to a spinlock, which fixed a very similar case for non-paravirtualized cases. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 0:16 ` Linus Torvalds @ 2015-02-05 7:10 ` Ingo Molnar 2015-02-05 9:30 ` Peter Zijlstra 2015-02-05 20:59 ` Davidlohr Bueso 2 siblings, 0 replies; 14+ messages in thread From: Ingo Molnar @ 2015-02-05 7:10 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML * Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > > As usual, the paravirt code is a horribly buggy heap of crud. > Film at 11. > > Why did I think we had this bug but already fixed it ? Maybe > it's one of those things that Waiman fixed in his long delayed > qspinlock series? Waiman? Or maybe I just remember the fixes > where we changed from a mutex to a spinlock, which fixed a very > similar case for non-paravirtualized cases. We definitely had such a high profile bug in the native case, years before the qspinlock series came along. I don't remember the specifics anymore, but maybe it was in the VFS code? Thanks, Ingo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 0:16 ` Linus Torvalds 2015-02-05 7:10 ` Ingo Molnar @ 2015-02-05 9:30 ` Peter Zijlstra 2015-02-05 20:44 ` Sasha Levin 2015-02-05 20:59 ` Davidlohr Bueso 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2015-02-05 9:30 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Waiman Long, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote: > > Why did I think we had this bug but already fixed it ? Maybe it's one > of those things that Waiman fixed in his long delayed qspinlock > series? Waiman? ISTR that that would do the exact same thing, but I need to go look a the latest paravirt code -- that's the part that we all were still bothered with. > Or maybe I just remember the fixes where we changed > from a mutex to a spinlock, which fixed a very similar case for > non-paravirtualized cases. I think we had a non paravirt mutex case last year or so. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 9:30 ` Peter Zijlstra @ 2015-02-05 20:44 ` Sasha Levin 0 siblings, 0 replies; 14+ messages in thread From: Sasha Levin @ 2015-02-05 20:44 UTC (permalink / raw) To: Peter Zijlstra, Linus Torvalds Cc: Waiman Long, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On 02/05/2015 04:30 AM, Peter Zijlstra wrote: > On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote: >> > Why did I think we had this bug but already fixed it ? Maybe it's one >> > of those things that Waiman fixed in his long delayed qspinlock >> > series? Waiman? > ISTR that that would do the exact same thing, but I need to go look a > the latest paravirt code -- that's the part that we all were still > bothered with. Testing Linus's explanation, I tried simply: diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 7050d86..54454da 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -142,7 +142,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, __ticket_unlock_kick(lock, old.tickets.head); } } - +static inline int arch_spin_is_locked(arch_spinlock_t *lock); static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG && @@ -153,7 +153,7 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) add_smp(&lock->tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ - + WARN_ON(arch_spin_is_locked(lock)); if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); } else And the warnings confirmed that the lock is indeed "unlocked" before we finished arch_spin_unlock()... Thanks, Sasha ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 0:16 ` Linus Torvalds 2015-02-05 7:10 ` Ingo Molnar 2015-02-05 9:30 ` Peter Zijlstra @ 2015-02-05 20:59 ` Davidlohr Bueso 2015-02-05 21:02 ` Sasha Levin 2 siblings, 1 reply; 14+ messages in thread From: Davidlohr Bueso @ 2015-02-05 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote: > And looking at the arch version, I think the paravirtualized code is crap. > > It does: > > prev = *lock; > add_smp(&lock->tickets.head, TICKET_LOCK_INC); > > /* add_smp() is a full mb() */ > > if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) > __ticket_unlock_slowpath(lock, prev); > > > which is *exactly* the kind of things you cannot do with spinlocks, > because after you've done the "add_smp()" and released the spinlock > for the fast-path, you can't access the spinlock any more. Exactly > because a fast-path lock migth come in, and release the whole data > structure. > > As usual, the paravirt code is a horribly buggy heap of crud. Film at 11. Per http://lwn.net/Articles/495597/ which clearly describes the intent of the slowpath unlocking. Cc'ing Raghavendra. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 20:59 ` Davidlohr Bueso @ 2015-02-05 21:02 ` Sasha Levin 2015-02-05 21:34 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Sasha Levin @ 2015-02-05 21:02 UTC (permalink / raw) To: Davidlohr Bueso, Linus Torvalds Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T On 02/05/2015 03:59 PM, Davidlohr Bueso wrote: > On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote: >> And looking at the arch version, I think the paravirtualized code is crap. >> >> It does: >> >> prev = *lock; >> add_smp(&lock->tickets.head, TICKET_LOCK_INC); >> >> /* add_smp() is a full mb() */ >> >> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) >> __ticket_unlock_slowpath(lock, prev); >> >> >> which is *exactly* the kind of things you cannot do with spinlocks, >> because after you've done the "add_smp()" and released the spinlock >> for the fast-path, you can't access the spinlock any more. Exactly >> because a fast-path lock migth come in, and release the whole data >> structure. >> >> As usual, the paravirt code is a horribly buggy heap of crud. Film at 11. > > Per http://lwn.net/Articles/495597/ which clearly describes the intent > of the slowpath unlocking. Cc'ing Raghavendra. Interestingly enough, according to that article this behaviour seems to be "by design": """ This version of the patch uses a locked add to do this, followed by a test to see if the slowflag is set. The lock prefix acts as a full memory barrier, so we can be sure that other CPUs will have seen the unlock before we read the flag """ Thanks, Sasha ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 21:02 ` Sasha Levin @ 2015-02-05 21:34 ` Linus Torvalds 2015-02-05 22:37 ` Davidlohr Bueso 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-02-05 21:34 UTC (permalink / raw) To: Sasha Levin Cc: Davidlohr Bueso, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote: > > Interestingly enough, according to that article this behaviour seems to be > "by design": Oh, it's definitely by design, it's just that the design looked at spinlocks without the admittedly very subtle issue of lifetime vs unlocking. Spinlocks (and completions) are special - for other locks we have basically allowed lifetimes to be separate from the lock state, and if you have a data structure with a mutex in it, you'll have to have some separate lifetime rule outside of the lock itself. But spinlocks and completions have their locking state tied into their lifetime. Completions are so very much by design (because dynamic completions on the stack is one of the core use cases), and spinlocks do it because in some cases you cannot sanely avoid it (and one of those cases is the implementation of completions - they aren't actually first-class locking primitives of their own, although they actually *used* to be, originally). It is possible that the paravirt spinlocks could be saved by: - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code. - making sure that the "unlock" path never does a *write* to the possibly stale lock. KASan would still complain about the read, but we could just say that it's a speculative read - bad form, but not disastrous. but right now they are clearly horribly broken. Because the unlock path doesn't just read the value, it can end up writing to it too. Looking at the Xen version, for example, the unlock_kick part looks fine. It just compares the pointer to the lock against its data structures, and doesn't touch the lock itself. Which is fine. But right now, the real problem is that "__ticket_unlock_slowpath()" will clear the TICKET_SLOWPATH_FLAG on a lock that has already been unlocked - which means that it may actually *change* a byte that has already been freed. It might not be a spinlock any more, it might have been reused for something else, and might be anything else, and clearing TICKET_SLOWPATH_FLAG might be clearing a random bit in a kernel pointer or other value.. It is *very* hard to hit, though. And it obviously doesn't happen in raw hardware, but paravirt people need to think hard about this. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 21:34 ` Linus Torvalds @ 2015-02-05 22:37 ` Davidlohr Bueso 2015-02-05 22:57 ` Linus Torvalds 2015-02-06 12:29 ` Raghavendra K T 0 siblings, 2 replies; 14+ messages in thread From: Davidlohr Bueso @ 2015-02-05 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote: > On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote: > > > > Interestingly enough, according to that article this behaviour seems to be > > "by design": > > Oh, it's definitely by design, it's just that the design looked at > spinlocks without the admittedly very subtle issue of lifetime vs > unlocking. > > Spinlocks (and completions) are special - for other locks we have > basically allowed lifetimes to be separate from the lock state, and if > you have a data structure with a mutex in it, you'll have to have some > separate lifetime rule outside of the lock itself. But spinlocks and > completions have their locking state tied into their lifetime. For spinlocks I find this very much a virtue. Tight lifetimes allow the overall locking logic to be *simple* - keeping people from being "smart" and bloating up spinlocks. Similarly, I hate how the paravirt alternative blends in with regular (sane) bare metal code. What was preventing this instead?? #ifdef CONFIG_PARAVIRT_SPINLOCKS static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (!static_key_false(¶virt_ticketlocks_enabled)) return; add_smp(&lock->tickets.head, TICKET_LOCK_INC); /* Do slowpath tail stuff... */ } #else static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX); } #endif I just don't see the point to all this TICKET_SLOWPATH_FLAG: #ifdef CONFIG_PARAVIRT_SPINLOCKS #define __TICKET_LOCK_INC 2 #define TICKET_SLOWPATH_FLAG ((__ticket_t)1) #else #define __TICKET_LOCK_INC 1 #define TICKET_SLOWPATH_FLAG ((__ticket_t)0) #endif when it is only for paravirt -- and the word slowpath implies the general steps as part of the generic algorithm. Lets keep code for simple locks simple. > Completions are so very much by design (because dynamic completions on > the stack is one of the core use cases), and spinlocks do it because > in some cases you cannot sanely avoid it (and one of those cases is > the implementation of completions - they aren't actually first-class > locking primitives of their own, although they actually *used* to be, > originally). > > It is possible that the paravirt spinlocks could be saved by: > > - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code. Ouch, to avoid deadlocks they explicitly need the unlock to occur before the slowpath tail flag is read. > - making sure that the "unlock" path never does a *write* to the > possibly stale lock. KASan would still complain about the read, but we > could just say that it's a speculative read - bad form, but not > disastrous. Yeah, you just cannot have a slowpath without reads or writes :D Thanks, Davidlohr ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 22:37 ` Davidlohr Bueso @ 2015-02-05 22:57 ` Linus Torvalds 2015-02-06 6:48 ` Raghavendra K T 2015-02-06 12:29 ` Raghavendra K T 1 sibling, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-02-05 22:57 UTC (permalink / raw) To: Davidlohr Bueso Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: >> >> It is possible that the paravirt spinlocks could be saved by: >> >> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code. > > Ouch, to avoid deadlocks they explicitly need the unlock to occur before > the slowpath tail flag is read. Well, just make the unlock do the actual real unlock operation ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path can *test* the flag and do whatever kicking it needs to get people started again, but not clear it again. Then, the clearing could be done by the lockers. That way the unlock path at least doesn't change the spinlock any more - sicne the spinlock might have been free'd at any time after the actual unlock. Or something. I'm handwaving. I really dislike all the paravirt crap we do. It tends to be really ugly, and I'm not just talking about spinlocks. TLB flushing etc tends to get interesting too. > Yeah, you just cannot have a slowpath without reads or writes :D Well, you *could* do the slow-path while holding the lock, and making the actual ock release be the last part of the operation (and do the final unlock with a "cmpxchg" that fails and re-does the slowpath if something changed). Then the slowpath can read and write the lock all it wants. But people generally want to avoid that, since it makes the hold times longer. So the "drop the lock, *then* test if we should do some slow path fixup" tends to be the preferred model, it's just that that model is broken due to the lifetime rules. Making it read-only would at least make it a lot less broken.. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 22:57 ` Linus Torvalds @ 2015-02-06 6:48 ` Raghavendra K T 2015-02-06 15:00 ` Raghavendra K T 0 siblings, 1 reply; 14+ messages in thread From: Raghavendra K T @ 2015-02-06 6:48 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On 02/06/2015 04:27 AM, Linus Torvalds wrote: > On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: >>> >>> It is possible that the paravirt spinlocks could be saved by: >>> >>> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code. >> >> Ouch, to avoid deadlocks they explicitly need the unlock to occur before >> the slowpath tail flag is read. > > Well, just make the unlock do the actual real unlock operation > ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path > can *test* the flag and do whatever kicking it needs to get people > started again, but not clear it again. > This is definitely a good idea, will think more on this. (especially since any remote possibility of forgetting to wake up the lock-waiter would result in eventual hang of kvm guest). Hopeful to come up with a solution soon. /me agreeing with the fact that we did not have the 'lifetime' in mind during the design :(. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-06 6:48 ` Raghavendra K T @ 2015-02-06 15:00 ` Raghavendra K T 0 siblings, 0 replies; 14+ messages in thread From: Raghavendra K T @ 2015-02-06 15:00 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On 02/06/2015 12:18 PM, Raghavendra K T wrote: > On 02/06/2015 04:27 AM, Linus Torvalds wrote: >> On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net> >> wrote: >>>> >>>> It is possible that the paravirt spinlocks could be saved by: >>>> >>>> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath >>>> locking code. >>> >>> Ouch, to avoid deadlocks they explicitly need the unlock to occur before >>> the slowpath tail flag is read. >> >> Well, just make the unlock do the actual real unlock operation >> ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path >> can *test* the flag and do whatever kicking it needs to get people >> started again, but not clear it again. >> > > This is definitely a good idea, will think more on this. > Just sent the patch with that idea, also keeping in mind that head = tail may not hold good for trylock checking (last unlock might have left TICKET_SLOWPATH_FLAG set) http://article.gmane.org/gmane.linux.kernel/1883900 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: sched: memory corruption on completing completions 2015-02-05 22:37 ` Davidlohr Bueso 2015-02-05 22:57 ` Linus Torvalds @ 2015-02-06 12:29 ` Raghavendra K T 1 sibling, 0 replies; 14+ messages in thread From: Raghavendra K T @ 2015-02-06 12:29 UTC (permalink / raw) To: Davidlohr Bueso Cc: Linus Torvalds, Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML On 02/06/2015 04:07 AM, Davidlohr Bueso wrote: > On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote: >> On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote: >>> >>> Interestingly enough, according to that article this behaviour seems to be >>> "by design": >> >> Oh, it's definitely by design, it's just that the design looked at >> spinlocks without the admittedly very subtle issue of lifetime vs >> unlocking. >> >> Spinlocks (and completions) are special - for other locks we have >> basically allowed lifetimes to be separate from the lock state, and if >> you have a data structure with a mutex in it, you'll have to have some >> separate lifetime rule outside of the lock itself. But spinlocks and >> completions have their locking state tied into their lifetime. > > For spinlocks I find this very much a virtue. Tight lifetimes allow the > overall locking logic to be *simple* - keeping people from being "smart" > and bloating up spinlocks. Similarly, I hate how the paravirt > alternative blends in with regular (sane) bare metal code. What was > preventing this instead?? > > #ifdef CONFIG_PARAVIRT_SPINLOCKS > static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > { > if (!static_key_false(¶virt_ticketlocks_enabled)) > return; > > add_smp(&lock->tickets.head, TICKET_LOCK_INC); > /* Do slowpath tail stuff... */ > } > #else > static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > { > __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX); > } > #endif > > I just don't see the point to all this TICKET_SLOWPATH_FLAG: > > #ifdef CONFIG_PARAVIRT_SPINLOCKS > #define __TICKET_LOCK_INC 2 > #define TICKET_SLOWPATH_FLAG ((__ticket_t)1) > #else > #define __TICKET_LOCK_INC 1 > #define TICKET_SLOWPATH_FLAG ((__ticket_t)0) > #endif > > when it is only for paravirt -- and the word slowpath implies the > general steps as part of the generic algorithm. Lets keep code for > simple locks simple. > Good point, I will send this as a separate cleanup once I test the patch I have to correct the current problem. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-02-06 14:59 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin 2015-02-04 23:46 ` Andrew Morton 2015-02-05 0:16 ` Linus Torvalds 2015-02-05 7:10 ` Ingo Molnar 2015-02-05 9:30 ` Peter Zijlstra 2015-02-05 20:44 ` Sasha Levin 2015-02-05 20:59 ` Davidlohr Bueso 2015-02-05 21:02 ` Sasha Levin 2015-02-05 21:34 ` Linus Torvalds 2015-02-05 22:37 ` Davidlohr Bueso 2015-02-05 22:57 ` Linus Torvalds 2015-02-06 6:48 ` Raghavendra K T 2015-02-06 15:00 ` Raghavendra K T 2015-02-06 12:29 ` Raghavendra K T
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.