* Possible lock-less list race in scheduler_ipi() [not found] <995381344.227770.1425597484864.JavaMail.zimbra@efficios.com> @ 2015-03-05 23:48 ` Mathieu Desnoyers 2015-03-06 1:02 ` Linus Torvalds 2015-03-06 17:09 ` Paul E. McKenney 0 siblings, 2 replies; 9+ messages in thread From: Mathieu Desnoyers @ 2015-03-05 23:48 UTC (permalink / raw) To: Paul E. McKenney Cc: Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt, Linus Torvalds Hi, I recently reviewed the scheduler_ipi() code by coincidence, and noticed that the use of llist.h in there seemed a bit odd. So I'd like to have more eyes to help me look into this. I've been told that there has been issues with IPIs lately, so here is one possible scenario I think might be possible: - scheduler_ipi() - sched_ttwu_pending() - llist_del_all() (grabs the wake_list with xchg()) - raw_spin_lock_irqsave(&rq->lock, flags); - iteration on the llist (owned locally by interrupt handler) (for each item) p = llist_entry(llist, struct task_struct, wake_entry); llist = llist_next(llist); ttwu_do_activate(rq, p, 0); - raw_spin_unlock_irqrestore(&rq->lock, flags); ttwu_do_activate() ends up calling ttwu_activate() and ttwu_do_wakeup(), I'm worried that ttwu_do_wakeup() might end up indirectly handing off the task to the following functions (within another execution context): - try_to_wake_up() - ttwu_queue() - ttwu_queue_remote adds the process to another wake_list with a cmpxchg() llist_next() is pretty simple: static inline struct llist_node *llist_next(struct llist_node *node) { return node->next; } It is so simple that I wonder if the compiler would be within its rights to reorder the load of node->next after some operations within ttwu_do_activate(), thus causing corruption of this linked-list due to a concurrent try_to_wake_up() performed by another core. Am I too paranoid about the possible compiler mishaps there, or are my concerns justified ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers @ 2015-03-06 1:02 ` Linus Torvalds 2015-03-06 14:35 ` Mathieu Desnoyers 2015-03-06 17:09 ` Paul E. McKenney 1 sibling, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2015-03-06 1:02 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt On Thu, Mar 5, 2015 at 3:48 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > llist_next() is pretty simple: > > static inline struct llist_node *llist_next(struct llist_node *node) > { > return node->next; > } > > It is so simple that I wonder if the compiler would be > within its rights to reorder the load of node->next > after some operations within ttwu_do_activate(), thus > causing corruption of this linked-list due to a > concurrent try_to_wake_up() performed by another core. > > Am I too paranoid about the possible compiler mishaps > there, or are my concerns justified ? I *think* you are too paranoid, because that would be a major compiler bug anyway - gcc cannot reorder the load against anything that might be changing the value. Which obviously includes calling non-inlined functions. At least the code generation I see doesn't seem to say that gcc gets this wrong: ... leaq -32(%rbx), %rsi #, p movq (%rbx), %rbx # MEM[(struct llist_node *)__mptr_19].next, __mptr movq %r12, %rdi # tcp_ptr__, call ttwu_do_activate.constprop.85 # ... that "movq (%rbx), %rbx" is the "llist = llist_next(llist);" thing. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 1:02 ` Linus Torvalds @ 2015-03-06 14:35 ` Mathieu Desnoyers 2015-03-06 19:03 ` Steven Rostedt 0 siblings, 1 reply; 9+ messages in thread From: Mathieu Desnoyers @ 2015-03-06 14:35 UTC (permalink / raw) To: Linus Torvalds Cc: Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt ----- Original Message ----- > From: "Linus Torvalds" <torvalds@linux-foundation.org> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying" <ying.huang@intel.com>, "Lai Jiangshan" > <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter Zijlstra" <peterz@infradead.org>, "LKML" > <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org>, "Steven Rostedt" <rostedt@goodmis.org> > Sent: Thursday, March 5, 2015 8:02:06 PM > Subject: Re: Possible lock-less list race in scheduler_ipi() > > On Thu, Mar 5, 2015 at 3:48 PM, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > llist_next() is pretty simple: > > > > static inline struct llist_node *llist_next(struct llist_node *node) > > { > > return node->next; > > } > > > > It is so simple that I wonder if the compiler would be > > within its rights to reorder the load of node->next > > after some operations within ttwu_do_activate(), thus > > causing corruption of this linked-list due to a > > concurrent try_to_wake_up() performed by another core. > > > > Am I too paranoid about the possible compiler mishaps > > there, or are my concerns justified ? > > I *think* you are too paranoid, because that would be a major compiler > bug anyway - gcc cannot reorder the load against anything that might > be changing the value. Which obviously includes calling non-inlined > functions. > > At least the code generation I see doesn't seem to say that gcc gets this > wrong: > > ... > leaq -32(%rbx), %rsi #, p > movq (%rbx), %rbx # MEM[(struct llist_node > *)__mptr_19].next, __mptr > movq %r12, %rdi # tcp_ptr__, > call ttwu_do_activate.constprop.85 # > ... > > that "movq (%rbx), %rbx" is the "llist = llist_next(llist);" thing. Indeed, the compiler should never reorder loads/stores from/to same memory location from a program order POV. What I had in mind is a bit more far-fetched though: it would involve having the compiler reorder this load after a store to another memory location, which would in turn allow another execution context (interrupt or thread) to corrupt the list. The assembly snippet you show above appears to be OK. However, another compiler may choose to inline ttwu_do_activate, leaving room for more aggressive optimizations. But I agree that it is rather far-fetched. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 14:35 ` Mathieu Desnoyers @ 2015-03-06 19:03 ` Steven Rostedt 2015-03-06 19:39 ` Mathieu Desnoyers 0 siblings, 1 reply; 9+ messages in thread From: Steven Rostedt @ 2015-03-06 19:03 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar On Fri, 6 Mar 2015 14:35:23 +0000 (UTC) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > Indeed, the compiler should never reorder loads/stores from/to > same memory location from a program order POV. What I had in mind > is a bit more far-fetched though: it would involve having the compiler > reorder this load after a store to another memory location, which > would in turn allow another execution context (interrupt or thread) > to corrupt the list. You mean on another CPU? Because the code you are worried about has interrupts disabled. -- Steve ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 19:03 ` Steven Rostedt @ 2015-03-06 19:39 ` Mathieu Desnoyers 2015-03-06 20:38 ` Steven Rostedt 0 siblings, 1 reply; 9+ messages in thread From: Mathieu Desnoyers @ 2015-03-06 19:39 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar ----- Original Message ----- > From: "Steven Rostedt" <rostedt@goodmis.org> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying" > <ying.huang@intel.com>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter > Zijlstra" <peterz@infradead.org>, "LKML" <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org> > Sent: Friday, March 6, 2015 2:03:47 PM > Subject: Re: Possible lock-less list race in scheduler_ipi() > > On Fri, 6 Mar 2015 14:35:23 +0000 (UTC) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > Indeed, the compiler should never reorder loads/stores from/to > > same memory location from a program order POV. What I had in mind > > is a bit more far-fetched though: it would involve having the compiler > > reorder this load after a store to another memory location, which > > would in turn allow another execution context (interrupt or thread) > > to corrupt the list. > > You mean on another CPU? Because the code you are worried about has > interrupts disabled. I'm worried that another CPU could issue try_to_wake_up() on a task concurrently with the llist iteration within sched_ttwu_pending(). AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding the rq lock. So I'm wondering what prevents corruption of the wake_list in this situation. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 19:39 ` Mathieu Desnoyers @ 2015-03-06 20:38 ` Steven Rostedt 2015-03-06 20:39 ` Steven Rostedt 0 siblings, 1 reply; 9+ messages in thread From: Steven Rostedt @ 2015-03-06 20:38 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar On Fri, 6 Mar 2015 19:39:44 +0000 (UTC) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: ask concurrently with the llist iteration within sched_ttwu_pending(). > > AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding > the rq lock. So I'm wondering what prevents corruption of the wake_list > in this situation. I guess if it is on the wake_list, then the task's state is already RUNNING. Any other task can switch a task's state to RUNNING but only the task itself can switch it back to something else. If the task is on the wake_list, it's state is already RUNNING, but it has not run yet. That means any other wakeup will jump to the "goto out" and skip over the ttwu_queue() call. -- Steve ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 20:38 ` Steven Rostedt @ 2015-03-06 20:39 ` Steven Rostedt 2015-03-06 22:07 ` Mathieu Desnoyers 0 siblings, 1 reply; 9+ messages in thread From: Steven Rostedt @ 2015-03-06 20:39 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar On Fri, 6 Mar 2015 15:38:21 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 6 Mar 2015 19:39:44 +0000 (UTC) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ask concurrently with the llist iteration within sched_ttwu_pending(). > > > > AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding > > the rq lock. So I'm wondering what prevents corruption of the wake_list > > in this situation. > > I guess if it is on the wake_list, then the task's state is already > RUNNING. Any other task can switch a task's state to RUNNING but only > the task itself can switch it back to something else. If the task is on > the wake_list, it's state is already RUNNING, but it has not run yet. > That means any other wakeup will jump to the "goto out" and skip over > the ttwu_queue() call. If my assumption is indeed the case, then these types of subtleties really need comments in the code. -- Steve ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-06 20:39 ` Steven Rostedt @ 2015-03-06 22:07 ` Mathieu Desnoyers 0 siblings, 0 replies; 9+ messages in thread From: Mathieu Desnoyers @ 2015-03-06 22:07 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar ----- Original Message ----- > From: "Steven Rostedt" <rostedt@goodmis.org> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying" > <ying.huang@intel.com>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter > Zijlstra" <peterz@infradead.org>, "LKML" <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org> > Sent: Friday, March 6, 2015 3:39:25 PM > Subject: Re: Possible lock-less list race in scheduler_ipi() > > On Fri, 6 Mar 2015 15:38:21 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Fri, 6 Mar 2015 19:39:44 +0000 (UTC) > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > ask concurrently with the llist iteration within sched_ttwu_pending(). > > > > > > AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding > > > the rq lock. So I'm wondering what prevents corruption of the wake_list > > > in this situation. > > > > I guess if it is on the wake_list, then the task's state is already > > RUNNING. Any other task can switch a task's state to RUNNING but only > > the task itself can switch it back to something else. If the task is on > > the wake_list, it's state is already RUNNING, but it has not run yet. > > That means any other wakeup will jump to the "goto out" and skip over > > the ttwu_queue() call. > > If my assumption is indeed the case, then these types of subtleties > really need comments in the code. My understanding is that try_to_wake_up, by calling ttwu_queue(), is responsible for enqueuing the task into the wake_list. Inspection of try_to_wake_up() seems to show that the state of the task is set to TASK_WAKING by try_to_wake_up. Then when dequeuing the task from the llist, ttwu_do_wakeup sets the task state to TASK_RUNNING. Both TASK_WAKING and TASK_RUNNING mean that the try_to_wake_up check for if (!(p->state & state)), which is typically done against TASK_NORMAL, will skip the following ttwu_queue() for that task until it is set to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE again. So there should not be any double-enqueue AFAIU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Possible lock-less list race in scheduler_ipi() 2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers 2015-03-06 1:02 ` Linus Torvalds @ 2015-03-06 17:09 ` Paul E. McKenney 1 sibling, 0 replies; 9+ messages in thread From: Paul E. McKenney @ 2015-03-06 17:09 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt, Linus Torvalds On Thu, Mar 05, 2015 at 11:48:51PM +0000, Mathieu Desnoyers wrote: > Hi, > > I recently reviewed the scheduler_ipi() code by > coincidence, and noticed that the use of llist.h > in there seemed a bit odd. So I'd like to have > more eyes to help me look into this. > > I've been told that there has been issues with > IPIs lately, so here is one possible scenario > I think might be possible: > > - scheduler_ipi() > - sched_ttwu_pending() > - llist_del_all() > (grabs the wake_list with xchg()) > - raw_spin_lock_irqsave(&rq->lock, flags); > - iteration on the llist (owned locally by interrupt handler) > (for each item) > p = llist_entry(llist, struct task_struct, wake_entry); > llist = llist_next(llist); > ttwu_do_activate(rq, p, 0); > - raw_spin_unlock_irqrestore(&rq->lock, flags); > > ttwu_do_activate() ends up calling > ttwu_activate() and ttwu_do_wakeup(), > > I'm worried that ttwu_do_wakeup() might end up > indirectly handing off the task to the following > functions (within another execution context): > > - try_to_wake_up() > - ttwu_queue() > - ttwu_queue_remote adds the process to another > wake_list with a cmpxchg() > > llist_next() is pretty simple: > > static inline struct llist_node *llist_next(struct llist_node *node) > { > return node->next; > } > > It is so simple that I wonder if the compiler would be > within its rights to reorder the load of node->next > after some operations within ttwu_do_activate(), thus > causing corruption of this linked-list due to a > concurrent try_to_wake_up() performed by another core. Well, it clearly cannot leak this load out of the critical section. That said, there is nothing in llist_next() that tells the compiler that it need to be careful with this load. And all three functions, sched_ttwu_pending(), ttwu_do_activate(), and llist_next(), are in the same compilation unit. The compiler therefore has full knowledge, and full ability to rearrange. It can slide this load down past the CONFIG_SMP #ifdef in ttwu_do_activate() and into ttwu_activate(). However, it cannot push it past the call to sched_clock_cpu(), which is defined in the separate compilation unit kernel/sched/clock.c. And sched_clock_cpu() is invoked from update_rq_clock(), which is invoked from enqueue_task(), which is invoked from activate_task(), which is invoked from ttwu_activate(). And even if sched_clock_cpu() was in the same compilation unit, its preempt_disable_notrace() keeps the compiler from reordering. At least it does if the compiler cannot prove that sched_clock_cpu() takes one of two early exits. ;-) So I am not seeing anything important that the compiler could reorder this past. The CPU might do additional reordering, of course. > Am I too paranoid about the possible compiler mishaps > there, or are my concerns justified ? Now try_to_wake_up() does acquire p->pi_lock, but that of course does not exclude sched_ttwu_pending(), which instead acquires rq->lock. And in a virtualized environment, sched_ttwu_pending() can be preempted and delayed pretty much arbitrarily pretty much anywhere, even when its interrupts are disabled. So let's look at the llist code. One question is of course "How does llist_add_batch() avoid the ABA problem?" In this case, the answer appears to be that llist_del_all() is used, and never llist_del_first(). (Mixing use of llist_del_all() and llist_del_first() by multiple threads could cause this ABA problem, even if only one thread at a time used llist_del_first().) The code does appear to assume that if a given task is on the local llist, that it cannot be awakened any other way without acquiring the same runqueue lock that sched_ttwu_pending() holds. If this rule was violated, then of course the list could be corrupted. However, the only accesses to the list appear to be sched_ttwu_pending()'s llist_del_all(), ttwu_queue_remote()'s llist_add(), and a couple of llist_empty() calls. This combination appears to me to be safe. The scheduler_ipi() code assumes that an IPI cannot outrun data. If this is violated, the scheduler_ipi() might still see an empty ->wake_list() even though the sender of the IPI just filled it. The last time I worked with a machine that violated this assumption was about twenty years ago, and I very much hope that such hardware designs have not come back from the dead. But this would cause a hang, not a corrupt list. So, does this trigger any helpful thoughts? Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-03-06 22:07 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <995381344.227770.1425597484864.JavaMail.zimbra@efficios.com>
2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers
2015-03-06 1:02 ` Linus Torvalds
2015-03-06 14:35 ` Mathieu Desnoyers
2015-03-06 19:03 ` Steven Rostedt
2015-03-06 19:39 ` Mathieu Desnoyers
2015-03-06 20:38 ` Steven Rostedt
2015-03-06 20:39 ` Steven Rostedt
2015-03-06 22:07 ` Mathieu Desnoyers
2015-03-06 17:09 ` Paul E. McKenney
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.