[BUG] Possible locking issues in stop_machine code on 6k core machine

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] Possible locking issues in stop_machine code on 6k core machine
@ 2014-12-03 20:40 Alex Thorlton
  2014-12-03 23:34 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Alex Thorlton @ 2014-12-03 20:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Peter Zijlstra, Fabian Frederick, Ingo Molnar,
	Alex Thorlton, Russ Anderson, linux-kernel

Hey guys,                                                                                                                                                                                                                                                                                                                                                                                 

While working to get our newly upgraded 6k core machine online, we've                                                                                                                                                                                                                                                                                                                     
discovered a few possible locking issues in the stop_machine code that                                                                                                                                                                                                                                                                                                                    
we're trying to get sorted out.  (We think) the problems we're seeing                                                                                                                                                                                                                                                                                                                     
stem from possible interaction between stop_cpus and stop_one_cpu.  The                                                                                                                                                                                                                                                                                                                   
issue presents as a deadlock, and seems to only show itself                                                                                                                                                                                                                                                                                                                               
intermittently.                                                                                                                                                                                                                                                                                                                                                                           

After quite a bit of debugging we think we've narrowed the issue down to                                                                                                                                                                                                                                                                                                                  
the fact that stop_one_cpu does not respect many of the locks that are                                                                                                                                                                                                                                                                                                                    
taken in the stop_cpus code path.  For reference the stop_cpus code path                                                                                                                                                                                                                                                                                                                  
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each                                                                                                                                                                                                                                                                                                                       
cpu's stopper->lock.  stop_one_cpu seems to rely solely on the                                                                                                                                                                                                                                                                                                                            
stopper->lock.                                                                                                                                                                                                                                                                                                                                                                            

What appears to be happening to cause our deadlock is, stop_cpus works                                                                                                                                                                                                                                                                                                                    
its way down to queue_stop_cpus_work, which tells each cpu's stopper                                                                                                                                                                                                                                                                                                                      
task to wake up, take its lock, and do its work.  As the loop that does                                                                                                                                                                                                                                                                                                                   
this progresses, the lowest numbered cpus complete their work, and are                                                                                                                                                                                                                                                                                                                    
allowed to go on about their business.  The problem occurs when one of                                                                                                                                                                                                                                                                                                                    
these lower numbered cpus calls stop_one_cpu, targeting one of the                                                                                                                                                                                                                                                                                                                        
higher numbered cpus, which the stop_cpus loop has not yet reached.  If                                                                                                                                                                                                                                                                                                                   
this happens, that higher numbered cpu's completion variable will get                                                                                                                                                                                                                                                                                                                     
stomped on, and the wait_for_completion in the stop_cpus code path will                                                                                                                                                                                                                                                                                                                   
never return.                                                                                                                                                                                                                                                                                                                                                                             

A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.                                                                                                                                                                                                                                                                                                                   
CPU 50 completes its stopper work, and at some point in the near future                                                                                                                                                                                                                                                                                                                   
calls stop_one_cpu on CPU 5000.  This clobbers CPU 5000's pointer to the                                                                                                                                                                                                                                                                                                                  
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once                                                                                                                                                                                                                                                                                                                   
CPU 5000 completes its work, it won't be able to decrement the nr_todo                                                                                                                                                                                                                                                                                                                    
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion                                                                                                                                                                                                                                                                                                                     
will never return.                                                                                                                                                                                                                                                                                                                                                                        

Again, much of this is semi-educated guesswork, put together based on                                                                                                                                                                                                                                                                                                                     
information gathered from examining lots of debug output, in an attempt                                                                                                                                                                                                                                                                                                                   
to spot the problem.  We're fairly certain that we've pinned down our                                                                                                                                                                                                                                                                                                                     
issue, but we'd like to ask those who are more knowledgeable of these                                                                                                                                                                                                                                                                                                                     
code paths to weigh in their opinions here.                                                                                                                                                                                                                                                                                                                                               

We'd really appreciate any help that anyone can offer.  Thanks!                                                                                                                                                                                                                                                                                                                           

- Alex

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] Possible locking issues in stop_machine code on 6k core machine
  2014-12-03 20:40 [BUG] Possible locking issues in stop_machine code on 6k core machine Alex Thorlton
@ 2014-12-03 23:34 ` Thomas Gleixner
  2014-12-04 15:58   ` Alex Thorlton
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2014-12-03 23:34 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Andrew Morton, Peter Zijlstra, Fabian Frederick,
	Ingo Molnar, Russ Anderson

Alex,

first of all, could you please format your mail so it is actually
readable and can be replied to? This looks like copy and paste from a
office app with about 5 gazillion spaces at the end of each line. I
have no idea how you tricked mutt to do that.

On Wed, 3 Dec 2014, Alex Thorlton wrote:

> While working to get our newly upgraded 6k core machine online,
> we've discovered a few possible locking issues in the stop_machine
> code that we're trying to get sorted out.  (We think) the problems
> we're seeing stem from possible interaction between stop_cpus and
> stop_one_cpu.  The issue presents as a deadlock, and seems to only
> show itself intermittently.

> After quite a bit of debugging we think we've narrowed the issue
> down to the fact that stop_one_cpu does not respect many of the
> locks that are taken in the stop_cpus code path.  For reference the
> stop_cpus code path takes the stop_cpus_mutex, then stop_cpus_lock,
> and then takes each cpu's stopper->lock.  stop_one_cpu seems to rely
> solely on the stopper->lock.

And what's actually wrong with that?

stop_cpus()
  mutex_lock(&stop_cpus_mutex);
  __stop_cpus()
    cpu_stop_init_done(); <-- works on a stack variable
    queue_stop_cpus_work()
      for_each_cpu()
        per_cpu(stop_cpus_work, cpu)
	        ^^^^ is a static per cpu work struct protected by
		     stop_cpus_mutex

      /* Now we have to queue that work for real */
      lg_global_lock(&stop_cpus_lock);
      for_each_cpu(cpu, cpumask)
      	cpu_stop_queue_work(cpu ...)
	  lock(per_cpu(stopper_lock, cpu);
	  list_add_tail(work, per_cpu(worklist, cpu);
	  unlock(per_cpu(stopper_lock, cpu);
      lg_global_unlock(&stop_cpus_lock);

stop_one_cpu()
  cpu_stop_init_done(&done, 1); <-- works on a stack variable
  cpu_stop_queue_work(cpu, &work); <-- work is on stack
  wait_for_completion(&done.completion);

So stop_cpus() operates on static allocated per cpu cpu_stop_work. The
work->done of each per cpu work points to cpu_stop_done which is on
the stack of the caller. Nothing can stomp on the per_cpu work as long
as stop_cpus_mutex is held. Nothing can stomp on the done struct which
is on the stack of the caller and only referenced by the cpu work.

stop_one_cpu() has both the work and the done on the callers stack. So
it does neither need to take the stop_cpus_mutex nor the
stop_cpus_lock lglock. All it requires is the per cpu stopper->lock to
add the work to the per cpu list of the target cpu.

> What appears to be happening to cause our deadlock is, stop_cpus
> works its way down to queue_stop_cpus_work, which tells each cpu's
> stopper task to wake up, take its lock, and do its work.  As the
> loop that does this progresses, the lowest numbered cpus complete
> their work, and are allowed to go on about their business.  The
> problem occurs when one of these lower numbered cpus calls
> stop_one_cpu, targeting one of the higher numbered cpus, which the
> stop_cpus loop has not yet reached.  If this happens, that higher
> numbered cpu's completion variable will get stomped on, and the
> wait_for_completion in the stop_cpus code path will never return.

That's complete nonsense. It CANNOT stomp on the completion variable
of a higher numbered cpu, because there is no such variable.

Lets look at a single CPU:

     The per cpu struct cpu_stopper has work A (from
     stop_cpus) queued.

     A looks like this:

     A->fn   = stop_cpus(fn);
     A->arg  = stop_cpus(arg);
     A->done = struct cpu_stop_done on the stack of the stop_cpus() caller

     Now B gets queued:

     B looks like this:

     B->fn   = stop_one_cpu(fn);
     B->arg  = stop_one_cpu(arg);
     B->done = struct cpu_stop_done on the stack of the stop_one_cpu() caller

     The only connection between A and B is that they are enqueued on
     the same list, i.e. the er cpu struct cpu_stopper::works list.

So now explain, how the enqueueing of B stomps on A->done. There is no
way that the list_add_tail(B) can modify A->done.

Again. This are the protection scopes:

 stop_cpus_mutex protects stop_cpus_work

   It is held across the stop_cpus() call until the
   wait__for_completion() returns. So nothing can stomp on
   stop_cpus_work and the done pointer of any cpu until
   stop_cpus_mutex is released.

 stop_cpus_lock serializes multi_cpu_stop invocations

   You have not provided any indication that multi_cpu_stop is
   involved, so that's irrelevant.

 stopper->lock protects the per cpu work list

> Again, much of this is semi-educated guesswork, put together based

I'd say semi-educated guesswork is a euphemism.

> on information gathered from examining lots of debug output, in an
> attempt to spot the problem. 

Of course you completely missed to provide the actual call chains and
callback functions involved. Instead of that you provide your
interpretation of the problem, which seems to be a bit off.

> We're fairly certain that we've pinned down our issue,

I'm very certain that your interpretation of lots of debug output is
pretty much in line with your creative description of the problem.

> but we'd like to ask those who are more knowledgeable of these code
> paths to weigh in their opinions here.

Done. 

> We'd really appreciate any help that anyone can offer.  Thanks!

If you would provide real data instead of half baken assumptions
we might actually be able to help you.

There might be some hidden issue in the stomp machine crap, but surely
not the one you described.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] Possible locking issues in stop_machine code on 6k core machine
  2014-12-03 23:34 ` Thomas Gleixner
@ 2014-12-04 15:58   ` Alex Thorlton
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Thorlton @ 2014-12-04 15:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alex Thorlton, linux-kernel, Andrew Morton, Peter Zijlstra,
	Fabian Frederick, Ingo Molnar, Russ Anderson

On Thu, Dec 04, 2014 at 12:34:04AM +0100, Thomas Gleixner wrote:
> first of all, could you please format your mail so it is actually
> readable and can be replied to?

My fault there - stupid mistake.

> If you would provide real data instead of half baken assumptions
> we might actually be able to help you.
> 
> There might be some hidden issue in the stomp machine crap, but surely
> not the one you described.

I'll return with some actual data so that others can get a better
picture of what we're seeing.  I'll also see if I can come up with a
way to reproduce the issue on a smaller configuration; might not be
possible, but attempting to do so might provide some more insight into
our problem.

Thanks for your detailed description of the code path!  I definitely
understand that our assumption can't be the actual problem here.  I'll
dig back into the issue and try to get some better information.

Thanks again!

- Alex

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-12-04 15:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-03 20:40 [BUG] Possible locking issues in stop_machine code on 6k core machine Alex Thorlton
2014-12-03 23:34 ` Thomas Gleixner
2014-12-04 15:58   ` Alex Thorlton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox