[BUG] Possible locking issues in stop_machine code on 6k core machine

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Alex Thorlton <athorlton@sgi.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Fabian Frederick <fabf@skynet.be>, Ingo Molnar <mingo@kernel.org>,
	Alex Thorlton <athorlton@sgi.com>, Russ Anderson <rja@sgi.com>,
	linux-kernel@vger.kernel.org
Subject: [BUG] Possible locking issues in stop_machine code on 6k core machine
Date: Wed, 3 Dec 2014 14:40:49 -0600	[thread overview]
Message-ID: <20141203204048.GJ4720@sgi.com> (raw)

Hey guys,                                                                                                                                                                                                                                                                                                                                                                                 

While working to get our newly upgraded 6k core machine online, we've                                                                                                                                                                                                                                                                                                                     
discovered a few possible locking issues in the stop_machine code that                                                                                                                                                                                                                                                                                                                    
we're trying to get sorted out.  (We think) the problems we're seeing                                                                                                                                                                                                                                                                                                                     
stem from possible interaction between stop_cpus and stop_one_cpu.  The                                                                                                                                                                                                                                                                                                                   
issue presents as a deadlock, and seems to only show itself                                                                                                                                                                                                                                                                                                                               
intermittently.                                                                                                                                                                                                                                                                                                                                                                           

After quite a bit of debugging we think we've narrowed the issue down to                                                                                                                                                                                                                                                                                                                  
the fact that stop_one_cpu does not respect many of the locks that are                                                                                                                                                                                                                                                                                                                    
taken in the stop_cpus code path.  For reference the stop_cpus code path                                                                                                                                                                                                                                                                                                                  
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each                                                                                                                                                                                                                                                                                                                       
cpu's stopper->lock.  stop_one_cpu seems to rely solely on the                                                                                                                                                                                                                                                                                                                            
stopper->lock.                                                                                                                                                                                                                                                                                                                                                                            

What appears to be happening to cause our deadlock is, stop_cpus works                                                                                                                                                                                                                                                                                                                    
its way down to queue_stop_cpus_work, which tells each cpu's stopper                                                                                                                                                                                                                                                                                                                      
task to wake up, take its lock, and do its work.  As the loop that does                                                                                                                                                                                                                                                                                                                   
this progresses, the lowest numbered cpus complete their work, and are                                                                                                                                                                                                                                                                                                                    
allowed to go on about their business.  The problem occurs when one of                                                                                                                                                                                                                                                                                                                    
these lower numbered cpus calls stop_one_cpu, targeting one of the                                                                                                                                                                                                                                                                                                                        
higher numbered cpus, which the stop_cpus loop has not yet reached.  If                                                                                                                                                                                                                                                                                                                   
this happens, that higher numbered cpu's completion variable will get                                                                                                                                                                                                                                                                                                                     
stomped on, and the wait_for_completion in the stop_cpus code path will                                                                                                                                                                                                                                                                                                                   
never return.                                                                                                                                                                                                                                                                                                                                                                             

A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.                                                                                                                                                                                                                                                                                                                   
CPU 50 completes its stopper work, and at some point in the near future                                                                                                                                                                                                                                                                                                                   
calls stop_one_cpu on CPU 5000.  This clobbers CPU 5000's pointer to the                                                                                                                                                                                                                                                                                                                  
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once                                                                                                                                                                                                                                                                                                                   
CPU 5000 completes its work, it won't be able to decrement the nr_todo                                                                                                                                                                                                                                                                                                                    
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion                                                                                                                                                                                                                                                                                                                     
will never return.                                                                                                                                                                                                                                                                                                                                                                        

Again, much of this is semi-educated guesswork, put together based on                                                                                                                                                                                                                                                                                                                     
information gathered from examining lots of debug output, in an attempt                                                                                                                                                                                                                                                                                                                   
to spot the problem.  We're fairly certain that we've pinned down our                                                                                                                                                                                                                                                                                                                     
issue, but we'd like to ask those who are more knowledgeable of these                                                                                                                                                                                                                                                                                                                     
code paths to weigh in their opinions here.                                                                                                                                                                                                                                                                                                                                               

We'd really appreciate any help that anyone can offer.  Thanks!                                                                                                                                                                                                                                                                                                                           

- Alex

next             reply	other threads:[~2014-12-03 20:40 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-03 20:40 Alex Thorlton [this message]
2014-12-03 23:34 ` [BUG] Possible locking issues in stop_machine code on 6k core machine Thomas Gleixner
2014-12-04 15:58   ` Alex Thorlton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141203204048.GJ4720@sgi.com \
    --to=athorlton@sgi.com \
    --cc=akpm@linux-foundation.org \
    --cc=fabf@skynet.be \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rja@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox