linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Rik van Riel <riel@redhat.com>
Cc: peterz@infradead.org, mingo@kernel.org, prarit@redhat.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -tip] fix race between stop_two_cpus and stop_cpus
Date: Fri, 1 Nov 2013 11:08:25 +0000	[thread overview]
Message-ID: <20131101110825.GX2400@suse.de> (raw)
In-Reply-To: <20131031163144.0fd27457@annuminas.surriel.com>

On Thu, Oct 31, 2013 at 04:31:44PM -0400, Rik van Riel wrote:
> There is a race between stop_two_cpus, and the global stop_cpus.
> 

What was the trigger for this? I want to see what was missing from my own
testing. I'm going to go out on a limb and guess that CPU hotplug was also
running in the background to specifically stress this sort of rare condition.
Something like running a standard test with the monitors/watch-cpuoffline.sh
from mmtests running in parallel.

> It is possible for two CPUs to get their stopper functions queued
> "backwards" from one another, resulting in the stopper threads
> getting stuck, and the system hanging. This can happen because
> queuing up stoppers is not synchronized.
> 
> This patch adds synchronization between stop_cpus (a rare operation),
> and stop_two_cpus.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> Prarit is running a test with this patch. By now the kernel would have
> crashed already, yet it is still going. I expect Prarit will add his
> Tested-by: some time tomorrow morning.
> 
>  kernel/stop_machine.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 32a6c44..46cb4c2 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -40,8 +40,10 @@ struct cpu_stopper {
>  };
>  
>  static DEFINE_PER_CPU(struct cpu_stopper, cpu_stopper);
> +static DEFINE_PER_CPU(bool, stop_two_cpus_queueing);
>  static DEFINE_PER_CPU(struct task_struct *, cpu_stopper_task);
>  static bool stop_machine_initialized = false;
> +static bool stop_cpus_queueing = false;
>  
>  static void cpu_stop_init_done(struct cpu_stop_done *done, unsigned int nr_todo)
>  {
> @@ -261,16 +263,37 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *
>  	cpu_stop_init_done(&done, 2);
>  	set_state(&msdata, MULTI_STOP_PREPARE);
>  
> + wait_for_global:
> +	/* If a global stop_cpus is queuing up stoppers, wait. */
> +	while (unlikely(stop_cpus_queueing))
> +		cpu_relax();
> +

This partially serialises callers to migrate_swap() while it is checked
if the pair of CPUs are being affected at the moment. It's two-stage
locking. The global lock is short-lived while the per-cpu data is updated
and the per-cpu values allow a degree of parallelisation on call_cpu which
could not be done with a spinlock held anyway.  Why not make protection
of the initial update a normal spinlock? i.e.

spin_lock(&stop_cpus_queue_lock);
this_cpu_write(stop_two_cpus_queueing, true);
spin_unlock(&stop_cpus_queue_lock);

and get rid of the barriers and gogo wait_for_global loop entirely? I'm not
seeing the hidden advantage. The this_cpu_write(stop_two_cpus_queueing, false)
would also need to be within the lock as would the checks in queue_stop_cpus_work.

The locks look bad but it's not clear to me why the barriers and retries
are better.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2013-11-01 11:08 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-31 20:31 [PATCH -tip] fix race between stop_two_cpus and stop_cpus Rik van Riel
2013-11-01 11:08 ` Mel Gorman [this message]
2013-11-01 11:36   ` Rik van Riel
2013-11-01 12:08     ` Prarit Bhargava
2013-11-01 13:44     ` Mel Gorman
2013-11-01 14:24       ` Peter Zijlstra
2013-11-01 14:27         ` Rik van Riel
2013-11-01 14:41           ` [PATCH -v2 " Rik van Riel
2013-11-01 14:47             ` Mel Gorman
2013-11-01 14:49               ` Prarit Bhargava
2013-11-01 18:24               ` Prarit Bhargava
2013-11-11 17:52             ` [tip:sched/core] stop_machine: Fix race between stop_two_cpus() and stop_cpus() tip-bot for Rik van Riel
2013-11-01 11:39   ` [PATCH -tip] fix race between stop_two_cpus and stop_cpus Prarit Bhargava

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131101110825.GX2400@suse.de \
    --to=mgorman@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=prarit@redhat.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).