Re: [PATCH -tip] fix race between stop_two_cpus and stop_cpus

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Prarit Bhargava <prarit@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>,
	peterz@infradead.org, mingo@kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -tip] fix race between stop_two_cpus and stop_cpus
Date: Fri, 01 Nov 2013 07:39:21 -0400	[thread overview]
Message-ID: <527392E9.6080005@redhat.com> (raw)
In-Reply-To: <20131101110825.GX2400@suse.de>



On 11/01/2013 07:08 AM, Mel Gorman wrote:
> On Thu, Oct 31, 2013 at 04:31:44PM -0400, Rik van Riel wrote:
>> There is a race between stop_two_cpus, and the global stop_cpus.
>>
> 
> What was the trigger for this? I want to see what was missing from my own
> testing. I'm going to go out on a limb and guess that CPU hotplug was also
> running in the background to specifically stress this sort of rare condition.
> Something like running a standard test with the monitors/watch-cpuoffline.sh
> from mmtests running in parallel.
> 

I have a test that loads and unloads each module in /lib/modules/3.*/...

Each run typically takes a few minutes.  After running 4-5 times, the system
issues a soft lockup warning with a CPU in multi_cpu_stop().  Unfortunately,
kdump isn't working on this particular system (due to another bug) so I modified
the code with (sorry for the cut-and-paste):

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 05039e3..4a8c9f9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -323,8 +323,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtime
                else
                        dump_stack();

-               if (softlockup_panic)
+               if (softlockup_panic) {
+                       show_state();
                        panic("softlockup: hung tasks");
+               }
                __this_cpu_write(soft_watchdog_warn, true);
        } else
                __this_cpu_write(soft_watchdog_warn, false);

and then 'echo 1 > /proc/sys/kernel/softlockup_panic' to get a full trace of all
tasks.

When I did this and ran the kernel module load unload test ...

[prarit@prarit tmp]$ cat /tmp/intel.log | grep RIP
[  678.081168] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.156180] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.230190] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.244186] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.259194] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.274192] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.288195] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.303197] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.318200] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.333203] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.349206] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.364208] RIP: 0010:[<ffffffff810d328b>]  [<ffffffff810d328b>]
multi_cpu_stop+0x7b/0xf0
[  678.379211] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.394212] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.409215] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.424217] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.438219] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.452221] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.466228] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.481228] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.496230] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.511234] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.526236] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.541238] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.556244] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.571243] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.586247] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0
[  678.601248] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.616251] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  678.632254] RIP: 0010:[<ffffffff810d328b>]  [<ffffffff810d328b>]
multi_cpu_stop+0x7b/0xf0
[  678.647257] RIP: 0010:[<ffffffff810d3292>]  [<ffffffff810d3292>]
multi_cpu_stop+0x82/0xf0
[  687.570464] RIP: 0010:[<ffffffff810d3296>]  [<ffffffff810d3296>]
multi_cpu_stop+0x86/0xf0

and,

[prarit@prarit tmp]$ cat /tmp/intel.log | grep RIP | wc -l
32

which shows all 32 cpus are "correctly" in the cpu stop threads.  After some
investigation, Rik came up with his patch.

Hope this explains things,

P.

     prev parent reply	other threads:[~2013-11-01 11:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-31 20:31 [PATCH -tip] fix race between stop_two_cpus and stop_cpus Rik van Riel
2013-11-01 11:08 ` Mel Gorman
2013-11-01 11:36   ` Rik van Riel
2013-11-01 12:08     ` Prarit Bhargava
2013-11-01 13:44     ` Mel Gorman
2013-11-01 14:24       ` Peter Zijlstra
2013-11-01 14:27         ` Rik van Riel
2013-11-01 14:41           ` [PATCH -v2 " Rik van Riel
2013-11-01 14:47             ` Mel Gorman
2013-11-01 14:49               ` Prarit Bhargava
2013-11-01 18:24               ` Prarit Bhargava
2013-11-11 17:52             ` [tip:sched/core] stop_machine: Fix race between stop_two_cpus() and stop_cpus() tip-bot for Rik van Riel
2013-11-01 11:39   ` Prarit Bhargava [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:05039e3 dfblob:4a8c9f9 )
 OR (
bs:"Re: [PATCH -tip] fix race between stop_two_cpus and stop_cpus" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=527392E9.6080005@redhat.com \
    --to=prarit@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).