linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Chris J Arges <chris.j.arges@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: smp_call_function_single lockups
Date: Thu, 2 Apr 2015 21:07:25 +0200	[thread overview]
Message-ID: <20150402190725.GA10570@gmail.com> (raw)
In-Reply-To: <551D8FAF.5070805@canonical.com>


* Chris J Arges <chris.j.arges@canonical.com> wrote:

> Whenever we look through the crashdump we see csd_lock_wait waiting 
> for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
> up to that looks like the following (in the openstack tempest on 
> openstack and nested VM stress case)
> 
> (qemu-system-x86 task)
> kvm_sched_in
>  -> kvm_arch_vcpu_load
>   -> vmx_vcpu_load
>    -> loaded_vmcs_clear
>     -> smp_call_function_single
> 
> (ksmd task)
> pmdp_clear_flush
>  -> flush_tlb_mm_range
>   -> native_flush_tlb_others
>     -> smp_call_function_many

So is this two separate smp_call_function instances, crossing each 
other, and none makes any progress, indefinitely - as if the two IPIs 
got lost?

The traces Rafael he linked to show a simpler scenario with two CPUs 
apparently locked up, doing this:

CPU0:

 #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386
 #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee
 #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f
 #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563
 #9 [ffffffff81c03f30] rest_init at ffffffff817a6067
#10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce
#11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7
#12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c

This CPU is idle.

CPU1:

#10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69
#11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae
#12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4
#13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d
#14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39
#15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65
#16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03
#17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08
#18 [ffff88081993fe80] sys_madvise at ffffffff811b9775
#19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad

This CPU is busy-waiting for the TLB flush IPI to finish.

There's no unexpected pattern here (other than it not finishing) 
AFAICS, the smp_call_function_single() is just the usual way we invoke 
the TLB flushing methods AFAICS.

So one possibility would be that an 'IPI was sent but lost'.

We could try the following trick: poll for completion for a couple of 
seconds (since an IPI is not held up by anything but irqs-off 
sections, it should arrive within microseconds typically - seconds of 
polling should be more than enough), and if the IPI does not arrive, 
print a warning message and re-send the IPI.

If the IPI was lost due to some race and there's no other failure mode 
that we don't understand, then this would work around the bug and 
would make the tests pass indefinitely - with occasional hickups and a 
handful of messages produced along the way whenever it would have 
locked up with a previous kernel.

If testing indeed confirms that kind of behavior we could drill down 
more closely to figure out why the IPI did not get to its destination.

Or if the behavior is different, we'd have some new behavior to look 
at. (for example the IPI sending mechanism might be wedged 
indefinitely for some reason, so that even a resend won't work.)

Agreed?

Thanks,

	Ingo

  reply	other threads:[~2015-04-02 19:07 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar [this message]
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2015-02-22  8:59 smp_call_function_single lockups Daniel J Blueman
2015-02-22 10:37 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150402190725.GA10570@gmail.com \
    --to=mingo@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chris.j.arges@canonical.com \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).