Re: [PATCH 16/19] sched, numa: NUMA home-node selection code

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Don Morris <don.morris@hp.com>
To: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 16/19] sched, numa: NUMA home-node selection code
Date: Wed, 26 Sep 2012 07:31:42 -0700	[thread overview]
Message-ID: <506311CE.30406@hp.com> (raw)
In-Reply-To: <506310A8.5040402@hp.com>

Re-sending to LKML due to mailer picking up an incorrect
address. (Sorry for the dupe).

On 09/26/2012 07:26 AM, Don Morris wrote:
> Peter --
> 
> You may have / probably have already seen this, and if so I
> apologize in advance (can't find any sign of a fix via any
> searches...).
> 
> I picked up your August sched/numa patch set and have been
> working on it with a 2-node and a 8-node configuration. Got
> a very intermittent crash on the 2-node which of course
> hasn't reproduced since I got the crash/kdump configured.
> (I suspect it is related, however).
> 
> On the 8-node, however, I very reliably got a hard lockup
> NMI after several minutes. This occurs when running Andrea's
> autonuma-benchmark
> (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably
> with the first test (two processes, one
> thread per core/vcore, each loops over a single malloc space).
> I'll attach the full stack set from that crash.
> 
> Since the NMI output seemed really consistent that the hard
> lockup stemmed from waiting for a spinlock that never seemed
> to be picked up, I turned on Lock debugging in the .config and
> got a very clear, very consistent circular dependency warning (just
> below).
> 
> As far as I can tell, the warning is correct and is consistent
> with the actual NMI crash output (variant in that the "pidof"
> process on cpu 52 is going through task_sched_runtime() to do
> the task_rq_lock() operation on the numa01 process which
> results in it getting the pi_lock and waiting for
> the rq->lock when numa01 (back on CPU 0) had the rq->lock
> from scheduler_tick() and is going for the pi_lock via
> task_work_add()... ).
> 
> I'm nowhere near confident enough in my knowledge of the
> nuances of run queue locking during the tick update to try
> to hack a workaround - so sorry no proposed patch fix here,
> just a bug report.
> 
> On another minor note, while looking over this and of course
> noticing that most other cpus were tied up waiting for the
> page lock on one of the huge pages (THP was of course on)
> while one of them busied itself invalidating across the other
> CPUs -- the question comes to mind if that's really needed.
> Yes, it certainly is needed in the true PROT_NONE case you're
> building off of as you certainly can't allow access to a
> translation which is now supposed to be locked out, but you
> could allow transitory minor faults when going from PROT_NONE
> back to access as the fault would clear the TLB anyway (at
> least on x86, any architecture which doesn't do that would have
> to have an explicit TLB invalidation for cases where the translation
> is detected as updated anyway, so that should be okay). In your
> case, I would think the transitory faults on what's really a
> hint to the system would probably be much better than tying up
> N-1 other CPUs to do the other flush on a process that spans
> the system -- especially if the other processors are in a scenario
> where they're running that process but working on a different page
> (and hence may never even touch the page changing access anyway).
> Even in the case where you're adding the hint (access to NONE)
> you could be willing to miss an access in favor of letting the
> next context switch invalidate the TLB for you (again, there
> may be architectures where you'll never invalidate unless it is
> explicitly, I think IPF was that way but it has been a while)
> given you really need a non-trivial run time to merit doing this
> work and have a good chance of settling out to a good access
> pattern.
> 
> Just a thought.
> 
> Thanks for your work,
> Don Morris
> 
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.6.0-rc4 #28 Not tainted
> -------------------------------------------------------
> numa01/35386 is trying to acquire lock:
>  (&p->pi_lock){-.-.-.}, at: [<ffffffff81073e68>] task_work_add+0x38/0xa0
> 
> but task is already holding lock:
>  (&rq->lock){-.-.-.}, at: [<ffffffff81085d83>] scheduler_tick+0x53/0x150
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&rq->lock){-.-.-.}:
>        [<ffffffff810b52e3>] validate_chain+0x633/0x730
>        [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
>        [<ffffffff810b5959>] lock_acquire+0xe9/0x120
>        [<ffffffff8152e306>] _raw_spin_lock+0x36/0x70
>        [<ffffffff8108c1f1>] wake_up_new_task+0xd1/0x190
>        [<ffffffff810513f2>] do_fork+0x1f2/0x280
>        [<ffffffff8101bcd6>] kernel_thread+0x76/0x80
>        [<ffffffff81513976>] rest_init+0x26/0xc0
>        [<ffffffff81cdfeff>] start_kernel+0x3c6/0x3d3
>        [<ffffffff81cdf356>] x86_64_start_reservations+0x131/0x136
>        [<ffffffff81cdf45c>] x86_64_start_kernel+0x101/0x110
> 
> -> #0 (&p->pi_lock){-.-.-.}:
>        [<ffffffff810b48ef>] check_prev_add+0x11f/0x4e0
>        [<ffffffff810b52e3>] validate_chain+0x633/0x730
>        [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
>        [<ffffffff810b5959>] lock_acquire+0xe9/0x120
>        [<ffffffff8152e4b5>] _raw_spin_lock_irqsave+0x55/0xa0
>        [<ffffffff81073e68>] task_work_add+0x38/0xa0
>        [<ffffffff810905d7>] task_tick_numa+0xb7/0xd0
>        [<ffffffff8109237a>] task_tick_fair+0x5a/0x70
>        [<ffffffff81085e0e>] scheduler_tick+0xde/0x150
>        [<ffffffff8106267e>] update_process_times+0x6e/0x90
>        [<ffffffff810ad803>] tick_sched_timer+0xa3/0xe0
>        [<ffffffff8107c266>] __run_hrtimer+0x106/0x1c0
>        [<ffffffff8107c5f0>] hrtimer_interrupt+0x120/0x260
>        [<ffffffff81538fdd>] smp_apic_timer_interrupt+0x8d/0xa3
>        [<ffffffff81537eaf>] apic_timer_interrupt+0x6f/0x80
>        [<ffffffff8152e326>] _raw_spin_lock+0x56/0x70
>        [<ffffffff811488e8>] do_anonymous_page+0x1e8/0x270
>        [<ffffffff8114d1fc>] handle_pte_fault+0x9c/0x2a0
>        [<ffffffff8114d5a0>] handle_mm_fault+0x1a0/0x1c0
>        [<ffffffff81532de1>] do_page_fault+0x421/0x450
>        [<ffffffff8152f2d5>] page_fault+0x25/0x30
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&rq->lock);
>                                lock(&p->pi_lock);
>                                lock(&rq->lock);
>   lock(&p->pi_lock);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by numa01/35386:
>  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81532bbc>]
> do_page_fault+0x1fc/0x450
>  #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff811488e8>]
> do_anonymous_page+0x1e8/0x270
>  #2:  (&rq->lock){-.-.-.}, at: [<ffffffff81085d83>]
> scheduler_tick+0x53/0x150
> 
> stack backtrace:
> Pid: 35386, comm: numa01 Not tainted 3.6.0-rc4 #28
> Call Trace:
>  <IRQ>  [<ffffffff810b36a7>] print_circular_bug+0xf7/0x120
>  [<ffffffff8108f5d7>] ? update_sd_lb_stats+0x347/0x700
>  [<ffffffff810b48ef>] check_prev_add+0x11f/0x4e0
>  [<ffffffff8101afe5>] ? native_sched_clock+0x35/0x80
>  [<ffffffff8101a5d9>] ? sched_clock+0x9/0x10
>  [<ffffffff8108d82f>] ? sched_clock_cpu+0x4f/0x110
>  [<ffffffff810b52e3>] validate_chain+0x633/0x730
>  [<ffffffff8101a5d9>] ? sched_clock+0x9/0x10
>  [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
>  [<ffffffff810afc5d>] ? trace_hardirqs_off+0xd/0x10
>  [<ffffffff810b5959>] lock_acquire+0xe9/0x120
>  [<ffffffff81073e68>] ? task_work_add+0x38/0xa0
>  [<ffffffff8152e4b5>] _raw_spin_lock_irqsave+0x55/0xa0
>  [<ffffffff81073e68>] ? task_work_add+0x38/0xa0
>  [<ffffffff81073e68>] task_work_add+0x38/0xa0
>  [<ffffffff810905d7>] task_tick_numa+0xb7/0xd0
>  [<ffffffff8109237a>] task_tick_fair+0x5a/0x70
>  [<ffffffff81085e0e>] scheduler_tick+0xde/0x150
>  [<ffffffff8106267e>] update_process_times+0x6e/0x90
>  [<ffffffff810ad803>] tick_sched_timer+0xa3/0xe0
>  [<ffffffff8107c266>] __run_hrtimer+0x106/0x1c0
>  [<ffffffff810ad760>] ? tick_nohz_restart+0xa0/0xa0
>  [<ffffffff8107c5f0>] hrtimer_interrupt+0x120/0x260
>  [<ffffffff81538fdd>] smp_apic_timer_interrupt+0x8d/0xa3
>  [<ffffffff81537eaf>] apic_timer_interrupt+0x6f/0x80
>  <EOI>  [<ffffffff8108d93b>] ? local_clock+0x4b/0x70
>  [<ffffffff812754e2>] ? do_raw_spin_lock+0xb2/0x140
>  [<ffffffff81275509>] ? do_raw_spin_lock+0xd9/0x140
>  [<ffffffff8152e326>] _raw_spin_lock+0x56/0x70
>  [<ffffffff811488e8>] ? do_anonymous_page+0x1e8/0x270
>  [<ffffffff811488e8>] do_anonymous_page+0x1e8/0x270
>  [<ffffffff8114d1fc>] handle_pte_fault+0x9c/0x2a0
>  [<ffffffff81532bbc>] ? do_page_fault+0x1fc/0x450
>  [<ffffffff810b5ddf>] ? __lock_release+0x14f/0x180
>  [<ffffffff8114d5a0>] handle_mm_fault+0x1a0/0x1c0
>  [<ffffffff8107d1c5>] ? down_read_trylock+0x55/0x70
>  [<ffffffff81532de1>] do_page_fault+0x421/0x450
>  [<ffffffff810b5ddf>] ? __lock_release+0x14f/0x180
>  [<ffffffff810b4522>] ? trace_hardirqs_on_caller+0x152/0x1c0
>  [<ffffffff810b459d>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8152ed60>] ? _raw_spin_unlock_irq+0x30/0x40
>  [<ffffffff8152d670>] ? __schedule+0x610/0x690
>  [<ffffffff8126f03d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [<ffffffff8152f2d5>] page_fault+0x25/0x30
>

next      parent reply	other threads:[~2012-09-26 14:31 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <506310A8.5040402@hp.com>
2012-09-26 14:31 ` Don Morris [this message]
2012-07-31 19:12 [PATCH 00/19] sched-numa rewrite Peter Zijlstra
2012-07-31 19:12 ` [PATCH 16/19] sched, numa: NUMA home-node selection code Peter Zijlstra
2012-07-31 21:52   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=506311CE.30406@hp.com \
    --to=don.morris@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.