From: Joe Lawrence <joe.lawrence@stratus.com>
To: Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Subject: Re: numa balancing stuck in task_work_run
Date: Thu, 24 Sep 2015 17:14:32 -0400 [thread overview]
Message-ID: <560467B8.6000101@stratus.com> (raw)
In-Reply-To: <5604665D.3030504@stratus.com>
[ +cc for linux-mm mailinglist address ]
On 09/24/2015 05:08 PM, Joe Lawrence wrote:
> Hi Mel, Rik et al,
>
> We've encountered interesting NUMA balancing behavior on RHEL7.1,
> reproduced with an upstream 4.2 kernel (of similar .config), that can
> leave a user process trapped in the kernel performing task_numa_work.
>
> Our test group set up a server with 256GB memory running a program that
> allocates and dirties ~50% of that memory. They reported the following
> condition when they attempted to kill the test process -- the signal was
> never handled, instead traces showed the task stuck here:
>
> PID: 36205 TASK: ffff887a692a8b60 CPU: 23 COMMAND: "memory_test_64"
> [exception RIP: change_protection_range+0x4f8]
> RIP: ffffffff8118a878 RSP: ffff887a777c3d68 RFLAGS: 00000282
> RAX: ffff887f24065141 RBX: ffff887a38e8ee58 RCX: 00000000d37a9780
> RDX: 80000034dea5e906 RSI: 00000006e9bcb000 RDI: 80000034dea5e906
> RBP: ffff887a777c3e60 R8: ffff887ae51f6948 R9: 0000000000000001
> R10: 0000000000000000 R11: ffff887f26f6d428 R12: 0000000000000000
> R13: 00000006e9c00000 R14: 8000000000000025 R15: 00000006e9bcb000
> CS: 0010 SS: 0018
> #0 [ffff887a777c3e68] change_protection at ffffffff8118abf5
> #1 [ffff887a777c3ea0] change_prot_numa at ffffffff811a106b
> #2 [ffff887a777c3eb0] task_numa_work at ffffffff810adb23
> #3 [ffff887a777c3f00] task_work_run at ffffffff81093b37
> #4 [ffff887a777c3f30] do_notify_resume at ffffffff81013b0c
> #5 [ffff887a777c3f50] retint_signal at ffffffff8160bafc
> RIP: 00000000004025c4 RSP: 00007fff80aa5cf0 RFLAGS: 00000206
> RAX: 0000000ddaa64a60 RBX: 00000000000fdd00 RCX: 00000000000afbc0
> RDX: 00007fe45b6b6010 RSI: 000000000002dd50 RDI: 0000000ddaa36d10
> RBP: 00007fff80aa5d40 R8: 0000000000000000 R9: 000000000000007d
> R10: 00007fff80aa5a70 R11: 0000000000000246 R12: 00007fe45b6b6010
> R13: 00007fff80aa5f30 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
>
> A quick sanity check of the kernel .config and sysctl values:
>
> % grep NUMA .config
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> # CONFIG_X86_NUMACHIP is not set
> CONFIG_NUMA=y
> CONFIG_AMD_NUMA=y
> CONFIG_X86_64_ACPI_NUMA=y
> # CONFIG_NUMA_EMU is not set
> CONFIG_USE_PERCPU_NUMA_NODE_ID=y
> CONFIG_ACPI_NUMA=y
>
> % sysctl -a | grep numa_balancing
> kernel.numa_balancing = 1
> kernel.numa_balancing_scan_delay_ms = 1000
> kernel.numa_balancing_scan_period_max_ms = 60000
> kernel.numa_balancing_scan_period_min_ms = 1000
> kernel.numa_balancing_scan_size_mb = 256
>
> A systemtap probe confirmed the task was indeed stuck in task_work_run,
> with new task_work_add occurring before the previous task_numa_work had a
> chance to return (prefix is current jiffies):
>
> 4555133534: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
> 4555133676: kernel.function("task_work_add@kernel/task_work.c:8")
> 4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
> 4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
> 4555134554: kernel.function("task_work_add@kernel/task_work.c:8")
> 4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
> 4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
> 4555135433: kernel.function("task_work_add@kernel/task_work.c:8")
> 4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
> 4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
> 4555136309: kernel.function("task_work_add@kernel/task_work.c:8")
>
> Looking at the implementation of task_work_run, it will continue to
> churn as long as task->task_works will feed it.
>
> I did further systemtap investigation to watch the program find its way
> into this condition. What I found was that the numa_scan_period_max was
> dropping < 200. This was an effect of a ballooning MM_ANONPAGES value
> (by way of task_nr_scan_windows() and task_scan_max()):
>
> [ ... shortly after program start ... ]
>
> numa_scan_period_max = 1621 task_nr_scan_windows = 39 MM_ANONPAGES = 2548181
> numa_scan_period_max = 1538 task_nr_scan_windows = 40 MM_ANONPAGES = 2574349
> numa_scan_period_max = 1538 task_nr_scan_windows = 40 MM_ANONPAGES = 2599734
>
> [ ... snip about 20 minutes of data... ]
>
> numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32956990
> numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32958955
> numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32960104
> numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32960104
>
> update_task_scan_period will assign the numa_scan_period to the minimum
> of numa_scan_period_max and numa_scan_period * 2. As
> numa_scan_period_max decreases, it will be the smaller value and hence
> the numa_next_scan's get closer and closer.
>
> Looking back through the changelog, commit 598f0ec0bc99 "sched/numa:
> Set the scan rate proportional to the memory usage of the task being
> scanned" changed numa_balancing_scan_period_max semantics to tune the
> length of time to complete a full scan. This may introduce the
> possibility of falling into this condition, though I'm not 100% sure.
>
> Let me know if there is any additional data that would be helpful to
> report. In the meantime, I've run a few hours with the following
> workaround to hold off the task_numa_work grind.
>
> Regards,
>
> -- Joe
>
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..df34df492949 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -467,6 +467,11 @@ struct mm_struct {
> */
> unsigned long numa_next_scan;
>
> + /* numa_last_work_time is the jiffy runtime of the previous
> + * task_numa_work invocation, providing hysteresis for numa_next_scan
> + * so it will be at least this many jiffies in the future. */
> + unsigned long numa_last_work_time;
> +
> /* Restart point for scanning and setting pte_numa */
> unsigned long numa_scan_offset;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..16a96297e2a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1651,7 +1651,7 @@ static void update_task_scan_period(struct task_struct *p,
> p->numa_scan_period = min(p->numa_scan_period_max,
> p->numa_scan_period << 1);
>
> - p->mm->numa_next_scan = jiffies +
> + p->mm->numa_next_scan = jiffies + p->mm->numa_last_work_time +
> msecs_to_jiffies(p->numa_scan_period);
>
> return;
> @@ -2269,6 +2269,9 @@ out:
> mm->numa_scan_offset = start;
> else
> reset_ptenuma_scan(p);
> +
> + mm->numa_last_work_time = jiffies - now;
> +
> up_read(&mm->mmap_sem);
> }
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next parent reply other threads:[~2015-09-24 21:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5604665D.3030504@stratus.com>
2015-09-24 21:14 ` Joe Lawrence [this message]
2015-09-25 1:04 ` numa balancing stuck in task_work_run Rik van Riel
2015-09-25 17:27 ` Joe Lawrence
2015-09-25 17:38 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=560467B8.6000101@stratus.com \
--to=joe.lawrence@stratus.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.