Re: numa balancing stuck in task_work_run

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Joe Lawrence <joe.lawrence@stratus.com>
To: Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Subject: Re: numa balancing stuck in task_work_run
Date: Thu, 24 Sep 2015 17:14:32 -0400	[thread overview]
Message-ID: <560467B8.6000101@stratus.com> (raw)
In-Reply-To: <5604665D.3030504@stratus.com>

[ +cc for linux-mm mailinglist address ]

On 09/24/2015 05:08 PM, Joe Lawrence wrote:
> Hi Mel, Rik et al,
> 
> We've encountered interesting NUMA balancing behavior on RHEL7.1,
> reproduced with an upstream 4.2 kernel (of similar .config), that can
> leave a user process trapped in the kernel performing task_numa_work.
> 
> Our test group set up a server with 256GB memory running a program that
> allocates and dirties ~50% of that memory.  They reported the following
> condition when they attempted to kill the test process -- the signal was
> never handled, instead traces showed the task stuck here:
> 
>   PID: 36205  TASK: ffff887a692a8b60  CPU: 23  COMMAND: "memory_test_64"
>       [exception RIP: change_protection_range+0x4f8]
>       RIP: ffffffff8118a878  RSP: ffff887a777c3d68  RFLAGS: 00000282
>       RAX: ffff887f24065141  RBX: ffff887a38e8ee58  RCX: 00000000d37a9780
>       RDX: 80000034dea5e906  RSI: 00000006e9bcb000  RDI: 80000034dea5e906
>       RBP: ffff887a777c3e60   R8: ffff887ae51f6948   R9: 0000000000000001
>       R10: 0000000000000000  R11: ffff887f26f6d428  R12: 0000000000000000
>       R13: 00000006e9c00000  R14: 8000000000000025  R15: 00000006e9bcb000
>       CS: 0010  SS: 0018
>    #0 [ffff887a777c3e68] change_protection at ffffffff8118abf5
>    #1 [ffff887a777c3ea0] change_prot_numa at ffffffff811a106b
>    #2 [ffff887a777c3eb0] task_numa_work at ffffffff810adb23
>    #3 [ffff887a777c3f00] task_work_run at ffffffff81093b37
>    #4 [ffff887a777c3f30] do_notify_resume at ffffffff81013b0c
>    #5 [ffff887a777c3f50] retint_signal at ffffffff8160bafc
>       RIP: 00000000004025c4  RSP: 00007fff80aa5cf0  RFLAGS: 00000206
>       RAX: 0000000ddaa64a60  RBX: 00000000000fdd00  RCX: 00000000000afbc0
>       RDX: 00007fe45b6b6010  RSI: 000000000002dd50  RDI: 0000000ddaa36d10
>       RBP: 00007fff80aa5d40   R8: 0000000000000000   R9: 000000000000007d
>       R10: 00007fff80aa5a70  R11: 0000000000000246  R12: 00007fe45b6b6010
>       R13: 00007fff80aa5f30  R14: 0000000000000000  R15: 0000000000000000
>       ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
> 
> A quick sanity check of the kernel .config and sysctl values:
> 
>   % grep NUMA .config
>   CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
>   CONFIG_NUMA_BALANCING=y
>   CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
>   # CONFIG_X86_NUMACHIP is not set
>   CONFIG_NUMA=y
>   CONFIG_AMD_NUMA=y
>   CONFIG_X86_64_ACPI_NUMA=y
>   # CONFIG_NUMA_EMU is not set
>   CONFIG_USE_PERCPU_NUMA_NODE_ID=y
>   CONFIG_ACPI_NUMA=y
> 
>   % sysctl -a | grep numa_balancing
>   kernel.numa_balancing = 1
>   kernel.numa_balancing_scan_delay_ms = 1000
>   kernel.numa_balancing_scan_period_max_ms = 60000
>   kernel.numa_balancing_scan_period_min_ms = 1000
>   kernel.numa_balancing_scan_size_mb = 256
> 
> A systemtap probe confirmed the task was indeed stuck in task_work_run,
> with new task_work_add occurring before the previous task_numa_work had a
> chance to return (prefix is current jiffies):
> 
>   4555133534: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555133676: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555134554: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555135433: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555136309: kernel.function("task_work_add@kernel/task_work.c:8")
> 
> Looking at the implementation of task_work_run, it will continue to
> churn as long as task->task_works will feed it.
> 
> I did further systemtap investigation to watch the program find its way
> into this condition.  What I found was that the numa_scan_period_max was
> dropping < 200.  This was an effect of a ballooning MM_ANONPAGES value
> (by way of task_nr_scan_windows() and task_scan_max()):
> 
>   [ ... shortly after program start ... ]
> 
>   numa_scan_period_max = 1621     task_nr_scan_windows = 39       MM_ANONPAGES = 2548181
>   numa_scan_period_max = 1538     task_nr_scan_windows = 40       MM_ANONPAGES = 2574349
>   numa_scan_period_max = 1538     task_nr_scan_windows = 40       MM_ANONPAGES = 2599734 
> 
>   [ ... snip about 20 minutes of data... ]
> 
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32956990
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32958955
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32960104
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32960104
> 
> update_task_scan_period will assign the numa_scan_period to the minimum
> of numa_scan_period_max and numa_scan_period * 2.  As
> numa_scan_period_max decreases, it will be the smaller value and hence
> the numa_next_scan's get closer and closer.
> 
> Looking back through the changelog, commit 598f0ec0bc99 "sched/numa:
> Set the scan rate proportional to the memory usage of the task being
> scanned" changed numa_balancing_scan_period_max semantics to tune the
> length of time to complete a full scan.  This may introduce the
> possibility of falling into this condition, though I'm not 100% sure.
> 
> Let me know if there is any additional data that would be helpful to
> report.  In the meantime, I've run a few hours with the following
> workaround to hold off the task_numa_work grind.
> 
> Regards,
> 
> -- Joe
> 
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..df34df492949 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -467,6 +467,11 @@ struct mm_struct {
>  	 */
>  	unsigned long numa_next_scan;
>  
> +	/* numa_last_work_time is the jiffy runtime of the previous
> +	 * task_numa_work invocation, providing hysteresis for numa_next_scan
> +	 * so it will be at least this many jiffies in the future. */
> +	unsigned long numa_last_work_time;
> +
>  	/* Restart point for scanning and setting pte_numa */
>  	unsigned long numa_scan_offset;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..16a96297e2a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1651,7 +1651,7 @@ static void update_task_scan_period(struct task_struct *p,
>  		p->numa_scan_period = min(p->numa_scan_period_max,
>  			p->numa_scan_period << 1);
>  
> -		p->mm->numa_next_scan = jiffies +
> +		p->mm->numa_next_scan = jiffies + p->mm->numa_last_work_time +
>  			msecs_to_jiffies(p->numa_scan_period);
>  
>  		return;
> @@ -2269,6 +2269,9 @@ out:
>  		mm->numa_scan_offset = start;
>  	else
>  		reset_ptenuma_scan(p);
> +
> +	mm->numa_last_work_time = jiffies - now;
> +
>  	up_read(&mm->mmap_sem);
>  }
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next      parent reply	other threads:[~2015-09-24 21:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <5604665D.3030504@stratus.com>
2015-09-24 21:14 ` Joe Lawrence [this message]
2015-09-25  1:04   ` numa balancing stuck in task_work_run Rik van Riel
2015-09-25 17:27     ` Joe Lawrence
2015-09-25 17:38       ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=560467B8.6000101@stratus.com \
    --to=joe.lawrence@stratus.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.