Re: numa balancing stuck in task_work

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: numa balancing stuck in task_work_run
       [not found] <5604665D.3030504@stratus.com>
@ 2015-09-24 21:14 ` Joe Lawrence
  2015-09-25  1:04   ` Rik van Riel
  0 siblings, 1 reply; 4+ messages in thread
From: Joe Lawrence @ 2015-09-24 21:14 UTC (permalink / raw)
  To: Mel Gorman, Rik van Riel; +Cc: linux-mm

[ +cc for linux-mm mailinglist address ]

On 09/24/2015 05:08 PM, Joe Lawrence wrote:
> Hi Mel, Rik et al,
> 
> We've encountered interesting NUMA balancing behavior on RHEL7.1,
> reproduced with an upstream 4.2 kernel (of similar .config), that can
> leave a user process trapped in the kernel performing task_numa_work.
> 
> Our test group set up a server with 256GB memory running a program that
> allocates and dirties ~50% of that memory.  They reported the following
> condition when they attempted to kill the test process -- the signal was
> never handled, instead traces showed the task stuck here:
> 
>   PID: 36205  TASK: ffff887a692a8b60  CPU: 23  COMMAND: "memory_test_64"
>       [exception RIP: change_protection_range+0x4f8]
>       RIP: ffffffff8118a878  RSP: ffff887a777c3d68  RFLAGS: 00000282
>       RAX: ffff887f24065141  RBX: ffff887a38e8ee58  RCX: 00000000d37a9780
>       RDX: 80000034dea5e906  RSI: 00000006e9bcb000  RDI: 80000034dea5e906
>       RBP: ffff887a777c3e60   R8: ffff887ae51f6948   R9: 0000000000000001
>       R10: 0000000000000000  R11: ffff887f26f6d428  R12: 0000000000000000
>       R13: 00000006e9c00000  R14: 8000000000000025  R15: 00000006e9bcb000
>       CS: 0010  SS: 0018
>    #0 [ffff887a777c3e68] change_protection at ffffffff8118abf5
>    #1 [ffff887a777c3ea0] change_prot_numa at ffffffff811a106b
>    #2 [ffff887a777c3eb0] task_numa_work at ffffffff810adb23
>    #3 [ffff887a777c3f00] task_work_run at ffffffff81093b37
>    #4 [ffff887a777c3f30] do_notify_resume at ffffffff81013b0c
>    #5 [ffff887a777c3f50] retint_signal at ffffffff8160bafc
>       RIP: 00000000004025c4  RSP: 00007fff80aa5cf0  RFLAGS: 00000206
>       RAX: 0000000ddaa64a60  RBX: 00000000000fdd00  RCX: 00000000000afbc0
>       RDX: 00007fe45b6b6010  RSI: 000000000002dd50  RDI: 0000000ddaa36d10
>       RBP: 00007fff80aa5d40   R8: 0000000000000000   R9: 000000000000007d
>       R10: 00007fff80aa5a70  R11: 0000000000000246  R12: 00007fe45b6b6010
>       R13: 00007fff80aa5f30  R14: 0000000000000000  R15: 0000000000000000
>       ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
> 
> A quick sanity check of the kernel .config and sysctl values:
> 
>   % grep NUMA .config
>   CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
>   CONFIG_NUMA_BALANCING=y
>   CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
>   # CONFIG_X86_NUMACHIP is not set
>   CONFIG_NUMA=y
>   CONFIG_AMD_NUMA=y
>   CONFIG_X86_64_ACPI_NUMA=y
>   # CONFIG_NUMA_EMU is not set
>   CONFIG_USE_PERCPU_NUMA_NODE_ID=y
>   CONFIG_ACPI_NUMA=y
> 
>   % sysctl -a | grep numa_balancing
>   kernel.numa_balancing = 1
>   kernel.numa_balancing_scan_delay_ms = 1000
>   kernel.numa_balancing_scan_period_max_ms = 60000
>   kernel.numa_balancing_scan_period_min_ms = 1000
>   kernel.numa_balancing_scan_size_mb = 256
> 
> A systemtap probe confirmed the task was indeed stuck in task_work_run,
> with new task_work_add occurring before the previous task_numa_work had a
> chance to return (prefix is current jiffies):
> 
>   4555133534: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555133676: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555134554: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555135433: kernel.function("task_work_add@kernel/task_work.c:8")
>   4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return
>   4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796")
>   4555136309: kernel.function("task_work_add@kernel/task_work.c:8")
> 
> Looking at the implementation of task_work_run, it will continue to
> churn as long as task->task_works will feed it.
> 
> I did further systemtap investigation to watch the program find its way
> into this condition.  What I found was that the numa_scan_period_max was
> dropping < 200.  This was an effect of a ballooning MM_ANONPAGES value
> (by way of task_nr_scan_windows() and task_scan_max()):
> 
>   [ ... shortly after program start ... ]
> 
>   numa_scan_period_max = 1621     task_nr_scan_windows = 39       MM_ANONPAGES = 2548181
>   numa_scan_period_max = 1538     task_nr_scan_windows = 40       MM_ANONPAGES = 2574349
>   numa_scan_period_max = 1538     task_nr_scan_windows = 40       MM_ANONPAGES = 2599734 
> 
>   [ ... snip about 20 minutes of data... ]
> 
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32956990
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32958955
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32960104
>   numa_scan_period_max = 119      task_nr_scan_windows = 503      MM_ANONPAGES = 32960104
> 
> update_task_scan_period will assign the numa_scan_period to the minimum
> of numa_scan_period_max and numa_scan_period * 2.  As
> numa_scan_period_max decreases, it will be the smaller value and hence
> the numa_next_scan's get closer and closer.
> 
> Looking back through the changelog, commit 598f0ec0bc99 "sched/numa:
> Set the scan rate proportional to the memory usage of the task being
> scanned" changed numa_balancing_scan_period_max semantics to tune the
> length of time to complete a full scan.  This may introduce the
> possibility of falling into this condition, though I'm not 100% sure.
> 
> Let me know if there is any additional data that would be helpful to
> report.  In the meantime, I've run a few hours with the following
> workaround to hold off the task_numa_work grind.
> 
> Regards,
> 
> -- Joe
> 
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..df34df492949 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -467,6 +467,11 @@ struct mm_struct {
>  	 */
>  	unsigned long numa_next_scan;
>  
> +	/* numa_last_work_time is the jiffy runtime of the previous
> +	 * task_numa_work invocation, providing hysteresis for numa_next_scan
> +	 * so it will be at least this many jiffies in the future. */
> +	unsigned long numa_last_work_time;
> +
>  	/* Restart point for scanning and setting pte_numa */
>  	unsigned long numa_scan_offset;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..16a96297e2a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1651,7 +1651,7 @@ static void update_task_scan_period(struct task_struct *p,
>  		p->numa_scan_period = min(p->numa_scan_period_max,
>  			p->numa_scan_period << 1);
>  
> -		p->mm->numa_next_scan = jiffies +
> +		p->mm->numa_next_scan = jiffies + p->mm->numa_last_work_time +
>  			msecs_to_jiffies(p->numa_scan_period);
>  
>  		return;
> @@ -2269,6 +2269,9 @@ out:
>  		mm->numa_scan_offset = start;
>  	else
>  		reset_ptenuma_scan(p);
> +
> +	mm->numa_last_work_time = jiffies - now;
> +
>  	up_read(&mm->mmap_sem);
>  }
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: numa balancing stuck in task_work_run
  2015-09-24 21:14 ` numa balancing stuck in task_work_run Joe Lawrence
@ 2015-09-25  1:04   ` Rik van Riel
  2015-09-25 17:27     ` Joe Lawrence
  0 siblings, 1 reply; 4+ messages in thread
From: Rik van Riel @ 2015-09-25  1:04 UTC (permalink / raw)
  To: Joe Lawrence, Mel Gorman; +Cc: linux-mm

On 09/24/2015 05:14 PM, Joe Lawrence wrote:
> [ +cc for linux-mm mailinglist address ]
> 
> On 09/24/2015 05:08 PM, Joe Lawrence wrote:
>> Hi Mel, Rik et al,
>>
>> We've encountered interesting NUMA balancing behavior on RHEL7.1,
>> reproduced with an upstream 4.2 kernel (of similar .config), that can
>> leave a user process trapped in the kernel performing task_numa_work.
>>
>> Our test group set up a server with 256GB memory running a program that
>> allocates and dirties ~50% of that memory.  They reported the following
>> condition when they attempted to kill the test process -- the signal was
>> never handled, instead traces showed the task stuck here:

Does the bug still happen with this patch applied?

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=4620f8c1fda2af4ccbd11e194e2dd785f7d7f279

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: numa balancing stuck in task_work_run
  2015-09-25  1:04   ` Rik van Riel
@ 2015-09-25 17:27     ` Joe Lawrence
  2015-09-25 17:38       ` Rik van Riel
  0 siblings, 1 reply; 4+ messages in thread
From: Joe Lawrence @ 2015-09-25 17:27 UTC (permalink / raw)
  To: Rik van Riel, Mel Gorman; +Cc: linux-mm

On 09/24/2015 09:04 PM, Rik van Riel wrote:
> On 09/24/2015 05:14 PM, Joe Lawrence wrote:
>> [ +cc for linux-mm mailinglist address ]
>>
>> On 09/24/2015 05:08 PM, Joe Lawrence wrote:
>>> Hi Mel, Rik et al,
>>>
>>> We've encountered interesting NUMA balancing behavior on RHEL7.1,
>>> reproduced with an upstream 4.2 kernel (of similar .config), that can
>>> leave a user process trapped in the kernel performing task_numa_work.
>>>
>>> Our test group set up a server with 256GB memory running a program that
>>> allocates and dirties ~50% of that memory.  They reported the following
>>> condition when they attempted to kill the test process -- the signal was
>>> never handled, instead traces showed the task stuck here:
> 
> Does the bug still happen with this patch applied?
> 
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
> 

Hi Rik,

Success!  With 4620f8c1fda2 (-tip) cherry-picked on-top of 4.2, I could
successfully kill off the memory test process, even when the
numa_scan_period_max dropped to 140.

I also ran kicked off the est program and let continue overnight (it
restarts itself after a given time) and several iterations ran without
incident.

Thanks,

-- Joe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: numa balancing stuck in task_work_run
  2015-09-25 17:27     ` Joe Lawrence
@ 2015-09-25 17:38       ` Rik van Riel
  0 siblings, 0 replies; 4+ messages in thread
From: Rik van Riel @ 2015-09-25 17:38 UTC (permalink / raw)
  To: Joe Lawrence, Mel Gorman; +Cc: linux-mm

On 09/25/2015 01:27 PM, Joe Lawrence wrote:
> On 09/24/2015 09:04 PM, Rik van Riel wrote:
>> On 09/24/2015 05:14 PM, Joe Lawrence wrote:
>>> [ +cc for linux-mm mailinglist address ]
>>>
>>> On 09/24/2015 05:08 PM, Joe Lawrence wrote:
>>>> Hi Mel, Rik et al,
>>>>
>>>> We've encountered interesting NUMA balancing behavior on RHEL7.1,
>>>> reproduced with an upstream 4.2 kernel (of similar .config), that can
>>>> leave a user process trapped in the kernel performing task_numa_work.
>>>>
>>>> Our test group set up a server with 256GB memory running a program that
>>>> allocates and dirties ~50% of that memory.  They reported the following
>>>> condition when they attempted to kill the test process -- the signal was
>>>> never handled, instead traces showed the task stuck here:
>>
>> Does the bug still happen with this patch applied?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
>>
> 
> Hi Rik,
> 
> Success!  With 4620f8c1fda2 (-tip) cherry-picked on-top of 4.2, I could
> successfully kill off the memory test process, even when the
> numa_scan_period_max dropped to 140.
> 
> I also ran kicked off the est program and let continue overnight (it
> restarts itself after a given time) and several iterations ran without
> incident.

Glad to hear the issue is fixed in the latest -tip tree.

FWIW, that fix is also slated to show up in the next RHEL 7
update.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-09-25 17:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <5604665D.3030504@stratus.com>
2015-09-24 21:14 ` numa balancing stuck in task_work_run Joe Lawrence
2015-09-25  1:04   ` Rik van Riel
2015-09-25 17:27     ` Joe Lawrence
2015-09-25 17:38       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).