From: Vlastimil Babka <vbabka@suse.cz>
To: Mel Gorman <mgorman@techsingularity.net>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option
Date: Tue, 1 Mar 2016 13:42:06 +0100 [thread overview]
Message-ID: <56D58E1E.5090708@suse.cz> (raw)
In-Reply-To: <1456503359-4910-1-git-send-email-mgorman@techsingularity.net>
On 02/26/2016 05:15 PM, Mel Gorman wrote:
> Changelog since v1
> o Default defrag to madvise instead of never
> o Introduce "defer" defrag option to wake kswapd/kcompact if THP is unavailable
> o Restore "always" to historical behaviour
> o Update documentation
>
> THP defrag is enabled by default to direct reclaim/compact but not wake
> kswapd in the event of a THP allocation failure. The problem is that THP
> allocation requests potentially enter reclaim/compaction. This potentially
> incurs a severe stall that is not guaranteed to be offset by reduced TLB
> misses. While there has been considerable effort to reduce the impact
> of reclaim/compaction, it is still a high cost and workloads that should
> fit in memory fail to do so. Specifically, a simple anon/file streaming
> workload will enter direct reclaim on NUMA at least even though the working
> set size is 80% of RAM. It's been years and it's time to throw in the towel.
>
> First, this patch defines THP defrag as follows;
>
> madvise: A failed allocation will direct reclaim/compact if the application requests it
> never: Neither reclaim/compact nor wake kswapd
> defer: A failed allocation will wake kswapd/kcompactd
> always: A failed allocation will direct reclaim/compact (historical behaviour)
> khugepaged defrag will enter direct/reclaim but not wake kswapd.
>
> Next it sets the default defrag option to be "madvise" to only enter direct
> reclaim/compaction for applications that specifically requested it.
>
> Lastly, it removes a check from the page allocator slowpath that is related
> to __GFP_THISNODE to allow "defer" to work. The callers that really cares are
> slub/slab and they are updated accordingly. The slab one may be surprising
> because it also corrects a comment as kswapd was never woken up by that path.
It would be also nice if we could remove the is_thp_gfp_mask() checks one day.
They try to make direct reclaim/compaction for THP less intrusive, but maybe
could be removed now that the stalls are limited otherwise. But that's out of
scope here.
[...]
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 8a282687ee06..a19b173cbc57 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -113,9 +113,26 @@ guaranteed, but it may be more likely in case the allocation is for a
> MADV_HUGEPAGE region.
>
> echo always >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
> echo never >/sys/kernel/mm/transparent_hugepage/defrag
>
> +"always" means that an application requesting THP will stall on allocation
> +failure and directly reclaim pages and compact memory in an effort to
> +allocate a THP immediately. This may be desirable for virtual machines
> +that benefit heavily from THP use and are willing to delay the VM start
> +to utilise them.
> +
> +"defer" means that an application will wake kswapd in the background
> +to reclaim pages and wake kcompact to compact memory so that THP is
> +available in the near future. It's the responsibility of khugepaged
> +to then install the THP pages later.
> +
> +"madvise" will enter direct reclaim like "always" but only for regions
> +that are have used madvise(). This is the default behaviour.
"madvise(MADV_HUGEPAGE)" perhaps?
[...]
> @@ -277,17 +273,23 @@ static ssize_t double_flag_store(struct kobject *kobj,
> static ssize_t enabled_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - return double_flag_show(kobj, attr, buf,
> - TRANSPARENT_HUGEPAGE_FLAG,
> - TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> + if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) {
> + VM_BUG_ON(test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags));
> + return sprintf(buf, "[always] madvise never\n");
> + } else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always [madvise] never\n");
> + else
> + return sprintf(buf, "always madvise [never]\n");
Somewhat ugly wrt consistent usage of { }, due to the VM_BUG_ON(), which I would
just drop. Also I wonder if some racy read vs write of the file can trigger the
BUG_ON? Or are the kobject accesses synchronized at a higher level?
> }
> +
> static ssize_t enabled_store(struct kobject *kobj,
> struct kobj_attribute *attr,
> const char *buf, size_t count)
> {
> ssize_t ret;
>
> - ret = double_flag_store(kobj, attr, buf, count,
> + ret = triple_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_FLAG,
> TRANSPARENT_HUGEPAGE_FLAG,
> TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
So this makes "echo defer > enabled" behave just like "echo always"? For
userspace interface that becomes a fixed ABI, I would prefer to be more careful
with unintended aliases like this. Maybe pass something like "-1" that
triple_flag_store() would check in the "defer" case and return -EINVAL?
> @@ -345,16 +347,23 @@ static ssize_t single_flag_store(struct kobject *kobj,
> static ssize_t defrag_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - return double_flag_show(kobj, attr, buf,
> - TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> - TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "[always] defer madvise never\n");
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always [defer] madvise never\n");
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always defer [madvise] never\n");
> + else
> + return sprintf(buf, "always defer madvise [never]\n");
> +
> }
> static ssize_t defrag_store(struct kobject *kobj,
> struct kobj_attribute *attr,
> const char *buf, size_t count)
> {
> - return double_flag_store(kobj, attr, buf, count,
> - TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> + return triple_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> }
> static struct kobj_attribute defrag_attr =
> @@ -784,9 +793,30 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> return 0;
> }
>
> -static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
> +/*
> + * If THP is set to always then directly reclaim/compact as necessary
> + * If set to defer then do no reclaim and defer to khugepaged
> + * If set to madvise and the VMA is flagged then directly reclaim/compact
> + */
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +{
> + gfp_t reclaim_flags = 0;
> +
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
> + (vma->vm_flags & VM_HUGEPAGE))
> + reclaim_flags = __GFP_DIRECT_RECLAIM;
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> + reclaim_flags = __GFP_KSWAPD_RECLAIM;
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> + reclaim_flags = __GFP_DIRECT_RECLAIM;
Hmm, here's a trick question. What if I wanted direct reclaim for madvise()
vma's and kswapd/kcompactd for others? Right now there's no such option, right?
And expressing that with different values for a single tunable becomes ugly...
(For completeness, somebody could want kswapd/kcompactd defrag only for
madvise() but that's arguably not much useful)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Vlastimil Babka <vbabka@suse.cz>
To: Mel Gorman <mgorman@techsingularity.net>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option
Date: Tue, 1 Mar 2016 13:42:06 +0100 [thread overview]
Message-ID: <56D58E1E.5090708@suse.cz> (raw)
In-Reply-To: <1456503359-4910-1-git-send-email-mgorman@techsingularity.net>
On 02/26/2016 05:15 PM, Mel Gorman wrote:
> Changelog since v1
> o Default defrag to madvise instead of never
> o Introduce "defer" defrag option to wake kswapd/kcompact if THP is unavailable
> o Restore "always" to historical behaviour
> o Update documentation
>
> THP defrag is enabled by default to direct reclaim/compact but not wake
> kswapd in the event of a THP allocation failure. The problem is that THP
> allocation requests potentially enter reclaim/compaction. This potentially
> incurs a severe stall that is not guaranteed to be offset by reduced TLB
> misses. While there has been considerable effort to reduce the impact
> of reclaim/compaction, it is still a high cost and workloads that should
> fit in memory fail to do so. Specifically, a simple anon/file streaming
> workload will enter direct reclaim on NUMA at least even though the working
> set size is 80% of RAM. It's been years and it's time to throw in the towel.
>
> First, this patch defines THP defrag as follows;
>
> madvise: A failed allocation will direct reclaim/compact if the application requests it
> never: Neither reclaim/compact nor wake kswapd
> defer: A failed allocation will wake kswapd/kcompactd
> always: A failed allocation will direct reclaim/compact (historical behaviour)
> khugepaged defrag will enter direct/reclaim but not wake kswapd.
>
> Next it sets the default defrag option to be "madvise" to only enter direct
> reclaim/compaction for applications that specifically requested it.
>
> Lastly, it removes a check from the page allocator slowpath that is related
> to __GFP_THISNODE to allow "defer" to work. The callers that really cares are
> slub/slab and they are updated accordingly. The slab one may be surprising
> because it also corrects a comment as kswapd was never woken up by that path.
It would be also nice if we could remove the is_thp_gfp_mask() checks one day.
They try to make direct reclaim/compaction for THP less intrusive, but maybe
could be removed now that the stalls are limited otherwise. But that's out of
scope here.
[...]
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 8a282687ee06..a19b173cbc57 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -113,9 +113,26 @@ guaranteed, but it may be more likely in case the allocation is for a
> MADV_HUGEPAGE region.
>
> echo always >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
> echo never >/sys/kernel/mm/transparent_hugepage/defrag
>
> +"always" means that an application requesting THP will stall on allocation
> +failure and directly reclaim pages and compact memory in an effort to
> +allocate a THP immediately. This may be desirable for virtual machines
> +that benefit heavily from THP use and are willing to delay the VM start
> +to utilise them.
> +
> +"defer" means that an application will wake kswapd in the background
> +to reclaim pages and wake kcompact to compact memory so that THP is
> +available in the near future. It's the responsibility of khugepaged
> +to then install the THP pages later.
> +
> +"madvise" will enter direct reclaim like "always" but only for regions
> +that are have used madvise(). This is the default behaviour.
"madvise(MADV_HUGEPAGE)" perhaps?
[...]
> @@ -277,17 +273,23 @@ static ssize_t double_flag_store(struct kobject *kobj,
> static ssize_t enabled_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - return double_flag_show(kobj, attr, buf,
> - TRANSPARENT_HUGEPAGE_FLAG,
> - TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> + if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) {
> + VM_BUG_ON(test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags));
> + return sprintf(buf, "[always] madvise never\n");
> + } else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always [madvise] never\n");
> + else
> + return sprintf(buf, "always madvise [never]\n");
Somewhat ugly wrt consistent usage of { }, due to the VM_BUG_ON(), which I would
just drop. Also I wonder if some racy read vs write of the file can trigger the
BUG_ON? Or are the kobject accesses synchronized at a higher level?
> }
> +
> static ssize_t enabled_store(struct kobject *kobj,
> struct kobj_attribute *attr,
> const char *buf, size_t count)
> {
> ssize_t ret;
>
> - ret = double_flag_store(kobj, attr, buf, count,
> + ret = triple_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_FLAG,
> TRANSPARENT_HUGEPAGE_FLAG,
> TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
So this makes "echo defer > enabled" behave just like "echo always"? For
userspace interface that becomes a fixed ABI, I would prefer to be more careful
with unintended aliases like this. Maybe pass something like "-1" that
triple_flag_store() would check in the "defer" case and return -EINVAL?
> @@ -345,16 +347,23 @@ static ssize_t single_flag_store(struct kobject *kobj,
> static ssize_t defrag_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - return double_flag_show(kobj, attr, buf,
> - TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> - TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "[always] defer madvise never\n");
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always [defer] madvise never\n");
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> + return sprintf(buf, "always defer [madvise] never\n");
> + else
> + return sprintf(buf, "always defer madvise [never]\n");
> +
> }
> static ssize_t defrag_store(struct kobject *kobj,
> struct kobj_attribute *attr,
> const char *buf, size_t count)
> {
> - return double_flag_store(kobj, attr, buf, count,
> - TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> + return triple_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> }
> static struct kobj_attribute defrag_attr =
> @@ -784,9 +793,30 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> return 0;
> }
>
> -static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
> +/*
> + * If THP is set to always then directly reclaim/compact as necessary
> + * If set to defer then do no reclaim and defer to khugepaged
> + * If set to madvise and the VMA is flagged then directly reclaim/compact
> + */
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +{
> + gfp_t reclaim_flags = 0;
> +
> + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
> + (vma->vm_flags & VM_HUGEPAGE))
> + reclaim_flags = __GFP_DIRECT_RECLAIM;
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> + reclaim_flags = __GFP_KSWAPD_RECLAIM;
> + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> + reclaim_flags = __GFP_DIRECT_RECLAIM;
Hmm, here's a trick question. What if I wanted direct reclaim for madvise()
vma's and kswapd/kcompactd for others? Right now there's no such option, right?
And expressing that with different values for a single tunable becomes ugly...
(For completeness, somebody could want kswapd/kcompactd defrag only for
madvise() but that's arguably not much useful)
next prev parent reply other threads:[~2016-03-01 12:42 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-26 16:15 [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option Mel Gorman
2016-02-26 16:15 ` Mel Gorman
2016-03-01 12:42 ` Vlastimil Babka [this message]
2016-03-01 12:42 ` Vlastimil Babka
2016-03-01 13:31 ` Mel Gorman
2016-03-01 13:31 ` Mel Gorman
2016-03-01 13:36 ` Vlastimil Babka
2016-03-01 13:36 ` Vlastimil Babka
2016-03-01 13:44 ` [PATCH] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option -fix Mel Gorman
2016-03-01 13:44 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56D58E1E.5090708@suse.cz \
--to=vbabka@suse.cz \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.