From: Sasha Levin <sasha.levin@oracle.com>
To: Dave Hansen <dave@sr71.net>
Cc: akpm@linux-foundation.org, n-horiguchi@ah.jp.nec.com,
mike.kravetz@oracle.com, hillf.zj@alibaba-inc.com,
rientjes@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com
Subject: Re: [PATCH] mm, hugetlb: use memory policy when available
Date: Thu, 22 Oct 2015 17:39:39 -0400 [thread overview]
Message-ID: <5629579B.8050507@oracle.com> (raw)
In-Reply-To: <20151020195317.ADA052D8@viggo.jf.intel.com>
On 10/20/2015 03:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
>
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
>
> The code in question is this:
>
>> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> > if (!page) {
>> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool. If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
>
> Almost everybody preallocates huge pages. That's probably why nobody has
> ever noticed this. Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
>
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator. It's fairly straightforward plumbing. This has
> been lightly tested.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Hey Dave,
Trinity seems to be able to hit the newly added warnings pretty easily:
[ 339.282065] WARNING: CPU: 4 PID: 10181 at mm/hugetlb.c:1520 __alloc_buddy_huge_page+0xff/0xa80()
[ 339.360228] Modules linked in:
[ 339.360838] CPU: 4 PID: 10181 Comm: trinity-c291 Not tainted 4.3.0-rc6-next-20151022-sasha-00040-g5ecc711-dirty #2608
[ 339.362629] ffff88015e59c000 00000000e6475701 ffff88015e61f9a0 ffffffff9dd3ef48
[ 339.363896] 0000000000000000 ffff88015e61f9e0 ffffffff9c32d1ca ffffffff9c7175bf
[ 339.365167] ffffffffabddc0c8 ffff88015e61faf0 0000000000000000 ffffffffffffffff
[ 339.366387] Call Trace:
[ 339.366831] [<ffffffff9dd3ef48>] dump_stack+0x4e/0x86
[ 339.367648] [<ffffffff9c32d1ca>] warn_slowpath_common+0xfa/0x120
[ 339.368635] [<ffffffff9c7175bf>] ? __alloc_buddy_huge_page+0xff/0xa80
[ 339.369631] [<ffffffff9c32d3ca>] warn_slowpath_null+0x1a/0x20
[ 339.370574] [<ffffffff9c7175bf>] __alloc_buddy_huge_page+0xff/0xa80
[ 339.371551] [<ffffffff9c7174c0>] ? return_unused_surplus_pages+0x120/0x120
[ 339.372698] [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[ 339.373683] [<ffffffff9c41574b>] ? get_lock_stats+0x1b/0x80
[ 339.374551] [<ffffffff9c42e901>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 339.375744] [<ffffffff9c433870>] ? do_raw_spin_unlock+0x1d0/0x1e0
[ 339.376728] [<ffffffff9c718333>] hugetlb_acct_memory+0x193/0x990
[ 339.377663] [<ffffffff9c7181a0>] ? dequeue_huge_page_node+0x260/0x260
[ 339.378658] [<ffffffff9c41c970>] ? trace_hardirqs_on_caller+0x540/0x5e0
[ 339.379671] [<ffffffff9c71e469>] hugetlb_reserve_pages+0x229/0x330
[ 339.380738] [<ffffffff9cba273b>] hugetlb_file_setup+0x54b/0x810
[ 339.381689] [<ffffffff9cba21f0>] ? hugetlbfs_fallocate+0x9e0/0x9e0
[ 339.382653] [<ffffffff9dd669f0>] ? scnprintf+0x100/0x100
[ 339.383526] [<ffffffff9da638af>] newseg+0x49f/0xa70
[ 339.384371] [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[ 339.385345] [<ffffffff9da63410>] ? shm_try_destroy_orphaned+0x190/0x190
[ 339.386365] [<ffffffff9da52cf0>] ? ipcget+0x60/0x510
[ 339.387139] [<ffffffff9da52d1f>] ipcget+0x8f/0x510
[ 339.387902] [<ffffffff9c0046f0>] ? do_audit_syscall_entry+0x2b0/0x2b0
[ 339.388931] [<ffffffff9da64e1a>] SyS_shmget+0x11a/0x160
[ 339.389737] [<ffffffff9da64d00>] ? is_file_shm_hugepages+0x40/0x40
[ 339.393268] [<ffffffff9c006ac2>] ? syscall_trace_enter_phase2+0x462/0x5f0
[ 339.395643] [<ffffffffa55ce0f8>] tracesys_phase2+0x88/0x8d
Thanks,
Sasha
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Sasha Levin <sasha.levin@oracle.com>
To: Dave Hansen <dave@sr71.net>
Cc: akpm@linux-foundation.org, n-horiguchi@ah.jp.nec.com,
mike.kravetz@oracle.com, hillf.zj@alibaba-inc.com,
rientjes@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com
Subject: Re: [PATCH] mm, hugetlb: use memory policy when available
Date: Thu, 22 Oct 2015 17:39:39 -0400 [thread overview]
Message-ID: <5629579B.8050507@oracle.com> (raw)
In-Reply-To: <20151020195317.ADA052D8@viggo.jf.intel.com>
On 10/20/2015 03:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
>
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
>
> The code in question is this:
>
>> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> > if (!page) {
>> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool. If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
>
> Almost everybody preallocates huge pages. That's probably why nobody has
> ever noticed this. Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
>
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator. It's fairly straightforward plumbing. This has
> been lightly tested.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Hey Dave,
Trinity seems to be able to hit the newly added warnings pretty easily:
[ 339.282065] WARNING: CPU: 4 PID: 10181 at mm/hugetlb.c:1520 __alloc_buddy_huge_page+0xff/0xa80()
[ 339.360228] Modules linked in:
[ 339.360838] CPU: 4 PID: 10181 Comm: trinity-c291 Not tainted 4.3.0-rc6-next-20151022-sasha-00040-g5ecc711-dirty #2608
[ 339.362629] ffff88015e59c000 00000000e6475701 ffff88015e61f9a0 ffffffff9dd3ef48
[ 339.363896] 0000000000000000 ffff88015e61f9e0 ffffffff9c32d1ca ffffffff9c7175bf
[ 339.365167] ffffffffabddc0c8 ffff88015e61faf0 0000000000000000 ffffffffffffffff
[ 339.366387] Call Trace:
[ 339.366831] [<ffffffff9dd3ef48>] dump_stack+0x4e/0x86
[ 339.367648] [<ffffffff9c32d1ca>] warn_slowpath_common+0xfa/0x120
[ 339.368635] [<ffffffff9c7175bf>] ? __alloc_buddy_huge_page+0xff/0xa80
[ 339.369631] [<ffffffff9c32d3ca>] warn_slowpath_null+0x1a/0x20
[ 339.370574] [<ffffffff9c7175bf>] __alloc_buddy_huge_page+0xff/0xa80
[ 339.371551] [<ffffffff9c7174c0>] ? return_unused_surplus_pages+0x120/0x120
[ 339.372698] [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[ 339.373683] [<ffffffff9c41574b>] ? get_lock_stats+0x1b/0x80
[ 339.374551] [<ffffffff9c42e901>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 339.375744] [<ffffffff9c433870>] ? do_raw_spin_unlock+0x1d0/0x1e0
[ 339.376728] [<ffffffff9c718333>] hugetlb_acct_memory+0x193/0x990
[ 339.377663] [<ffffffff9c7181a0>] ? dequeue_huge_page_node+0x260/0x260
[ 339.378658] [<ffffffff9c41c970>] ? trace_hardirqs_on_caller+0x540/0x5e0
[ 339.379671] [<ffffffff9c71e469>] hugetlb_reserve_pages+0x229/0x330
[ 339.380738] [<ffffffff9cba273b>] hugetlb_file_setup+0x54b/0x810
[ 339.381689] [<ffffffff9cba21f0>] ? hugetlbfs_fallocate+0x9e0/0x9e0
[ 339.382653] [<ffffffff9dd669f0>] ? scnprintf+0x100/0x100
[ 339.383526] [<ffffffff9da638af>] newseg+0x49f/0xa70
[ 339.384371] [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[ 339.385345] [<ffffffff9da63410>] ? shm_try_destroy_orphaned+0x190/0x190
[ 339.386365] [<ffffffff9da52cf0>] ? ipcget+0x60/0x510
[ 339.387139] [<ffffffff9da52d1f>] ipcget+0x8f/0x510
[ 339.387902] [<ffffffff9c0046f0>] ? do_audit_syscall_entry+0x2b0/0x2b0
[ 339.388931] [<ffffffff9da64e1a>] SyS_shmget+0x11a/0x160
[ 339.389737] [<ffffffff9da64d00>] ? is_file_shm_hugepages+0x40/0x40
[ 339.393268] [<ffffffff9c006ac2>] ? syscall_trace_enter_phase2+0x462/0x5f0
[ 339.395643] [<ffffffffa55ce0f8>] tracesys_phase2+0x88/0x8d
Thanks,
Sasha
next prev parent reply other threads:[~2015-10-22 21:40 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-20 19:53 [PATCH] mm, hugetlb: use memory policy when available Dave Hansen
2015-10-20 19:53 ` Dave Hansen
2015-10-20 22:19 ` Andrew Morton
2015-10-20 22:19 ` Andrew Morton
2015-10-21 15:12 ` Kirill A. Shutemov
2015-10-21 15:12 ` Kirill A. Shutemov
2015-10-22 21:39 ` Sasha Levin [this message]
2015-10-22 21:39 ` Sasha Levin
2015-10-22 21:42 ` Dave Hansen
2015-10-22 21:42 ` Dave Hansen
2015-11-03 19:12 ` Sasha Levin
2015-11-03 19:12 ` Sasha Levin
2015-11-05 13:47 ` Vlastimil Babka
2015-11-05 13:47 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5629579B.8050507@oracle.com \
--to=sasha.levin@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@linux.intel.com \
--cc=dave@sr71.net \
--cc=hillf.zj@alibaba-inc.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.