From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
"\\\"Kirill A. Shutemov\\\"" <kirill@shutemov.name>,
Mel Gorman <mgorman@techsingularity.net>,
Hugh Dickins <hughd@google.com>,
Johannes Weiner <jweiner@redhat.com>,
Dave Hansen <dave.hansen@intel.com>,
Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH 1/2] mm: thp: introduce thp_mmu_gather to pin tail pages during MMU gather
Date: Sat, 05 Dec 2015 13:54:51 +0530 [thread overview]
Message-ID: <87poyl5mlo.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20151123160302.GX5078@redhat.com>
Andrea Arcangeli <aarcange@redhat.com> writes:
> On Thu, Nov 19, 2015 at 04:22:55PM -0800, Andrew Morton wrote:
>> On Thu, 19 Nov 2015 14:00:51 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>> > This theoretical SMP race condition was found with source review. No
>> > real life app could be affected as the result of freeing memory while
>> > accessing it is either undefined or it's a workload the produces no
>> > information.
>> >
>> > For something to go wrong because the SMP race condition triggered,
>> > it'd require a further tiny window within the SMP race condition
>> > window. So nothing bad is happening in practice even if the SMP race
>> > condition triggers. It's still better to apply the fix to have the
>> > math guarantee.
>> >
>> > The fix just adds a thp_mmu_gather atomic_t counter to the THP pages,
>> > so split_huge_page can elevate the tail page count accordingly and
>> > leave the tail page freeing task to whoever elevated thp_mmu_gather.
>> >
>>
>> This is a pretty nasty patch :( We now have random page*'s with bit 0
>> set floating around in mmu_gather.__pages[]. It assumes/requires that
>
> Yes, and bit 0 is only relevant for the mmu_gather structure and its
> users.
>
>> nobody uses those pages until they hit release_pages(). And the tlb
>> flushing code is pretty twisty, with various Kconfig and arch dependent
>> handlers.
>
> I already reviewed all callers and the mmu_gather freeing path for all
> archs to be sure they all take the two paths that I updated in order
> to free the pages. They call free_page_and_swap_cache or
> free_pages_and_swap_cache.
>
> The freeing is using common code VM functions, and it shouldn't
> improvise calling put_page() manually, the freeing has to take care of
> collecting the swapcache if needed etc... it has to deal with VM
> details the arch is not aware about.
But it is still lot of really complicated code.
>
>> Is there no nicer way?
>
> We can grow the size of the mmu_gather to keep track that the page was
> THP before the pmd_lock was dropped, in a separate bit from the struct
> page pointer, but it'll take more memory.
>
> This bit 0 looks a walk in the park if compared to the bit 0 in
> page->compound_head that was just introduced. The compound_head bit 0
> isn't only visible to the mmu_gather users (which should never try to
> mess with the page pointer themself) and it collides with lru/next,
> rcu_head.next users.
>
> If you prefer me to enlarge the mmu_gather structure I can do that.
>
If we can update mmu_gather to track the page size of the pages, that
will also help some archs to better implement tlb_flush(struct
mmu_gather *). Right now arch/powerpc/mm/tlb_nohash.c does flush the tlb
mapping for the entire mm_struct.
we can also make sure that we do a force flush when we are trying to
gather pages of different size. So one instance of mmu_gather will end
up gathering pages of specific size only ?
> 1 bit of extra information needs to be extracted before dropping the
> pmd_lock in zap_huge_pmd() and it has to be available in
> release_pages(), to know if the tail pages needs an explicit put_page
> or not. It's basically a bit in the local stack, except it's not in
> the local stack because it is an array of pages, so it needs many
> bits and it's stored in the mmu_gather along the page.
>
> Aside from the implementation of bit 0, I can't think of a simpler
> design that provides for the same performance and low locking overhead
> (this patch actually looks like an optimization when it's not helping
> to prevent the race, because to fix the race I had to reduce the
> number of times the lru_lock is taken in release_pages).
>
>> > +/*
>> > + * free_trans_huge_page_list() is used to free the pages returned by
>> > + * trans_huge_page_release() (if still PageTransHuge()) in
>> > + * release_pages().
>> > + */
>>
>> There is no function trans_huge_page_release().
>
> Oops I updated the function name but not the comment... thanks!
> Andrea
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f7ae08f..2810322 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -225,9 +225,8 @@ static inline int trans_huge_mmu_gather_count(struct page *page)
> }
>
> /*
> - * free_trans_huge_page_list() is used to free the pages returned by
> - * trans_huge_page_release() (if still PageTransHuge()) in
> - * release_pages().
> + * free_trans_huge_page_list() is used to free THP pages (if still
> + * PageTransHuge()) in release_pages().
> */
> extern void free_trans_huge_page_list(struct list_head *list);
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-12-05 8:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-19 13:00 [PATCH 0/2] THP MMU gather Andrea Arcangeli
2015-11-19 13:00 ` [PATCH 1/2] mm: thp: introduce thp_mmu_gather to pin tail pages during " Andrea Arcangeli
2015-11-20 0:22 ` Andrew Morton
2015-11-23 16:03 ` Andrea Arcangeli
2015-12-05 8:24 ` Aneesh Kumar K.V [this message]
2015-12-07 14:44 ` Andrea Arcangeli
2015-12-07 9:30 ` Aneesh Kumar K.V
2015-12-07 15:11 ` Andrea Arcangeli
2015-11-19 13:00 ` [PATCH 2/2] mm: thp: put_huge_zero_page() with " Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87poyl5mlo.fsf@linux.vnet.ibm.com \
--to=aneesh.kumar@linux.vnet.ibm.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=hughd@google.com \
--cc=jweiner@redhat.com \
--cc=kirill@shutemov.name \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.