All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tarun Sahu <tsahu@linux.ibm.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	muchun.song@linux.dev, mike.kravetz@oracle.com,
	aneesh.kumar@linux.ibm.com, sidhartha.kumar@oracle.com,
	gerald.schaefer@linux.ibm.com, linux-kernel@vger.kernel.org,
	jaypatel@linux.ibm.com
Subject: Re: [PATCH] mm/folio: Avoid special handling for order value 0 in folio_set_order
Date: Tue, 18 Apr 2023 14:33:48 +0530	[thread overview]
Message-ID: <878repa7ez.fsf@linux.ibm.com> (raw)
In-Reply-To: <ZDmzyag88pO1Kdk8@casper.infradead.org>

Matthew Wilcox <willy@infradead.org> writes:

Hi Mathew,
Thanks for reviewing. please find my comments inline.

> On Sat, Apr 15, 2023 at 01:18:32AM +0530, Tarun Sahu wrote:
>> folio_set_order(folio, 0); which is an abuse of folio_set_order as 0-order
>> folio does not have any tail page to set order.
>
> I think you're missing the point of how folio_set_order() is used.
> When splitting a large folio, we need to zero out the folio_nr_pages
> in the tail, so it does have a tail page, and that tail page needs to
> be zeroed.  We even assert that there is a tail page:
>
>         if (WARN_ON_ONCE(!folio_test_large(folio)))
>                 return;
>
> Or maybe you need to explain yourself better.
>

Yes, I understand, folio_set_order(order, 0) is called to clear out
tail pages folio_order/folio_nr_pages. With this patch, I am trying
to convey two things:-
 1. It is not necessary to clear out these field if page->mapping is
 being explicitly updated. I explain this below [EXP].

 2. folio_set_order(order, 0) now currently being used to clear
 folio_order and folio_nr_pages which is ok. But looking at
 folio_set_order(folio, 0) is confusing as setting order 0 implies that
 only 1 page in folio. and folio_order and folio_nr_pages are part of
 first tail page. IIRC, there was a discussion to use folio_clear_order
 to avoid such confusion. But if above point 1 deemed to be correct, there
 will not be any need of this too.

**[EXP]**
IIUC, during splitting, page->mapping is updated
explicitly for tail pages. There is no code path I see, where
folio_set_order(order, 0) or set_compound_order(head, 0) is called
except below two places.

  1. __destroy_compound_gigantic_folio
       Here, in past there was a problem when struct page used to have
       compound_nr field which used to overlap with page->mapping. So
       while freeing, it was necessary to explicitly clear out
       compound_nr. Which was taken care by Commit ba9c1201beaa
       ("mm/hugetlb: clear compound_nr before freeing gigantic pages").

       But after, Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA
       pages to CMA"), page->mapping has explicitly been cleared out for
       all tail pages.

           	for (i = 1; i < nr_pages; i++) {
            		p = folio_page(folio, i);
                p->mapping = NULL;    <======== (Here)
                clear_compound_head(p);
                if (!demote)
                   set_page_refcounted(p);
                }
            folio_set_order(folio, 0); <== this line can be removed.

   2. __prep_compound_gigantic_folio
        Here, folio_set_order(folio, 0) is called in error path only.
        which can be avoided if we call folio_set_order(folio, order)
        after the for loop.
        I am new to memory allocators. But as far as I could understood by
        looking at past discussion around this function [1][2], During RCU
        grace period there could be a race condition causing ref count
        inflation. But IIUC, that doesn't have any dependency on newly
        allocated gigantic page except that the ref count might be taken
        by folio_ref_try_add_rcu for the same page/s which will cause
        prep_compound_gigantic_folio to fail. So IMHO, it will be ok to
        move __folio_set_head and folio_set_order after the for loop.
        Here, Just for reference, below I copy pasted the *for loop*,
        from before, I am moving these two calls to after this.

	for (i = 0; i < nr_pages; i++) {
		p = folio_page(folio, i);

		if (i != 0)	/* head page cleared above */
			__ClearPageReserved(p);
		if (!demote) {
			if (!page_ref_freeze(p, 1)) {
				pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
				goto out_error;
			}
		} else {
			VM_BUG_ON_PAGE(page_count(p), p);
		}
		if (i != 0)
			set_compound_head(p, &folio->page);
	}

I also tested it with triggering demotion of gigantic hugepages to PMD
hugepages.

$ echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
5
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
0
$ echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
512

I am quite new to field. Please correct me if I understood it
differently than it is. Also if I didn't consider other code path
for its consideration.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
[2] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com/T/#u

>> folio->_folio_nr_pages is
>> set to 0 for order 0 in folio_set_order. It is required because
>> _folio_nr_pages overlapped with page->mapping and leaving it non zero
>> caused "bad page" error while freeing gigantic hugepages. This was fixed in
>> Commit ba9c1201beaa ("mm/hugetlb: clear compound_nr before freeing gigantic
>> pages"). Also commit a01f43901cfb ("hugetlb: be sure to free demoted CMA
>> pages to CMA") now explicitly clear page->mapping and hence we won't see
>> the bad page error even if _folio_nr_pages remains unset. Also the order 0
>> folios are not supposed to call folio_set_order, So now we can get rid of
>> folio_set_order(folio, 0) from hugetlb code path to clear the confusion.
>
> ... this is all very confusing.
>

Sorry, for this. Lemme know if above explanation [EXP] is clear.

>> The patch also moves _folio_set_head and folio_set_order calls in
>> __prep_compound_gigantic_folio() such that we avoid clearing them in the
>> error path.
>
> But don't we need those bits set while we operate on the folio to set it
> up?  It makes me nervous if we don't have those bits set because we can
> end up with speculative references that point to a head page while that
> page is not marked as a head page.  It may not be a problem, but I want
> to see some air-tight analysis of that.
>


>> Testing: I have run LTP tests, which all passes. and also I have written
>> the test in LTP which tests the bug caused by compound_nr and page->mapping
>> overlapping.
>> 
>> https://lore.kernel.org/all/20230413090753.883953-1-tsahu@linux.ibm.com/
>> 
>> Running on older kernel ( < 5.10-rc7) with the above bug this fails while
>> on newer kernel and, also with this patch it passes.
>> 
>> Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com>
>> ---
>>  mm/hugetlb.c  | 9 +++------
>>  mm/internal.h | 8 ++------
>>  2 files changed, 5 insertions(+), 12 deletions(-)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f16b25b1a6b9..e2540269c1dc 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1489,7 +1489,6 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>>  			set_page_refcounted(p);
>>  	}
>>  
>> -	folio_set_order(folio, 0);
>>  	__folio_clear_head(folio);
>>  }
>>  
>> @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>>  	struct page *p;
>>  
>>  	__folio_clear_reserved(folio);
>> -	__folio_set_head(folio);
>> -	/* we rely on prep_new_hugetlb_folio to set the destructor */
>> -	folio_set_order(folio, order);
>>  	for (i = 0; i < nr_pages; i++) {
>>  		p = folio_page(folio, i);
>>  
>> @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>>  		if (i != 0)
>>  			set_compound_head(p, &folio->page);
>>  	}
>> +	__folio_set_head(folio);
>> +	/* we rely on prep_new_hugetlb_folio to set the destructor */
>> +	folio_set_order(folio, order);
>>  	atomic_set(&folio->_entire_mapcount, -1);
>>  	atomic_set(&folio->_nr_pages_mapped, 0);
>>  	atomic_set(&folio->_pincount, 0);
>> @@ -2017,8 +2016,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>>  		p = folio_page(folio, j);
>>  		__ClearPageReserved(p);
>>  	}
>> -	folio_set_order(folio, 0);
>> -	__folio_clear_head(folio);
>>  	return false;
>>  }
>>  
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 18cda26b8a92..0d96a3bc1d58 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -425,16 +425,12 @@ int split_free_page(struct page *free_page,
>>   */
>>  static inline void folio_set_order(struct folio *folio, unsigned int order)
>>  {
>> -	if (WARN_ON_ONCE(!folio_test_large(folio)))
>> +	if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
>>  		return;
>>  
>>  	folio->_folio_order = order;
>>  #ifdef CONFIG_64BIT
>> -	/*
>> -	 * When hugetlb dissolves a folio, we need to clear the tail
>> -	 * page, rather than setting nr_pages to 1.
>> -	 */
>> -	folio->_folio_nr_pages = order ? 1U << order : 0;
>> +	folio->_folio_nr_pages = 1U << order;
>>  #endif
>>  }
>>  
>> -- 
>> 2.31.1
>> 


  reply	other threads:[~2023-04-18  9:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 19:48 [PATCH] mm/folio: Avoid special handling for order value 0 in folio_set_order Tarun Sahu
2023-04-14 20:12 ` Matthew Wilcox
2023-04-18  9:03   ` Tarun Sahu [this message]
2023-04-18 18:56   ` Mike Kravetz
2023-04-24 15:40     ` Tarun Sahu
2023-04-24 15:31   ` Tarun Sahu
2023-04-14 21:35 ` Sidhartha Kumar
2023-04-18  9:20   ` Tarun Sahu
2023-04-18  9:25     ` Tarun Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878repa7ez.fsf@linux.ibm.com \
    --to=tsahu@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=jaypatel@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=sidhartha.kumar@oracle.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.