Re: [PATCH] mm: support large mapping building for tmpfs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, hughd@google.com
Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, baohua@kernel.org, vbabka@suse.cz,
	rppt@kernel.org, surenb@google.com, mhocko@suse.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: support large mapping building for tmpfs
Date: Wed, 2 Jul 2025 17:44:11 +0800	[thread overview]
Message-ID: <67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com> (raw)
In-Reply-To: <ec5d4e52-658b-4fdc-b7f9-f844ab29665c@redhat.com>



On 2025/7/2 16:45, David Hildenbrand wrote:
>>> Hm, are we sure about that?
>>
>> IMO, referring to the definition of RSS:
>> "resident set size (RSS) is the portion of memory (measured in
>> kilobytes) occupied by a process that is held in main memory (RAM). "
>>
>> Seems we should report the whole large folio already in file to users.
>> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
>> option to allocate large folios, so the increase in RSS seems also 
>> expected?
> 
> Well, traditionally we only account what is actually mapped. If you
> MADV_DONTNEED part of the large folio, or only mmap() parts of it,
> the RSS would never cover the whole folio -- only what is mapped.
> 
> I discuss part of that in:
> 
> commit 749492229e3bd6222dda7267b8244135229d1fd8
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Mar 3 17:30:13 2025 +0100
> 
>      mm: stop maintaining the per-page mapcount of large folios 
> (CONFIG_NO_PAGE_MAPCOUNT)
> 
> And how my changes there affect some system stats (e.g., "AnonPages", 
> "Mapped").
> But the RSS stays unchanged and corresponds to what is actually mapped into
> the process.
> Doing something similar for the RSS would be extremely hard (single page 
> mapped into process
> -> account whole folio to RSS), because it's per-folio-per-process 
> information, not
> per-folio information.

Thanks. Good to know this.

> So by mapping more in a single page fault, you end up increasing "RSS". 
> But I wouldn't
> call that "expected". I rather suspect that nobody will really care :)

But tmpfs is a little special here. It uses the 'huge=' option to 
control large folio allocation. So, I think users should know they want 
to use large folios and build the whole mapping for the large folios. 
That is why I call it 'expected'.

>> Also, how does fault_around_bytes interact
>>> here?
>>
>> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
>> applies to read faults (via do_read_fault()) and does not control write
>> shared faults (via do_shared_fault()). Additionally, in the
>> do_shared_fault() function, PMD-sized large folios are also not
>> controlled by 'fault_around', so I just follow the handling of PMD-sized
>> large folios.
>>
>>>> In order to support large mappings for tmpfs, besides checking VMA
>>>> limits and
>>>> PMD pagetable limits, it is also necessary to check if the linear page
>>>> offset
>>>> of the VMA is order-aligned within the file.
>>>
>>> Why?
>>>
>>> This only applies to PMD mappings. See below.
>>
>> I previously had the same question, but I saw the comments for
>> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
>> not necessary to check non-PMD-sized large folios, should we update the
>> comments for 'thp_vma_suitable_order'?
> 
> I was not quite clear about PMD vs. !PMD.
> 
> The thing is, when you *allocate* a new folio, it must adhere at least to
> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- 

Yes, agree.

> that is what
> thp_vma_suitable_order() checks. Otherwise you cannot add it to the 
> pagecache.

But this alignment is not done by thp_vma_suitable_order().

For tmpfs, it will check the alignment in shmem_suitable_orders() via:
"
	if (!xa_find(&mapping->i_pages, &aligned_index,
			aligned_index + pages - 1, XA_PRESENT))
"

For other fs systems, it will check the alignment in 
__filemap_get_folio() via:
"
	/* If we're not aligned, allocate a smaller folio */
	if (index & ((1UL << order) - 1))
		order = __ffs(index);
"

> But once you *obtain* a folio from the pagecache and are supposed to map it
> into the page tables, that must already hold true.
> 
> So you should be able to just blindly map whatever is given to you here
> AFAIKS.
> 
> If you would get a pagecache folio that violates the linear page offset 
> requirement
> at that point, something else would have messed up the pagecache.

Yes. But the comments from thp_vma_suitable_order() is not about the 
pagecache alignment, it says "the order-aligned addresses in the VMA map 
to order-aligned offsets within the file", which is used to do alignment 
for PMD mapping originally. So I wonder if we need this restriction for 
non-PMD-sized large folios?

"
  *   - For file vma, check if the linear page offset of vma is
  *     order-aligned within the file.  The hugepage is
  *     guaranteed to be order-aligned within the file, but we must
  *     check that the order-aligned addresses in the VMA map to
  *     order-aligned offsets within the file, else the hugepage will
  *     not be mappable.
"

next prev parent reply	other threads:[~2025-07-02  9:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-01  8:40 [PATCH] mm: support large mapping building for tmpfs Baolin Wang
2025-07-01 13:08 ` David Hildenbrand
2025-07-02  2:03   ` Baolin Wang
2025-07-02  8:45     ` David Hildenbrand
2025-07-02  9:44       ` Baolin Wang [this message]
2025-07-02 11:38         ` David Hildenbrand
2025-07-02 11:55           ` David Hildenbrand
2025-07-04  2:35             ` Baolin Wang
2025-07-04  2:04           ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com \
    --to=baolin.wang@linux.alibaba.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.