linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
       [not found]       ` <a8ac7ec3-4cb3-4dd8-8d02-ede6905f322e@linux.alibaba.com>
@ 2025-07-25  2:38         ` Baolin Wang
  2025-07-25  4:47           ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Baolin Wang @ 2025-07-25  2:38 UTC (permalink / raw)
  To: David Hildenbrand, Hugh Dickins
  Cc: Patryk Kowalczyk, da.gomez, baohua, wangkefeng.wang, ioworker0,
	willy, ryan.roberts, akpm, eero.t.tamminen,
	Ville Syrjälä, linux-mm@kvack.org

Cc +mm maillist

On 2025/7/25 10:29, Baolin Wang wrote:
> 
> 
> On 2025/7/25 03:34, David Hildenbrand wrote:
>> On 24.07.25 21:19, Hugh Dickins wrote:
>>> On Thu, 24 Jul 2025, David Hildenbrand wrote:
>>>> On 24.07.25 20:34, Patryk Kowalczyk wrote:
>>>>>
>>>>> Recently, I have observed a significant drop in the performance of the
>>>>> graphics card on a Meteor Lake platform equipped with an Intel Core 
>>>>> Ultra
>>>>> 155H CPU and Xe 128EU GPU, using the i915 driver. Nearly every 
>>>>> workload now
>>>>> runs slower, with memory-intensive tasks experiencing performance 
>>>>> reduced by
>>>>> several percentages.
>>>>>
>>>>> This issue began with the Linux kernel version 6.14.x. Using git 
>>>>> bisect, I
>>>>> identified the problematic commit as
>>>>> acd7ccb284b86da1b2e3233a6826fe933844fc06, which relates to the " 
>>>>> large folio
>>>>> support for tmpfs in the memory management subsystem."
>>>>>
>>>>> More information about this regression can be found at the 
>>>>> following issue
>>>>> tracker: https://gitlab.freedesktop.org/drm/i915/kernel/-/ 
>>>>> issues/14645
>>>>> <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645>
>>>>>
>>>>> Older bug for textures:
>>>>> https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845 
>>>>> <https://
>>>>> gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845>
>>>>
>>>> Reading the tracker, the valuable information is:
>>>>
>>>> "I tested all the options available via the sysfs interface for 
>>>> shmem_enabled
>>>> in transparent hugepage. The selectable settings are: always, 
>>>> within_size,
>>>> advise, [never] (the default), deny, and force. Among these, only 
>>>> the force
>>>> option restores GPU performance on kernels later than version 6.13 
>>>> and only
>>>> for application executed after change."
>>>>
>>>> So probably you are no longer getting PMD THPs, whereby previously 
>>>> you would
>>>> have gotten PMD THPs.
>>>>
>>>> Do we know how the shmem memory is accessed? read/write or mmap? I 
>>>> suspect it
>>>> is accessed through write() first.
>>>>
>>>> I think we had a change in behavior regarding write(): we will now try
>>>> allocating a PMD THP only if the write() spans the complete PHD range.
>>>>
>>>> I recall that we had a similar report before ...
> 
> Yes, Ville reported the same issue before[1], and I provided a fix to 
> Ville off-list, but I have not received any feedback.
> 
> [1] https://lore.kernel.org/lkml/aBEP-6iFhIC87zmb@intel.com/
> 
>>> I haven't noticed a similar report, but that's my guess too: although
>>> I thought I checked for precisely this (knowing it to be a danger)
>>> during 6.14-rc, I did later find that I must have messed my test up;
>>> but still haven't found time to establish precisely what's going on and
>>> fix it (can't worry too much about releases between LTSs these days).
>>
>> At least scanning the code, write() would behave differently now.
>>
>> Now, I don't know which "hint" we should use to use PMD-sized THPs and 
>> ignore the write size.
>>
>> Maybe the last resort would be starting at PMD-size, but falling back 
>> to smaller orders if we fail to allocate one, hmmmm
> 
> I hope to correct the logic of i915 driver's shmem allocation, by 
> extending the shmem write length in the i915 driver to allocate PMD- 
> sized THPs. IIUC, some sample fix code is as follows (untested). Patryk, 
> could you help test it to see if this resolves your issue? Thanks.
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 1e659d2660f7..5dee740d1e70 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -591,7 +591,7 @@ struct page **drm_gem_get_pages(struct 
> drm_gem_object *obj)
>          i = 0;
>          while (i < npages) {
>                  long nr;
> -               folio = shmem_read_folio_gfp(mapping, i,
> +               folio = shmem_read_folio_gfp(mapping, i, 0,
>                                  mapping_gfp_mask(mapping));
>                  if (IS_ERR(folio))
>                          goto fail;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/ 
> drm/i915/gem/i915_gem_shmem.c
> index f263615f6ece..0edc75208b7a 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -69,6 +69,7 @@ int shmem_sg_alloc_table(struct drm_i915_private 
> *i915, struct sg_table *st,
>          struct scatterlist *sg;
>          unsigned long next_pfn = 0;     /* suppress gcc warning */
>          gfp_t noreclaim;
> +       size_t chunk;
>          int ret;
> 
>          if (overflows_type(size / PAGE_SIZE, page_count))
> @@ -94,6 +95,7 @@ int shmem_sg_alloc_table(struct drm_i915_private 
> *i915, struct sg_table *st,
>          mapping_set_unevictable(mapping);
>          noreclaim = mapping_gfp_constraint(mapping, ~__GFP_RECLAIM);
>          noreclaim |= __GFP_NORETRY | __GFP_NOWARN;
> +       chunk = mapping_max_folio_size(mapping);
> 
>          sg = st->sgl;
>          st->nents = 0;
> @@ -105,10 +107,13 @@ int shmem_sg_alloc_table(struct drm_i915_private 
> *i915, struct sg_table *st,
>                          0,
>                  }, *s = shrink;
>                  gfp_t gfp = noreclaim;
> +               size_t bytes = (page_count - i) << PAGE_SHIFT;
> +               loff_t pos = i << PAGE_SHIFT;
> 
> +               bytes = min(chunk, bytes);
>                  do {
>                          cond_resched();
> -                       folio = shmem_read_folio_gfp(mapping, i, gfp);
> +                       folio = shmem_read_folio_gfp(mapping, i, pos + 
> bytes, gfp);
>                          if (!IS_ERR(folio))
>                                  break;
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_backup.c b/drivers/gpu/drm/ttm/ 
> ttm_backup.c
> index 6f2e58be4f3e..0c90ae338afb 100644
> --- a/drivers/gpu/drm/ttm/ttm_backup.c
> +++ b/drivers/gpu/drm/ttm/ttm_backup.c
> @@ -100,7 +100,7 @@ ttm_backup_backup_page(struct file *backup, struct 
> page *page,
>          struct folio *to_folio;
>          int ret;
> 
> -       to_folio = shmem_read_folio_gfp(mapping, idx, alloc_gfp);
> +       to_folio = shmem_read_folio_gfp(mapping, idx, 0, alloc_gfp);
>          if (IS_ERR(to_folio))
>                  return PTR_ERR(to_folio);
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index cbe46e0c8bce..9fb5f30552e4 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -156,12 +156,12 @@ enum sgp_type {
>   int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end,
>                  struct folio **foliop, enum sgp_type sgp);
>   struct folio *shmem_read_folio_gfp(struct address_space *mapping,
> -               pgoff_t index, gfp_t gfp);
> +               pgoff_t index, loff_t write_end, gfp_t gfp);
> 
>   static inline struct folio *shmem_read_folio(struct address_space 
> *mapping,
>                  pgoff_t index)
>   {
> -       return shmem_read_folio_gfp(mapping, index, 
> mapping_gfp_mask(mapping));
> +       return shmem_read_folio_gfp(mapping, index, 0, 
> mapping_gfp_mask(mapping));
>   }
> 
>   static inline struct page *shmem_read_mapping_page(
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c5eea697a96f..fcf233440c34 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -5920,6 +5920,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
>    * shmem_read_folio_gfp - read into page cache, using specified page 
> allocation flags.
>    * @mapping:   the folio's address_space
>    * @index:     the folio index
> + * @write_end: end of a write if allocating a new folio
>    * @gfp:       the page allocator flags to use if allocating
>    *
>    * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)",
> @@ -5932,14 +5933,14 @@ int shmem_zero_setup(struct vm_area_struct *vma)
>    * with the mapping_gfp_mask(), to avoid OOMing the machine 
> unnecessarily.
>    */
>   struct folio *shmem_read_folio_gfp(struct address_space *mapping,
> -               pgoff_t index, gfp_t gfp)
> +               pgoff_t index, loff_t write_end, gfp_t gfp)
>   {
>   #ifdef CONFIG_SHMEM
>          struct inode *inode = mapping->host;
>          struct folio *folio;
>          int error;
> 
> -       error = shmem_get_folio_gfp(inode, index, 0, &folio, SGP_CACHE,
> +       error = shmem_get_folio_gfp(inode, index, write_end, &folio, 
> SGP_CACHE,
>                                      gfp, NULL, NULL);
>          if (error)
>                  return ERR_PTR(error);
> @@ -5958,7 +5959,7 @@ EXPORT_SYMBOL_GPL(shmem_read_folio_gfp);
>   struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>                                           pgoff_t index, gfp_t gfp)
>   {
> -       struct folio *folio = shmem_read_folio_gfp(mapping, index, gfp);
> +       struct folio *folio = shmem_read_folio_gfp(mapping, index, 0, gfp);
>          struct page *page;
> 
>          if (IS_ERR(folio))



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-25  2:38         ` regression - mm: shmem: add large folio support for tmpfs affect GPU performance Baolin Wang
@ 2025-07-25  4:47           ` Hugh Dickins
  2025-07-25  6:05             ` Baolin Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2025-07-25  4:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, Hugh Dickins, Patryk Kowalczyk, da.gomez,
	baohua, wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org

On Fri, 25 Jul 2025, Baolin Wang wrote:
> > 
> > I hope to correct the logic of i915 driver's shmem allocation, by extending
> > the shmem write length in the i915 driver to allocate PMD- sized THPs. IIUC,
> > some sample fix code is as follows (untested). Patryk, could you help test
> > it to see if this resolves your issue? Thanks.

This patch cannot be the right fix.  It may be a very sensible workaround
for some in-kernel drivers (I've not looked or tried); but unless I
misunderstand, it does nothing to restore userspace behaviour on a
huge=always tmpfs.

Please reread my comment earlier in the thread, in particular,
Passing a new SIGBUS xfstest does not excuse a regression: strict PAGE_SIZE
SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
but not for the long-established huge=always.

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-25  4:47           ` Hugh Dickins
@ 2025-07-25  6:05             ` Baolin Wang
  2025-07-25  8:36               ` Patryk Kowalczyk
  2025-07-28  5:35               ` Hugh Dickins
  0 siblings, 2 replies; 7+ messages in thread
From: Baolin Wang @ 2025-07-25  6:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Hildenbrand, Patryk Kowalczyk, da.gomez, baohua,
	wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org



On 2025/7/25 12:47, Hugh Dickins wrote:
> On Fri, 25 Jul 2025, Baolin Wang wrote:
>>>
>>> I hope to correct the logic of i915 driver's shmem allocation, by extending
>>> the shmem write length in the i915 driver to allocate PMD- sized THPs. IIUC,
>>> some sample fix code is as follows (untested). Patryk, could you help test
>>> it to see if this resolves your issue? Thanks.
> 
> This patch cannot be the right fix.  It may be a very sensible workaround
> for some in-kernel drivers (I've not looked or tried); but unless I
> misunderstand, it does nothing to restore userspace behaviour on a
> huge=always tmpfs.

Yes. Initially, we wanted to maintain compatibility with the 'huge=' 
option, meaning that 'huge=always' tmpfs mount would still allocate 
PMD-sized THPs. However, the current implementation is the consensus we 
reached after much debate:

1. “When using tmpfs as a filesystem, it should behave like other 
filesystems. No more special mount options.” Per Matthew.
2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other 
sizes exist.” Per David.

At the time, we should have sought your advice, but we failed. The long 
historical discussion is in this thread[1]. So now the strategy for 
tmpfs supporting large folios is:

"
Considering that tmpfs already has the 'huge=' option to control the 
PMD-sized large folios allocation, we can extend the 'huge=' option to 
allow any sized large folios. The semantics of the 'huge=' mount option are:
huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect i_size
huge=advise: like 'always' if requested with madvise()

Note: For tmpfs mmap() faults, due to the lack of a write size hint, 
still allocate the PMD-sized large folios if 
huge=always/within_size/advise is set.

Moreover, the 'deny' and 'force' testing options controlled by 
'/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the 
same semantics. The 'deny' can disable any sized large folios for tmpfs, 
while the 'force' can enable PMD sized large folios for tmpfs.
"

Currently, we have observed regression in the i915 driver but have not 
yet seen userspace regression on a huge=always tmpfs.

If you have better suggestions, please feel free to point them out. Thanks.

[1] https://lore.kernel.org/lkml/Zw_IT136rxW_KuhU@casper.infradead.org/

> Please reread my comment earlier in the thread, in particular,
> Passing a new SIGBUS xfstest does not excuse a regression: strict PAGE_SIZE
> SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
> but not for the long-established huge=always.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-25  6:05             ` Baolin Wang
@ 2025-07-25  8:36               ` Patryk Kowalczyk
  2025-07-25  9:17                 ` Baolin Wang
  2025-07-28  5:35               ` Hugh Dickins
  1 sibling, 1 reply; 7+ messages in thread
From: Patryk Kowalczyk @ 2025-07-25  8:36 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Hugh Dickins, David Hildenbrand, da.gomez, baohua,
	wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 3211 bytes --]

I have tested the fix for the i915 driver with kernel 6.16-rc7 and intel
core ultra 155H.
The performance in several tests is on the kernel 6.12 level,
so from a functional standpoint, the issue has been resolved.
Additionally, I also see potential for further optimization of the Xe
driver.

regards,
Patryk
pt., 25 lip 2025 o 08:05 Baolin Wang <baolin.wang@linux.alibaba.com>
napisał(a):

>
>
> On 2025/7/25 12:47, Hugh Dickins wrote:
> > On Fri, 25 Jul 2025, Baolin Wang wrote:
> >>>
> >>> I hope to correct the logic of i915 driver's shmem allocation, by
> extending
> >>> the shmem write length in the i915 driver to allocate PMD- sized THPs.
> IIUC,
> >>> some sample fix code is as follows (untested). Patryk, could you help
> test
> >>> it to see if this resolves your issue? Thanks.
> >
> > This patch cannot be the right fix.  It may be a very sensible workaround
> > for some in-kernel drivers (I've not looked or tried); but unless I
> > misunderstand, it does nothing to restore userspace behaviour on a
> > huge=always tmpfs.
>
> Yes. Initially, we wanted to maintain compatibility with the 'huge='
> option, meaning that 'huge=always' tmpfs mount would still allocate
> PMD-sized THPs. However, the current implementation is the consensus we
> reached after much debate:
>
> 1. “When using tmpfs as a filesystem, it should behave like other
> filesystems. No more special mount options.” Per Matthew.
> 2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other
> sizes exist.” Per David.
>
> At the time, we should have sought your advice, but we failed. The long
> historical discussion is in this thread[1]. So now the strategy for
> tmpfs supporting large folios is:
>
> "
> Considering that tmpfs already has the 'huge=' option to control the
> PMD-sized large folios allocation, we can extend the 'huge=' option to
> allow any sized large folios. The semantics of the 'huge=' mount option
> are:
> huge=never: no any sized large folios
> huge=always: any sized large folios
> huge=within_size: like 'always' but respect i_size
> huge=advise: like 'always' if requested with madvise()
>
> Note: For tmpfs mmap() faults, due to the lack of a write size hint,
> still allocate the PMD-sized large folios if
> huge=always/within_size/advise is set.
>
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the
> same semantics. The 'deny' can disable any sized large folios for tmpfs,
> while the 'force' can enable PMD sized large folios for tmpfs.
> "
>
> Currently, we have observed regression in the i915 driver but have not
> yet seen userspace regression on a huge=always tmpfs.
>
> If you have better suggestions, please feel free to point them out. Thanks.
>
> [1] https://lore.kernel.org/lkml/Zw_IT136rxW_KuhU@casper.infradead.org/
>
> > Please reread my comment earlier in the thread, in particular,
> > Passing a new SIGBUS xfstest does not excuse a regression: strict
> PAGE_SIZE
> > SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
> > but not for the long-established huge=always.
>
>

[-- Attachment #2: Type: text/html, Size: 4042 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-25  8:36               ` Patryk Kowalczyk
@ 2025-07-25  9:17                 ` Baolin Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Baolin Wang @ 2025-07-25  9:17 UTC (permalink / raw)
  To: Patryk Kowalczyk
  Cc: Hugh Dickins, David Hildenbrand, da.gomez, baohua,
	wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org



On 2025/7/25 16:36, Patryk Kowalczyk wrote:
> I have tested the fix for the i915 driver with kernel 6.16-rc7 and intel
> core ultra 155H.
> The performance in several tests is on the kernel 6.12 level,
> so from a functional standpoint, the issue has been resolved.

Thanks for testing. I need to wait for Hugh's opinion before sending out 
a formal fix patch.

> Additionally, I also see potential for further optimization of the Xe
> driver.

If you can guide me where the Xe driver allocates shmem memory, I can 
help with the fix. Thanks.

>> On 2025/7/25 12:47, Hugh Dickins wrote:
>>> On Fri, 25 Jul 2025, Baolin Wang wrote:
>>>>>
>>>>> I hope to correct the logic of i915 driver's shmem allocation, by
>> extending
>>>>> the shmem write length in the i915 driver to allocate PMD- sized THPs.
>> IIUC,
>>>>> some sample fix code is as follows (untested). Patryk, could you help
>> test
>>>>> it to see if this resolves your issue? Thanks.
>>>
>>> This patch cannot be the right fix.  It may be a very sensible workaround
>>> for some in-kernel drivers (I've not looked or tried); but unless I
>>> misunderstand, it does nothing to restore userspace behaviour on a
>>> huge=always tmpfs.
>>
>> Yes. Initially, we wanted to maintain compatibility with the 'huge='
>> option, meaning that 'huge=always' tmpfs mount would still allocate
>> PMD-sized THPs. However, the current implementation is the consensus we
>> reached after much debate:
>>
>> 1. “When using tmpfs as a filesystem, it should behave like other
>> filesystems. No more special mount options.” Per Matthew.
>> 2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other
>> sizes exist.” Per David.
>>
>> At the time, we should have sought your advice, but we failed. The long
>> historical discussion is in this thread[1]. So now the strategy for
>> tmpfs supporting large folios is:
>>
>> "
>> Considering that tmpfs already has the 'huge=' option to control the
>> PMD-sized large folios allocation, we can extend the 'huge=' option to
>> allow any sized large folios. The semantics of the 'huge=' mount option
>> are:
>> huge=never: no any sized large folios
>> huge=always: any sized large folios
>> huge=within_size: like 'always' but respect i_size
>> huge=advise: like 'always' if requested with madvise()
>>
>> Note: For tmpfs mmap() faults, due to the lack of a write size hint,
>> still allocate the PMD-sized large folios if
>> huge=always/within_size/advise is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the
>> same semantics. The 'deny' can disable any sized large folios for tmpfs,
>> while the 'force' can enable PMD sized large folios for tmpfs.
>> "
>>
>> Currently, we have observed regression in the i915 driver but have not
>> yet seen userspace regression on a huge=always tmpfs.
>>
>> If you have better suggestions, please feel free to point them out. Thanks.
>>
>> [1] https://lore.kernel.org/lkml/Zw_IT136rxW_KuhU@casper.infradead.org/
>>
>>> Please reread my comment earlier in the thread, in particular,
>>> Passing a new SIGBUS xfstest does not excuse a regression: strict
>> PAGE_SIZE
>>> SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
>>> but not for the long-established huge=always.
>>
>>
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-25  6:05             ` Baolin Wang
  2025-07-25  8:36               ` Patryk Kowalczyk
@ 2025-07-28  5:35               ` Hugh Dickins
  2025-07-28  6:29                 ` Baolin Wang
  1 sibling, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2025-07-28  5:35 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Hugh Dickins, David Hildenbrand, Patryk Kowalczyk, da.gomez,
	baohua, wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 5116 bytes --]

On Fri, 25 Jul 2025, Baolin Wang wrote:
> On 2025/7/25 12:47, Hugh Dickins wrote:
> > On Fri, 25 Jul 2025, Baolin Wang wrote:
> >>>
> >>> I hope to correct the logic of i915 driver's shmem allocation, by
> >>> extending
> >>> the shmem write length in the i915 driver to allocate PMD- sized THPs.
> >>> IIUC,
> >>> some sample fix code is as follows (untested). Patryk, could you help test
> >>> it to see if this resolves your issue? Thanks.
> > 
> > This patch cannot be the right fix.  It may be a very sensible workaround
> > for some in-kernel drivers (I've not looked or tried); but unless I
> > misunderstand, it does nothing to restore userspace behaviour on a
> > huge=always tmpfs.
> 
> Yes. Initially, we wanted to maintain compatibility with the 'huge=' option,
> meaning that 'huge=always' tmpfs mount would still allocate PMD-sized THPs.
> However, the current implementation is the consensus we reached after much
> debate:
> 
> 1. “When using tmpfs as a filesystem, it should behave like other filesystems.
> No more special mount options.” Per Matthew.

That's okay, I've not proposed a new mount option at all (though that is
rather how "never" came to end up meaning "not usually": our shared dislike
for adding yet more options).  I'm proposing (shock horror) respecting the
long-standing meaning of "huge=always".

> 2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other sizes
> exist.” Per David.

That's less obvious.  The collision in tmpfs between anon mTHP, file large
folio, and huge mount option (where shmem_enabled in sysfs provides that
mount option for the internal mounts) is certainly difficult to resolve
in any way pleasing to all (or any) of us.

But what remains clear is that we should not degrade the behaviour of
"huge=always" for existing users: they were given PMD-sized when possible
before, and they should be given PMD-sized when possible now (not suited
to all usages, when "huge=within_size" may be more suitable).

> 
> At the time, we should have sought your advice, but we failed. The long
> historical discussion is in this thread[1]. So now the strategy for tmpfs
> supporting large folios is:

Yes, it's a pity how limited and unresponsive I am, then and now and forever;
but the principle of not regressing userspace is not a topic on which my
special input should be needed.

> 
> "
> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> large folios allocation, we can extend the 'huge=' option to allow any sized
> large folios. The semantics of the 'huge=' mount option are:
> huge=never: no any sized large folios
> huge=always: any sized large folios
> huge=within_size: like 'always' but respect i_size
> huge=advise: like 'always' if requested with madvise()
> 
> Note: For tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized large folios if huge=always/within_size/advise is set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while the
> 'force' can enable PMD sized large folios for tmpfs.
> "

Thanks for the summary, I'll have to come back to it another time: on
first reading, it is not incompatible with "huge=always" always trying
for PMD-sized, but falling back to smaller large folios when unsuccessful.

(I'll mention in passing that I find it strange the way shmem is getting
large folios of a selected subset of sizes from one direction, but large
folios of all possible sizes from another direction - often dependent
on whether i_nlink is 0 at the time, but maybe not.  My own preference,
so long as those tunings exist, is that shmem should always be restricted
to the selected subset of sizes; but I may well alienate everyone I've
not already annoyed with that opinion, and it's probably "not a hill I'm
prepared to die on", nor even directly relevant here - except that I'd
better mention that unhappiness while I'm in the area.)

> 
> Currently, we have observed regression in the i915 driver but have not yet
> seen userspace regression on a huge=always tmpfs.

I shall not object to a temporary workaround to suit the i915 driver; but
insist it not be taken as excuse not to fix the userspace regression later.

> 
> If you have better suggestions, please feel free to point them out. Thanks.

Sounds like you're disinclined to fix it yourself, and I'll lose the
argument if it's not fixed during this cycle (since 6.17-next will become
6.18 LTS); so I'd better carve out the time to get into it in coming weeks.

Hugh

> 
> [1] https://lore.kernel.org/lkml/Zw_IT136rxW_KuhU@casper.infradead.org/
> 
> > Please reread my comment earlier in the thread, in particular,
> > Passing a new SIGBUS xfstest does not excuse a regression: strict PAGE_SIZE
> > SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
> > but not for the long-established huge=always.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: regression - mm: shmem: add large folio support for tmpfs affect GPU performance.
  2025-07-28  5:35               ` Hugh Dickins
@ 2025-07-28  6:29                 ` Baolin Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Baolin Wang @ 2025-07-28  6:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Hildenbrand, Patryk Kowalczyk, da.gomez, baohua,
	wangkefeng.wang, ioworker0, willy, ryan.roberts, akpm,
	eero.t.tamminen, Ville Syrjälä, linux-mm@kvack.org



On 2025/7/28 13:35, Hugh Dickins wrote:
> On Fri, 25 Jul 2025, Baolin Wang wrote:
>> On 2025/7/25 12:47, Hugh Dickins wrote:
>>> On Fri, 25 Jul 2025, Baolin Wang wrote:
>>>>>
>>>>> I hope to correct the logic of i915 driver's shmem allocation, by
>>>>> extending
>>>>> the shmem write length in the i915 driver to allocate PMD- sized THPs.
>>>>> IIUC,
>>>>> some sample fix code is as follows (untested). Patryk, could you help test
>>>>> it to see if this resolves your issue? Thanks.
>>>
>>> This patch cannot be the right fix.  It may be a very sensible workaround
>>> for some in-kernel drivers (I've not looked or tried); but unless I
>>> misunderstand, it does nothing to restore userspace behaviour on a
>>> huge=always tmpfs.
>>
>> Yes. Initially, we wanted to maintain compatibility with the 'huge=' option,
>> meaning that 'huge=always' tmpfs mount would still allocate PMD-sized THPs.
>> However, the current implementation is the consensus we reached after much
>> debate:
>>
>> 1. “When using tmpfs as a filesystem, it should behave like other filesystems.
>> No more special mount options.” Per Matthew.
> 
> That's okay, I've not proposed a new mount option at all (though that is
> rather how "never" came to end up meaning "not usually": our shared dislike
> for adding yet more options).  I'm proposing (shock horror) respecting the
> long-standing meaning of "huge=always".

OK.

>> 2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other sizes
>> exist.” Per David.
> 
> That's less obvious.  The collision in tmpfs between anon mTHP, file large
> folio, and huge mount option (where shmem_enabled in sysfs provides that
> mount option for the internal mounts) is certainly difficult to resolve
> in any way pleasing to all (or any) of us.
> 
> But what remains clear is that we should not degrade the behaviour of
> "huge=always" for existing users: they were given PMD-sized when possible
> before, and they should be given PMD-sized when possible now (not suited
> to all usages, when "huge=within_size" may be more suitable).

This is the most contentious point. When we agree on the principle that 
'when using tmpfs as a filesystem, it should behave like other 
filesystems,' it means that tmpfs's large folio allocations should also 
be consistent with other filesystems, i.e., ‘tmpfs will allow getting a 
highest order hint based on the size of write and fallocate paths, and 
then will try each allowable huge order’ (see shmem_huge_global_enabled()).

So, based on the above principle, we allocate large folios in the same 
way when ‘huge=always’ option. However, as you mentioned, this will 
break the previous behavior where 'huge=always' option always meant 
PMD-sized large folio allocation.

>> At the time, we should have sought your advice, but we failed. The long
>> historical discussion is in this thread[1]. So now the strategy for tmpfs
>> supporting large folios is:
> 
> Yes, it's a pity how limited and unresponsive I am, then and now and forever;
> but the principle of not regressing userspace is not a topic on which my
> special input should be needed.
> 
>>
>> "
>> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
>> large folios allocation, we can extend the 'huge=' option to allow any sized
>> large folios. The semantics of the 'huge=' mount option are:
>> huge=never: no any sized large folios
>> huge=always: any sized large folios
>> huge=within_size: like 'always' but respect i_size
>> huge=advise: like 'always' if requested with madvise()
>>
>> Note: For tmpfs mmap() faults, due to the lack of a write size hint, still
>> allocate the PMD-sized large folios if huge=always/within_size/advise is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the same
>> semantics. The 'deny' can disable any sized large folios for tmpfs, while the
>> 'force' can enable PMD sized large folios for tmpfs.
>> "
> 
> Thanks for the summary, I'll have to come back to it another time: on
> first reading, it is not incompatible with "huge=always" always trying
> for PMD-sized, but falling back to smaller large folios when unsuccessful.
> 
> (I'll mention in passing that I find it strange the way shmem is getting
> large folios of a selected subset of sizes from one direction, but large
> folios of all possible sizes from another direction - often dependent
> on whether i_nlink is 0 at the time, but maybe not.  My own preference,
> so long as those tunings exist, is that shmem should always be restricted
> to the selected subset of sizes; but I may well alienate everyone I've
> not already annoyed with that opinion, and it's probably "not a hill I'm
> prepared to die on", nor even directly relevant here - except that I'd
> better mention that unhappiness while I'm in the area.)
> 
>>
>> Currently, we have observed regression in the i915 driver but have not yet
>> seen userspace regression on a huge=always tmpfs.
> 
> I shall not object to a temporary workaround to suit the i915 driver; but
> insist it not be taken as excuse not to fix the userspace regression later.

OK. Let me fix the i915 driver first.

>> If you have better suggestions, please feel free to point them out. Thanks.
> 
> Sounds like you're disinclined to fix it yourself, and I'll lose the

No, I do intend to address this incompatibility myself. However, I 
wanted to clearly describe our previous discussions and decisions to you 
first, and then make a decision after fully understanding your opinion.

Now I see your point and I can create a patch to fix the semantics of 
'huge=always,' but I think there will still be a lot of contention and 
discussion. Let's continue the discussion in that patch.

Thanks for your input.

> argument if it's not fixed during this cycle (since 6.17-next will become
> 6.18 LTS); so I'd better carve out the time to get into it in coming weeks.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-28  6:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAJCW39JCDX6_S2Ojt1HMmX-h_qAKm2eBRzxX5kOHNJz60Zu=vw@mail.gmail.com>
     [not found] ` <d5c6ac93-1af0-4093-afea-94a29a387903@redhat.com>
     [not found]   ` <63b69425-2fd1-2c77-06d6-e7ea25c92f34@google.com>
     [not found]     ` <3f204974-26c8-4d5f-b7ae-4052cbfdf4ac@redhat.com>
     [not found]       ` <a8ac7ec3-4cb3-4dd8-8d02-ede6905f322e@linux.alibaba.com>
2025-07-25  2:38         ` regression - mm: shmem: add large folio support for tmpfs affect GPU performance Baolin Wang
2025-07-25  4:47           ` Hugh Dickins
2025-07-25  6:05             ` Baolin Wang
2025-07-25  8:36               ` Patryk Kowalczyk
2025-07-25  9:17                 ` Baolin Wang
2025-07-28  5:35               ` Hugh Dickins
2025-07-28  6:29                 ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).