[PATCH v3 0/6] Support large folios for tmpfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Support large folios for tmpfs
@ 2024-11-28  7:40 Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 1/6] mm: factor out the order calculation into a new helper Baolin Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
with other file systems supporting any sized large folios, and extending
anonymous to support mTHP, we should not restrict tmpfs to allocating only
PMD-sized large folios, making it more special. Instead, we should allow
tmpfs can allocate any sized large folios.

Considering that tmpfs already has the 'huge=' option to control the PMD-sized
large folios allocation, we can extend the 'huge=' option to allow any sized
large folios. The semantics of the 'huge=' mount option are:

huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized large folios if huge=always/within_size/advise is set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.

Any comments and suggestions are appreciated. Thanks.

Changes from v2:
 - Collect reviewed tags. Thanks.
 - Add a new patch to drop fadvise from the docs.
 - Drop the 'MAX_PAGECACHE_ORDER' check in shmem_huge_global_enabled(),
   per David.
 - Update the commit message, per Daniel.
 - Rebase on the latest mm-unstable branch.

Changes from v1:
 - Add reviewed tag from Barry and David. Thanks.
 - Fix building warnings reported by kernel test robot.
 - Add a new patch to control the default huge policy for tmpfs.

Changes from RFC v3:
 - Drop the huge=write_size option.
 - Allow any sized huge folios for 'hgue' option.
 - Update the documentation, per David.

Changes from RFC v2:
 - Drop mTHP interfaces to control huge page allocation, per Matthew.
 - Add a new helper to calculate the order, suggested by Matthew.
 - Add a new huge=write_size option to allocate large folios based on
   the write size.
 - Add a new patch to update the documentation.

Changes from RFC v1:
 - Drop patch 1.
 - Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
 - Update shmem_mapping_size_order() per Daniel.

Baolin Wang (5):
  mm: factor out the order calculation into a new helper
  mm: shmem: change shmem_huge_global_enabled() to return huge order
    bitmap
  mm: shmem: add large folio support for tmpfs
  mm: shmem: add a kernel command line to change the default huge policy
    for tmpfs
  docs: tmpfs: drop 'fadvise()' from the documentation

David Hildenbrand (1):
  docs: tmpfs: update the large folios policy for tmpfs and shmem

 .../admin-guide/kernel-parameters.txt         |   7 +
 Documentation/admin-guide/mm/transhuge.rst    |  72 ++++++---
 include/linux/pagemap.h                       |  16 +-
 mm/shmem.c                                    | 150 ++++++++++++++----
 4 files changed, 188 insertions(+), 57 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/6] mm: factor out the order calculation into a new helper
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 2/6] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Factor out the order calculation into a new helper, which can be reused
by shmem in the following patch.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
---
 include/linux/pagemap.h | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..d796c8a33647 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -727,6 +727,16 @@ typedef unsigned int __bitwise fgf_t;
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline unsigned int filemap_get_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+
+	return shift - PAGE_SHIFT;
+}
+
 /**
  * fgf_set_order - Encode a length in the fgf_t flags.
  * @size: The suggested size of the folio to create.
@@ -740,11 +750,11 @@ typedef unsigned int __bitwise fgf_t;
  */
 static inline fgf_t fgf_set_order(size_t size)
 {
-	unsigned int shift = ilog2(size);
+	unsigned int order = filemap_get_order(size);
 
-	if (shift <= PAGE_SHIFT)
+	if (!order)
 		return 0;
-	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+	return (__force fgf_t)(order << 26);
 }
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/6] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 1/6] mm: factor out the order calculation into a new helper Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs Baolin Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Change the shmem_huge_global_enabled() to return the suitable huge
order bitmap, and return 0 if huge pages are not allowed. This is a
preparation for supporting various huge orders allocation of tmpfs
in the following patches.

No functional changes.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/shmem.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ccb9629a0f70..7595c3db4c1c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -554,37 +554,37 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
-				      loff_t write_end, bool shmem_huge_force,
-				      unsigned long vm_flags)
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+					      loff_t write_end, bool shmem_huge_force,
+					      unsigned long vm_flags)
 {
 	loff_t i_size;
 
 	if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
-		return false;
+		return 0;
 	if (!S_ISREG(inode->i_mode))
-		return false;
+		return 0;
 	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
+		return 0;
 	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
+		return BIT(HPAGE_PMD_ORDER);
 
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
-		return true;
+		return BIT(HPAGE_PMD_ORDER);
 	case SHMEM_HUGE_WITHIN_SIZE:
 		index = round_up(index + 1, HPAGE_PMD_NR);
 		i_size = max(write_end, i_size_read(inode));
 		i_size = round_up(i_size, PAGE_SIZE);
 		if (i_size >> PAGE_SHIFT >= index)
-			return true;
+			return BIT(HPAGE_PMD_ORDER);
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
 		if (vm_flags & VM_HUGEPAGE)
-			return true;
+			return BIT(HPAGE_PMD_ORDER);
 		fallthrough;
 	default:
-		return false;
+		return 0;
 	}
 }
 
@@ -779,11 +779,11 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 	return 0;
 }
 
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
-				      loff_t write_end, bool shmem_huge_force,
-				      unsigned long vm_flags)
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+					      loff_t write_end, bool shmem_huge_force,
+					      unsigned long vm_flags)
 {
-	return false;
+	return 0;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1685,21 +1685,21 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
 	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
 	unsigned long vm_flags = vma ? vma->vm_flags : 0;
-	bool global_huge;
+	unsigned int global_orders;
 	loff_t i_size;
 	int order;
 
 	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
 		return 0;
 
-	global_huge = shmem_huge_global_enabled(inode, index, write_end,
-						shmem_huge_force, vm_flags);
+	global_orders = shmem_huge_global_enabled(inode, index, write_end,
+						  shmem_huge_force, vm_flags);
 	if (!vma || !vma_is_anon_shmem(vma)) {
 		/*
 		 * For tmpfs, we now only support PMD sized THP if huge page
 		 * is enabled, otherwise fallback to order 0.
 		 */
-		return global_huge ? BIT(HPAGE_PMD_ORDER) : 0;
+		return global_orders;
 	}
 
 	/*
@@ -1732,7 +1732,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	if (vm_flags & VM_HUGEPAGE)
 		mask |= READ_ONCE(huge_shmem_orders_madvise);
 
-	if (global_huge)
+	if (global_orders > 0)
 		mask |= READ_ONCE(huge_shmem_orders_inherit);
 
 	return THP_ORDERS_ALL_FILE_DEFAULT & mask;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 1/6] mm: factor out the order calculation into a new helper Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 2/6] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  2025-04-29 17:44   ` [REGRESSION] " Ville Syrjälä
  2024-11-28  7:40 ` [PATCH v3 4/6] mm: shmem: add a kernel command line to change the default huge policy " Baolin Wang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Add large folio support for tmpfs write and fallocate paths matching the
same high order preference mechanism used in the iomap buffered IO path
as used in __filemap_get_folio().

Add shmem_mapping_size_orders() to get a hint for the orders of the folio
based on the file size which takes care of the mapping requirements.

Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
with other file systems supporting any sized large folios, and extending
anonymous to support mTHP, we should not restrict tmpfs to allocating only
PMD-sized large folios, making it more special. Instead, we should allow
tmpfs can allocate any sized large folios.

Considering that tmpfs already has the 'huge=' option to control the PMD-sized
large folios allocation, we can extend the 'huge=' option to allow any sized
large folios. The semantics of the 'huge=' mount option are:

huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=always/within_size/advise is set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.

Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 81 insertions(+), 18 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7595c3db4c1c..54eaa724c153 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -554,34 +554,100 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
+/**
+ * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @write_end: end of a write, could extend inode size.
+ *
+ * This returns huge orders for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Return: The orders.
+ */
+static inline unsigned int
+shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
+{
+	unsigned int order;
+	size_t size;
+
+	if (!mapping_large_folio_support(mapping) || !write_end)
+		return 0;
+
+	/* Calculate the write size based on the write_end */
+	size = write_end - (index << PAGE_SHIFT);
+	order = filemap_get_order(size);
+	if (!order)
+		return 0;
+
+	/* If we're not aligned, allocate a smaller folio */
+	if (index & ((1UL << order) - 1))
+		order = __ffs(index);
+
+	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
+	return order > 0 ? BIT(order + 1) - 1 : 0;
+}
+
 static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
 					      loff_t write_end, bool shmem_huge_force,
+					      struct vm_area_struct *vma,
 					      unsigned long vm_flags)
 {
+	unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ?
+		0 : BIT(HPAGE_PMD_ORDER);
+	unsigned long within_size_orders;
+	unsigned int order;
+	pgoff_t aligned_index;
 	loff_t i_size;
 
-	if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
-		return 0;
 	if (!S_ISREG(inode->i_mode))
 		return 0;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return 0;
 	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
-		return BIT(HPAGE_PMD_ORDER);
+		return maybe_pmd_order;
 
+	/*
+	 * The huge order allocation for anon shmem is controlled through
+	 * the mTHP interface, so we still use PMD-sized huge order to
+	 * check whether global control is enabled.
+	 *
+	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
+	 * allocate huge pages due to lack of a write size hint.
+	 *
+	 * Otherwise, tmpfs will allow getting a highest order hint based on
+	 * the size of write and fallocate paths, then will try each allowable
+	 * huge orders.
+	 */
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
-		return BIT(HPAGE_PMD_ORDER);
+		if (vma)
+			return maybe_pmd_order;
+
+		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
 	case SHMEM_HUGE_WITHIN_SIZE:
-		index = round_up(index + 1, HPAGE_PMD_NR);
-		i_size = max(write_end, i_size_read(inode));
-		i_size = round_up(i_size, PAGE_SIZE);
-		if (i_size >> PAGE_SHIFT >= index)
-			return BIT(HPAGE_PMD_ORDER);
+		if (vma)
+			within_size_orders = maybe_pmd_order;
+		else
+			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
+								       index, write_end);
+
+		order = highest_order(within_size_orders);
+		while (within_size_orders) {
+			aligned_index = round_up(index + 1, 1 << order);
+			i_size = max(write_end, i_size_read(inode));
+			i_size = round_up(i_size, PAGE_SIZE);
+			if (i_size >> PAGE_SHIFT >= aligned_index)
+				return within_size_orders;
+
+			order = next_order(&within_size_orders, order);
+		}
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
 		if (vm_flags & VM_HUGEPAGE)
-			return BIT(HPAGE_PMD_ORDER);
+			return maybe_pmd_order;
 		fallthrough;
 	default:
 		return 0;
@@ -781,6 +847,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 
 static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
 					      loff_t write_end, bool shmem_huge_force,
+					      struct vm_area_struct *vma,
 					      unsigned long vm_flags)
 {
 	return 0;
@@ -1176,7 +1243,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
 			STATX_ATTR_NODUMP);
 	generic_fillattr(idmap, request_mask, inode, stat);
 
-	if (shmem_huge_global_enabled(inode, 0, 0, false, 0))
+	if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	if (request_mask & STATX_BTIME) {
@@ -1693,14 +1760,10 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 		return 0;
 
 	global_orders = shmem_huge_global_enabled(inode, index, write_end,
-						  shmem_huge_force, vm_flags);
-	if (!vma || !vma_is_anon_shmem(vma)) {
-		/*
-		 * For tmpfs, we now only support PMD sized THP if huge page
-		 * is enabled, otherwise fallback to order 0.
-		 */
+						  shmem_huge_force, vma, vm_flags);
+	/* Tmpfs huge pages allocation */
+	if (!vma || !vma_is_anon_shmem(vma))
 		return global_orders;
-	}
 
 	/*
 	 * Following the 'deny' semantics of the top level, force the huge
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/6] mm: shmem: add a kernel command line to change the default huge policy for tmpfs
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
                   ` (2 preceding siblings ...)
  2024-11-28  7:40 ` [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 5/6] docs: tmpfs: update the large folios policy for tmpfs and shmem Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 6/6] docs: tmpfs: drop 'fadvise()' from the documentation Baolin Wang
  5 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Now the tmpfs can allow to allocate any sized large folios, and the default
huge policy is still preferred to be 'never'. Due to tmpfs not behaving like
other file systems in some cases as previously explained by David[1]:
"
I think I raised this in the past, but tmpfs/shmem is just like any
other file system .. except it sometimes really isn't and behaves much
more like (swappable) anonymous memory. (or mlocked files)

There are many systems out there that run without swap enabled, or with
extremely minimal swap (IIRC until recently kubernetes was completely
incompatible with swapping). Swap can even be disabled today for shmem
using a mount option.

That's a big difference to all other file systems where you are
guaranteed to have backend storage where you can simply evict under
memory pressure (might temporarily fail, of course).

I *think* that's the reason why we have the "huge=" parameter that also
controls the THP allocations during page faults (IOW possible memory
over-allocation). Maybe also because it was a new feature, and we only
had a single THP size.
"

Thus adding a new command line to change the default huge policy will be
helpful to use the large folios for tmpfs, which is similar to the
'transparent_hugepage_shmem' cmdline for shmem.

[1] https://lore.kernel.org/all/cbadd5fe-69d5-4c21-8eb8-3344ed36c721@redhat.com/

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 .../admin-guide/kernel-parameters.txt         |  7 ++++++
 Documentation/admin-guide/mm/transhuge.rst    |  6 +++++
 mm/shmem.c                                    | 23 ++++++++++++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dc663c0ca670..e73383450240 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6987,6 +6987,13 @@
 			See Documentation/admin-guide/mm/transhuge.rst
 			for more details.
 
+	transparent_hugepage_tmpfs= [KNL]
+			Format: [always|within_size|advise|never]
+			Can be used to control the default hugepage allocation policy
+			for the tmpfs mount.
+			See Documentation/admin-guide/mm/transhuge.rst
+			for more details.
+
 	trusted.source=	[KEYS]
 			Format: <string>
 			This parameter identifies the trust source as a backend
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5034915f4e8e..9ae775eaacbe 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,12 @@ allocation policy for the internal shmem mount by using the kernel parameter
 seven valid policies for shmem (``always``, ``within_size``, ``advise``,
 ``never``, ``deny``, and ``force``).
 
+Similarly to ``transparent_hugepage_shmem``, you can control the default
+hugepage allocation policy for the tmpfs mount by using the kernel parameter
+``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the
+four valid policies for tmpfs (``always``, ``within_size``, ``advise``,
+``never``). The tmpfs mount default policy is ``never``.
+
 In the same manner as ``thp_anon`` controls each supported anonymous THP
 size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
 has the same format as ``thp_anon``, but also supports the policy
diff --git a/mm/shmem.c b/mm/shmem.c
index 54eaa724c153..8a602fc61edb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -553,6 +553,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
+static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
 
 /**
  * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
@@ -4951,7 +4952,12 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->gid = ctx->gid;
 	sbinfo->full_inums = ctx->full_inums;
 	sbinfo->mode = ctx->mode;
-	sbinfo->huge = ctx->huge;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (ctx->seen & SHMEM_SEEN_HUGE)
+		sbinfo->huge = ctx->huge;
+	else
+		sbinfo->huge = tmpfs_huge;
+#endif
 	sbinfo->mpol = ctx->mpol;
 	ctx->mpol = NULL;
 
@@ -5502,6 +5508,21 @@ static int __init setup_transparent_hugepage_shmem(char *str)
 }
 __setup("transparent_hugepage_shmem=", setup_transparent_hugepage_shmem);
 
+static int __init setup_transparent_hugepage_tmpfs(char *str)
+{
+	int huge;
+
+	huge = shmem_parse_huge(str);
+	if (huge < 0) {
+		pr_warn("transparent_hugepage_tmpfs= cannot parse, ignored\n");
+		return huge;
+	}
+
+	tmpfs_huge = huge;
+	return 1;
+}
+__setup("transparent_hugepage_tmpfs=", setup_transparent_hugepage_tmpfs);
+
 static char str_dup[PAGE_SIZE] __initdata;
 static int __init setup_thp_shmem(char *str)
 {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/6] docs: tmpfs: update the large folios policy for tmpfs and shmem
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
                   ` (3 preceding siblings ...)
  2024-11-28  7:40 ` [PATCH v3 4/6] mm: shmem: add a kernel command line to change the default huge policy " Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  2024-11-28  7:40 ` [PATCH v3 6/6] docs: tmpfs: drop 'fadvise()' from the documentation Baolin Wang
  5 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

From: David Hildenbrand <david@redhat.com>

Update the large folios policy for tmpfs and shmem.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 58 +++++++++++++++-------
 1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9ae775eaacbe..ba6edff728ed 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -358,8 +358,21 @@ default to ``never``.
 Hugepages in tmpfs/shmem
 ========================
 
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all available
+huge page sizes without any control over the exact sizes, behaving more like
+other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
 
 always
     Attempt to allocate huge pages every time we need a new page;
@@ -374,19 +387,19 @@ within_size
 advise
     Only allocate huge pages if requested with fadvise()/madvise();
 
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``.
 
 ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
 ``huge=never`` will not attempt to break up huge pages at all, just stop more
 from being allocated.
 
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
 
 deny
     For use in emergencies, to force the huge option off from
@@ -394,13 +407,24 @@ deny
 force
     Force the huge option on for all - very useful for testing;
 
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting.  An 'inherit' option is added to ensure compatibility with these
-global settings.  Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+shmem / internal tmpfs
+----------------------
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM  objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
 
 always
     Attempt to allocate <size> huge pages every time we need a new page;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/6] docs: tmpfs: drop 'fadvise()' from the documentation
  2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
                   ` (4 preceding siblings ...)
  2024-11-28  7:40 ` [PATCH v3 5/6] docs: tmpfs: update the large folios policy for tmpfs and shmem Baolin Wang
@ 2024-11-28  7:40 ` Baolin Wang
  5 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2024-11-28  7:40 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
	da.gomez, baolin.wang, linux-mm, linux-kernel

Drop 'fadvise()' from the doc, since fadvise() has no HUGEPAGE advise
currently.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index ba6edff728ed..333958ef0d5f 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -382,10 +382,10 @@ never
 
 within_size
     Only allocate huge page if it will be fully within i_size.
-    Also respect fadvise()/madvise() hints;
+    Also respect madvise() hints;
 
 advise
-    Only allocate huge pages if requested with fadvise()/madvise();
+    Only allocate huge pages if requested with madvise();
 
 Remember, that the kernel may use huge pages of all available sizes, and
 that no fine control as for the internal tmpfs mount is available.
@@ -438,10 +438,10 @@ never
 
 within_size
     Only allocate <size> huge page if it will be fully within i_size.
-    Also respect fadvise()/madvise() hints;
+    Also respect madvise() hints;
 
 advise
-    Only allocate <size> huge pages if requested with fadvise()/madvise();
+    Only allocate <size> huge pages if requested with madvise();
 
 Need of application restart
 ===========================
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2024-11-28  7:40 ` [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs Baolin Wang
@ 2025-04-29 17:44   ` Ville Syrjälä
  2025-04-30  6:32     ` Baolin Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2025-04-29 17:44 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, 21cnbao, ryan.roberts,
	ioworker0, da.gomez, linux-mm, linux-kernel, regressions,
	intel-gfx, Eero Tamminen

On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> Add large folio support for tmpfs write and fallocate paths matching the
> same high order preference mechanism used in the iomap buffered IO path
> as used in __filemap_get_folio().
> 
> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
> based on the file size which takes care of the mapping requirements.
> 
> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
> with other file systems supporting any sized large folios, and extending
> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> PMD-sized large folios, making it more special. Instead, we should allow
> tmpfs can allocate any sized large folios.
> 
> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> large folios allocation, we can extend the 'huge=' option to allow any sized
> large folios. The semantics of the 'huge=' mount option are:
> 
> huge=never: no any sized large folios
> huge=always: any sized large folios
> huge=within_size: like 'always' but respect the i_size
> huge=advise: like 'always' if requested with madvise()
> 
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
> 
> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Hi,

This causes a huge regression in Intel iGPU texturing performance.

I haven't had time to look at this in detail, but presumably the
problem is that we're no longer getting huge pages from our
private tmpfs mount (done in i915_gemfs_init()).

Some more details at
https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845

> ---
>  mm/shmem.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 81 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7595c3db4c1c..54eaa724c153 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -554,34 +554,100 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  
>  static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>  
> +/**
> + * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
> + * @mapping: Target address_space.
> + * @index: The page index.
> + * @write_end: end of a write, could extend inode size.
> + *
> + * This returns huge orders for folios (when supported) based on the file size
> + * which the mapping currently allows at the given index. The index is relevant
> + * due to alignment considerations the mapping might have. The returned order
> + * may be less than the size passed.
> + *
> + * Return: The orders.
> + */
> +static inline unsigned int
> +shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
> +{
> +	unsigned int order;
> +	size_t size;
> +
> +	if (!mapping_large_folio_support(mapping) || !write_end)
> +		return 0;
> +
> +	/* Calculate the write size based on the write_end */
> +	size = write_end - (index << PAGE_SHIFT);
> +	order = filemap_get_order(size);
> +	if (!order)
> +		return 0;
> +
> +	/* If we're not aligned, allocate a smaller folio */
> +	if (index & ((1UL << order) - 1))
> +		order = __ffs(index);
> +
> +	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
> +	return order > 0 ? BIT(order + 1) - 1 : 0;
> +}
> +
>  static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>  					      loff_t write_end, bool shmem_huge_force,
> +					      struct vm_area_struct *vma,
>  					      unsigned long vm_flags)
>  {
> +	unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ?
> +		0 : BIT(HPAGE_PMD_ORDER);
> +	unsigned long within_size_orders;
> +	unsigned int order;
> +	pgoff_t aligned_index;
>  	loff_t i_size;
>  
> -	if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
> -		return 0;
>  	if (!S_ISREG(inode->i_mode))
>  		return 0;
>  	if (shmem_huge == SHMEM_HUGE_DENY)
>  		return 0;
>  	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
> -		return BIT(HPAGE_PMD_ORDER);
> +		return maybe_pmd_order;
>  
> +	/*
> +	 * The huge order allocation for anon shmem is controlled through
> +	 * the mTHP interface, so we still use PMD-sized huge order to
> +	 * check whether global control is enabled.
> +	 *
> +	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
> +	 * allocate huge pages due to lack of a write size hint.
> +	 *
> +	 * Otherwise, tmpfs will allow getting a highest order hint based on
> +	 * the size of write and fallocate paths, then will try each allowable
> +	 * huge orders.
> +	 */
>  	switch (SHMEM_SB(inode->i_sb)->huge) {
>  	case SHMEM_HUGE_ALWAYS:
> -		return BIT(HPAGE_PMD_ORDER);
> +		if (vma)
> +			return maybe_pmd_order;
> +
> +		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
>  	case SHMEM_HUGE_WITHIN_SIZE:
> -		index = round_up(index + 1, HPAGE_PMD_NR);
> -		i_size = max(write_end, i_size_read(inode));
> -		i_size = round_up(i_size, PAGE_SIZE);
> -		if (i_size >> PAGE_SHIFT >= index)
> -			return BIT(HPAGE_PMD_ORDER);
> +		if (vma)
> +			within_size_orders = maybe_pmd_order;
> +		else
> +			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
> +								       index, write_end);
> +
> +		order = highest_order(within_size_orders);
> +		while (within_size_orders) {
> +			aligned_index = round_up(index + 1, 1 << order);
> +			i_size = max(write_end, i_size_read(inode));
> +			i_size = round_up(i_size, PAGE_SIZE);
> +			if (i_size >> PAGE_SHIFT >= aligned_index)
> +				return within_size_orders;
> +
> +			order = next_order(&within_size_orders, order);
> +		}
>  		fallthrough;
>  	case SHMEM_HUGE_ADVISE:
>  		if (vm_flags & VM_HUGEPAGE)
> -			return BIT(HPAGE_PMD_ORDER);
> +			return maybe_pmd_order;
>  		fallthrough;
>  	default:
>  		return 0;
> @@ -781,6 +847,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>  
>  static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>  					      loff_t write_end, bool shmem_huge_force,
> +					      struct vm_area_struct *vma,
>  					      unsigned long vm_flags)
>  {
>  	return 0;
> @@ -1176,7 +1243,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>  			STATX_ATTR_NODUMP);
>  	generic_fillattr(idmap, request_mask, inode, stat);
>  
> -	if (shmem_huge_global_enabled(inode, 0, 0, false, 0))
> +	if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>  		stat->blksize = HPAGE_PMD_SIZE;
>  
>  	if (request_mask & STATX_BTIME) {
> @@ -1693,14 +1760,10 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>  		return 0;
>  
>  	global_orders = shmem_huge_global_enabled(inode, index, write_end,
> -						  shmem_huge_force, vm_flags);
> -	if (!vma || !vma_is_anon_shmem(vma)) {
> -		/*
> -		 * For tmpfs, we now only support PMD sized THP if huge page
> -		 * is enabled, otherwise fallback to order 0.
> -		 */
> +						  shmem_huge_force, vma, vm_flags);
> +	/* Tmpfs huge pages allocation */
> +	if (!vma || !vma_is_anon_shmem(vma))
>  		return global_orders;
> -	}
>  
>  	/*
>  	 * Following the 'deny' semantics of the top level, force the huge
> -- 
> 2.39.3
> 

-- 
Ville Syrjälä
Intel


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-04-29 17:44   ` [REGRESSION] " Ville Syrjälä
@ 2025-04-30  6:32     ` Baolin Wang
  2025-04-30 11:20       ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-04-30  6:32 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: akpm, hughd, willy, david, wangkefeng.wang, 21cnbao, ryan.roberts,
	ioworker0, da.gomez, linux-mm, linux-kernel, regressions,
	intel-gfx, Eero Tamminen

Hi,

On 2025/4/30 01:44, Ville Syrjälä wrote:
> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>> Add large folio support for tmpfs write and fallocate paths matching the
>> same high order preference mechanism used in the iomap buffered IO path
>> as used in __filemap_get_folio().
>>
>> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
>> based on the file size which takes care of the mapping requirements.
>>
>> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
>> with other file systems supporting any sized large folios, and extending
>> anonymous to support mTHP, we should not restrict tmpfs to allocating only
>> PMD-sized large folios, making it more special. Instead, we should allow
>> tmpfs can allocate any sized large folios.
>>
>> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
>> large folios allocation, we can extend the 'huge=' option to allow any sized
>> large folios. The semantics of the 'huge=' mount option are:
>>
>> huge=never: no any sized large folios
>> huge=always: any sized large folios
>> huge=within_size: like 'always' but respect the i_size
>> huge=advise: like 'always' if requested with madvise()
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
>> semantics. The 'deny' can disable any sized large folios for tmpfs, while
>> the 'force' can enable PMD sized large folios for tmpfs.
>>
>> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Hi,
> 
> This causes a huge regression in Intel iGPU texturing performance.

Unfortunately, I don't have such platform to test it.

> 
> I haven't had time to look at this in detail, but presumably the
> problem is that we're no longer getting huge pages from our
> private tmpfs mount (done in i915_gemfs_init()).

IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
in the shmem_pwrite(), which prevents tmpfs from allocating large 
folios. As mentioned in the comments below, tmpfs like other file 
systems that support large folios, will allow getting a highest order 
hint based on the size of the write and fallocate paths, and then will 
attempt each allowable huge order.

Therefore, I think the shmem_pwrite() function should be changed to 
remove the limitation that the write size cannot exceed PAGE_SIZE.

Something like the following code (untested):
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index ae3343c81a64..97eefb73c5d2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -420,6 +420,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
         struct address_space *mapping = obj->base.filp->f_mapping;
         const struct address_space_operations *aops = mapping->a_ops;
         char __user *user_data = u64_to_user_ptr(arg->data_ptr);
+       size_t chunk = mapping_max_folio_size(mapping);
         u64 remain;
         loff_t pos;
         unsigned int pg;
@@ -463,10 +464,10 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
                 void *data, *vaddr;
                 int err;
                 char __maybe_unused c;
+               size_t offset;

-               len = PAGE_SIZE - pg;
-               if (len > remain)
-                       len = remain;
+               offset = pos & (chunk - 1);
+               len = min(chunk - offset, remain);

                 /* Prefault the user page to reduce potential recursion */
                 err = __get_user(c, user_data);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-04-30  6:32     ` Baolin Wang
@ 2025-04-30 11:20       ` Ville Syrjälä
  2025-04-30 13:24         ` Daniel Gomez
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2025-04-30 11:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, 21cnbao, ryan.roberts,
	ioworker0, da.gomez, linux-mm, linux-kernel, regressions,
	intel-gfx, Eero Tamminen

On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> Hi,
> 
> On 2025/4/30 01:44, Ville Syrjälä wrote:
> > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> >> Add large folio support for tmpfs write and fallocate paths matching the
> >> same high order preference mechanism used in the iomap buffered IO path
> >> as used in __filemap_get_folio().
> >>
> >> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
> >> based on the file size which takes care of the mapping requirements.
> >>
> >> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
> >> with other file systems supporting any sized large folios, and extending
> >> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> >> PMD-sized large folios, making it more special. Instead, we should allow
> >> tmpfs can allocate any sized large folios.
> >>
> >> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> >> large folios allocation, we can extend the 'huge=' option to allow any sized
> >> large folios. The semantics of the 'huge=' mount option are:
> >>
> >> huge=never: no any sized large folios
> >> huge=always: any sized large folios
> >> huge=within_size: like 'always' but respect the i_size
> >> huge=advise: like 'always' if requested with madvise()
> >>
> >> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> >> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
> >>
> >> Moreover, the 'deny' and 'force' testing options controlled by
> >> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> >> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> >> the 'force' can enable PMD sized large folios for tmpfs.
> >>
> >> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
> >> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > 
> > Hi,
> > 
> > This causes a huge regression in Intel iGPU texturing performance.
> 
> Unfortunately, I don't have such platform to test it.
> 
> > 
> > I haven't had time to look at this in detail, but presumably the
> > problem is that we're no longer getting huge pages from our
> > private tmpfs mount (done in i915_gemfs_init()).
> 
> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
> in the shmem_pwrite(),

pwrite is just one random way to write to objects, and probably
not something that's even used by current Mesa.

> which prevents tmpfs from allocating large 
> folios. As mentioned in the comments below, tmpfs like other file 
> systems that support large folios, will allow getting a highest order 
> hint based on the size of the write and fallocate paths, and then will 
> attempt each allowable huge order.
> 
> Therefore, I think the shmem_pwrite() function should be changed to 
> remove the limitation that the write size cannot exceed PAGE_SIZE.
> 
> Something like the following code (untested):
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> index ae3343c81a64..97eefb73c5d2 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -420,6 +420,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
>          struct address_space *mapping = obj->base.filp->f_mapping;
>          const struct address_space_operations *aops = mapping->a_ops;
>          char __user *user_data = u64_to_user_ptr(arg->data_ptr);
> +       size_t chunk = mapping_max_folio_size(mapping);
>          u64 remain;
>          loff_t pos;
>          unsigned int pg;
> @@ -463,10 +464,10 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
>                  void *data, *vaddr;
>                  int err;
>                  char __maybe_unused c;
> +               size_t offset;
> 
> -               len = PAGE_SIZE - pg;
> -               if (len > remain)
> -                       len = remain;
> +               offset = pos & (chunk - 1);
> +               len = min(chunk - offset, remain);
> 
>                  /* Prefault the user page to reduce potential recursion */
>                  err = __get_user(c, user_data);

-- 
Ville Syrjälä
Intel


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-04-30 11:20       ` Ville Syrjälä
@ 2025-04-30 13:24         ` Daniel Gomez
  2025-05-02  1:02           ` Baolin Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Gomez @ 2025-04-30 13:24 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Baolin Wang, akpm, hughd, willy, david, wangkefeng.wang, 21cnbao,
	ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
	regressions, intel-gfx, Eero Tamminen

On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> > On 2025/4/30 01:44, Ville Syrjälä wrote:
> > > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> > > Hi,
> > > 
> > > This causes a huge regression in Intel iGPU texturing performance.
> > 
> > Unfortunately, I don't have such platform to test it.
> > 
> > > 
> > > I haven't had time to look at this in detail, but presumably the
> > > problem is that we're no longer getting huge pages from our
> > > private tmpfs mount (done in i915_gemfs_init()).
> > 
> > IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
> > in the shmem_pwrite(),
> 
> pwrite is just one random way to write to objects, and probably
> not something that's even used by current Mesa.
> 
> > which prevents tmpfs from allocating large 
> > folios. As mentioned in the comments below, tmpfs like other file 
> > systems that support large folios, will allow getting a highest order 
> > hint based on the size of the write and fallocate paths, and then will 
> > attempt each allowable huge order.
> > 
> > Therefore, I think the shmem_pwrite() function should be changed to 
> > remove the limitation that the write size cannot exceed PAGE_SIZE.

To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
as they are not enabled by default IIRC (only THP, PMD level). Ville, I
see i915_gemfs the huge=within_size mount option is passed. Can you confirm
if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
'always' when the regression is found?

Even if these are enabled, the possible difference may be that before, i915 was
using PMD pages (THP) always and now mTHP will be used, unless the file size is
as big as the PMD page. I think the always mount option would also try to infer
the size to actually give a proper order folio according to that size. Baolin,
is that correct?

And Ville, can you confirm if what i915 needs is to enable PMD-size allocations
always?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-04-30 13:24         ` Daniel Gomez
@ 2025-05-02  1:02           ` Baolin Wang
  2025-05-02  7:18             ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-05-02  1:02 UTC (permalink / raw)
  To: Daniel Gomez, Ville Syrjälä
  Cc: akpm, hughd, willy, david, wangkefeng.wang, 21cnbao, ryan.roberts,
	ioworker0, da.gomez, linux-mm, linux-kernel, regressions,
	intel-gfx, Eero Tamminen



On 2025/4/30 21:24, Daniel Gomez wrote:
> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>
>>> Unfortunately, I don't have such platform to test it.
>>>
>>>>
>>>> I haven't had time to look at this in detail, but presumably the
>>>> problem is that we're no longer getting huge pages from our
>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>
>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>> in the shmem_pwrite(),
>>
>> pwrite is just one random way to write to objects, and probably
>> not something that's even used by current Mesa.
>>
>>> which prevents tmpfs from allocating large
>>> folios. As mentioned in the comments below, tmpfs like other file
>>> systems that support large folios, will allow getting a highest order
>>> hint based on the size of the write and fallocate paths, and then will
>>> attempt each allowable huge order.
>>>
>>> Therefore, I think the shmem_pwrite() function should be changed to
>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
> 
> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
> 'always' when the regression is found?

The tmpfs mount will not be controlled by 
'/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for 
the debugging options 'deny' and 'force').

> Even if these are enabled, the possible difference may be that before, i915 was
> using PMD pages (THP) always and now mTHP will be used, unless the file size is
> as big as the PMD page. I think the always mount option would also try to infer
> the size to actually give a proper order folio according to that size. Baolin,
> is that correct?

Right.

> And Ville, can you confirm if what i915 needs is to enable PMD-size allocations
> always?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-05-02  1:02           ` Baolin Wang
@ 2025-05-02  7:18             ` David Hildenbrand
  2025-05-02 13:10               ` Daniel Gomez
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-05-02  7:18 UTC (permalink / raw)
  To: Baolin Wang, Daniel Gomez, Ville Syrjälä
  Cc: akpm, hughd, willy, wangkefeng.wang, 21cnbao, ryan.roberts,
	ioworker0, da.gomez, linux-mm, linux-kernel, regressions,
	intel-gfx, Eero Tamminen

On 02.05.25 03:02, Baolin Wang wrote:
> 
> 
> On 2025/4/30 21:24, Daniel Gomez wrote:
>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>> Hi,
>>>>>
>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>
>>>> Unfortunately, I don't have such platform to test it.
>>>>
>>>>>
>>>>> I haven't had time to look at this in detail, but presumably the
>>>>> problem is that we're no longer getting huge pages from our
>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>
>>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>>> in the shmem_pwrite(),
>>>
>>> pwrite is just one random way to write to objects, and probably
>>> not something that's even used by current Mesa.
>>>
>>>> which prevents tmpfs from allocating large
>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>> systems that support large folios, will allow getting a highest order
>>>> hint based on the size of the write and fallocate paths, and then will
>>>> attempt each allowable huge order.
>>>>
>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>
>> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
>> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
>> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
>> 'always' when the regression is found?
> 
> The tmpfs mount will not be controlled by
> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
> the debugging options 'deny' and 'force').

Right, IIRC as requested by Willy, it should behave like other FSes 
where there is no control over the folio size to be used.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-05-02  7:18             ` David Hildenbrand
@ 2025-05-02 13:10               ` Daniel Gomez
  2025-05-02 15:31                 ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Gomez @ 2025-05-02 13:10 UTC (permalink / raw)
  To: David Hildenbrand, Baolin Wang
  Cc: Ville Syrjälä, akpm, hughd, willy, wangkefeng.wang,
	21cnbao, ryan.roberts, ioworker0, da.gomez, linux-mm,
	linux-kernel, regressions, intel-gfx, Eero Tamminen

On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
> On 02.05.25 03:02, Baolin Wang wrote:
> > 
> > 
> > On 2025/4/30 21:24, Daniel Gomez wrote:
> > > On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
> > > > On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> > > > > On 2025/4/30 01:44, Ville Syrjälä wrote:
> > > > > > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > This causes a huge regression in Intel iGPU texturing performance.
> > > > > 
> > > > > Unfortunately, I don't have such platform to test it.
> > > > > 
> > > > > > 
> > > > > > I haven't had time to look at this in detail, but presumably the
> > > > > > problem is that we're no longer getting huge pages from our
> > > > > > private tmpfs mount (done in i915_gemfs_init()).
> > > > > 
> > > > > IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
> > > > > in the shmem_pwrite(),
> > > > 
> > > > pwrite is just one random way to write to objects, and probably
> > > > not something that's even used by current Mesa.
> > > > 
> > > > > which prevents tmpfs from allocating large
> > > > > folios. As mentioned in the comments below, tmpfs like other file
> > > > > systems that support large folios, will allow getting a highest order
> > > > > hint based on the size of the write and fallocate paths, and then will
> > > > > attempt each allowable huge order.
> > > > > 
> > > > > Therefore, I think the shmem_pwrite() function should be changed to
> > > > > remove the limitation that the write size cannot exceed PAGE_SIZE.
> > > 
> > > To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
> > > as they are not enabled by default IIRC (only THP, PMD level). Ville, I
> > > see i915_gemfs the huge=within_size mount option is passed. Can you confirm
> > > if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
> > > 'always' when the regression is found?
> > 
> > The tmpfs mount will not be controlled by
> > '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
> > the debugging options 'deny' and 'force').
> 
> Right, IIRC as requested by Willy, it should behave like other FSes where
> there is no control over the folio size to be used.

Thanks for reminding me. I forgot we finally changed it.

Could the performance drop be due to the driver no longer using PMD-level pages?

I also recall a performance drop when using order-8 and order-9 folios in tmpfs
with the initial per-block implementation. Baolin, did you experience anything
similar in the final implementation?

These were my numbers:

| Block Size (bs) | Linux Kernel v6.9 (GiB/s) | tmpfs with Large Folios v6.9 (GiB/s) |
| 4k   | 20.4 | 20.5 |
| 8k   | 34.3 | 34.3 |
| 16k  | 52.9 | 52.2 |
| 32k  | 70.2 | 76.9 |
| 64k  | 73.9 | 92.5 |
| 128k | 76.7 | 101  |
| 256k | 80.5 | 114  |
| 512k | 80.3 | 132  |
| 1M   | 78.5 | 75.2 |
| 2M   | 65.7 | 47.1 |

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-05-02 13:10               ` Daniel Gomez
@ 2025-05-02 15:31                 ` David Hildenbrand
  2025-05-06  3:33                   ` Baolin Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-05-02 15:31 UTC (permalink / raw)
  To: Daniel Gomez, Baolin Wang
  Cc: Ville Syrjälä, akpm, hughd, willy, wangkefeng.wang,
	21cnbao, ryan.roberts, ioworker0, da.gomez, linux-mm,
	linux-kernel, regressions, intel-gfx, Eero Tamminen

On 02.05.25 15:10, Daniel Gomez wrote:
> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>> On 02.05.25 03:02, Baolin Wang wrote:
>>>
>>>
>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>
>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>
>>>>>>>
>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>
>>>>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>>>>> in the shmem_pwrite(),
>>>>>
>>>>> pwrite is just one random way to write to objects, and probably
>>>>> not something that's even used by current Mesa.
>>>>>
>>>>>> which prevents tmpfs from allocating large
>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>> systems that support large folios, will allow getting a highest order
>>>>>> hint based on the size of the write and fallocate paths, and then will
>>>>>> attempt each allowable huge order.
>>>>>>
>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>
>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
>>>> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
>>>> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
>>>> 'always' when the regression is found?
>>>
>>> The tmpfs mount will not be controlled by
>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>> the debugging options 'deny' and 'force').
>>
>> Right, IIRC as requested by Willy, it should behave like other FSes where
>> there is no control over the folio size to be used.
> 
> Thanks for reminding me. I forgot we finally changed it.
> 
> Could the performance drop be due to the driver no longer using PMD-level pages?

I suspect that the faulting logic will now go to a smaller order first, 
indeed.

... trying to digest shmem_allowable_huge_orders() and 
shmem_huge_global_enabled(), having a hard time trying to isolate the 
tmpfs case: especially, if we run here into the vma vs. !vma case.

Without a VMA, I think we should have "mpfs will allow getting a highest 
order hint based on and fallocate paths, then will try each allowable 
order".

With a VMA (no access hint), "we still use PMD-sized order to locate 
huge pages due to lack of a write size hint."

So if we get a fallocate()/write() that is, say, 1 MiB, we'd now 
allocate an 1 MiB folio instead of a 2 MiB one.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-05-02 15:31                 ` David Hildenbrand
@ 2025-05-06  3:33                   ` Baolin Wang
  2025-05-06 14:36                     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-05-06  3:33 UTC (permalink / raw)
  To: David Hildenbrand, Daniel Gomez
  Cc: Ville Syrjälä, akpm, hughd, willy, wangkefeng.wang,
	21cnbao, ryan.roberts, ioworker0, da.gomez, linux-mm,
	linux-kernel, regressions, intel-gfx, Eero Tamminen



On 2025/5/2 23:31, David Hildenbrand wrote:
> On 02.05.25 15:10, Daniel Gomez wrote:
>> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>>> On 02.05.25 03:02, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>>
>>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>>
>>>>>>>>
>>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>>
>>>>>>> IIUC, the i915 driver still limits the maximum write size to 
>>>>>>> PAGE_SIZE
>>>>>>> in the shmem_pwrite(),
>>>>>>
>>>>>> pwrite is just one random way to write to objects, and probably
>>>>>> not something that's even used by current Mesa.
>>>>>>
>>>>>>> which prevents tmpfs from allocating large
>>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>>> systems that support large folios, will allow getting a highest 
>>>>>>> order
>>>>>>> hint based on the size of the write and fallocate paths, and then 
>>>>>>> will
>>>>>>> attempt each allowable huge order.
>>>>>>>
>>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>>
>>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled 
>>>>> in sysfs
>>>>> as they are not enabled by default IIRC (only THP, PMD level). 
>>>>> Ville, I
>>>>> see i915_gemfs the huge=within_size mount option is passed. Can you 
>>>>> confirm
>>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also 
>>>>> marked as
>>>>> 'always' when the regression is found?
>>>>
>>>> The tmpfs mount will not be controlled by
>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>>> the debugging options 'deny' and 'force').
>>>
>>> Right, IIRC as requested by Willy, it should behave like other FSes 
>>> where
>>> there is no control over the folio size to be used.
>>
>> Thanks for reminding me. I forgot we finally changed it.
>>
>> Could the performance drop be due to the driver no longer using 
>> PMD-level pages?
> 
> I suspect that the faulting logic will now go to a smaller order first, 
> indeed.
> 
> ... trying to digest shmem_allowable_huge_orders() and 
> shmem_huge_global_enabled(), having a hard time trying to isolate the 
> tmpfs case: especially, if we run here into the vma vs. !vma case.
> 
> Without a VMA, I think we should have "mpfs will allow getting a highest 
> order hint based on and fallocate paths, then will try each allowable 
> order".
> 
> With a VMA (no access hint), "we still use PMD-sized order to locate 
> huge pages due to lack of a write size hint."
> 
> So if we get a fallocate()/write() that is, say, 1 MiB, we'd now 
> allocate an 1 MiB folio instead of a 2 MiB one.

Right.

So I asked Ville how the shmem folios are allocated in the i915 driver, 
and to see if we can make some improvements.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
  2025-05-06  3:33                   ` Baolin Wang
@ 2025-05-06 14:36                     ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-05-06 14:36 UTC (permalink / raw)
  To: Baolin Wang, Daniel Gomez
  Cc: Ville Syrjälä, akpm, hughd, willy, wangkefeng.wang,
	21cnbao, ryan.roberts, ioworker0, da.gomez, linux-mm,
	linux-kernel, regressions, intel-gfx, Eero Tamminen

On 06.05.25 05:33, Baolin Wang wrote:
> 
> 
> On 2025/5/2 23:31, David Hildenbrand wrote:
>> On 02.05.25 15:10, Daniel Gomez wrote:
>>> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>>>> On 02.05.25 03:02, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>>>
>>>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>>>
>>>>>>>> IIUC, the i915 driver still limits the maximum write size to
>>>>>>>> PAGE_SIZE
>>>>>>>> in the shmem_pwrite(),
>>>>>>>
>>>>>>> pwrite is just one random way to write to objects, and probably
>>>>>>> not something that's even used by current Mesa.
>>>>>>>
>>>>>>>> which prevents tmpfs from allocating large
>>>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>>>> systems that support large folios, will allow getting a highest
>>>>>>>> order
>>>>>>>> hint based on the size of the write and fallocate paths, and then
>>>>>>>> will
>>>>>>>> attempt each allowable huge order.
>>>>>>>>
>>>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>>>
>>>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled
>>>>>> in sysfs
>>>>>> as they are not enabled by default IIRC (only THP, PMD level).
>>>>>> Ville, I
>>>>>> see i915_gemfs the huge=within_size mount option is passed. Can you
>>>>>> confirm
>>>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also
>>>>>> marked as
>>>>>> 'always' when the regression is found?
>>>>>
>>>>> The tmpfs mount will not be controlled by
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>>>> the debugging options 'deny' and 'force').
>>>>
>>>> Right, IIRC as requested by Willy, it should behave like other FSes
>>>> where
>>>> there is no control over the folio size to be used.
>>>
>>> Thanks for reminding me. I forgot we finally changed it.
>>>
>>> Could the performance drop be due to the driver no longer using
>>> PMD-level pages?
>>
>> I suspect that the faulting logic will now go to a smaller order first,
>> indeed.
>>
>> ... trying to digest shmem_allowable_huge_orders() and
>> shmem_huge_global_enabled(), having a hard time trying to isolate the
>> tmpfs case: especially, if we run here into the vma vs. !vma case.
>>
>> Without a VMA, I think we should have "mpfs will allow getting a highest
>> order hint based on and fallocate paths, then will try each allowable
>> order".
>>
>> With a VMA (no access hint), "we still use PMD-sized order to locate
>> huge pages due to lack of a write size hint."
>>
>> So if we get a fallocate()/write() that is, say, 1 MiB, we'd now
>> allocate an 1 MiB folio instead of a 2 MiB one.
> 
> Right.
> 
> So I asked Ville how the shmem folios are allocated in the i915 driver,
> and to see if we can make some improvements.

Maybe preallocation (using fallocate) might be reasonable for their use 
case: if they know they will consume all that memory either way. If it's 
sparse, it's more problematic.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-05-06 14:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-28  7:40 [PATCH v3 0/6] Support large folios for tmpfs Baolin Wang
2024-11-28  7:40 ` [PATCH v3 1/6] mm: factor out the order calculation into a new helper Baolin Wang
2024-11-28  7:40 ` [PATCH v3 2/6] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
2024-11-28  7:40 ` [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs Baolin Wang
2025-04-29 17:44   ` [REGRESSION] " Ville Syrjälä
2025-04-30  6:32     ` Baolin Wang
2025-04-30 11:20       ` Ville Syrjälä
2025-04-30 13:24         ` Daniel Gomez
2025-05-02  1:02           ` Baolin Wang
2025-05-02  7:18             ` David Hildenbrand
2025-05-02 13:10               ` Daniel Gomez
2025-05-02 15:31                 ` David Hildenbrand
2025-05-06  3:33                   ` Baolin Wang
2025-05-06 14:36                     ` David Hildenbrand
2024-11-28  7:40 ` [PATCH v3 4/6] mm: shmem: add a kernel command line to change the default huge policy " Baolin Wang
2024-11-28  7:40 ` [PATCH v3 5/6] docs: tmpfs: update the large folios policy for tmpfs and shmem Baolin Wang
2024-11-28  7:40 ` [PATCH v3 6/6] docs: tmpfs: drop 'fadvise()' from the documentation Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).