linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
@ 2024-07-17  7:12 Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes Ryan Roberts
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17  7:12 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: Ryan Roberts, linux-kernel, linux-mm

Hi All,

This series is an RFC that adds sysfs and kernel cmdline controls to configure
the set of allowed large folio sizes that can be used when allocating
file-memory for the page cache. As part of the control mechanism, it provides
for a special-case "preferred folio size for executable mappings" marker.

I'm trying to solve 2 separate problems with this series:

1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
approach for the change at [1]. Instead of hardcoding the preferred executable
folio size into the arch, user space can now select it. This decouples the arch
code and also makes the mechanism more generic; it can be bypassed (the default)
or any folio size can be set. For my use case, 64K is preferred, but I've also
heard from Willy of a use case where putting all text into 2M PMD-sized folios
is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
therefore faulting in all text ahead of time) to achieve that.

2. Reduce memory fragmentation in systems under high memory pressure (e.g.
Android): The theory goes that if all folios are 64K, then failure to allocate a
64K folio should become unlikely. But if the page cache is allocating lots of
different orders, with most allocations having an order below 64K (as is the
case today) then ability to allocate 64K folios diminishes. By providing control
over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
allocation failure. Additionally I've heard (second hand) of the need to disable
large folios in the page cache entirely due to latency concerns in some
settings. These controls allow all of this without kernel changes.

The value of (1) is clear and the performance improvements are documented in
patch 2. I don't yet have any data demonstrating the theory for (2) since I
can't reproduce the setup that Barry had at [2]. But my view is that by adding
these controls we will enable the community to explore further, in the same way
that the anon mTHP controls helped harden the understanding for anonymous
memory.

---
This series depends on the "mTHP allocation stats for file-backed memory" series
at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
mm selftests have been run; no regressions were observed.

[1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
[2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
[3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (4):
  mm: mTHP user controls to configure pagecache large folio sizes
  mm: Introduce "always+exec" for mTHP file_enabled control
  mm: Override mTHP "enabled" defaults at kernel cmdline
  mm: Override mTHP "file_enabled" defaults at kernel cmdline

 .../admin-guide/kernel-parameters.txt         |  16 ++
 Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
 include/linux/huge_mm.h                       |  61 ++++---
 mm/filemap.c                                  |  26 ++-
 mm/huge_memory.c                              | 158 +++++++++++++++++-
 mm/readahead.c                                |  43 ++++-
 6 files changed, 329 insertions(+), 41 deletions(-)

--
2.43.0



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes
  2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
@ 2024-07-17  7:12 ` Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control Ryan Roberts
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17  7:12 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: Ryan Roberts, linux-kernel, linux-mm

Add mTHP controls to sysfs to allow user space to configure the folio
sizes that can be considered for allocation of file-backed memory:

  /sys/kernel/mm/transparent_hugepage/hugepages-*kB/file_enable

For now, the control can be set to either `always` or `never` to enable
or disable that size. More options may be added in future.

By default, at boot, all folio sizes are enabled, and the algorithm used
to select a folio size remains conceptually unchanged; increase by 2
enabled orders each time a readahead marker is hit then reduce to the
closest enabled order to fit within bounds of ra size, index alignment
and EOF. So when all folio sizes are enabled, behavior should be
unchanged. When folio sizes are disabled, the algorithm will never
select them.

Systems such as Android are always under extreme memory pressure and as
a result fragmentation often causes attempts to allocate large folios to
fail and fallback to smaller folios. By fixing the pagecache to one
large folio size (e.g. 64K) plus fallback to small folios, a large
source of this fragmentation can be removed and 64K mTHP allocations
succeed more often, allowing the system to benefit from improved
performance on arm64 and other arches that support "contpte".

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 21 +++++++++
 include/linux/huge_mm.h                    | 50 +++++++++++++---------
 mm/filemap.c                               | 15 ++++---
 mm/huge_memory.c                           | 43 +++++++++++++++++++
 mm/readahead.c                             | 43 +++++++++++++++----
 5 files changed, 138 insertions(+), 34 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index d4857e457add..9f3ed504c646 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -284,6 +284,27 @@ that THP is shared. Exceeding the number would block the collapse::
 
 A higher value may increase memory footprint for some workloads.
 
+File-Backed Hugepages
+---------------------
+
+The kernel will automatically select an appropriate THP size for file-backed
+memory from a set of allowed sizes. By default all THP sizes that the page cache
+supports are allowed, but this set can be modified with one of::
+
+	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
+	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
+
+where <size> is the hugepage size being addressed, the available sizes for which
+vary by system. ``always`` adds the hugepage size to the set of allowed sizes,
+and ``never`` removes the hugepage size from the set of allowed sizes.
+
+In some situations, constraining the allowed sizes can reduce memory
+fragmentation, resulting in fewer allocation fallbacks and improved system
+performance.
+
+Note that any changes to the allowed set of sizes only applies to future
+file-backed THP allocations.
+
 Boot parameter
 ==============
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4f9109fcdded..19ced8192d39 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -114,6 +114,24 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
 
+static inline int lowest_order(unsigned long orders)
+{
+	if (orders)
+		return __ffs(orders);
+	return -1;
+}
+
+static inline int highest_order(unsigned long orders)
+{
+	return fls_long(orders) - 1;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+	*orders &= ~BIT(prev);
+	return highest_order(*orders);
+}
+
 enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
@@ -158,6 +176,12 @@ extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
+extern unsigned long huge_file_orders_always;
+
+static inline unsigned long file_orders_always(void)
+{
+	return READ_ONCE(huge_file_orders_always);
+}
 
 static inline bool hugepage_global_enabled(void)
 {
@@ -172,17 +196,6 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
-static inline int highest_order(unsigned long orders)
-{
-	return fls_long(orders) - 1;
-}
-
-static inline int next_order(unsigned long *orders, int prev)
-{
-	*orders &= ~BIT(prev);
-	return highest_order(*orders);
-}
-
 /*
  * Do the below checks:
  *   - For file vma, check if the linear page offset of vma is
@@ -435,6 +448,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline unsigned long file_orders_always(void)
+{
+	return 0;
+}
+
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
 	return false;
@@ -578,16 +596,6 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
-
-static inline int highest_order(unsigned long orders)
-{
-	return 0;
-}
-
-static inline int next_order(unsigned long *orders, int prev)
-{
-	return 0;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/filemap.c b/mm/filemap.c
index 131d514fca29..870016fcfdde 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1922,6 +1922,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
 		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned long orders;
 		int err;
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
@@ -1937,13 +1938,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 		if (!mapping_large_folio_support(mapping))
 			order = 0;
-		if (order > MAX_PAGECACHE_ORDER)
-			order = MAX_PAGECACHE_ORDER;
+
+		orders = file_orders_always() | BIT(0);
+		orders &= BIT(order + 1) - 1;
 		/* If we're not aligned, allocate a smaller folio */
 		if (index & ((1UL << order) - 1))
-			order = __ffs(index);
+			orders &= BIT(__ffs(index) + 1) - 1;
+		order = highest_order(orders);
 
-		do {
+		while (orders) {
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
@@ -1962,7 +1965,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+
+			order = next_order(&orders, order);
+		};
 
 		if (err == -EEXIST)
 			goto repeat;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26d558e3e80f..e8fe28fe9cf9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -80,6 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
+unsigned long huge_file_orders_always __read_mostly;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -525,6 +526,37 @@ static ssize_t anon_enabled_store(struct kobject *kobj,
 	return ret;
 }
 
+static ssize_t file_enabled_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_file_orders_always))
+		output = "[always] never";
+	else
+		output = "always [never]";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t file_enabled_store(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always"))
+		set_bit(order, &huge_file_orders_always);
+	else if (sysfs_streq(buf, "never"))
+		clear_bit(order, &huge_file_orders_always);
+	else
+		ret = -EINVAL;
+
+	return ret;
+}
+
 static struct kobj_attribute anon_enabled_attr =
 	__ATTR(enabled, 0644, anon_enabled_show, anon_enabled_store);
 
@@ -537,7 +569,11 @@ static const struct attribute_group anon_ctrl_attr_grp = {
 	.attrs = anon_ctrl_attrs,
 };
 
+static struct kobj_attribute file_enabled_attr =
+	__ATTR(file_enabled, 0644, file_enabled_show, file_enabled_store);
+
 static struct attribute *file_ctrl_attrs[] = {
+	&file_enabled_attr.attr,
 #ifdef CONFIG_SHMEM
 	&thpsize_shmem_enabled_attr.attr,
 #endif
@@ -712,6 +748,13 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 	 */
 	huge_anon_orders_inherit = BIT(PMD_ORDER);
 
+	/*
+	 * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
+	 * (and therefore THP_ORDERS_ALL_FILE_DEFAULT) isn't a compile-time
+	 * constant so we have to do this here.
+	 */
+	huge_file_orders_always = THP_ORDERS_ALL_FILE_DEFAULT;
+
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
 		pr_err("failed to create transparent hugepage kobject\n");
diff --git a/mm/readahead.c b/mm/readahead.c
index 517c0be7ce66..e05f85974396 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -432,6 +432,34 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 	return 0;
 }
 
+static int select_new_order(int old_order, int max_order, unsigned long orders)
+{
+	unsigned long hi_orders, lo_orders;
+
+	/*
+	 * Select the next order to use from the set in `orders`, while ensuring
+	 * we don't go above max_order. Prefer the next + 1 highest allowed
+	 * order after old_order, unless there isn't one, in which case return
+	 * the closest allowed order, which is either the next highest allowed
+	 * order or less than or equal to old_order. The "next + 1" skip
+	 * behaviour is intended to allow ramping up to large folios quickly.
+	 */
+
+	orders &= BIT(max_order + 1) - 1;
+	VM_WARN_ON(!orders);
+	hi_orders = orders & ~(BIT(old_order + 1) - 1);
+
+	if (hi_orders) {
+		old_order = lowest_order(hi_orders);
+		hi_orders &= ~BIT(old_order);
+		if (hi_orders)
+			return lowest_order(hi_orders);
+	}
+
+	lo_orders = orders & (BIT(old_order + 1) - 1);
+	return highest_order(lo_orders);
+}
+
 void page_cache_ra_order(struct readahead_control *ractl,
 		struct file_ra_state *ra, unsigned int new_order)
 {
@@ -443,17 +471,15 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned long orders;
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	if (!mapping_large_folio_support(mapping))
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER)
-		new_order += 2;
-
-	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
-	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+	orders = file_orders_always() | BIT(0);
+	new_order = select_new_order(new_order, ilog2(ra->size), orders);
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
@@ -463,9 +489,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 		/* Align with smaller pages if needed */
 		if (index & ((1UL << order) - 1))
-			order = __ffs(index);
+			order = select_new_order(order, __ffs(index), orders);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (index + (1UL << order) - 1 > limit &&
+				(BIT(order) & orders) == 0)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control
  2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes Ryan Roberts
@ 2024-07-17  7:12 ` Ryan Roberts
  2024-07-17 17:10   ` Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline Ryan Roberts
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17  7:12 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: Ryan Roberts, linux-kernel, linux-mm

In addition to `always` and `never`, add `always+exec` as an option for:

  /sys/kernel/mm/transparent_hugepage/hugepages-*kB/file_enabled

`always+exec` acts like `always` but additionally marks the hugepage
size as the preferred hugepage size for sections of any file mapped with
execute permission. A maximum of one hugepage size can be marked as
`exec` at a time, so applying it to a new size implicitly removes it
from any size it was previously set for.

Change readahead to use this flagged exec size; when a request is made
for an executable mapping, do a synchronous read of the size in a
naturally aligned manner.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" changes, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low liklihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).

The reason for the low liklihood is that the current readahead algorithm
starts with an order-2 folio and increases the folio order by 2 every
time the readahead mark is hit. But most executable memory is faulted in
fairly randomly and so the readahead mark is rarely hit and most
executable folios remain order-2. This is observed impirically and
confirmed from discussion with a gnu linker expert; in general, the
linker does nothing to group temporally accessed text together
spacially. Additionally, with the current read-around approach there are
no alignment guarrantees between the file and folio. This is
insufficient for arm64's contpte mapping requirement (order-4 for 4K
base pages).

So it seems reasonable to special-case the read(ahead) logic for
executable mappings. The trade-off is performance improvement (due to
more efficient storage of the translations in iTLB) vs potential read
amplification (due to reading too much data around the fault which won't
be used), and the latter is independent of base page size.

Of course if no hugepage size is marked as `always+exec` the old
behaviour is maintained.

Performance Benchmarking
------------------------

The below shows kernel compilation and speedometer javascript benchmarks
on Ampere Altra arm64 system. When the patch is applied, `always+exec`
is set for 64K folios.

First, confirmation that this patch causes more memory to be contained
in 64K folios (this is for all file-backed memory so includes
non-executable too):

| File-backed folios      |   Speedometer   |  Kernel Compile |
| by size as percentage   |-----------------|-----------------|
| of all mapped file mem  | before |  after | before |  after |
|=========================|========|========|========|========|
|file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
|file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
|file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
|file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
|file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
|file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
|file-thp-partial         |     0% |     0% |     0% |     0% |
|-------------------------|--------|--------|--------|--------|
|file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |

The above shows that for both use cases, the amount of file memory
backed by 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of memory that is contpte-mapped
significantly increases (last line).

And this is reflected in performance improvement:

Kernel Compilation (smaller is faster):
| kernel   |   real-time |   kern-time |   user-time |   peak memory |
|----------|-------------|-------------|-------------|---------------|
| before   |        0.0% |        0.0% |        0.0% |          0.0% |
| after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |

Speedometer (bigger is faster):
| kernel   |   runs_per_min |   peak memory |
|----------|----------------|---------------|
| before   |           0.0% |          0.0% |
| after    |           1.3% |          1.0% |

Both benchmarks show a ~1.5% improvement once the patch is applied.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  6 +++++
 include/linux/huge_mm.h                    | 11 ++++++++
 mm/filemap.c                               | 11 ++++++++
 mm/huge_memory.c                           | 31 +++++++++++++++++-----
 4 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9f3ed504c646..1aaf8e3a0b5a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -292,12 +292,18 @@ memory from a set of allowed sizes. By default all THP sizes that the page cache
 supports are allowed, but this set can be modified with one of::

 	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
+	echo always+exec >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
 	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled

 where <size> is the hugepage size being addressed, the available sizes for which
 vary by system. ``always`` adds the hugepage size to the set of allowed sizes,
 and ``never`` removes the hugepage size from the set of allowed sizes.

+``always+exec`` acts like ``always`` but additionally marks the hugepage size as
+the preferred hugepage size for sections of any file mapped executable. A
+maximum of one hugepage size can be marked as ``exec`` at a time, so applying it
+to a new size implicitly removes it from any size it was previously set for.
+
 In some situations, constraining the allowed sizes can reduce memory
 fragmentation, resulting in fewer allocation fallbacks and improved system
 performance.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19ced8192d39..3571ea0c3d8c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -177,12 +177,18 @@ extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
 extern unsigned long huge_file_orders_always;
+extern int huge_file_exec_order;

 static inline unsigned long file_orders_always(void)
 {
 	return READ_ONCE(huge_file_orders_always);
 }

+static inline int file_exec_order(void)
+{
+	return READ_ONCE(huge_file_exec_order);
+}
+
 static inline bool hugepage_global_enabled(void)
 {
 	return transparent_hugepage_flags &
@@ -453,6 +459,11 @@ static inline unsigned long file_orders_always(void)
 	return 0;
 }

+static inline int file_exec_order(void)
+{
+	return -1;
+}
+
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
 	return false;
diff --git a/mm/filemap.c b/mm/filemap.c
index 870016fcfdde..c4a3cc6a2e46 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3128,6 +3128,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *fpin = NULL;
 	unsigned long vm_flags = vmf->vma->vm_flags;
 	unsigned int mmap_miss;
+	int exec_order = file_exec_order();

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
@@ -3147,6 +3148,16 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 #endif

+	/* If explicit order is set for exec mappings, use it. */
+	if ((vm_flags & VM_EXEC) && exec_order >= 0) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+		ra->size = 1UL << exec_order;
+		ra->async_size = 0;
+		ractl._index &= ~((unsigned long)ra->size - 1);
+		page_cache_ra_order(&ractl, ra, exec_order);
+		return fpin;
+	}
+
 	/* If we don't want any read-ahead, don't bother */
 	if (vm_flags & VM_RAND_READ)
 		return fpin;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e8fe28fe9cf9..4249c0bc9388 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -81,6 +81,7 @@ unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
 unsigned long huge_file_orders_always __read_mostly;
+int huge_file_exec_order __read_mostly = -1;

 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -462,6 +463,7 @@ static const struct attribute_group hugepage_attr_group = {
 static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
 static void thpsize_release(struct kobject *kobj);
 static DEFINE_SPINLOCK(huge_anon_orders_lock);
+static DEFINE_SPINLOCK(huge_file_orders_lock);
 static LIST_HEAD(thpsize_list);

 static ssize_t anon_enabled_show(struct kobject *kobj,
@@ -531,11 +533,15 @@ static ssize_t file_enabled_show(struct kobject *kobj,
 {
 	int order = to_thpsize(kobj)->order;
 	const char *output;
+	bool exec;

-	if (test_bit(order, &huge_file_orders_always))
-		output = "[always] never";
-	else
-		output = "always [never]";
+	if (test_bit(order, &huge_file_orders_always)) {
+		exec = READ_ONCE(huge_file_exec_order) == order;
+		output = exec ? "always [always+exec] never" :
+				"[always] always+exec never";
+	} else {
+		output = "always always+exec [never]";
+	}

 	return sysfs_emit(buf, "%s\n", output);
 }
@@ -547,13 +553,24 @@ static ssize_t file_enabled_store(struct kobject *kobj,
 	int order = to_thpsize(kobj)->order;
 	ssize_t ret = count;

-	if (sysfs_streq(buf, "always"))
+	spin_lock(&huge_file_orders_lock);
+
+	if (sysfs_streq(buf, "always")) {
 		set_bit(order, &huge_file_orders_always);
-	else if (sysfs_streq(buf, "never"))
+		if (huge_file_exec_order == order)
+			huge_file_exec_order = -1;
+	} else if (sysfs_streq(buf, "always+exec")) {
+		set_bit(order, &huge_file_orders_always);
+		huge_file_exec_order = order;
+	} else if (sysfs_streq(buf, "never")) {
 		clear_bit(order, &huge_file_orders_always);
-	else
+		if (huge_file_exec_order == order)
+			huge_file_exec_order = -1;
+	} else {
 		ret = -EINVAL;
+	}

+	spin_unlock(&huge_file_orders_lock);
 	return ret;
 }

--
2.43.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes Ryan Roberts
  2024-07-17  7:12 ` [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control Ryan Roberts
@ 2024-07-17  7:12 ` Ryan Roberts
  2024-07-19  0:46   ` Barry Song
  2024-07-22  9:13   ` Daniel Gomez
  2024-07-17  7:12 ` [RFC PATCH v1 4/4] mm: Override mTHP "file_enabled" " Ryan Roberts
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17  7:12 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: Ryan Roberts, linux-kernel, linux-mm

Add thp_anon= cmdline parameter to allow specifying the default
enablement of each supported anon THP size. The parameter accepts the
following format and can be provided multiple times to configure each
size:

thp_anon=<size>[KMG]:<value>

See Documentation/admin-guide/mm/transhuge.rst for more details.

Configuring the defaults at boot time is useful to allow early user
space to take advantage of mTHP before its been configured through
sysfs.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 +++
 Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
 mm/huge_memory.c                              | 55 ++++++++++++++++++-
 3 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bc55fb55cd26..48443ad12e3f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6592,6 +6592,14 @@
 			<deci-seconds>: poll all this frequency
 			0: no polling (default)
 
+	thp_anon=	[KNL]
+			Format: <size>[KMG]:always|madvise|never|inherit
+			Can be used to control the default behavior of the
+			system with respect to anonymous transparent hugepages.
+			Can be used multiple times for multiple anon THP sizes.
+			See Documentation/admin-guide/mm/transhuge.rst for more
+			details.
+
 	threadirqs	[KNL,EARLY]
 			Force threading of all interrupt handlers except those
 			marked explicitly IRQF_NO_THREAD.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 1aaf8e3a0b5a..f53d43d986e2 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -311,13 +311,27 @@ performance.
 Note that any changes to the allowed set of sizes only applies to future
 file-backed THP allocations.
 
-Boot parameter
-==============
+Boot parameters
+===============
 
-You can change the sysfs boot time defaults of Transparent Hugepage
-Support by passing the parameter ``transparent_hugepage=always`` or
-``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
-to the kernel command line.
+You can change the sysfs boot time default for the top-level "enabled"
+control by passing the parameter ``transparent_hugepage=always`` or
+``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
+kernel command line.
+
+Alternatively, each supported anonymous THP size can be controlled by
+passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
+and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
+``inherit``.
+
+For example, the following will set 64K THP to ``always``::
+
+	thp_anon=64K:always
+
+``thp_anon=`` may be specified multiple times to configure all THP sizes as
+required. If ``thp_anon=`` is specified at least once, any anon THP sizes
+not explicitly configured on the command line are implicitly set to
+``never``.
 
 Hugepages in tmpfs/shmem
 ========================
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4249c0bc9388..794d2790d90d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
 unsigned long huge_file_orders_always __read_mostly;
 int huge_file_exec_order __read_mostly = -1;
+static bool anon_orders_configured;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 	 * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
 	 * constant so we have to do this here.
 	 */
-	huge_anon_orders_inherit = BIT(PMD_ORDER);
+	if (!anon_orders_configured) {
+		huge_anon_orders_inherit = BIT(PMD_ORDER);
+		anon_orders_configured = true;
+	}
 
 	/*
 	 * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
@@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
+static int __init setup_thp_anon(char *str)
+{
+	unsigned long size;
+	char *state;
+	int order;
+	int ret = 0;
+
+	if (!str)
+		goto out;
+
+	size = (unsigned long)memparse(str, &state);
+	order = ilog2(size >> PAGE_SHIFT);
+	if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
+	    !(BIT(order) & THP_ORDERS_ALL_ANON))
+		goto out;
+
+	state++;
+
+	if (!strcmp(state, "always")) {
+		clear_bit(order, &huge_anon_orders_inherit);
+		clear_bit(order, &huge_anon_orders_madvise);
+		set_bit(order, &huge_anon_orders_always);
+		ret = 1;
+	} else if (!strcmp(state, "inherit")) {
+		clear_bit(order, &huge_anon_orders_always);
+		clear_bit(order, &huge_anon_orders_madvise);
+		set_bit(order, &huge_anon_orders_inherit);
+		ret = 1;
+	} else if (!strcmp(state, "madvise")) {
+		clear_bit(order, &huge_anon_orders_always);
+		clear_bit(order, &huge_anon_orders_inherit);
+		set_bit(order, &huge_anon_orders_madvise);
+		ret = 1;
+	} else if (!strcmp(state, "never")) {
+		clear_bit(order, &huge_anon_orders_always);
+		clear_bit(order, &huge_anon_orders_inherit);
+		clear_bit(order, &huge_anon_orders_madvise);
+		ret = 1;
+	}
+
+	if (ret)
+		anon_orders_configured = true;
+out:
+	if (!ret)
+		pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
+	return ret;
+}
+__setup("thp_anon=", setup_thp_anon);
+
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 4/4] mm: Override mTHP "file_enabled" defaults at kernel cmdline
  2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
                   ` (2 preceding siblings ...)
  2024-07-17  7:12 ` [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline Ryan Roberts
@ 2024-07-17  7:12 ` Ryan Roberts
  2024-07-17 10:31 ` [RFC PATCH v1 0/4] Control folio sizes used for page cache memory David Hildenbrand
       [not found] ` <480f34d0-a943-40da-9c69-2353fe311cf7@arm.com>
  5 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17  7:12 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: Ryan Roberts, linux-kernel, linux-mm

Add thp_file= cmdline parameter to allow specifying the default
enablement of each supported file-backed THP size. The parameter accepts
the following format and can be provided multiple times to configure
each size:

  thp_file=<size>[KMG]:<value>

See Documentation/admin-guide/mm/transhuge.rst for more details.

Configuring the defaults at boot time is often necessary because its not
always possible to drop active executable pages from the page cache,
especially if they are well used like libc. The command line parameter
allows configuring the values before the first page is installed in the
page cache.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++
 Documentation/admin-guide/mm/transhuge.rst    | 13 ++++++
 mm/huge_memory.c                              | 45 ++++++++++++++++++-
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 48443ad12e3f..e3e99def5691 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6600,6 +6600,14 @@
 			See Documentation/admin-guide/mm/transhuge.rst for more
 			details.
 
+	thp_file=	[KNL]
+			Format: <size>[KMG]:always|always+exec|never
+			Can be used to control the default behavior of the
+			system with respect to file-backed transparent hugepages.
+			Can be used multiple times for multiple file-backed THP
+			sizes. See Documentation/admin-guide/mm/transhuge.rst
+			for more details.
+
 	threadirqs	[KNL,EARLY]
 			Force threading of all interrupt handlers except those
 			marked explicitly IRQF_NO_THREAD.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index f53d43d986e2..2379ed4ad085 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -333,6 +333,19 @@ required. If ``thp_anon=`` is specified at least once, any anon THP sizes
 not explicitly configured on the command line are implicitly set to
 ``never``.
 
+Each supported file-backed THP size can be controlled by passing
+``thp_file=<size>[KMG]:<state>``, where ``<size>`` is the THP size and
+``<state>`` is one of ``always``, ``always+exec`` or ``never``.
+
+For example, the following will set 64K THP to ``always+exec``::
+
+	thp_file=64K:always+exec
+
+``thp_file=`` may be specified multiple times to configure all THP sizes as
+required. If ``thp_file=`` is specified at least once, any file-backed THP
+sizes not explicitly configured on the command line are implicitly set to
+``never``.
+
 Hugepages in tmpfs/shmem
 ========================
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 794d2790d90d..4d963dde7aea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -83,6 +83,7 @@ unsigned long huge_anon_orders_inherit __read_mostly;
 unsigned long huge_file_orders_always __read_mostly;
 int huge_file_exec_order __read_mostly = -1;
 static bool anon_orders_configured;
+static bool file_orders_configured;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -774,7 +775,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 	 * (and therefore THP_ORDERS_ALL_FILE_DEFAULT) isn't a compile-time
 	 * constant so we have to do this here.
 	 */
-	huge_file_orders_always = THP_ORDERS_ALL_FILE_DEFAULT;
+	if (!file_orders_configured) {
+		huge_file_orders_always = THP_ORDERS_ALL_FILE_DEFAULT;
+		file_orders_configured = true;
+	}
 
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
@@ -1008,6 +1012,45 @@ static int __init setup_thp_anon(char *str)
 }
 __setup("thp_anon=", setup_thp_anon);
 
+static int __init setup_thp_file(char *str)
+{
+	unsigned long size;
+	char *state;
+	int order;
+	int ret = 0;
+
+	if (!str)
+		goto out;
+
+	size = (unsigned long)memparse(str, &state);
+	order = ilog2(size >> PAGE_SHIFT);
+	if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
+	    !(BIT(order) & THP_ORDERS_ALL_FILE_DEFAULT))
+		goto out;
+
+	state++;
+
+	if (!strcmp(state, "always")) {
+		set_bit(order, &huge_file_orders_always);
+		ret = 1;
+	} else if (!strcmp(state, "always+exec")) {
+		set_bit(order, &huge_file_orders_always);
+		huge_file_exec_order = order;
+		ret = 1;
+	} else if (!strcmp(state, "never")) {
+		clear_bit(order, &huge_file_orders_always);
+		ret = 1;
+	}
+
+	if (ret)
+		file_orders_configured = true;
+out:
+	if (!ret)
+		pr_warn("thp_file=%s: cannot parse, ignored\n", str);
+	return ret;
+}
+__setup("thp_file=", setup_thp_file);
+
 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
                   ` (3 preceding siblings ...)
  2024-07-17  7:12 ` [RFC PATCH v1 4/4] mm: Override mTHP "file_enabled" " Ryan Roberts
@ 2024-07-17 10:31 ` David Hildenbrand
  2024-07-17 10:45   ` Ryan Roberts
       [not found] ` <480f34d0-a943-40da-9c69-2353fe311cf7@arm.com>
  5 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-07-17 10:31 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Barry Song, Lance Yang, Baolin Wang,
	Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: linux-kernel, linux-mm

On 17.07.24 09:12, Ryan Roberts wrote:
> Hi All,
> 
> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> the set of allowed large folio sizes that can be used when allocating
> file-memory for the page cache. As part of the control mechanism, it provides
> for a special-case "preferred folio size for executable mappings" marker.
> 
> I'm trying to solve 2 separate problems with this series:
> 
> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> approach for the change at [1]. Instead of hardcoding the preferred executable
> folio size into the arch, user space can now select it. This decouples the arch
> code and also makes the mechanism more generic; it can be bypassed (the default)
> or any folio size can be set. For my use case, 64K is preferred, but I've also
> heard from Willy of a use case where putting all text into 2M PMD-sized folios
> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> therefore faulting in all text ahead of time) to achieve that.
> 
> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> Android): The theory goes that if all folios are 64K, then failure to allocate a
> 64K folio should become unlikely. But if the page cache is allocating lots of
> different orders, with most allocations having an order below 64K (as is the
> case today) then ability to allocate 64K folios diminishes. By providing control
> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> allocation failure. Additionally I've heard (second hand) of the need to disable
> large folios in the page cache entirely due to latency concerns in some
> settings. These controls allow all of this without kernel changes.
> 
> The value of (1) is clear and the performance improvements are documented in
> patch 2. I don't yet have any data demonstrating the theory for (2) since I
> can't reproduce the setup that Barry had at [2]. But my view is that by adding
> these controls we will enable the community to explore further, in the same way
> that the anon mTHP controls helped harden the understanding for anonymous
> memory.
> 
> ---

How would this interact with other requirements we get from the 
filesystem (for example, because of the device) [1].

Assuming a device has a filesystem has a min order of X, but we disable 
anything >= X, how would we combine that configuration/information?


[1] 
https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-07-17 10:31 ` [RFC PATCH v1 0/4] Control folio sizes used for page cache memory David Hildenbrand
@ 2024-07-17 10:45   ` Ryan Roberts
  2024-07-17 14:25     ` David Hildenbrand
  2024-07-22  9:35     ` Daniel Gomez
  0 siblings, 2 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17 10:45 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Barry Song, Lance Yang, Baolin Wang,
	Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: linux-kernel, linux-mm

On 17/07/2024 11:31, David Hildenbrand wrote:
> On 17.07.24 09:12, Ryan Roberts wrote:
>> Hi All,
>>
>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>> the set of allowed large folio sizes that can be used when allocating
>> file-memory for the page cache. As part of the control mechanism, it provides
>> for a special-case "preferred folio size for executable mappings" marker.
>>
>> I'm trying to solve 2 separate problems with this series:
>>
>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>> approach for the change at [1]. Instead of hardcoding the preferred executable
>> folio size into the arch, user space can now select it. This decouples the arch
>> code and also makes the mechanism more generic; it can be bypassed (the default)
>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>> therefore faulting in all text ahead of time) to achieve that.
>>
>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>> 64K folio should become unlikely. But if the page cache is allocating lots of
>> different orders, with most allocations having an order below 64K (as is the
>> case today) then ability to allocate 64K folios diminishes. By providing control
>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>> allocation failure. Additionally I've heard (second hand) of the need to disable
>> large folios in the page cache entirely due to latency concerns in some
>> settings. These controls allow all of this without kernel changes.
>>
>> The value of (1) is clear and the performance improvements are documented in
>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>> these controls we will enable the community to explore further, in the same way
>> that the anon mTHP controls helped harden the understanding for anonymous
>> memory.
>>
>> ---
> 
> How would this interact with other requirements we get from the filesystem (for
> example, because of the device) [1].
> 
> Assuming a device has a filesystem has a min order of X, but we disable anything
>>= X, how would we combine that configuration/information?

Currently order-0 is implicitly the "always-on" fallback order. My thinking was
that with [1], the specified min order just becomes that "always-on" fallback order.

Today:

  orders = file_orders_always() | BIT(0);

Tomorrow:

  orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);

That does mean that in this case, a user-disabled order could still be used. So
the controls are really hints rather than definitive commands.


> 
> 
> [1]
> https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-07-17 10:45   ` Ryan Roberts
@ 2024-07-17 14:25     ` David Hildenbrand
  2024-07-22  9:35     ` Daniel Gomez
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:25 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Barry Song, Lance Yang, Baolin Wang,
	Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: linux-kernel, linux-mm

On 17.07.24 12:45, Ryan Roberts wrote:
> On 17/07/2024 11:31, David Hildenbrand wrote:
>> On 17.07.24 09:12, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>> the set of allowed large folio sizes that can be used when allocating
>>> file-memory for the page cache. As part of the control mechanism, it provides
>>> for a special-case "preferred folio size for executable mappings" marker.
>>>
>>> I'm trying to solve 2 separate problems with this series:
>>>
>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>> folio size into the arch, user space can now select it. This decouples the arch
>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>> therefore faulting in all text ahead of time) to achieve that.
>>>
>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>> different orders, with most allocations having an order below 64K (as is the
>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>> large folios in the page cache entirely due to latency concerns in some
>>> settings. These controls allow all of this without kernel changes.
>>>
>>> The value of (1) is clear and the performance improvements are documented in
>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>> these controls we will enable the community to explore further, in the same way
>>> that the anon mTHP controls helped harden the understanding for anonymous
>>> memory.
>>>
>>> ---
>>
>> How would this interact with other requirements we get from the filesystem (for
>> example, because of the device) [1].
>>
>> Assuming a device has a filesystem has a min order of X, but we disable anything
>>> = X, how would we combine that configuration/information?
> 
> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
> that with [1], the specified min order just becomes that "always-on" fallback order.
> 
> Today:
> 
>    orders = file_orders_always() | BIT(0);
> 
> Tomorrow:
> 
>    orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
> 
> That does mean that in this case, a user-disabled order could still be used. So
> the controls are really hints rather than definitive commands.

Okay, because that's a difference to order-0, which is -- as you note -- 
always-on (not even a toggle).

Staring at patch #1, you use the name "file_enable". That might indeed 
cause some confusion. Thinking out loud, I wonder if a different 
terminology could better express the semantics. Hm ... but maybe it only 
would have to be documented.

Thanks for the details.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control
  2024-07-17  7:12 ` [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control Ryan Roberts
@ 2024-07-17 17:10   ` Ryan Roberts
  0 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-17 17:10 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez
  Cc: linux-kernel, linux-mm

On 17/07/2024 08:12, Ryan Roberts wrote:
> In addition to `always` and `never`, add `always+exec` as an option for:
> 
>   /sys/kernel/mm/transparent_hugepage/hugepages-*kB/file_enabled
> 
> `always+exec` acts like `always` but additionally marks the hugepage
> size as the preferred hugepage size for sections of any file mapped with
> execute permission. A maximum of one hugepage size can be marked as
> `exec` at a time, so applying it to a new size implicitly removes it
> from any size it was previously set for.

Just to document discussion around this which happened in the THP Cabal meeting,
there was a proposal to avoid user controls by always always allocating the
largest folio that we can for an exec mapping, given the bounds of the VMA. But
we concluded that this would likely cause latency concerns for some workloads
(e.g. increased app start up time). We then thought about setting a ceiling on
the folio size to allocate (e.g. 64K). But in that case, who decides what the
right ceiling value is? So, in my opinion, we are back to user controls.

Thanks,
Ryan

> 
> Change readahead to use this flagged exec size; when a request is made
> for an executable mapping, do a synchronous read of the size in a
> naturally aligned manner.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2. This is observed impirically and
> confirmed from discussion with a gnu linker expert; in general, the
> linker does nothing to group temporally accessed text together
> spacially. Additionally, with the current read-around approach there are
> no alignment guarrantees between the file and folio. This is
> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> base pages).
> 
> So it seems reasonable to special-case the read(ahead) logic for
> executable mappings. The trade-off is performance improvement (due to
> more efficient storage of the translations in iTLB) vs potential read
> amplification (due to reading too much data around the fault which won't
> be used), and the latter is independent of base page size.
> 
> Of course if no hugepage size is marked as `always+exec` the old
> behaviour is maintained.
> 
> Performance Benchmarking
> ------------------------
> 
> The below shows kernel compilation and speedometer javascript benchmarks
> on Ampere Altra arm64 system. When the patch is applied, `always+exec`
> is set for 64K folios.
> 
> First, confirmation that this patch causes more memory to be contained
> in 64K folios (this is for all file-backed memory so includes
> non-executable too):
> 
> | File-backed folios      |   Speedometer   |  Kernel Compile |
> | by size as percentage   |-----------------|-----------------|
> | of all mapped file mem  | before |  after | before |  after |
> |=========================|========|========|========|========|
> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> |file-thp-partial         |     0% |     0% |     0% |     0% |
> |-------------------------|--------|--------|--------|--------|
> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
> 
> The above shows that for both use cases, the amount of file memory
> backed by 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of memory that is contpte-mapped
> significantly increases (last line).
> 
> And this is reflected in performance improvement:
> 
> Kernel Compilation (smaller is faster):
> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> |----------|-------------|-------------|-------------|---------------|
> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
> 
> Speedometer (bigger is faster):
> | kernel   |   runs_per_min |   peak memory |
> |----------|----------------|---------------|
> | before   |           0.0% |          0.0% |
> | after    |           1.3% |          1.0% |
> 
> Both benchmarks show a ~1.5% improvement once the patch is applied.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |  6 +++++
>  include/linux/huge_mm.h                    | 11 ++++++++
>  mm/filemap.c                               | 11 ++++++++
>  mm/huge_memory.c                           | 31 +++++++++++++++++-----
>  4 files changed, 52 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 9f3ed504c646..1aaf8e3a0b5a 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -292,12 +292,18 @@ memory from a set of allowed sizes. By default all THP sizes that the page cache
>  supports are allowed, but this set can be modified with one of::
> 
>  	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
> +	echo always+exec >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
>  	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/file_enabled
> 
>  where <size> is the hugepage size being addressed, the available sizes for which
>  vary by system. ``always`` adds the hugepage size to the set of allowed sizes,
>  and ``never`` removes the hugepage size from the set of allowed sizes.
> 
> +``always+exec`` acts like ``always`` but additionally marks the hugepage size as
> +the preferred hugepage size for sections of any file mapped executable. A
> +maximum of one hugepage size can be marked as ``exec`` at a time, so applying it
> +to a new size implicitly removes it from any size it was previously set for.
> +
>  In some situations, constraining the allowed sizes can reduce memory
>  fragmentation, resulting in fewer allocation fallbacks and improved system
>  performance.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 19ced8192d39..3571ea0c3d8c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -177,12 +177,18 @@ extern unsigned long huge_anon_orders_always;
>  extern unsigned long huge_anon_orders_madvise;
>  extern unsigned long huge_anon_orders_inherit;
>  extern unsigned long huge_file_orders_always;
> +extern int huge_file_exec_order;
> 
>  static inline unsigned long file_orders_always(void)
>  {
>  	return READ_ONCE(huge_file_orders_always);
>  }
> 
> +static inline int file_exec_order(void)
> +{
> +	return READ_ONCE(huge_file_exec_order);
> +}
> +
>  static inline bool hugepage_global_enabled(void)
>  {
>  	return transparent_hugepage_flags &
> @@ -453,6 +459,11 @@ static inline unsigned long file_orders_always(void)
>  	return 0;
>  }
> 
> +static inline int file_exec_order(void)
> +{
> +	return -1;
> +}
> +
>  static inline bool folio_test_pmd_mappable(struct folio *folio)
>  {
>  	return false;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 870016fcfdde..c4a3cc6a2e46 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3128,6 +3128,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	struct file *fpin = NULL;
>  	unsigned long vm_flags = vmf->vma->vm_flags;
>  	unsigned int mmap_miss;
> +	int exec_order = file_exec_order();
> 
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* Use the readahead code, even if readahead is disabled */
> @@ -3147,6 +3148,16 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  #endif
> 
> +	/* If explicit order is set for exec mappings, use it. */
> +	if ((vm_flags & VM_EXEC) && exec_order >= 0) {
> +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +		ra->size = 1UL << exec_order;
> +		ra->async_size = 0;
> +		ractl._index &= ~((unsigned long)ra->size - 1);
> +		page_cache_ra_order(&ractl, ra, exec_order);
> +		return fpin;
> +	}
> +
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vm_flags & VM_RAND_READ)
>  		return fpin;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e8fe28fe9cf9..4249c0bc9388 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -81,6 +81,7 @@ unsigned long huge_anon_orders_always __read_mostly;
>  unsigned long huge_anon_orders_madvise __read_mostly;
>  unsigned long huge_anon_orders_inherit __read_mostly;
>  unsigned long huge_file_orders_always __read_mostly;
> +int huge_file_exec_order __read_mostly = -1;
> 
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
> @@ -462,6 +463,7 @@ static const struct attribute_group hugepage_attr_group = {
>  static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
>  static void thpsize_release(struct kobject *kobj);
>  static DEFINE_SPINLOCK(huge_anon_orders_lock);
> +static DEFINE_SPINLOCK(huge_file_orders_lock);
>  static LIST_HEAD(thpsize_list);
> 
>  static ssize_t anon_enabled_show(struct kobject *kobj,
> @@ -531,11 +533,15 @@ static ssize_t file_enabled_show(struct kobject *kobj,
>  {
>  	int order = to_thpsize(kobj)->order;
>  	const char *output;
> +	bool exec;
> 
> -	if (test_bit(order, &huge_file_orders_always))
> -		output = "[always] never";
> -	else
> -		output = "always [never]";
> +	if (test_bit(order, &huge_file_orders_always)) {
> +		exec = READ_ONCE(huge_file_exec_order) == order;
> +		output = exec ? "always [always+exec] never" :
> +				"[always] always+exec never";
> +	} else {
> +		output = "always always+exec [never]";
> +	}
> 
>  	return sysfs_emit(buf, "%s\n", output);
>  }
> @@ -547,13 +553,24 @@ static ssize_t file_enabled_store(struct kobject *kobj,
>  	int order = to_thpsize(kobj)->order;
>  	ssize_t ret = count;
> 
> -	if (sysfs_streq(buf, "always"))
> +	spin_lock(&huge_file_orders_lock);
> +
> +	if (sysfs_streq(buf, "always")) {
>  		set_bit(order, &huge_file_orders_always);
> -	else if (sysfs_streq(buf, "never"))
> +		if (huge_file_exec_order == order)
> +			huge_file_exec_order = -1;
> +	} else if (sysfs_streq(buf, "always+exec")) {
> +		set_bit(order, &huge_file_orders_always);
> +		huge_file_exec_order = order;
> +	} else if (sysfs_streq(buf, "never")) {
>  		clear_bit(order, &huge_file_orders_always);
> -	else
> +		if (huge_file_exec_order == order)
> +			huge_file_exec_order = -1;
> +	} else {
>  		ret = -EINVAL;
> +	}
> 
> +	spin_unlock(&huge_file_orders_lock);
>  	return ret;
>  }
> 
> --
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-17  7:12 ` [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline Ryan Roberts
@ 2024-07-19  0:46   ` Barry Song
  2024-07-19  7:47     ` Ryan Roberts
  2024-07-22  9:13   ` Daniel Gomez
  1 sibling, 1 reply; 23+ messages in thread
From: Barry Song @ 2024-07-19  0:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On Wed, Jul 17, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Add thp_anon= cmdline parameter to allow specifying the default
> enablement of each supported anon THP size. The parameter accepts the
> following format and can be provided multiple times to configure each
> size:
>
> thp_anon=<size>[KMG]:<value>
>
> See Documentation/admin-guide/mm/transhuge.rst for more details.
>
> Configuring the defaults at boot time is useful to allow early user
> space to take advantage of mTHP before its been configured through
> sysfs.

This is exactly what I need and want to implement, as the current behavior
is problematic. We need to boot up the system and reach the point where
we can set up the sys interfaces to enable mTHP. Many processes miss the
opportunity to use mTHP.

On the other hand, userspace might have been tuned to detect that mTHP
is enabled, such as a .so library. However, it turns out we have had
inconsistent settings between the two stages - before and after setting
mTHP enabled by sys interfaces.

>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  8 +++
>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>  3 files changed, 82 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bc55fb55cd26..48443ad12e3f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6592,6 +6592,14 @@
>                         <deci-seconds>: poll all this frequency
>                         0: no polling (default)
>
> +       thp_anon=       [KNL]
> +                       Format: <size>[KMG]:always|madvise|never|inherit
> +                       Can be used to control the default behavior of the
> +                       system with respect to anonymous transparent hugepages.
> +                       Can be used multiple times for multiple anon THP sizes.
> +                       See Documentation/admin-guide/mm/transhuge.rst for more
> +                       details.
> +
>         threadirqs      [KNL,EARLY]
>                         Force threading of all interrupt handlers except those
>                         marked explicitly IRQF_NO_THREAD.
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 1aaf8e3a0b5a..f53d43d986e2 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -311,13 +311,27 @@ performance.
>  Note that any changes to the allowed set of sizes only applies to future
>  file-backed THP allocations.
>
> -Boot parameter
> -==============
> +Boot parameters
> +===============
>
> -You can change the sysfs boot time defaults of Transparent Hugepage
> -Support by passing the parameter ``transparent_hugepage=always`` or
> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
> -to the kernel command line.
> +You can change the sysfs boot time default for the top-level "enabled"
> +control by passing the parameter ``transparent_hugepage=always`` or
> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
> +kernel command line.
> +
> +Alternatively, each supported anonymous THP size can be controlled by
> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
> +``inherit``.
> +
> +For example, the following will set 64K THP to ``always``::
> +
> +       thp_anon=64K:always
> +
> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
> +not explicitly configured on the command line are implicitly set to
> +``never``.
>
>  Hugepages in tmpfs/shmem
>  ========================
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4249c0bc9388..794d2790d90d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>  unsigned long huge_anon_orders_inherit __read_mostly;
>  unsigned long huge_file_orders_always __read_mostly;
>  int huge_file_exec_order __read_mostly = -1;
> +static bool anon_orders_configured;
>
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                          unsigned long vm_flags,
> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>          * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>          * constant so we have to do this here.
>          */
> -       huge_anon_orders_inherit = BIT(PMD_ORDER);
> +       if (!anon_orders_configured) {
> +               huge_anon_orders_inherit = BIT(PMD_ORDER);
> +               anon_orders_configured = true;
> +       }
>
>         /*
>          * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>  }
>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>
> +static int __init setup_thp_anon(char *str)
> +{
> +       unsigned long size;
> +       char *state;
> +       int order;
> +       int ret = 0;
> +
> +       if (!str)
> +               goto out;
> +
> +       size = (unsigned long)memparse(str, &state);
> +       order = ilog2(size >> PAGE_SHIFT);
> +       if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
> +           !(BIT(order) & THP_ORDERS_ALL_ANON))
> +               goto out;
> +
> +       state++;
> +
> +       if (!strcmp(state, "always")) {
> +               clear_bit(order, &huge_anon_orders_inherit);
> +               clear_bit(order, &huge_anon_orders_madvise);
> +               set_bit(order, &huge_anon_orders_always);
> +               ret = 1;
> +       } else if (!strcmp(state, "inherit")) {
> +               clear_bit(order, &huge_anon_orders_always);
> +               clear_bit(order, &huge_anon_orders_madvise);
> +               set_bit(order, &huge_anon_orders_inherit);
> +               ret = 1;
> +       } else if (!strcmp(state, "madvise")) {
> +               clear_bit(order, &huge_anon_orders_always);
> +               clear_bit(order, &huge_anon_orders_inherit);
> +               set_bit(order, &huge_anon_orders_madvise);
> +               ret = 1;
> +       } else if (!strcmp(state, "never")) {
> +               clear_bit(order, &huge_anon_orders_always);
> +               clear_bit(order, &huge_anon_orders_inherit);
> +               clear_bit(order, &huge_anon_orders_madvise);
> +               ret = 1;
> +       }
> +
> +       if (ret)
> +               anon_orders_configured = true;
> +out:
> +       if (!ret)
> +               pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
> +       return ret;
> +}
> +__setup("thp_anon=", setup_thp_anon);
> +
>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>  {
>         if (likely(vma->vm_flags & VM_WRITE))
> --
> 2.43.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-19  0:46   ` Barry Song
@ 2024-07-19  7:47     ` Ryan Roberts
  2024-07-19  7:52       ` Barry Song
  0 siblings, 1 reply; 23+ messages in thread
From: Ryan Roberts @ 2024-07-19  7:47 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On 19/07/2024 01:46, Barry Song wrote:
> On Wed, Jul 17, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Add thp_anon= cmdline parameter to allow specifying the default
>> enablement of each supported anon THP size. The parameter accepts the
>> following format and can be provided multiple times to configure each
>> size:
>>
>> thp_anon=<size>[KMG]:<value>
>>
>> See Documentation/admin-guide/mm/transhuge.rst for more details.
>>
>> Configuring the defaults at boot time is useful to allow early user
>> space to take advantage of mTHP before its been configured through
>> sysfs.
> 
> This is exactly what I need and want to implement, as the current behavior
> is problematic. We need to boot up the system and reach the point where
> we can set up the sys interfaces to enable mTHP. Many processes miss the
> opportunity to use mTHP.
> 
> On the other hand, userspace might have been tuned to detect that mTHP
> is enabled, such as a .so library. However, it turns out we have had
> inconsistent settings between the two stages - before and after setting
> mTHP enabled by sys interfaces.

Good feedback - sounds like I should separate out this patch from the rest of
the series to get it reviewed and merged faster?

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  .../admin-guide/kernel-parameters.txt         |  8 +++
>>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>>  3 files changed, 82 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index bc55fb55cd26..48443ad12e3f 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -6592,6 +6592,14 @@
>>                         <deci-seconds>: poll all this frequency
>>                         0: no polling (default)
>>
>> +       thp_anon=       [KNL]
>> +                       Format: <size>[KMG]:always|madvise|never|inherit
>> +                       Can be used to control the default behavior of the
>> +                       system with respect to anonymous transparent hugepages.
>> +                       Can be used multiple times for multiple anon THP sizes.
>> +                       See Documentation/admin-guide/mm/transhuge.rst for more
>> +                       details.
>> +
>>         threadirqs      [KNL,EARLY]
>>                         Force threading of all interrupt handlers except those
>>                         marked explicitly IRQF_NO_THREAD.
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 1aaf8e3a0b5a..f53d43d986e2 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -311,13 +311,27 @@ performance.
>>  Note that any changes to the allowed set of sizes only applies to future
>>  file-backed THP allocations.
>>
>> -Boot parameter
>> -==============
>> +Boot parameters
>> +===============
>>
>> -You can change the sysfs boot time defaults of Transparent Hugepage
>> -Support by passing the parameter ``transparent_hugepage=always`` or
>> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
>> -to the kernel command line.
>> +You can change the sysfs boot time default for the top-level "enabled"
>> +control by passing the parameter ``transparent_hugepage=always`` or
>> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
>> +kernel command line.
>> +
>> +Alternatively, each supported anonymous THP size can be controlled by
>> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
>> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
>> +``inherit``.
>> +
>> +For example, the following will set 64K THP to ``always``::
>> +
>> +       thp_anon=64K:always
>> +
>> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
>> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
>> +not explicitly configured on the command line are implicitly set to
>> +``never``.
>>
>>  Hugepages in tmpfs/shmem
>>  ========================
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 4249c0bc9388..794d2790d90d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>>  unsigned long huge_anon_orders_inherit __read_mostly;
>>  unsigned long huge_file_orders_always __read_mostly;
>>  int huge_file_exec_order __read_mostly = -1;
>> +static bool anon_orders_configured;
>>
>>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>                                          unsigned long vm_flags,
>> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>          * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>          * constant so we have to do this here.
>>          */
>> -       huge_anon_orders_inherit = BIT(PMD_ORDER);
>> +       if (!anon_orders_configured) {
>> +               huge_anon_orders_inherit = BIT(PMD_ORDER);
>> +               anon_orders_configured = true;
>> +       }
>>
>>         /*
>>          * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
>> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>>  }
>>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>>
>> +static int __init setup_thp_anon(char *str)
>> +{
>> +       unsigned long size;
>> +       char *state;
>> +       int order;
>> +       int ret = 0;
>> +
>> +       if (!str)
>> +               goto out;
>> +
>> +       size = (unsigned long)memparse(str, &state);
>> +       order = ilog2(size >> PAGE_SHIFT);
>> +       if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
>> +           !(BIT(order) & THP_ORDERS_ALL_ANON))
>> +               goto out;
>> +
>> +       state++;
>> +
>> +       if (!strcmp(state, "always")) {
>> +               clear_bit(order, &huge_anon_orders_inherit);
>> +               clear_bit(order, &huge_anon_orders_madvise);
>> +               set_bit(order, &huge_anon_orders_always);
>> +               ret = 1;
>> +       } else if (!strcmp(state, "inherit")) {
>> +               clear_bit(order, &huge_anon_orders_always);
>> +               clear_bit(order, &huge_anon_orders_madvise);
>> +               set_bit(order, &huge_anon_orders_inherit);
>> +               ret = 1;
>> +       } else if (!strcmp(state, "madvise")) {
>> +               clear_bit(order, &huge_anon_orders_always);
>> +               clear_bit(order, &huge_anon_orders_inherit);
>> +               set_bit(order, &huge_anon_orders_madvise);
>> +               ret = 1;
>> +       } else if (!strcmp(state, "never")) {
>> +               clear_bit(order, &huge_anon_orders_always);
>> +               clear_bit(order, &huge_anon_orders_inherit);
>> +               clear_bit(order, &huge_anon_orders_madvise);
>> +               ret = 1;
>> +       }
>> +
>> +       if (ret)
>> +               anon_orders_configured = true;
>> +out:
>> +       if (!ret)
>> +               pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
>> +       return ret;
>> +}
>> +__setup("thp_anon=", setup_thp_anon);
>> +
>>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>>  {
>>         if (likely(vma->vm_flags & VM_WRITE))
>> --
>> 2.43.0
>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-19  7:47     ` Ryan Roberts
@ 2024-07-19  7:52       ` Barry Song
  2024-07-19  8:18         ` Ryan Roberts
  2024-07-19  8:29         ` David Hildenbrand
  0 siblings, 2 replies; 23+ messages in thread
From: Barry Song @ 2024-07-19  7:52 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On Fri, Jul 19, 2024 at 7:48 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/07/2024 01:46, Barry Song wrote:
> > On Wed, Jul 17, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Add thp_anon= cmdline parameter to allow specifying the default
> >> enablement of each supported anon THP size. The parameter accepts the
> >> following format and can be provided multiple times to configure each
> >> size:
> >>
> >> thp_anon=<size>[KMG]:<value>
> >>
> >> See Documentation/admin-guide/mm/transhuge.rst for more details.
> >>
> >> Configuring the defaults at boot time is useful to allow early user
> >> space to take advantage of mTHP before its been configured through
> >> sysfs.
> >
> > This is exactly what I need and want to implement, as the current behavior
> > is problematic. We need to boot up the system and reach the point where
> > we can set up the sys interfaces to enable mTHP. Many processes miss the
> > opportunity to use mTHP.
> >
> > On the other hand, userspace might have been tuned to detect that mTHP
> > is enabled, such as a .so library. However, it turns out we have had
> > inconsistent settings between the two stages - before and after setting
> > mTHP enabled by sys interfaces.
>
> Good feedback - sounds like I should separate out this patch from the rest of
> the series to get it reviewed and merged faster?

+1

>
> >
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  .../admin-guide/kernel-parameters.txt         |  8 +++
> >>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
> >>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
> >>  3 files changed, 82 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index bc55fb55cd26..48443ad12e3f 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -6592,6 +6592,14 @@
> >>                         <deci-seconds>: poll all this frequency
> >>                         0: no polling (default)
> >>
> >> +       thp_anon=       [KNL]
> >> +                       Format: <size>[KMG]:always|madvise|never|inherit
> >> +                       Can be used to control the default behavior of the
> >> +                       system with respect to anonymous transparent hugepages.
> >> +                       Can be used multiple times for multiple anon THP sizes.
> >> +                       See Documentation/admin-guide/mm/transhuge.rst for more
> >> +                       details.
> >> +
> >>         threadirqs      [KNL,EARLY]
> >>                         Force threading of all interrupt handlers except those
> >>                         marked explicitly IRQF_NO_THREAD.
> >> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >> index 1aaf8e3a0b5a..f53d43d986e2 100644
> >> --- a/Documentation/admin-guide/mm/transhuge.rst
> >> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >> @@ -311,13 +311,27 @@ performance.
> >>  Note that any changes to the allowed set of sizes only applies to future
> >>  file-backed THP allocations.
> >>
> >> -Boot parameter
> >> -==============
> >> +Boot parameters
> >> +===============
> >>
> >> -You can change the sysfs boot time defaults of Transparent Hugepage
> >> -Support by passing the parameter ``transparent_hugepage=always`` or
> >> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
> >> -to the kernel command line.
> >> +You can change the sysfs boot time default for the top-level "enabled"
> >> +control by passing the parameter ``transparent_hugepage=always`` or
> >> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
> >> +kernel command line.
> >> +
> >> +Alternatively, each supported anonymous THP size can be controlled by
> >> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
> >> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
> >> +``inherit``.
> >> +
> >> +For example, the following will set 64K THP to ``always``::
> >> +
> >> +       thp_anon=64K:always
> >> +
> >> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
> >> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
> >> +not explicitly configured on the command line are implicitly set to
> >> +``never``.
> >>
> >>  Hugepages in tmpfs/shmem
> >>  ========================
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 4249c0bc9388..794d2790d90d 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
> >>  unsigned long huge_anon_orders_inherit __read_mostly;
> >>  unsigned long huge_file_orders_always __read_mostly;
> >>  int huge_file_exec_order __read_mostly = -1;
> >> +static bool anon_orders_configured;
> >>
> >>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>                                          unsigned long vm_flags,
> >> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> >>          * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
> >>          * constant so we have to do this here.
> >>          */
> >> -       huge_anon_orders_inherit = BIT(PMD_ORDER);
> >> +       if (!anon_orders_configured) {
> >> +               huge_anon_orders_inherit = BIT(PMD_ORDER);
> >> +               anon_orders_configured = true;
> >> +       }
> >>
> >>         /*
> >>          * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
> >> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
> >>  }
> >>  __setup("transparent_hugepage=", setup_transparent_hugepage);
> >>
> >> +static int __init setup_thp_anon(char *str)
> >> +{
> >> +       unsigned long size;
> >> +       char *state;
> >> +       int order;
> >> +       int ret = 0;
> >> +
> >> +       if (!str)
> >> +               goto out;
> >> +
> >> +       size = (unsigned long)memparse(str, &state);
> >> +       order = ilog2(size >> PAGE_SHIFT);
> >> +       if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
> >> +           !(BIT(order) & THP_ORDERS_ALL_ANON))
> >> +               goto out;
> >> +
> >> +       state++;
> >> +
> >> +       if (!strcmp(state, "always")) {
> >> +               clear_bit(order, &huge_anon_orders_inherit);
> >> +               clear_bit(order, &huge_anon_orders_madvise);
> >> +               set_bit(order, &huge_anon_orders_always);
> >> +               ret = 1;
> >> +       } else if (!strcmp(state, "inherit")) {
> >> +               clear_bit(order, &huge_anon_orders_always);
> >> +               clear_bit(order, &huge_anon_orders_madvise);
> >> +               set_bit(order, &huge_anon_orders_inherit);
> >> +               ret = 1;
> >> +       } else if (!strcmp(state, "madvise")) {
> >> +               clear_bit(order, &huge_anon_orders_always);
> >> +               clear_bit(order, &huge_anon_orders_inherit);
> >> +               set_bit(order, &huge_anon_orders_madvise);
> >> +               ret = 1;
> >> +       } else if (!strcmp(state, "never")) {
> >> +               clear_bit(order, &huge_anon_orders_always);
> >> +               clear_bit(order, &huge_anon_orders_inherit);
> >> +               clear_bit(order, &huge_anon_orders_madvise);
> >> +               ret = 1;
> >> +       }
> >> +
> >> +       if (ret)
> >> +               anon_orders_configured = true;
> >> +out:
> >> +       if (!ret)
> >> +               pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
> >> +       return ret;
> >> +}
> >> +__setup("thp_anon=", setup_thp_anon);
> >> +
> >>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> >>  {
> >>         if (likely(vma->vm_flags & VM_WRITE))
> >> --
> >> 2.43.0
> >>
> >
> > Thanks
> > Barry
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-19  7:52       ` Barry Song
@ 2024-07-19  8:18         ` Ryan Roberts
  2024-07-19  8:29         ` David Hildenbrand
  1 sibling, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-19  8:18 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On 19/07/2024 08:52, Barry Song wrote:
> On Fri, Jul 19, 2024 at 7:48 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/07/2024 01:46, Barry Song wrote:
>>> On Wed, Jul 17, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Add thp_anon= cmdline parameter to allow specifying the default
>>>> enablement of each supported anon THP size. The parameter accepts the
>>>> following format and can be provided multiple times to configure each
>>>> size:
>>>>
>>>> thp_anon=<size>[KMG]:<value>
>>>>
>>>> See Documentation/admin-guide/mm/transhuge.rst for more details.
>>>>
>>>> Configuring the defaults at boot time is useful to allow early user
>>>> space to take advantage of mTHP before its been configured through
>>>> sysfs.
>>>
>>> This is exactly what I need and want to implement, as the current behavior
>>> is problematic. We need to boot up the system and reach the point where
>>> we can set up the sys interfaces to enable mTHP. Many processes miss the
>>> opportunity to use mTHP.
>>>
>>> On the other hand, userspace might have been tuned to detect that mTHP
>>> is enabled, such as a .so library. However, it turns out we have had
>>> inconsistent settings between the two stages - before and after setting
>>> mTHP enabled by sys interfaces.
>>
>> Good feedback - sounds like I should separate out this patch from the rest of
>> the series to get it reviewed and merged faster?
> 
> +1

OK I'll wait a couple of days to see if anyone has any feedback against this
version, then I'll re-post this on its own.

> 
>>
>>>
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  .../admin-guide/kernel-parameters.txt         |  8 +++
>>>>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>>>>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>>>>  3 files changed, 82 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index bc55fb55cd26..48443ad12e3f 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -6592,6 +6592,14 @@
>>>>                         <deci-seconds>: poll all this frequency
>>>>                         0: no polling (default)
>>>>
>>>> +       thp_anon=       [KNL]
>>>> +                       Format: <size>[KMG]:always|madvise|never|inherit
>>>> +                       Can be used to control the default behavior of the
>>>> +                       system with respect to anonymous transparent hugepages.
>>>> +                       Can be used multiple times for multiple anon THP sizes.
>>>> +                       See Documentation/admin-guide/mm/transhuge.rst for more
>>>> +                       details.
>>>> +
>>>>         threadirqs      [KNL,EARLY]
>>>>                         Force threading of all interrupt handlers except those
>>>>                         marked explicitly IRQF_NO_THREAD.
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index 1aaf8e3a0b5a..f53d43d986e2 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>> @@ -311,13 +311,27 @@ performance.
>>>>  Note that any changes to the allowed set of sizes only applies to future
>>>>  file-backed THP allocations.
>>>>
>>>> -Boot parameter
>>>> -==============
>>>> +Boot parameters
>>>> +===============
>>>>
>>>> -You can change the sysfs boot time defaults of Transparent Hugepage
>>>> -Support by passing the parameter ``transparent_hugepage=always`` or
>>>> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
>>>> -to the kernel command line.
>>>> +You can change the sysfs boot time default for the top-level "enabled"
>>>> +control by passing the parameter ``transparent_hugepage=always`` or
>>>> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
>>>> +kernel command line.
>>>> +
>>>> +Alternatively, each supported anonymous THP size can be controlled by
>>>> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
>>>> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
>>>> +``inherit``.
>>>> +
>>>> +For example, the following will set 64K THP to ``always``::
>>>> +
>>>> +       thp_anon=64K:always
>>>> +
>>>> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
>>>> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
>>>> +not explicitly configured on the command line are implicitly set to
>>>> +``never``.
>>>>
>>>>  Hugepages in tmpfs/shmem
>>>>  ========================
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 4249c0bc9388..794d2790d90d 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>>>>  unsigned long huge_anon_orders_inherit __read_mostly;
>>>>  unsigned long huge_file_orders_always __read_mostly;
>>>>  int huge_file_exec_order __read_mostly = -1;
>>>> +static bool anon_orders_configured;
>>>>
>>>>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>                                          unsigned long vm_flags,
>>>> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>>>          * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>>>          * constant so we have to do this here.
>>>>          */
>>>> -       huge_anon_orders_inherit = BIT(PMD_ORDER);
>>>> +       if (!anon_orders_configured) {
>>>> +               huge_anon_orders_inherit = BIT(PMD_ORDER);
>>>> +               anon_orders_configured = true;
>>>> +       }
>>>>
>>>>         /*
>>>>          * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
>>>> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>>>>  }
>>>>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>>>>
>>>> +static int __init setup_thp_anon(char *str)
>>>> +{
>>>> +       unsigned long size;
>>>> +       char *state;
>>>> +       int order;
>>>> +       int ret = 0;
>>>> +
>>>> +       if (!str)
>>>> +               goto out;
>>>> +
>>>> +       size = (unsigned long)memparse(str, &state);
>>>> +       order = ilog2(size >> PAGE_SHIFT);
>>>> +       if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
>>>> +           !(BIT(order) & THP_ORDERS_ALL_ANON))
>>>> +               goto out;
>>>> +
>>>> +       state++;
>>>> +
>>>> +       if (!strcmp(state, "always")) {
>>>> +               clear_bit(order, &huge_anon_orders_inherit);
>>>> +               clear_bit(order, &huge_anon_orders_madvise);
>>>> +               set_bit(order, &huge_anon_orders_always);
>>>> +               ret = 1;
>>>> +       } else if (!strcmp(state, "inherit")) {
>>>> +               clear_bit(order, &huge_anon_orders_always);
>>>> +               clear_bit(order, &huge_anon_orders_madvise);
>>>> +               set_bit(order, &huge_anon_orders_inherit);
>>>> +               ret = 1;
>>>> +       } else if (!strcmp(state, "madvise")) {
>>>> +               clear_bit(order, &huge_anon_orders_always);
>>>> +               clear_bit(order, &huge_anon_orders_inherit);
>>>> +               set_bit(order, &huge_anon_orders_madvise);
>>>> +               ret = 1;
>>>> +       } else if (!strcmp(state, "never")) {
>>>> +               clear_bit(order, &huge_anon_orders_always);
>>>> +               clear_bit(order, &huge_anon_orders_inherit);
>>>> +               clear_bit(order, &huge_anon_orders_madvise);
>>>> +               ret = 1;
>>>> +       }
>>>> +
>>>> +       if (ret)
>>>> +               anon_orders_configured = true;
>>>> +out:
>>>> +       if (!ret)
>>>> +               pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
>>>> +       return ret;
>>>> +}
>>>> +__setup("thp_anon=", setup_thp_anon);
>>>> +
>>>>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>>>>  {
>>>>         if (likely(vma->vm_flags & VM_WRITE))
>>>> --
>>>> 2.43.0
>>>>
>>>
>>> Thanks
>>> Barry
>>
>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-19  7:52       ` Barry Song
  2024-07-19  8:18         ` Ryan Roberts
@ 2024-07-19  8:29         ` David Hildenbrand
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-07-19  8:29 UTC (permalink / raw)
  To: Barry Song, Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Lance Yang, Baolin Wang, Gavin Shan,
	Pankaj Raghav, Daniel Gomez, linux-kernel, linux-mm

On 19.07.24 09:52, Barry Song wrote:
> On Fri, Jul 19, 2024 at 7:48 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/07/2024 01:46, Barry Song wrote:
>>> On Wed, Jul 17, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Add thp_anon= cmdline parameter to allow specifying the default
>>>> enablement of each supported anon THP size. The parameter accepts the
>>>> following format and can be provided multiple times to configure each
>>>> size:
>>>>
>>>> thp_anon=<size>[KMG]:<value>
>>>>
>>>> See Documentation/admin-guide/mm/transhuge.rst for more details.
>>>>
>>>> Configuring the defaults at boot time is useful to allow early user
>>>> space to take advantage of mTHP before its been configured through
>>>> sysfs.
>>>
>>> This is exactly what I need and want to implement, as the current behavior
>>> is problematic. We need to boot up the system and reach the point where
>>> we can set up the sys interfaces to enable mTHP. Many processes miss the
>>> opportunity to use mTHP.
>>>
>>> On the other hand, userspace might have been tuned to detect that mTHP
>>> is enabled, such as a .so library. However, it turns out we have had
>>> inconsistent settings between the two stages - before and after setting
>>> mTHP enabled by sys interfaces.
>>
>> Good feedback - sounds like I should separate out this patch from the rest of
>> the series to get it reviewed and merged faster?
> 
> +1

Agreed, this is reasonable to have.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-17  7:12 ` [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline Ryan Roberts
  2024-07-19  0:46   ` Barry Song
@ 2024-07-22  9:13   ` Daniel Gomez
  2024-07-22  9:36     ` Ryan Roberts
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Gomez @ 2024-07-22  9:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jul 17, 2024 at 08:12:55AM GMT, Ryan Roberts wrote:
> Add thp_anon= cmdline parameter to allow specifying the default
> enablement of each supported anon THP size. The parameter accepts the
> following format and can be provided multiple times to configure each
> size:
> 
> thp_anon=<size>[KMG]:<value>

Minor suggestion. Should this be renamed to hp_anon= or hugepages_anon= instead?
This would align with the values under /sys/kernel/mm/transparent_hugepage/
hugepages-*kB.

> 
> See Documentation/admin-guide/mm/transhuge.rst for more details.
> 
> Configuring the defaults at boot time is useful to allow early user
> space to take advantage of mTHP before its been configured through
> sysfs.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  8 +++
>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>  3 files changed, 82 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bc55fb55cd26..48443ad12e3f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6592,6 +6592,14 @@
>  			<deci-seconds>: poll all this frequency
>  			0: no polling (default)
>  
> +	thp_anon=	[KNL]
> +			Format: <size>[KMG]:always|madvise|never|inherit
> +			Can be used to control the default behavior of the
> +			system with respect to anonymous transparent hugepages.
> +			Can be used multiple times for multiple anon THP sizes.
> +			See Documentation/admin-guide/mm/transhuge.rst for more
> +			details.
> +
>  	threadirqs	[KNL,EARLY]
>  			Force threading of all interrupt handlers except those
>  			marked explicitly IRQF_NO_THREAD.
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 1aaf8e3a0b5a..f53d43d986e2 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -311,13 +311,27 @@ performance.
>  Note that any changes to the allowed set of sizes only applies to future
>  file-backed THP allocations.
>  
> -Boot parameter
> -==============
> +Boot parameters
> +===============
>  
> -You can change the sysfs boot time defaults of Transparent Hugepage
> -Support by passing the parameter ``transparent_hugepage=always`` or
> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
> -to the kernel command line.
> +You can change the sysfs boot time default for the top-level "enabled"
> +control by passing the parameter ``transparent_hugepage=always`` or
> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
> +kernel command line.
> +
> +Alternatively, each supported anonymous THP size can be controlled by
> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
> +``inherit``.
> +
> +For example, the following will set 64K THP to ``always``::
> +
> +	thp_anon=64K:always
> +
> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
> +not explicitly configured on the command line are implicitly set to
> +``never``.
>  
>  Hugepages in tmpfs/shmem
>  ========================
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4249c0bc9388..794d2790d90d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>  unsigned long huge_anon_orders_inherit __read_mostly;
>  unsigned long huge_file_orders_always __read_mostly;
>  int huge_file_exec_order __read_mostly = -1;
> +static bool anon_orders_configured;
>  
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>  	 * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>  	 * constant so we have to do this here.
>  	 */
> -	huge_anon_orders_inherit = BIT(PMD_ORDER);
> +	if (!anon_orders_configured) {
> +		huge_anon_orders_inherit = BIT(PMD_ORDER);

PMD_ORDER for 64k base PS systems would result in a 512M value, which exceeds
the xarray limit [1]. Therefore, I think we need to avoid PMD-size orders by
checking if PMD_ORDER > MAX_PAGECACHE_ORDER.

[1] https://lore.kernel.org/all/20240627003953.1262512-1-gshan@redhat.com/

> +		anon_orders_configured = true;
> +	}
>  
>  	/*
>  	 * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>  }
>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>  
> +static int __init setup_thp_anon(char *str)
> +{
> +	unsigned long size;
> +	char *state;
> +	int order;
> +	int ret = 0;
> +
> +	if (!str)
> +		goto out;
> +
> +	size = (unsigned long)memparse(str, &state);
> +	order = ilog2(size >> PAGE_SHIFT);
> +	if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
> +	    !(BIT(order) & THP_ORDERS_ALL_ANON))
> +		goto out;
> +
> +	state++;
> +
> +	if (!strcmp(state, "always")) {
> +		clear_bit(order, &huge_anon_orders_inherit);
> +		clear_bit(order, &huge_anon_orders_madvise);
> +		set_bit(order, &huge_anon_orders_always);
> +		ret = 1;
> +	} else if (!strcmp(state, "inherit")) {
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_madvise);
> +		set_bit(order, &huge_anon_orders_inherit);
> +		ret = 1;
> +	} else if (!strcmp(state, "madvise")) {
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_inherit);
> +		set_bit(order, &huge_anon_orders_madvise);
> +		ret = 1;
> +	} else if (!strcmp(state, "never")) {
> +		clear_bit(order, &huge_anon_orders_always);
> +		clear_bit(order, &huge_anon_orders_inherit);
> +		clear_bit(order, &huge_anon_orders_madvise);
> +		ret = 1;
> +	}
> +
> +	if (ret)
> +		anon_orders_configured = true;
> +out:
> +	if (!ret)
> +		pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
> +	return ret;
> +}
> +__setup("thp_anon=", setup_thp_anon);
> +
>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>  {
>  	if (likely(vma->vm_flags & VM_WRITE))
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-07-17 10:45   ` Ryan Roberts
  2024-07-17 14:25     ` David Hildenbrand
@ 2024-07-22  9:35     ` Daniel Gomez
  2024-07-22  9:43       ` Ryan Roberts
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Gomez @ 2024-07-22  9:35 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Barry Song, Lance Yang, Baolin Wang,
	Gavin Shan, Pankaj Raghav, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org

On Wed, Jul 17, 2024 at 11:45:48AM GMT, Ryan Roberts wrote:
> On 17/07/2024 11:31, David Hildenbrand wrote:
> > On 17.07.24 09:12, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> This series is an RFC that adds sysfs and kernel cmdline controls to configure
> >> the set of allowed large folio sizes that can be used when allocating
> >> file-memory for the page cache. As part of the control mechanism, it provides
> >> for a special-case "preferred folio size for executable mappings" marker.
> >>
> >> I'm trying to solve 2 separate problems with this series:
> >>
> >> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> >> approach for the change at [1]. Instead of hardcoding the preferred executable
> >> folio size into the arch, user space can now select it. This decouples the arch
> >> code and also makes the mechanism more generic; it can be bypassed (the default)
> >> or any folio size can be set. For my use case, 64K is preferred, but I've also
> >> heard from Willy of a use case where putting all text into 2M PMD-sized folios
> >> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> >> therefore faulting in all text ahead of time) to achieve that.
> >>
> >> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> >> Android): The theory goes that if all folios are 64K, then failure to allocate a
> >> 64K folio should become unlikely. But if the page cache is allocating lots of
> >> different orders, with most allocations having an order below 64K (as is the
> >> case today) then ability to allocate 64K folios diminishes. By providing control
> >> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> >> allocation failure. Additionally I've heard (second hand) of the need to disable
> >> large folios in the page cache entirely due to latency concerns in some
> >> settings. These controls allow all of this without kernel changes.
> >>
> >> The value of (1) is clear and the performance improvements are documented in
> >> patch 2. I don't yet have any data demonstrating the theory for (2) since I
> >> can't reproduce the setup that Barry had at [2]. But my view is that by adding
> >> these controls we will enable the community to explore further, in the same way
> >> that the anon mTHP controls helped harden the understanding for anonymous
> >> memory.
> >>
> >> ---
> > 
> > How would this interact with other requirements we get from the filesystem (for
> > example, because of the device) [1].
> > 
> > Assuming a device has a filesystem has a min order of X, but we disable anything
> >>= X, how would we combine that configuration/information?
> 
> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
> that with [1], the specified min order just becomes that "always-on" fallback order.
> 
> Today:
> 
>   orders = file_orders_always() | BIT(0);
> 
> Tomorrow:
> 
>   orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
> 
> That does mean that in this case, a user-disabled order could still be used. So
> the controls are really hints rather than definitive commands.

In the scenario where a min order is not enabled in hugepages-<size>kB/
file_enabled, will the user still be allowed to automatically mkfs/mount with
blocksize=min_order, and will sysfs reflect this? Or, since it's a hint, will it
remain hidden but still allow mkfs/mount to proceed?

> 
> 
> > 
> > 
> > [1]
> > https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
> > 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-22  9:13   ` Daniel Gomez
@ 2024-07-22  9:36     ` Ryan Roberts
  2024-07-22 14:10       ` Ryan Roberts
  0 siblings, 1 reply; 23+ messages in thread
From: Ryan Roberts @ 2024-07-22  9:36 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On 22/07/2024 10:13, Daniel Gomez wrote:
> On Wed, Jul 17, 2024 at 08:12:55AM GMT, Ryan Roberts wrote:
>> Add thp_anon= cmdline parameter to allow specifying the default
>> enablement of each supported anon THP size. The parameter accepts the
>> following format and can be provided multiple times to configure each
>> size:
>>
>> thp_anon=<size>[KMG]:<value>
> 
> Minor suggestion. Should this be renamed to hp_anon= or hugepages_anon= instead?
> This would align with the values under /sys/kernel/mm/transparent_hugepage/
> hugepages-*kB.

"hp" doesn't feel right; that's not an abreviation we use today to my knowledge.
But I'd be happy to change it to "hugepages_anon", if that's the concensus.

> 
>>
>> See Documentation/admin-guide/mm/transhuge.rst for more details.
>>
>> Configuring the defaults at boot time is useful to allow early user
>> space to take advantage of mTHP before its been configured through
>> sysfs.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  .../admin-guide/kernel-parameters.txt         |  8 +++
>>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>>  3 files changed, 82 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index bc55fb55cd26..48443ad12e3f 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -6592,6 +6592,14 @@
>>  			<deci-seconds>: poll all this frequency
>>  			0: no polling (default)
>>  
>> +	thp_anon=	[KNL]
>> +			Format: <size>[KMG]:always|madvise|never|inherit
>> +			Can be used to control the default behavior of the
>> +			system with respect to anonymous transparent hugepages.
>> +			Can be used multiple times for multiple anon THP sizes.
>> +			See Documentation/admin-guide/mm/transhuge.rst for more
>> +			details.
>> +
>>  	threadirqs	[KNL,EARLY]
>>  			Force threading of all interrupt handlers except those
>>  			marked explicitly IRQF_NO_THREAD.
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 1aaf8e3a0b5a..f53d43d986e2 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -311,13 +311,27 @@ performance.
>>  Note that any changes to the allowed set of sizes only applies to future
>>  file-backed THP allocations.
>>  
>> -Boot parameter
>> -==============
>> +Boot parameters
>> +===============
>>  
>> -You can change the sysfs boot time defaults of Transparent Hugepage
>> -Support by passing the parameter ``transparent_hugepage=always`` or
>> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
>> -to the kernel command line.
>> +You can change the sysfs boot time default for the top-level "enabled"
>> +control by passing the parameter ``transparent_hugepage=always`` or
>> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
>> +kernel command line.
>> +
>> +Alternatively, each supported anonymous THP size can be controlled by
>> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
>> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
>> +``inherit``.
>> +
>> +For example, the following will set 64K THP to ``always``::
>> +
>> +	thp_anon=64K:always
>> +
>> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
>> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
>> +not explicitly configured on the command line are implicitly set to
>> +``never``.
>>  
>>  Hugepages in tmpfs/shmem
>>  ========================
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 4249c0bc9388..794d2790d90d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>>  unsigned long huge_anon_orders_inherit __read_mostly;
>>  unsigned long huge_file_orders_always __read_mostly;
>>  int huge_file_exec_order __read_mostly = -1;
>> +static bool anon_orders_configured;
>>  
>>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>  					 unsigned long vm_flags,
>> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>  	 * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>  	 * constant so we have to do this here.
>>  	 */
>> -	huge_anon_orders_inherit = BIT(PMD_ORDER);
>> +	if (!anon_orders_configured) {
>> +		huge_anon_orders_inherit = BIT(PMD_ORDER);
> 
> PMD_ORDER for 64k base PS systems would result in a 512M value, which exceeds
> the xarray limit [1]. Therefore, I think we need to avoid PMD-size orders by
> checking if PMD_ORDER > MAX_PAGECACHE_ORDER.

This is for anon memory, which isn't installed in the page cache so its
independent of MAX_PAGECACHE_ORDER. I don't believe there is a problem here.

> 
> [1] https://lore.kernel.org/all/20240627003953.1262512-1-gshan@redhat.com/
> 
>> +		anon_orders_configured = true;
>> +	}
>>  
>>  	/*
>>  	 * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
>> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>>  }
>>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>>  
>> +static int __init setup_thp_anon(char *str)
>> +{
>> +	unsigned long size;
>> +	char *state;
>> +	int order;
>> +	int ret = 0;
>> +
>> +	if (!str)
>> +		goto out;
>> +
>> +	size = (unsigned long)memparse(str, &state);
>> +	order = ilog2(size >> PAGE_SHIFT);
>> +	if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
>> +	    !(BIT(order) & THP_ORDERS_ALL_ANON))
>> +		goto out;
>> +
>> +	state++;
>> +
>> +	if (!strcmp(state, "always")) {
>> +		clear_bit(order, &huge_anon_orders_inherit);
>> +		clear_bit(order, &huge_anon_orders_madvise);
>> +		set_bit(order, &huge_anon_orders_always);
>> +		ret = 1;
>> +	} else if (!strcmp(state, "inherit")) {
>> +		clear_bit(order, &huge_anon_orders_always);
>> +		clear_bit(order, &huge_anon_orders_madvise);
>> +		set_bit(order, &huge_anon_orders_inherit);
>> +		ret = 1;
>> +	} else if (!strcmp(state, "madvise")) {
>> +		clear_bit(order, &huge_anon_orders_always);
>> +		clear_bit(order, &huge_anon_orders_inherit);
>> +		set_bit(order, &huge_anon_orders_madvise);
>> +		ret = 1;
>> +	} else if (!strcmp(state, "never")) {
>> +		clear_bit(order, &huge_anon_orders_always);
>> +		clear_bit(order, &huge_anon_orders_inherit);
>> +		clear_bit(order, &huge_anon_orders_madvise);
>> +		ret = 1;
>> +	}
>> +
>> +	if (ret)
>> +		anon_orders_configured = true;
>> +out:
>> +	if (!ret)
>> +		pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
>> +	return ret;
>> +}
>> +__setup("thp_anon=", setup_thp_anon);
>> +
>>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>>  {
>>  	if (likely(vma->vm_flags & VM_WRITE))
>> -- 
>> 2.43.0



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-07-22  9:35     ` Daniel Gomez
@ 2024-07-22  9:43       ` Ryan Roberts
  0 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-22  9:43 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: David Hildenbrand, Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), Barry Song, Lance Yang, Baolin Wang,
	Gavin Shan, Pankaj Raghav, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org

On 22/07/2024 10:35, Daniel Gomez wrote:
> On Wed, Jul 17, 2024 at 11:45:48AM GMT, Ryan Roberts wrote:
>> On 17/07/2024 11:31, David Hildenbrand wrote:
>>> On 17.07.24 09:12, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>>> the set of allowed large folio sizes that can be used when allocating
>>>> file-memory for the page cache. As part of the control mechanism, it provides
>>>> for a special-case "preferred folio size for executable mappings" marker.
>>>>
>>>> I'm trying to solve 2 separate problems with this series:
>>>>
>>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>>> folio size into the arch, user space can now select it. This decouples the arch
>>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>>> therefore faulting in all text ahead of time) to achieve that.
>>>>
>>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>>> different orders, with most allocations having an order below 64K (as is the
>>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>>> large folios in the page cache entirely due to latency concerns in some
>>>> settings. These controls allow all of this without kernel changes.
>>>>
>>>> The value of (1) is clear and the performance improvements are documented in
>>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>>> these controls we will enable the community to explore further, in the same way
>>>> that the anon mTHP controls helped harden the understanding for anonymous
>>>> memory.
>>>>
>>>> ---
>>>
>>> How would this interact with other requirements we get from the filesystem (for
>>> example, because of the device) [1].
>>>
>>> Assuming a device has a filesystem has a min order of X, but we disable anything
>>>> = X, how would we combine that configuration/information?
>>
>> Currently order-0 is implicitly the "always-on" fallback order. My thinking was
>> that with [1], the specified min order just becomes that "always-on" fallback order.
>>
>> Today:
>>
>>   orders = file_orders_always() | BIT(0);
>>
>> Tomorrow:
>>
>>   orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order);
>>
>> That does mean that in this case, a user-disabled order could still be used. So
>> the controls are really hints rather than definitive commands.
> 
> In the scenario where a min order is not enabled in hugepages-<size>kB/
> file_enabled, will the user still be allowed to automatically mkfs/mount with
> blocksize=min_order, and will sysfs reflect this? Or, since it's a hint, will it
> remain hidden but still allow mkfs/mount to proceed?

My proposal is that the controls are hints, and they would not block mounting a
file system.

As an example, the user may set
`/sys/kernel/mm/transparent_hugepage/hugepages-16kB/file_enable` to `never`. In
this case the kernel would never pick a 16K folio to back a file who's minimum
folio size is not 16K. If the file's minimum folio size is 16K then it would
still allocate that folio size in the fallback case, after trying any
appropriate bigger folio sizes that are set to `always`.

Thanks,
Ryan

> 
>>
>>
>>>
>>>
>>> [1]
>>> https://lore.kernel.org/all/20240715094457.452836-2-kernel@pankajraghav.com/T/#u
>>>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline
  2024-07-22  9:36     ` Ryan Roberts
@ 2024-07-22 14:10       ` Ryan Roberts
  0 siblings, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-07-22 14:10 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Barry Song,
	Lance Yang, Baolin Wang, Gavin Shan, Pankaj Raghav,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On 22/07/2024 10:36, Ryan Roberts wrote:
> On 22/07/2024 10:13, Daniel Gomez wrote:
>> On Wed, Jul 17, 2024 at 08:12:55AM GMT, Ryan Roberts wrote:
>>> Add thp_anon= cmdline parameter to allow specifying the default
>>> enablement of each supported anon THP size. The parameter accepts the
>>> following format and can be provided multiple times to configure each
>>> size:
>>>
>>> thp_anon=<size>[KMG]:<value>
>>
>> Minor suggestion. Should this be renamed to hp_anon= or hugepages_anon= instead?
>> This would align with the values under /sys/kernel/mm/transparent_hugepage/
>> hugepages-*kB.
> 
> "hp" doesn't feel right; that's not an abreviation we use today to my knowledge.
> But I'd be happy to change it to "hugepages_anon", if that's the concensus.

Thinking about this a bit more, "hugepages=" is already a cmdline parameter used
to reserve hugepages for use with HugeTLB. So I think that could get confusing.

transparent_hugepage= is the existing cmdline parameter for the top-level (anon)
control. I considered "transparent_hugepage_anon=" or even just extending to use
the same parameter for both the top level and the per-size controls (with
optional size):

  transparent_hugepage=[<size>[KMG]:]<value>

But given they likely need to be provided multiple times, both of those options
seem too long. Which is how I settled on thp_anon= (and in the next patch,
thp_file=).

> 
>>
>>>
>>> See Documentation/admin-guide/mm/transhuge.rst for more details.
>>>
>>> Configuring the defaults at boot time is useful to allow early user
>>> space to take advantage of mTHP before its been configured through
>>> sysfs.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  .../admin-guide/kernel-parameters.txt         |  8 +++
>>>  Documentation/admin-guide/mm/transhuge.rst    | 26 +++++++--
>>>  mm/huge_memory.c                              | 55 ++++++++++++++++++-
>>>  3 files changed, 82 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index bc55fb55cd26..48443ad12e3f 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -6592,6 +6592,14 @@
>>>  			<deci-seconds>: poll all this frequency
>>>  			0: no polling (default)
>>>  
>>> +	thp_anon=	[KNL]
>>> +			Format: <size>[KMG]:always|madvise|never|inherit
>>> +			Can be used to control the default behavior of the
>>> +			system with respect to anonymous transparent hugepages.
>>> +			Can be used multiple times for multiple anon THP sizes.
>>> +			See Documentation/admin-guide/mm/transhuge.rst for more
>>> +			details.
>>> +
>>>  	threadirqs	[KNL,EARLY]
>>>  			Force threading of all interrupt handlers except those
>>>  			marked explicitly IRQF_NO_THREAD.
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 1aaf8e3a0b5a..f53d43d986e2 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -311,13 +311,27 @@ performance.
>>>  Note that any changes to the allowed set of sizes only applies to future
>>>  file-backed THP allocations.
>>>  
>>> -Boot parameter
>>> -==============
>>> +Boot parameters
>>> +===============
>>>  
>>> -You can change the sysfs boot time defaults of Transparent Hugepage
>>> -Support by passing the parameter ``transparent_hugepage=always`` or
>>> -``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
>>> -to the kernel command line.
>>> +You can change the sysfs boot time default for the top-level "enabled"
>>> +control by passing the parameter ``transparent_hugepage=always`` or
>>> +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
>>> +kernel command line.
>>> +
>>> +Alternatively, each supported anonymous THP size can be controlled by
>>> +passing ``thp_anon=<size>[KMG]:<state>``, where ``<size>`` is the THP size
>>> +and ``<state>`` is one of ``always``, ``madvise``, ``never`` or
>>> +``inherit``.
>>> +
>>> +For example, the following will set 64K THP to ``always``::
>>> +
>>> +	thp_anon=64K:always
>>> +
>>> +``thp_anon=`` may be specified multiple times to configure all THP sizes as
>>> +required. If ``thp_anon=`` is specified at least once, any anon THP sizes
>>> +not explicitly configured on the command line are implicitly set to
>>> +``never``.
>>>  
>>>  Hugepages in tmpfs/shmem
>>>  ========================
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 4249c0bc9388..794d2790d90d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -82,6 +82,7 @@ unsigned long huge_anon_orders_madvise __read_mostly;
>>>  unsigned long huge_anon_orders_inherit __read_mostly;
>>>  unsigned long huge_file_orders_always __read_mostly;
>>>  int huge_file_exec_order __read_mostly = -1;
>>> +static bool anon_orders_configured;
>>>  
>>>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>  					 unsigned long vm_flags,
>>> @@ -763,7 +764,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>>  	 * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>>  	 * constant so we have to do this here.
>>>  	 */
>>> -	huge_anon_orders_inherit = BIT(PMD_ORDER);
>>> +	if (!anon_orders_configured) {
>>> +		huge_anon_orders_inherit = BIT(PMD_ORDER);
>>
>> PMD_ORDER for 64k base PS systems would result in a 512M value, which exceeds
>> the xarray limit [1]. Therefore, I think we need to avoid PMD-size orders by
>> checking if PMD_ORDER > MAX_PAGECACHE_ORDER.
> 
> This is for anon memory, which isn't installed in the page cache so its
> independent of MAX_PAGECACHE_ORDER. I don't believe there is a problem here.
> 
>>
>> [1] https://lore.kernel.org/all/20240627003953.1262512-1-gshan@redhat.com/
>>
>>> +		anon_orders_configured = true;
>>> +	}
>>>  
>>>  	/*
>>>  	 * For pagecache, default to enabling all orders. powerpc's PMD_ORDER
>>> @@ -955,6 +959,55 @@ static int __init setup_transparent_hugepage(char *str)
>>>  }
>>>  __setup("transparent_hugepage=", setup_transparent_hugepage);
>>>  
>>> +static int __init setup_thp_anon(char *str)
>>> +{
>>> +	unsigned long size;
>>> +	char *state;
>>> +	int order;
>>> +	int ret = 0;
>>> +
>>> +	if (!str)
>>> +		goto out;
>>> +
>>> +	size = (unsigned long)memparse(str, &state);
>>> +	order = ilog2(size >> PAGE_SHIFT);
>>> +	if (*state != ':' || !is_power_of_2(size) || size <= PAGE_SIZE ||
>>> +	    !(BIT(order) & THP_ORDERS_ALL_ANON))
>>> +		goto out;
>>> +
>>> +	state++;
>>> +
>>> +	if (!strcmp(state, "always")) {
>>> +		clear_bit(order, &huge_anon_orders_inherit);
>>> +		clear_bit(order, &huge_anon_orders_madvise);
>>> +		set_bit(order, &huge_anon_orders_always);
>>> +		ret = 1;
>>> +	} else if (!strcmp(state, "inherit")) {
>>> +		clear_bit(order, &huge_anon_orders_always);
>>> +		clear_bit(order, &huge_anon_orders_madvise);
>>> +		set_bit(order, &huge_anon_orders_inherit);
>>> +		ret = 1;
>>> +	} else if (!strcmp(state, "madvise")) {
>>> +		clear_bit(order, &huge_anon_orders_always);
>>> +		clear_bit(order, &huge_anon_orders_inherit);
>>> +		set_bit(order, &huge_anon_orders_madvise);
>>> +		ret = 1;
>>> +	} else if (!strcmp(state, "never")) {
>>> +		clear_bit(order, &huge_anon_orders_always);
>>> +		clear_bit(order, &huge_anon_orders_inherit);
>>> +		clear_bit(order, &huge_anon_orders_madvise);
>>> +		ret = 1;
>>> +	}
>>> +
>>> +	if (ret)
>>> +		anon_orders_configured = true;
>>> +out:
>>> +	if (!ret)
>>> +		pr_warn("thp_anon=%s: cannot parse, ignored\n", str);
>>> +	return ret;
>>> +}
>>> +__setup("thp_anon=", setup_thp_anon);
>>> +
>>>  pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>>>  {
>>>  	if (likely(vma->vm_flags & VM_WRITE))
>>> -- 
>>> 2.43.0
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
       [not found] ` <480f34d0-a943-40da-9c69-2353fe311cf7@arm.com>
@ 2024-09-19  8:20   ` Barry Song
  2024-09-19 17:21     ` Ryan Roberts
  2024-12-06  5:09     ` Barry Song
  0 siblings, 2 replies; 23+ messages in thread
From: Barry Song @ 2024-09-19  8:20 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 17/07/2024 08:12, Ryan Roberts wrote:
> > Hi All,
> >
> > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > the set of allowed large folio sizes that can be used when allocating
> > file-memory for the page cache. As part of the control mechanism, it provides
> > for a special-case "preferred folio size for executable mappings" marker.
> >
> > I'm trying to solve 2 separate problems with this series:
> >
> > 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> > approach for the change at [1]. Instead of hardcoding the preferred executable
> > folio size into the arch, user space can now select it. This decouples the arch
> > code and also makes the mechanism more generic; it can be bypassed (the default)
> > or any folio size can be set. For my use case, 64K is preferred, but I've also
> > heard from Willy of a use case where putting all text into 2M PMD-sized folios
> > is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> > therefore faulting in all text ahead of time) to achieve that.
>
> Just a polite bump on this; I'd really like to get something like this merged to
> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
> weeks back without solid conclusion. I haven't heard any concrete objections
> yet, but also only a luke-warm reception. How can I move this forwards?

Hi Ryan,

These requirements seem to apply to anon, swap, pagecache, and shmem to
some extent. While the swapin_enabled knob was rejected, the shmem_enabled
option is already in place.

I wonder if it's possible to use the existing 'enabled' setting across
all cases, as
from an architectural perspective with cont-pte, pagecache may not differ from
anon. The demand for reducing page faults, LRU overhead, etc., also seems
quite similar.

I imagine that once Android's file systems support mTHP, we’ll uniformly enable
64KB for anon, swap, shmem, and page cache. It should then be sufficient to
enable all of them using a single knob:
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.

Is there anything that makes pagecache and shmem significantly different
from anon? In my Android case, they all seem the same. However, I assume
there might be other use cases where differentiating them is necessary?

>
> Thanks,
> Ryan
>
>
> >
> > 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> > Android): The theory goes that if all folios are 64K, then failure to allocate a
> > 64K folio should become unlikely. But if the page cache is allocating lots of
> > different orders, with most allocations having an order below 64K (as is the
> > case today) then ability to allocate 64K folios diminishes. By providing control
> > over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> > allocation failure. Additionally I've heard (second hand) of the need to disable
> > large folios in the page cache entirely due to latency concerns in some
> > settings. These controls allow all of this without kernel changes.
> >
> > The value of (1) is clear and the performance improvements are documented in
> > patch 2. I don't yet have any data demonstrating the theory for (2) since I
> > can't reproduce the setup that Barry had at [2]. But my view is that by adding
> > these controls we will enable the community to explore further, in the same way
> > that the anon mTHP controls helped harden the understanding for anonymous
> > memory.
> >
> > ---
> > This series depends on the "mTHP allocation stats for file-backed memory" series
> > at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> > mm selftests have been run; no regressions were observed.
> >
> > [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> > [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> > [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> >
> > Thanks,
> > Ryan
> >
> > Ryan Roberts (4):
> >   mm: mTHP user controls to configure pagecache large folio sizes
> >   mm: Introduce "always+exec" for mTHP file_enabled control
> >   mm: Override mTHP "enabled" defaults at kernel cmdline
> >   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> >
> >  .../admin-guide/kernel-parameters.txt         |  16 ++
> >  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
> >  include/linux/huge_mm.h                       |  61 ++++---
> >  mm/filemap.c                                  |  26 ++-
> >  mm/huge_memory.c                              | 158 +++++++++++++++++-
> >  mm/readahead.c                                |  43 ++++-
> >  6 files changed, 329 insertions(+), 41 deletions(-)
> >
> > --
> > 2.43.0
> >
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-09-19  8:20   ` Barry Song
@ 2024-09-19 17:21     ` Ryan Roberts
  2024-12-06  5:09     ` Barry Song
  1 sibling, 0 replies; 23+ messages in thread
From: Ryan Roberts @ 2024-09-19 17:21 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

On 19/09/2024 09:20, Barry Song wrote:
> On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 17/07/2024 08:12, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This series is an RFC that adds sysfs and kernel cmdline controls to configure
>>> the set of allowed large folio sizes that can be used when allocating
>>> file-memory for the page cache. As part of the control mechanism, it provides
>>> for a special-case "preferred folio size for executable mappings" marker.
>>>
>>> I'm trying to solve 2 separate problems with this series:
>>>
>>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
>>> approach for the change at [1]. Instead of hardcoding the preferred executable
>>> folio size into the arch, user space can now select it. This decouples the arch
>>> code and also makes the mechanism more generic; it can be bypassed (the default)
>>> or any folio size can be set. For my use case, 64K is preferred, but I've also
>>> heard from Willy of a use case where putting all text into 2M PMD-sized folios
>>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
>>> therefore faulting in all text ahead of time) to achieve that.
>>
>> Just a polite bump on this; I'd really like to get something like this merged to
>> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
>> weeks back without solid conclusion. I haven't heard any concrete objections
>> yet, but also only a luke-warm reception. How can I move this forwards?
> 
> Hi Ryan,
> 
> These requirements seem to apply to anon, swap, pagecache, and shmem to
> some extent. While the swapin_enabled knob was rejected, the shmem_enabled
> option is already in place.
> 
> I wonder if it's possible to use the existing 'enabled' setting across
> all cases, as
> from an architectural perspective with cont-pte, pagecache may not differ from
> anon. The demand for reducing page faults, LRU overhead, etc., also seems
> quite similar.
> 
> I imagine that once Android's file systems support mTHP, we’ll uniformly enable
> 64KB for anon, swap, shmem, and page cache. It should then be sufficient to
> enable all of them using a single knob:
> '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.
> 
> Is there anything that makes pagecache and shmem significantly different
> from anon? In my Android case, they all seem the same. However, I assume
> there might be other use cases where differentiating them is necessary?

For anon vs shmem, we were just following the precedent set by the legacy PMD
controls, which separated them. I vaguely recall David explaining why there are
separate controls but don't recall the exact reason; I beleive there was some
use case where anon THP made sense, but shmem THP was problematic for some
reason. Note too, that the controls expose different options; anon has {always
never, madvise}, shmem has {always, never, advise (no m; it applies to fadvise
too), within_size, force, deny}. So I guess if the extra shmem options are
important then it makes sense to have a separate control.

For pagecache vs anon, I'm not sure it makes sense to tie these to the same
control. We have readahead information to help us make an educated guess at the
folio size we should use (currently we start at order-2 and increase by 2 orders
every time we hit the readahead marker) and it's much easier to drop pagecache
folios under memory pressure. So by default, I think most/all orders would be
enabled for pagecahce. But for anon, things are harder. In the common case,
likely we only want 2M when madvised, and 64K always (and possibly 16K always).

Talking with Willy today, his preference is to not expose any controls for
pagecache at all, and let the architecture hint the preferred folio size for
code - basically how I did it at [1] - linked in the original post. This is very
simple and exposes no user controls so could be easily modified over time as we
get more data.

Trouble is nobody seemed willing to R-b the first approach. So perhaps we're
stuck waiting for Android's FSs to support large folios so we can start
benchmarking the real-world gains?

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
>>> Android): The theory goes that if all folios are 64K, then failure to allocate a
>>> 64K folio should become unlikely. But if the page cache is allocating lots of
>>> different orders, with most allocations having an order below 64K (as is the
>>> case today) then ability to allocate 64K folios diminishes. By providing control
>>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
>>> allocation failure. Additionally I've heard (second hand) of the need to disable
>>> large folios in the page cache entirely due to latency concerns in some
>>> settings. These controls allow all of this without kernel changes.
>>>
>>> The value of (1) is clear and the performance improvements are documented in
>>> patch 2. I don't yet have any data demonstrating the theory for (2) since I
>>> can't reproduce the setup that Barry had at [2]. But my view is that by adding
>>> these controls we will enable the community to explore further, in the same way
>>> that the anon mTHP controls helped harden the understanding for anonymous
>>> memory.
>>>
>>> ---
>>> This series depends on the "mTHP allocation stats for file-backed memory" series
>>> at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
>>> mm selftests have been run; no regressions were observed.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
>>> [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
>>> [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>> Ryan Roberts (4):
>>>   mm: mTHP user controls to configure pagecache large folio sizes
>>>   mm: Introduce "always+exec" for mTHP file_enabled control
>>>   mm: Override mTHP "enabled" defaults at kernel cmdline
>>>   mm: Override mTHP "file_enabled" defaults at kernel cmdline
>>>
>>>  .../admin-guide/kernel-parameters.txt         |  16 ++
>>>  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
>>>  include/linux/huge_mm.h                       |  61 ++++---
>>>  mm/filemap.c                                  |  26 ++-
>>>  mm/huge_memory.c                              | 158 +++++++++++++++++-
>>>  mm/readahead.c                                |  43 ++++-
>>>  6 files changed, 329 insertions(+), 41 deletions(-)
>>>
>>> --
>>> 2.43.0
>>>
>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-09-19  8:20   ` Barry Song
  2024-09-19 17:21     ` Ryan Roberts
@ 2024-12-06  5:09     ` Barry Song
  2024-12-06  5:29       ` Baolin Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Barry Song @ 2024-12-06  5:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Baolin Wang, Gavin Shan, Pankaj Raghav, Daniel Gomez,
	linux-kernel, linux-mm

It's unusual that many emails sent days ago are resurfacing on LKML.
Please ignore them.
By the way, does anyone know what happened?

On Fri, Dec 6, 2024 at 5:12 AM Barry Song <baohua@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 17/07/2024 08:12, Ryan Roberts wrote:
> > > Hi All,
> > >
> > > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > > the set of allowed large folio sizes that can be used when allocating
> > > file-memory for the page cache. As part of the control mechanism, it provides
> > > for a special-case "preferred folio size for executable mappings" marker.
> > >
> > > I'm trying to solve 2 separate problems with this series:
> > >
> > > 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> > > approach for the change at [1]. Instead of hardcoding the preferred executable
> > > folio size into the arch, user space can now select it. This decouples the arch
> > > code and also makes the mechanism more generic; it can be bypassed (the default)
> > > or any folio size can be set. For my use case, 64K is preferred, but I've also
> > > heard from Willy of a use case where putting all text into 2M PMD-sized folios
> > > is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> > > therefore faulting in all text ahead of time) to achieve that.
> >
> > Just a polite bump on this; I'd really like to get something like this merged to
> > help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
> > weeks back without solid conclusion. I haven't heard any concrete objections
> > yet, but also only a luke-warm reception. How can I move this forwards?
>
> Hi Ryan,
>
> These requirements seem to apply to anon, swap, pagecache, and shmem to
> some extent. While the swapin_enabled knob was rejected, the shmem_enabled
> option is already in place.
>
> I wonder if it's possible to use the existing 'enabled' setting across
> all cases, as
> from an architectural perspective with cont-pte, pagecache may not differ from
> anon. The demand for reducing page faults, LRU overhead, etc., also seems
> quite similar.
>
> I imagine that once Android's file systems support mTHP, we’ll uniformly enable
> 64KB for anon, swap, shmem, and page cache. It should then be sufficient to
> enable all of them using a single knob:
> '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.
>
> Is there anything that makes pagecache and shmem significantly different
> from anon? In my Android case, they all seem the same. However, I assume
> there might be other use cases where differentiating them is necessary?
>
> >
> > Thanks,
> > Ryan
> >
> >
> > >
> > > 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> > > Android): The theory goes that if all folios are 64K, then failure to allocate a
> > > 64K folio should become unlikely. But if the page cache is allocating lots of
> > > different orders, with most allocations having an order below 64K (as is the
> > > case today) then ability to allocate 64K folios diminishes. By providing control
> > > over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> > > allocation failure. Additionally I've heard (second hand) of the need to disable
> > > large folios in the page cache entirely due to latency concerns in some
> > > settings. These controls allow all of this without kernel changes.
> > >
> > > The value of (1) is clear and the performance improvements are documented in
> > > patch 2. I don't yet have any data demonstrating the theory for (2) since I
> > > can't reproduce the setup that Barry had at [2]. But my view is that by adding
> > > these controls we will enable the community to explore further, in the same way
> > > that the anon mTHP controls helped harden the understanding for anonymous
> > > memory.
> > >
> > > ---
> > > This series depends on the "mTHP allocation stats for file-backed memory" series
> > > at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> > > mm selftests have been run; no regressions were observed.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> > > [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> > > [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> > >
> > > Thanks,
> > > Ryan
> > >
> > > Ryan Roberts (4):
> > >   mm: mTHP user controls to configure pagecache large folio sizes
> > >   mm: Introduce "always+exec" for mTHP file_enabled control
> > >   mm: Override mTHP "enabled" defaults at kernel cmdline
> > >   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> > >
> > >  .../admin-guide/kernel-parameters.txt         |  16 ++
> > >  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
> > >  include/linux/huge_mm.h                       |  61 ++++---
> > >  mm/filemap.c                                  |  26 ++-
> > >  mm/huge_memory.c                              | 158 +++++++++++++++++-
> > >  mm/readahead.c                                |  43 ++++-
> > >  6 files changed, 329 insertions(+), 41 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory
  2024-12-06  5:09     ` Barry Song
@ 2024-12-06  5:29       ` Baolin Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-12-06  5:29 UTC (permalink / raw)
  To: Barry Song, Ryan Roberts
  Cc: Andrew Morton, Hugh Dickins, Jonathan Corbet,
	Matthew Wilcox (Oracle), David Hildenbrand, Lance Yang,
	Gavin Shan, Pankaj Raghav, Daniel Gomez, linux-kernel, linux-mm



On 2024/12/6 13:09, Barry Song wrote:
> It's unusual that many emails sent days ago are resurfacing on LKML.
> Please ignore them.
> By the way, does anyone know what happened?

I also received many previous emails, and seems that changing the 
maillist spam filtering rules by the owner of linux-mm@kvack.org caused 
this.

See: https://lore.kernel.org/all/20241205154213.GA5247@kvack.org/


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-06  5:29 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-17  7:12 [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Ryan Roberts
2024-07-17  7:12 ` [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes Ryan Roberts
2024-07-17  7:12 ` [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control Ryan Roberts
2024-07-17 17:10   ` Ryan Roberts
2024-07-17  7:12 ` [RFC PATCH v1 3/4] mm: Override mTHP "enabled" defaults at kernel cmdline Ryan Roberts
2024-07-19  0:46   ` Barry Song
2024-07-19  7:47     ` Ryan Roberts
2024-07-19  7:52       ` Barry Song
2024-07-19  8:18         ` Ryan Roberts
2024-07-19  8:29         ` David Hildenbrand
2024-07-22  9:13   ` Daniel Gomez
2024-07-22  9:36     ` Ryan Roberts
2024-07-22 14:10       ` Ryan Roberts
2024-07-17  7:12 ` [RFC PATCH v1 4/4] mm: Override mTHP "file_enabled" " Ryan Roberts
2024-07-17 10:31 ` [RFC PATCH v1 0/4] Control folio sizes used for page cache memory David Hildenbrand
2024-07-17 10:45   ` Ryan Roberts
2024-07-17 14:25     ` David Hildenbrand
2024-07-22  9:35     ` Daniel Gomez
2024-07-22  9:43       ` Ryan Roberts
     [not found] ` <480f34d0-a943-40da-9c69-2353fe311cf7@arm.com>
2024-09-19  8:20   ` Barry Song
2024-09-19 17:21     ` Ryan Roberts
2024-12-06  5:09     ` Barry Song
2024-12-06  5:29       ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).