* [RFC PATCH v3 0/4] Support large folios for tmpfs
@ 2024-10-10 9:58 Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
` (5 more replies)
0 siblings, 6 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-10 9:58 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Hi,
This RFC patch series attempts to support large folios for tmpfs.
Considering that tmpfs already has the 'huge=' option to control the THP
allocation, it is necessary to maintain compatibility with the 'huge='
option, as well as considering the 'deny' and 'force' option controlled
by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
Add a new huge option 'write_size' to support large folio allocation based
on the write size for tmpfs write and fallocate paths. So the huge pages
allocation strategy for tmpfs is that, if the 'huge=' option
(huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
is 'force', it need just allow PMD sized THP to keep backward compatibility
for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled'
option is 'deny', it will still disable any large folio allocations. Only
when the 'huge=' option is 'write_size', it will allow allocating large
folios based on the write size.
And I think the 'huge=write_size' option should be the default behavior
for tmpfs in future.
Any comments and suggestions are appreciated. Thanks.
Changes from RFC v2:
- Drop mTHP interfaces to control huge page allocation, per Matthew.
- Add a new helper to calculate the order, suggested by Matthew.
- Add a new huge=write_size option to allocate large folios based on
the write size.
- Add a new patch to update the documentation.
Changes from RFC v1:
- Drop patch 1.
- Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
- Update shmem_mapping_size_order() per Daniel.
Baolin Wang (4):
mm: factor out the order calculation into a new helper
mm: shmem: change shmem_huge_global_enabled() to return huge order
bitmap
mm: shmem: add large folio support to the write and fallocate paths
for tmpfs
docs: tmpfs: add documention for 'write_size' huge option
Documentation/filesystems/tmpfs.rst | 7 +-
include/linux/pagemap.h | 16 ++++-
mm/shmem.c | 105 ++++++++++++++++++++--------
3 files changed, 94 insertions(+), 34 deletions(-)
--
2.39.3
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
@ 2024-10-10 9:58 ` Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
` (4 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-10 9:58 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Factor out the order calculation into a new helper, which can be reused
by shmem in the following patch.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
include/linux/pagemap.h | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..d796c8a33647 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -727,6 +727,16 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
+static inline unsigned int filemap_get_order(size_t size)
+{
+ unsigned int shift = ilog2(size);
+
+ if (shift <= PAGE_SHIFT)
+ return 0;
+
+ return shift - PAGE_SHIFT;
+}
+
/**
* fgf_set_order - Encode a length in the fgf_t flags.
* @size: The suggested size of the folio to create.
@@ -740,11 +750,11 @@ typedef unsigned int __bitwise fgf_t;
*/
static inline fgf_t fgf_set_order(size_t size)
{
- unsigned int shift = ilog2(size);
+ unsigned int order = filemap_get_order(size);
- if (shift <= PAGE_SHIFT)
+ if (!order)
return 0;
- return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+ return (__force fgf_t)(order << 26);
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
--
2.39.3
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
@ 2024-10-10 9:58 ` Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
` (3 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-10 9:58 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Change the shmem_huge_global_enabled() to return the suitable huge
order bitmap, and return 0 if huge pages are not allowed. This is a
preparation for adding a new huge option to support various huge
orders allocation in the following patch.
No functional changes.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 43 ++++++++++++++++++++++---------------------
1 file changed, 22 insertions(+), 21 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 0613421e09e7..f04935722457 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -548,48 +548,48 @@ static bool shmem_confirm_swap(struct address_space *mapping,
static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
-static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
- loff_t write_end, bool shmem_huge_force,
- struct vm_area_struct *vma,
- unsigned long vm_flags)
+static unsigned int __shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+ loff_t write_end, bool shmem_huge_force,
+ struct vm_area_struct *vma,
+ unsigned long vm_flags)
{
struct mm_struct *mm = vma ? vma->vm_mm : NULL;
loff_t i_size;
if (!S_ISREG(inode->i_mode))
- return false;
+ return 0;
if (mm && ((vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &mm->flags)))
- return false;
+ return 0;
if (shmem_huge == SHMEM_HUGE_DENY)
- return false;
+ return 0;
if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
- return true;
+ return BIT(HPAGE_PMD_ORDER);
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_ALWAYS:
- return true;
+ return BIT(HPAGE_PMD_ORDER);
case SHMEM_HUGE_WITHIN_SIZE:
index = round_up(index + 1, HPAGE_PMD_NR);
i_size = max(write_end, i_size_read(inode));
i_size = round_up(i_size, PAGE_SIZE);
if (i_size >> PAGE_SHIFT >= index)
- return true;
+ return BIT(HPAGE_PMD_ORDER);
fallthrough;
case SHMEM_HUGE_ADVISE:
if (mm && (vm_flags & VM_HUGEPAGE))
- return true;
+ return BIT(HPAGE_PMD_ORDER);
fallthrough;
default:
- return false;
+ return 0;
}
}
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
loff_t write_end, bool shmem_huge_force,
struct vm_area_struct *vma, unsigned long vm_flags)
{
if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
- return false;
+ return 0;
return __shmem_huge_global_enabled(inode, index, write_end,
shmem_huge_force, vma, vm_flags);
@@ -771,11 +771,11 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
return 0;
}
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
loff_t write_end, bool shmem_huge_force,
struct vm_area_struct *vma, unsigned long vm_flags)
{
- return false;
+ return 0;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -1170,7 +1170,8 @@ static int shmem_getattr(struct mnt_idmap *idmap,
generic_fillattr(idmap, request_mask, inode, stat);
inode_unlock_shared(inode);
- if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
+ if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0) ==
+ BIT(HPAGE_PMD_ORDER))
stat->blksize = HPAGE_PMD_SIZE;
if (request_mask & STATX_BTIME) {
@@ -1679,7 +1680,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
unsigned long mask = READ_ONCE(huge_shmem_orders_always);
unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
unsigned long vm_flags = vma ? vma->vm_flags : 0;
- bool global_huge;
+ unsigned int global_order;
loff_t i_size;
int order;
@@ -1691,14 +1692,14 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
return 0;
- global_huge = shmem_huge_global_enabled(inode, index, write_end,
+ global_order = shmem_huge_global_enabled(inode, index, write_end,
shmem_huge_force, vma, vm_flags);
if (!vma || !vma_is_anon_shmem(vma)) {
/*
* For tmpfs, we now only support PMD sized THP if huge page
* is enabled, otherwise fallback to order 0.
*/
- return global_huge ? BIT(HPAGE_PMD_ORDER) : 0;
+ return global_order;
}
/*
@@ -1731,7 +1732,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_shmem_orders_madvise);
- if (global_huge)
+ if (global_order > 0)
mask |= READ_ONCE(huge_shmem_orders_inherit);
return THP_ORDERS_ALL_FILE_DEFAULT & mask;
--
2.39.3
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
@ 2024-10-10 9:58 ` Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
` (2 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-10 9:58 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Add large folio support for tmpfs write and fallocate paths matching the
same high order preference mechanism used in the iomap buffered IO path
as used in __filemap_get_folio().
Add shmem_mapping_size_order() to get a hint for the order of the folio
based on the file size which takes care of the mapping requirements.
Considering that tmpfs already has the 'huge=' option to control the huge
pages allocation, it is necessary to maintain compatibility with the 'huge='
option, as well as considering the 'deny' and 'force' option controlled
by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
Add a new huge option 'write_size' to support large folio allocation based
on the write size for tmpfs write and fallocate paths. So the huge pages
allocation strategy for tmpfs is that, if the 'huge=' option
(huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
is 'force', it need just allow PMD sized THP to keep backward compatibility
for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled'
option is 'deny', it will still disable any large folio allocations. Only
when the 'huge=' option is 'write_size', it will allow allocating large
folios based on the write size.
Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 55 insertions(+), 7 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index f04935722457..66f1cf5b1645 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -523,12 +523,15 @@ static bool shmem_confirm_swap(struct address_space *mapping,
* also respect fadvise()/madvise() hints;
* SHMEM_HUGE_ADVISE:
* only allocate huge pages if requested with fadvise()/madvise();
+ * SHMEM_HUGE_WRITE_SIZE:
+ * only allocate huge pages based on the write size.
*/
#define SHMEM_HUGE_NEVER 0
#define SHMEM_HUGE_ALWAYS 1
#define SHMEM_HUGE_WITHIN_SIZE 2
#define SHMEM_HUGE_ADVISE 3
+#define SHMEM_HUGE_WRITE_SIZE 4
/*
* Special values.
@@ -548,12 +551,46 @@ static bool shmem_confirm_swap(struct address_space *mapping,
static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
+/**
+ * shmem_mapping_size_order - Get maximum folio order for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @size: The suggested size of the folio to create.
+ *
+ * This returns a high order for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Return: The order.
+ */
+static inline unsigned int
+shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
+{
+ unsigned int order;
+
+ if (!mapping_large_folio_support(mapping))
+ return 0;
+
+ order = filemap_get_order(size);
+ if (!order)
+ return 0;
+
+ /* If we're not aligned, allocate a smaller folio */
+ if (index & ((1UL << order) - 1))
+ order = __ffs(index);
+
+ return min_t(size_t, order, MAX_PAGECACHE_ORDER);
+}
+
static unsigned int __shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
loff_t write_end, bool shmem_huge_force,
struct vm_area_struct *vma,
unsigned long vm_flags)
{
struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+ unsigned int order;
+ size_t len;
loff_t i_size;
if (!S_ISREG(inode->i_mode))
@@ -568,6 +605,17 @@ static unsigned int __shmem_huge_global_enabled(struct inode *inode, pgoff_t ind
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_ALWAYS:
return BIT(HPAGE_PMD_ORDER);
+ /*
+ * If the huge option is SHMEM_HUGE_WRITE_SIZE, it will allow
+ * getting a highest order hint based on the size of write and
+ * fallocate paths, then will try each allowable huge orders.
+ */
+ case SHMEM_HUGE_WRITE_SIZE:
+ if (!write_end)
+ return 0;
+ len = write_end - (index << PAGE_SHIFT);
+ order = shmem_mapping_size_order(inode->i_mapping, index, len);
+ return order > 0 ? BIT(order + 1) - 1 : 0;
case SHMEM_HUGE_WITHIN_SIZE:
index = round_up(index + 1, HPAGE_PMD_NR);
i_size = max(write_end, i_size_read(inode));
@@ -624,6 +672,8 @@ static const char *shmem_format_huge(int huge)
return "always";
case SHMEM_HUGE_WITHIN_SIZE:
return "within_size";
+ case SHMEM_HUGE_WRITE_SIZE:
+ return "write_size";
case SHMEM_HUGE_ADVISE:
return "advise";
case SHMEM_HUGE_DENY:
@@ -1694,13 +1744,9 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
global_order = shmem_huge_global_enabled(inode, index, write_end,
shmem_huge_force, vma, vm_flags);
- if (!vma || !vma_is_anon_shmem(vma)) {
- /*
- * For tmpfs, we now only support PMD sized THP if huge page
- * is enabled, otherwise fallback to order 0.
- */
+ /* Tmpfs huge pages allocation? */
+ if (!vma || !vma_is_anon_shmem(vma))
return global_order;
- }
/*
* Following the 'deny' semantics of the top level, force the huge
@@ -2851,7 +2897,8 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
cache_no_acl(inode);
if (sbinfo->noswap)
mapping_set_unevictable(inode->i_mapping);
- mapping_set_large_folios(inode->i_mapping);
+ if (sbinfo->huge)
+ mapping_set_large_folios(inode->i_mapping);
switch (mode & S_IFMT) {
default:
@@ -4224,6 +4271,7 @@ static const struct constant_table shmem_param_enums_huge[] = {
{"always", SHMEM_HUGE_ALWAYS },
{"within_size", SHMEM_HUGE_WITHIN_SIZE },
{"advise", SHMEM_HUGE_ADVISE },
+ {"write_size", SHMEM_HUGE_WRITE_SIZE },
{}
};
--
2.39.3
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
` (2 preceding siblings ...)
2024-10-10 9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
@ 2024-10-10 9:58 ` Baolin Wang
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
2024-10-16 14:06 ` Matthew Wilcox
5 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-10 9:58 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Add documention for 'write_size' huge option, as well as making previous
huge options more clear.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Documentation/filesystems/tmpfs.rst | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe..be998ff47018 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -115,10 +115,11 @@ The mount options for this are:
================ ==============================================================
huge=never Do not allocate huge pages. This is the default.
-huge=always Attempt to allocate huge page every time a new page is needed.
-huge=within_size Only allocate huge page if it will be fully within i_size.
+huge=always Attempt to allocate PMD sized huge page every time a new page is needed.
+huge=within_size Only allocate PMD sized huge page if it will be fully within i_size.
Also respect madvise(2) hints.
-huge=advise Only allocate huge page if requested with madvise(2).
+huge=advise Only allocate PMD sized huge page if requested with madvise(2).
+huge=write_size Can allocate various sized huge page based on the write size.
================ ==============================================================
See also Documentation/admin-guide/mm/transhuge.rst, which describes the
--
2.39.3
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
` (3 preceding siblings ...)
2024-10-10 9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
@ 2024-10-16 7:49 ` Kefeng Wang
2024-10-16 9:29 ` Baolin Wang
2024-10-16 14:06 ` Matthew Wilcox
5 siblings, 1 reply; 37+ messages in thread
From: Kefeng Wang @ 2024-10-16 7:49 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, david, 21cnbao, ryan.roberts, ioworker0, da.gomez,
linux-mm, linux-kernel
On 2024/10/10 17:58, Baolin Wang wrote:
> Hi,
>
> This RFC patch series attempts to support large folios for tmpfs.
>
> Considering that tmpfs already has the 'huge=' option to control the THP
> allocation, it is necessary to maintain compatibility with the 'huge='
> option, as well as considering the 'deny' and 'force' option controlled
> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>
> Add a new huge option 'write_size' to support large folio allocation based
> on the write size for tmpfs write and fallocate paths. So the huge pages
> allocation strategy for tmpfs is that, if the 'huge=' option
> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
> is 'force', it need just allow PMD sized THP to keep backward compatibility
> for tmpfs. While 'huge=' option is disabled (huge=never) or the 'shmem_enabled'
> option is 'deny', it will still disable any large folio allocations. Only
> when the 'huge=' option is 'write_size', it will allow allocating large
> folios based on the write size.
>
> And I think the 'huge=write_size' option should be the default behavior
> for tmpfs in future.
Could we avoid new huge= option for tmpfs, maybe support other orders
for both read/write/fallocate if mount with huge?
>
> Any comments and suggestions are appreciated. Thanks.
>
> Changes from RFC v2:
> - Drop mTHP interfaces to control huge page allocation, per Matthew.
> - Add a new helper to calculate the order, suggested by Matthew.
> - Add a new huge=write_size option to allocate large folios based on
> the write size.
> - Add a new patch to update the documentation.
>
> Changes from RFC v1:
> - Drop patch 1.
> - Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
> - Update shmem_mapping_size_order() per Daniel.
>
> Baolin Wang (4):
> mm: factor out the order calculation into a new helper
> mm: shmem: change shmem_huge_global_enabled() to return huge order
> bitmap
> mm: shmem: add large folio support to the write and fallocate paths
> for tmpfs
> docs: tmpfs: add documention for 'write_size' huge option
>
> Documentation/filesystems/tmpfs.rst | 7 +-
> include/linux/pagemap.h | 16 ++++-
> mm/shmem.c | 105 ++++++++++++++++++++--------
> 3 files changed, 94 insertions(+), 34 deletions(-)
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
@ 2024-10-16 9:29 ` Baolin Wang
2024-10-16 13:45 ` Kefeng Wang
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-16 9:29 UTC (permalink / raw)
To: Kefeng Wang, akpm, hughd
Cc: willy, david, 21cnbao, ryan.roberts, ioworker0, da.gomez,
linux-mm, linux-kernel
On 2024/10/16 15:49, Kefeng Wang wrote:
>
>
> On 2024/10/10 17:58, Baolin Wang wrote:
>> Hi,
>>
>> This RFC patch series attempts to support large folios for tmpfs.
>>
>> Considering that tmpfs already has the 'huge=' option to control the THP
>> allocation, it is necessary to maintain compatibility with the 'huge='
>> option, as well as considering the 'deny' and 'force' option controlled
>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>
>> Add a new huge option 'write_size' to support large folio allocation
>> based
>> on the write size for tmpfs write and fallocate paths. So the huge pages
>> allocation strategy for tmpfs is that, if the 'huge=' option
>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled' option
>> is 'force', it need just allow PMD sized THP to keep backward
>> compatibility
>> for tmpfs. While 'huge=' option is disabled (huge=never) or the
>> 'shmem_enabled'
>> option is 'deny', it will still disable any large folio allocations. Only
>> when the 'huge=' option is 'write_size', it will allow allocating large
>> folios based on the write size.
>>
>> And I think the 'huge=write_size' option should be the default behavior
>> for tmpfs in future.
>
> Could we avoid new huge= option for tmpfs, maybe support other orders
> for both read/write/fallocate if mount with huge?
Um, I am afraid not, as that would break the 'huge=' compatibility. That
is to say, users still want PMD-sized huge pages if 'huge=always'.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-16 9:29 ` Baolin Wang
@ 2024-10-16 13:45 ` Kefeng Wang
2024-10-17 9:52 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: Kefeng Wang @ 2024-10-16 13:45 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, david, 21cnbao, ryan.roberts, ioworker0, da.gomez,
linux-mm, linux-kernel
On 2024/10/16 17:29, Baolin Wang wrote:
>
>
> On 2024/10/16 15:49, Kefeng Wang wrote:
>>
>>
>> On 2024/10/10 17:58, Baolin Wang wrote:
>>> Hi,
>>>
>>> This RFC patch series attempts to support large folios for tmpfs.
>>>
>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>> option, as well as considering the 'deny' and 'force' option controlled
>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>
>>> Add a new huge option 'write_size' to support large folio allocation
>>> based
>>> on the write size for tmpfs write and fallocate paths. So the huge pages
>>> allocation strategy for tmpfs is that, if the 'huge=' option
>>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled'
>>> option
>>> is 'force', it need just allow PMD sized THP to keep backward
>>> compatibility
>>> for tmpfs. While 'huge=' option is disabled (huge=never) or the
>>> 'shmem_enabled'
>>> option is 'deny', it will still disable any large folio allocations.
>>> Only
>>> when the 'huge=' option is 'write_size', it will allow allocating large
>>> folios based on the write size.
>>>
>>> And I think the 'huge=write_size' option should be the default behavior
>>> for tmpfs in future.
>>
>> Could we avoid new huge= option for tmpfs, maybe support other orders
>> for both read/write/fallocate if mount with huge?
>
> Um, I am afraid not, as that would break the 'huge=' compatibility. That
> is to say, users still want PMD-sized huge pages if 'huge=always'.
Yes, compatibility maybe an issue, but only write/fallocate side support
large folio is a little strange, maybe a new mode to support both read/
write/fallocate?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
` (4 preceding siblings ...)
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
@ 2024-10-16 14:06 ` Matthew Wilcox
2024-10-17 9:34 ` Baolin Wang
5 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2024-10-16 14:06 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, da.gomez, linux-mm, linux-kernel
On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> Considering that tmpfs already has the 'huge=' option to control the THP
> allocation, it is necessary to maintain compatibility with the 'huge='
> option, as well as considering the 'deny' and 'force' option controlled
> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
No, it's not. No other filesystem honours these settings. tmpfs would
not have had these settings if it were written today. It should simply
ignore them, the way that NFS ignores the "intr" mount option now that
we have a better solution to the original problem.
To reiterate my position:
- When using tmpfs as a filesystem, it should behave like other
filesystems.
- When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
behave like anonymous memory.
No more special mount options.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-16 14:06 ` Matthew Wilcox
@ 2024-10-17 9:34 ` Baolin Wang
2024-10-17 11:26 ` Kirill A. Shutemov
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-17 9:34 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, da.gomez, linux-mm, linux-kernel, Kirill A . Shutemov
+ Kirill
On 2024/10/16 22:06, Matthew Wilcox wrote:
> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>> Considering that tmpfs already has the 'huge=' option to control the THP
>> allocation, it is necessary to maintain compatibility with the 'huge='
>> option, as well as considering the 'deny' and 'force' option controlled
>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>
> No, it's not. No other filesystem honours these settings. tmpfs would
> not have had these settings if it were written today. It should simply
> ignore them, the way that NFS ignores the "intr" mount option now that
> we have a better solution to the original problem.
>
> To reiterate my position:
>
> - When using tmpfs as a filesystem, it should behave like other
> filesystems.
> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
> behave like anonymous memory.
I do agree with your point to some extent, but the ‘huge=’ option has
existed for nearly 8 years, and the huge orders based on write size may
not achieve the performance of PMD-sized THP in some scenarios, such as
when the write length is consistently 4K. So, I am still concerned that
ignoring the 'huge' option could lead to compatibility issues.
Another possible choice is to make the huge pages allocation based on
write size as the *default* behavior for tmpfs, while marking the
'huge=' option as deprecated and gradually removing it if there are no
user complaints about performance issues.
Let's also see what Hugh and Kirill think.
Hugh, Kirill, do you have any inputs?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-16 13:45 ` Kefeng Wang
@ 2024-10-17 9:52 ` Baolin Wang
0 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-17 9:52 UTC (permalink / raw)
To: Kefeng Wang, akpm, hughd
Cc: willy, david, 21cnbao, ryan.roberts, ioworker0, da.gomez,
linux-mm, linux-kernel
On 2024/10/16 21:45, Kefeng Wang wrote:
>
>
> On 2024/10/16 17:29, Baolin Wang wrote:
>>
>>
>> On 2024/10/16 15:49, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/10 17:58, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> This RFC patch series attempts to support large folios for tmpfs.
>>>>
>>>> Considering that tmpfs already has the 'huge=' option to control the
>>>> THP
>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>
>>>> Add a new huge option 'write_size' to support large folio allocation
>>>> based
>>>> on the write size for tmpfs write and fallocate paths. So the huge
>>>> pages
>>>> allocation strategy for tmpfs is that, if the 'huge=' option
>>>> (huge=always/within_size/advise) is enabled or the 'shmem_enabled'
>>>> option
>>>> is 'force', it need just allow PMD sized THP to keep backward
>>>> compatibility
>>>> for tmpfs. While 'huge=' option is disabled (huge=never) or the
>>>> 'shmem_enabled'
>>>> option is 'deny', it will still disable any large folio allocations.
>>>> Only
>>>> when the 'huge=' option is 'write_size', it will allow allocating large
>>>> folios based on the write size.
>>>>
>>>> And I think the 'huge=write_size' option should be the default behavior
>>>> for tmpfs in future.
>>>
>>> Could we avoid new huge= option for tmpfs, maybe support other orders
>>> for both read/write/fallocate if mount with huge?
>>
>> Um, I am afraid not, as that would break the 'huge=' compatibility.
>> That is to say, users still want PMD-sized huge pages if 'huge=always'.
>
> Yes, compatibility maybe an issue, but only write/fallocate side support
> large folio is a little strange, maybe a new mode to support both read/
> write/fallocate?
Because tmpfs read() will not allocate folios for tmpfs holes, and will
use ZERO_PAGE instead. If the shmem folios are swapped out, and now we
will always swapin base page, which is another story...
For tmpfs mmap() read, we do not have a length to indicate how large the
folio should be allocated. Moreover, we have decided against adding any
mTHP interfaces for tmpfs in the previous discussion[1].
[1] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-17 9:34 ` Baolin Wang
@ 2024-10-17 11:26 ` Kirill A. Shutemov
2024-10-21 6:24 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-10-17 11:26 UTC (permalink / raw)
To: Baolin Wang
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> + Kirill
>
> On 2024/10/16 22:06, Matthew Wilcox wrote:
> > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > > Considering that tmpfs already has the 'huge=' option to control the THP
> > > allocation, it is necessary to maintain compatibility with the 'huge='
> > > option, as well as considering the 'deny' and 'force' option controlled
> > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> >
> > No, it's not. No other filesystem honours these settings. tmpfs would
> > not have had these settings if it were written today. It should simply
> > ignore them, the way that NFS ignores the "intr" mount option now that
> > we have a better solution to the original problem.
> >
> > To reiterate my position:
> >
> > - When using tmpfs as a filesystem, it should behave like other
> > filesystems.
> > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
> > behave like anonymous memory.
>
> I do agree with your point to some extent, but the ‘huge=’ option has
> existed for nearly 8 years, and the huge orders based on write size may not
> achieve the performance of PMD-sized THP in some scenarios, such as when the
> write length is consistently 4K. So, I am still concerned that ignoring the
> 'huge' option could lead to compatibility issues.
Yeah, I don't think we are there yet to ignore the mount option.
Maybe we need to get a new generic interface to request the semantics
tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
handles to make kernel allocate PMD-size folio on any allocation or on
allocations within i_size. I think this behaviour is useful beyond tmpfs.
Then huge= implementation for tmpfs can be re-defined to set these
per-inode FADV_ flags by default. This way we can keep tmpfs compatible
with current deployments and less special comparing to rest of
filesystems on kernel side.
If huge= is not set, tmpfs would behave the same way as the rest of
filesystems.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-17 11:26 ` Kirill A. Shutemov
@ 2024-10-21 6:24 ` Baolin Wang
2024-10-21 8:54 ` Kirill A. Shutemov
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-21 6:24 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>> + Kirill
>>
>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>
>>> No, it's not. No other filesystem honours these settings. tmpfs would
>>> not have had these settings if it were written today. It should simply
>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>> we have a better solution to the original problem.
>>>
>>> To reiterate my position:
>>>
>>> - When using tmpfs as a filesystem, it should behave like other
>>> filesystems.
>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>> behave like anonymous memory.
>>
>> I do agree with your point to some extent, but the ‘huge=’ option has
>> existed for nearly 8 years, and the huge orders based on write size may not
>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>> write length is consistently 4K. So, I am still concerned that ignoring the
>> 'huge' option could lead to compatibility issues.
>
> Yeah, I don't think we are there yet to ignore the mount option.
OK.
> Maybe we need to get a new generic interface to request the semantics
> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
> handles to make kernel allocate PMD-size folio on any allocation or on
> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>
> Then huge= implementation for tmpfs can be re-defined to set these
> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
> with current deployments and less special comparing to rest of
> filesystems on kernel side.
I did a quick search, and I didn't find any other fs that require
PMD-sized huge pages, so I am not sure if FADV_* is useful for
filesystems other than tmpfs. Please correct me if I missed something.
> If huge= is not set, tmpfs would behave the same way as the rest of
> filesystems.
So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate
large folios based on the write size? If yes, that means it will change
the default huge behavior for tmpfs. Because previously having 'huge='
is not set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
to what I mentioned:
"Another possible choice is to make the huge pages allocation based on
write size as the *default* behavior for tmpfs, ..."
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-21 6:24 ` Baolin Wang
@ 2024-10-21 8:54 ` Kirill A. Shutemov
2024-10-21 13:34 ` Daniel Gomez
2024-10-22 3:34 ` Baolin Wang
0 siblings, 2 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-10-21 8:54 UTC (permalink / raw)
To: Baolin Wang
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>
>
> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> > > + Kirill
> > >
> > > On 2024/10/16 22:06, Matthew Wilcox wrote:
> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > > > > Considering that tmpfs already has the 'huge=' option to control the THP
> > > > > allocation, it is necessary to maintain compatibility with the 'huge='
> > > > > option, as well as considering the 'deny' and 'force' option controlled
> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> > > >
> > > > No, it's not. No other filesystem honours these settings. tmpfs would
> > > > not have had these settings if it were written today. It should simply
> > > > ignore them, the way that NFS ignores the "intr" mount option now that
> > > > we have a better solution to the original problem.
> > > >
> > > > To reiterate my position:
> > > >
> > > > - When using tmpfs as a filesystem, it should behave like other
> > > > filesystems.
> > > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
> > > > behave like anonymous memory.
> > >
> > > I do agree with your point to some extent, but the ‘huge=’ option has
> > > existed for nearly 8 years, and the huge orders based on write size may not
> > > achieve the performance of PMD-sized THP in some scenarios, such as when the
> > > write length is consistently 4K. So, I am still concerned that ignoring the
> > > 'huge' option could lead to compatibility issues.
> >
> > Yeah, I don't think we are there yet to ignore the mount option.
>
> OK.
>
> > Maybe we need to get a new generic interface to request the semantics
> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
> > handles to make kernel allocate PMD-size folio on any allocation or on
> > allocations within i_size. I think this behaviour is useful beyond tmpfs.
> >
> > Then huge= implementation for tmpfs can be re-defined to set these
> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible
> > with current deployments and less special comparing to rest of
> > filesystems on kernel side.
>
> I did a quick search, and I didn't find any other fs that require PMD-sized
> huge pages, so I am not sure if FADV_* is useful for filesystems other than
> tmpfs. Please correct me if I missed something.
What do you mean by "require"? THPs are always opportunistic.
IIUC, we don't have a way to hint kernel to use huge pages for a file on
read from backing storage. Readahead is not always the right way.
> > If huge= is not set, tmpfs would behave the same way as the rest of
> > filesystems.
>
> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
> folios based on the write size? If yes, that means it will change the
> default huge behavior for tmpfs. Because previously having 'huge=' is not
> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
> mentioned:
> "Another possible choice is to make the huge pages allocation based on write
> size as the *default* behavior for tmpfs, ..."
I am more worried about breaking existing users of huge pages. So changing
behaviour of users who don't specify huge is okay to me.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-21 8:54 ` Kirill A. Shutemov
@ 2024-10-21 13:34 ` Daniel Gomez
2024-10-22 3:41 ` Baolin Wang
2024-10-22 3:34 ` Baolin Wang
1 sibling, 1 reply; 37+ messages in thread
From: Daniel Gomez @ 2024-10-21 13:34 UTC (permalink / raw)
To: Kirill A. Shutemov, Baolin Wang
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>
>>
>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>> > > + Kirill
>> > >
>> > > On 2024/10/16 22:06, Matthew Wilcox wrote:
>> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>> > > > > Considering that tmpfs already has the 'huge=' option to control the THP
>> > > > > allocation, it is necessary to maintain compatibility with the 'huge='
>> > > > > option, as well as considering the 'deny' and 'force' option controlled
>> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> > > >
>> > > > No, it's not. No other filesystem honours these settings. tmpfs would
>> > > > not have had these settings if it were written today. It should simply
>> > > > ignore them, the way that NFS ignores the "intr" mount option now that
>> > > > we have a better solution to the original problem.
>> > > >
>> > > > To reiterate my position:
>> > > >
>> > > > - When using tmpfs as a filesystem, it should behave like other
>> > > > filesystems.
>> > > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>> > > > behave like anonymous memory.
>> > >
>> > > I do agree with your point to some extent, but the ‘huge=’ option has
>> > > existed for nearly 8 years, and the huge orders based on write size may not
>> > > achieve the performance of PMD-sized THP in some scenarios, such as when the
>> > > write length is consistently 4K. So, I am still concerned that ignoring the
>> > > 'huge' option could lead to compatibility issues.
>> >
>> > Yeah, I don't think we are there yet to ignore the mount option.
>>
>> OK.
>>
>> > Maybe we need to get a new generic interface to request the semantics
>> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>> > handles to make kernel allocate PMD-size folio on any allocation or on
>> > allocations within i_size. I think this behaviour is useful beyond tmpfs.
>> >
>> > Then huge= implementation for tmpfs can be re-defined to set these
>> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>> > with current deployments and less special comparing to rest of
>> > filesystems on kernel side.
>>
>> I did a quick search, and I didn't find any other fs that require PMD-sized
>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>> tmpfs. Please correct me if I missed something.
>
> What do you mean by "require"? THPs are always opportunistic.
>
> IIUC, we don't have a way to hint kernel to use huge pages for a file on
> read from backing storage. Readahead is not always the right way.
>
>> > If huge= is not set, tmpfs would behave the same way as the rest of
>> > filesystems.
>>
>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>> folios based on the write size? If yes, that means it will change the
>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>> mentioned:
>> "Another possible choice is to make the huge pages allocation based on write
>> size as the *default* behavior for tmpfs, ..."
>
> I am more worried about breaking existing users of huge pages. So changing
> behaviour of users who don't specify huge is okay to me.
I think moving tmpfs to allocate large folios opportunistically by
default (as it was proposed initially) doesn't necessary conflict with
the default behaviour (huge=never). We just need to clarify that in
the documentation.
However, and IIRC, one of the requests from Hugh was to have a way to
disable large folios which is something other FS do not have control
of as of today. Ryan sent a proposal to actually control that globally
but I think it didn't move forward. So, what are we missing to go back
to implement large folios in tmpfs in the default case, as any other fs
leveraging large folios?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-21 8:54 ` Kirill A. Shutemov
2024-10-21 13:34 ` Daniel Gomez
@ 2024-10-22 3:34 ` Baolin Wang
2024-10-22 10:06 ` Kirill A. Shutemov
1 sibling, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-22 3:34 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/21 16:54, Kirill A. Shutemov wrote:
> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>
>>
>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>> + Kirill
>>>>
>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>
>>>>> No, it's not. No other filesystem honours these settings. tmpfs would
>>>>> not have had these settings if it were written today. It should simply
>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>> we have a better solution to the original problem.
>>>>>
>>>>> To reiterate my position:
>>>>>
>>>>> - When using tmpfs as a filesystem, it should behave like other
>>>>> filesystems.
>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>> behave like anonymous memory.
>>>>
>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>> 'huge' option could lead to compatibility issues.
>>>
>>> Yeah, I don't think we are there yet to ignore the mount option.
>>
>> OK.
>>
>>> Maybe we need to get a new generic interface to request the semantics
>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>
>>> Then huge= implementation for tmpfs can be re-defined to set these
>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>> with current deployments and less special comparing to rest of
>>> filesystems on kernel side.
>>
>> I did a quick search, and I didn't find any other fs that require PMD-sized
>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>> tmpfs. Please correct me if I missed something.
>
> What do you mean by "require"? THPs are always opportunistic.
>
> IIUC, we don't have a way to hint kernel to use huge pages for a file on
> read from backing storage. Readahead is not always the right way.
IIUC, most file systems use method similar to iomap buffered IO (see
iomap_get_folio()) to allocate huge pages. What I mean is that, it would
be better to have a real use case to add a hint for allocating THP
(other than tmpfs).
>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>> filesystems.
>>
>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>> folios based on the write size? If yes, that means it will change the
>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>> mentioned:
>> "Another possible choice is to make the huge pages allocation based on write
>> size as the *default* behavior for tmpfs, ..."
>
> I am more worried about breaking existing users of huge pages. So changing
> behaviour of users who don't specify huge is okay to me.
OK. Good.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-21 13:34 ` Daniel Gomez
@ 2024-10-22 3:41 ` Baolin Wang
2024-10-22 15:31 ` David Hildenbrand
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-22 3:41 UTC (permalink / raw)
To: Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/21 21:34, Daniel Gomez wrote:
> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>> + Kirill
>>>>>
>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>
>>>>>> No, it's not. No other filesystem honours these settings. tmpfs would
>>>>>> not have had these settings if it were written today. It should simply
>>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>>> we have a better solution to the original problem.
>>>>>>
>>>>>> To reiterate my position:
>>>>>>
>>>>>> - When using tmpfs as a filesystem, it should behave like other
>>>>>> filesystems.
>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>>> behave like anonymous memory.
>>>>>
>>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>>> 'huge' option could lead to compatibility issues.
>>>>
>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>
>>> OK.
>>>
>>>> Maybe we need to get a new generic interface to request the semantics
>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>>
>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>>> with current deployments and less special comparing to rest of
>>>> filesystems on kernel side.
>>>
>>> I did a quick search, and I didn't find any other fs that require PMD-sized
>>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>>> tmpfs. Please correct me if I missed something.
>>
>> What do you mean by "require"? THPs are always opportunistic.
>>
>> IIUC, we don't have a way to hint kernel to use huge pages for a file on
>> read from backing storage. Readahead is not always the right way.
>>
>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>> filesystems.
>>>
>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>>> folios based on the write size? If yes, that means it will change the
>>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>>> mentioned:
>>> "Another possible choice is to make the huge pages allocation based on write
>>> size as the *default* behavior for tmpfs, ..."
>>
>> I am more worried about breaking existing users of huge pages. So changing
>> behaviour of users who don't specify huge is okay to me.
>
> I think moving tmpfs to allocate large folios opportunistically by
> default (as it was proposed initially) doesn't necessary conflict with
> the default behaviour (huge=never). We just need to clarify that in
> the documentation.
>
> However, and IIRC, one of the requests from Hugh was to have a way to
> disable large folios which is something other FS do not have control
> of as of today. Ryan sent a proposal to actually control that globally
> but I think it didn't move forward. So, what are we missing to go back
> to implement large folios in tmpfs in the default case, as any other fs
> leveraging large folios?
IMHO, as I discussed with Kirill, we still need maintain compatibility
with the 'huge=' mount option. This means that if 'huge=never' is set
for tmpfs, huge page allocation will still be prohibited (which can
address Hugh's request?). However, if 'huge=' is not set, we can
allocate large folios based on the write size.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-22 3:34 ` Baolin Wang
@ 2024-10-22 10:06 ` Kirill A. Shutemov
2024-10-23 9:25 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-10-22 10:06 UTC (permalink / raw)
To: Baolin Wang
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote:
> IIUC, most file systems use method similar to iomap buffered IO (see
> iomap_get_folio()) to allocate huge pages. What I mean is that, it would be
> better to have a real use case to add a hint for allocating THP (other than
> tmpfs).
I would be nice to hear from folks who works with production what the
actual needs are.
But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages
not justified. I think it would be easy to find use-cases for
FADV_HUGEPAGE/FADV_NOHUGEPAGE.
Furthermore I think it would be useful to have some kind of mechanism to
make these hints persistent: any open of a file would have these hints set
by default based on inode metadata on backing storage. Although, I am not
sure what the right way to archive that. xattrs?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-22 3:41 ` Baolin Wang
@ 2024-10-22 15:31 ` David Hildenbrand
2024-10-23 8:04 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-10-22 15:31 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 22.10.24 05:41, Baolin Wang wrote:
>
>
> On 2024/10/21 21:34, Daniel Gomez wrote:
>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>> + Kirill
>>>>>>
>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP
>>>>>>>> allocation, it is necessary to maintain compatibility with the 'huge='
>>>>>>>> option, as well as considering the 'deny' and 'force' option controlled
>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>
>>>>>>> No, it's not. No other filesystem honours these settings. tmpfs would
>>>>>>> not have had these settings if it were written today. It should simply
>>>>>>> ignore them, the way that NFS ignores the "intr" mount option now that
>>>>>>> we have a better solution to the original problem.
>>>>>>>
>>>>>>> To reiterate my position:
>>>>>>>
>>>>>>> - When using tmpfs as a filesystem, it should behave like other
>>>>>>> filesystems.
>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>>>>>>> behave like anonymous memory.
>>>>>>
>>>>>> I do agree with your point to some extent, but the ‘huge=’ option has
>>>>>> existed for nearly 8 years, and the huge orders based on write size may not
>>>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the
>>>>>> write length is consistently 4K. So, I am still concerned that ignoring the
>>>>>> 'huge' option could lead to compatibility issues.
>>>>>
>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>
>>>> OK.
>>>>
>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>>>>> handles to make kernel allocate PMD-size folio on any allocation or on
>>>>> allocations within i_size. I think this behaviour is useful beyond tmpfs.
>>>>>
>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>>>>> with current deployments and less special comparing to rest of
>>>>> filesystems on kernel side.
>>>>
>>>> I did a quick search, and I didn't find any other fs that require PMD-sized
>>>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>>>> tmpfs. Please correct me if I missed something.
>>>
>>> What do you mean by "require"? THPs are always opportunistic.
>>>
>>> IIUC, we don't have a way to hint kernel to use huge pages for a file on
>>> read from backing storage. Readahead is not always the right way.
>>>
>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>> filesystems.
>>>>
>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>>>> folios based on the write size? If yes, that means it will change the
>>>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>>>> mentioned:
>>>> "Another possible choice is to make the huge pages allocation based on write
>>>> size as the *default* behavior for tmpfs, ..."
>>>
>>> I am more worried about breaking existing users of huge pages. So changing
>>> behaviour of users who don't specify huge is okay to me.
>>
>> I think moving tmpfs to allocate large folios opportunistically by
>> default (as it was proposed initially) doesn't necessary conflict with
>> the default behaviour (huge=never). We just need to clarify that in
>> the documentation.
>>
>> However, and IIRC, one of the requests from Hugh was to have a way to
>> disable large folios which is something other FS do not have control
>> of as of today. Ryan sent a proposal to actually control that globally
>> but I think it didn't move forward. So, what are we missing to go back
>> to implement large folios in tmpfs in the default case, as any other fs
>> leveraging large folios?
>
> IMHO, as I discussed with Kirill, we still need maintain compatibility
> with the 'huge=' mount option. This means that if 'huge=never' is set
> for tmpfs, huge page allocation will still be prohibited (which can
> address Hugh's request?). However, if 'huge=' is not set, we can
> allocate large folios based on the write size.
I consider allocating large folios in shmem/tmpfs on the write path less
controversial than allocating them on the page fault path -- especially
as long as we stay within the size to-be-written.
I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
shmem_enabled=never). Maybe because of some rather undesired
side-effects (maybe some are historical?): I recall issues with VMs with
THP+ memory ballooning, as we cannot reclaim pages of folios if
splitting fails). I assume most of these problematic use cases don't use
tmpfs as an ordinary file system (write()/read()), but mmap() the whole
thing.
Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
documentation; most documentation is only concerned about anon THP.
Which makes me conclude that they are not suggested as of now.
I see more issues with allocating them on the page fault path and not
having a way to disable it -- compared to allocating them on the write()
path.
Getting Hugh's opinion in this would be very valuable.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-22 15:31 ` David Hildenbrand
@ 2024-10-23 8:04 ` Baolin Wang
2024-10-23 9:27 ` David Hildenbrand
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-23 8:04 UTC (permalink / raw)
To: David Hildenbrand, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/22 23:31, David Hildenbrand wrote:
> On 22.10.24 05:41, Baolin Wang wrote:
>>
>>
>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>> + Kirill
>>>>>>>
>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
>>>>>>>>> control the THP
>>>>>>>>> allocation, it is necessary to maintain compatibility with the
>>>>>>>>> 'huge='
>>>>>>>>> option, as well as considering the 'deny' and 'force' option
>>>>>>>>> controlled
>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>
>>>>>>>> No, it's not. No other filesystem honours these settings.
>>>>>>>> tmpfs would
>>>>>>>> not have had these settings if it were written today. It should
>>>>>>>> simply
>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
>>>>>>>> now that
>>>>>>>> we have a better solution to the original problem.
>>>>>>>>
>>>>>>>> To reiterate my position:
>>>>>>>>
>>>>>>>> - When using tmpfs as a filesystem, it should behave like
>>>>>>>> other
>>>>>>>> filesystems.
>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
>>>>>>>> it should
>>>>>>>> behave like anonymous memory.
>>>>>>>
>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
>>>>>>> has
>>>>>>> existed for nearly 8 years, and the huge orders based on write
>>>>>>> size may not
>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
>>>>>>> as when the
>>>>>>> write length is consistently 4K. So, I am still concerned that
>>>>>>> ignoring the
>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>
>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>
>>>>> OK.
>>>>>
>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
>>>>>> FADV_*
>>>>>> handles to make kernel allocate PMD-size folio on any allocation
>>>>>> or on
>>>>>> allocations within i_size. I think this behaviour is useful beyond
>>>>>> tmpfs.
>>>>>>
>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
>>>>>> compatible
>>>>>> with current deployments and less special comparing to rest of
>>>>>> filesystems on kernel side.
>>>>>
>>>>> I did a quick search, and I didn't find any other fs that require
>>>>> PMD-sized
>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
>>>>> other than
>>>>> tmpfs. Please correct me if I missed something.
>>>>
>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>
>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
>>>> file on
>>>> read from backing storage. Readahead is not always the right way.
>>>>
>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>> filesystems.
>>>>>
>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
>>>>> allocate large
>>>>> folios based on the write size? If yes, that means it will change the
>>>>> default huge behavior for tmpfs. Because previously having 'huge='
>>>>> is not
>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
>>>>> to what I
>>>>> mentioned:
>>>>> "Another possible choice is to make the huge pages allocation based
>>>>> on write
>>>>> size as the *default* behavior for tmpfs, ..."
>>>>
>>>> I am more worried about breaking existing users of huge pages. So
>>>> changing
>>>> behaviour of users who don't specify huge is okay to me.
>>>
>>> I think moving tmpfs to allocate large folios opportunistically by
>>> default (as it was proposed initially) doesn't necessary conflict with
>>> the default behaviour (huge=never). We just need to clarify that in
>>> the documentation.
>>>
>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>> disable large folios which is something other FS do not have control
>>> of as of today. Ryan sent a proposal to actually control that globally
>>> but I think it didn't move forward. So, what are we missing to go back
>>> to implement large folios in tmpfs in the default case, as any other fs
>>> leveraging large folios?
>>
>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>> with the 'huge=' mount option. This means that if 'huge=never' is set
>> for tmpfs, huge page allocation will still be prohibited (which can
>> address Hugh's request?). However, if 'huge=' is not set, we can
>> allocate large folios based on the write size.
>
> I consider allocating large folios in shmem/tmpfs on the write path less
> controversial than allocating them on the page fault path -- especially
> as long as we stay within the size to-be-written.
>
> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> shmem_enabled=never). Maybe because of some rather undesired
> side-effects (maybe some are historical?): I recall issues with VMs with
> THP+ memory ballooning, as we cannot reclaim pages of folios if
> splitting fails). I assume most of these problematic use cases don't use
> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> thing.
>
> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> documentation; most documentation is only concerned about anon THP.
> Which makes me conclude that they are not suggested as of now.
>
> I see more issues with allocating them on the page fault path and not
> having a way to disable it -- compared to allocating them on the write()
> path.
I may not understand your issues. IIUC, you can disable allocating huge
pages on the page fault path by using the 'huge=never' mount option or
setting shmem_enabled=deny. No?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-22 10:06 ` Kirill A. Shutemov
@ 2024-10-23 9:25 ` Baolin Wang
0 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-23 9:25 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, da.gomez, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/22 18:06, Kirill A. Shutemov wrote:
> On Tue, Oct 22, 2024 at 11:34:14AM +0800, Baolin Wang wrote:
>> IIUC, most file systems use method similar to iomap buffered IO (see
>> iomap_get_folio()) to allocate huge pages. What I mean is that, it would be
>> better to have a real use case to add a hint for allocating THP (other than
>> tmpfs).
>
> I would be nice to hear from folks who works with production what the
> actual needs are.
>
> But I find asymmetry between MADV_ hints and FADV_ hints wrt huge pages
> not justified. I think it would be easy to find use-cases for
> FADV_HUGEPAGE/FADV_NOHUGEPAGE.
>
> Furthermore I think it would be useful to have some kind of mechanism to
> make these hints persistent: any open of a file would have these hints set
> by default based on inode metadata on backing storage. Although, I am not
> sure what the right way to archive that. xattrs?
May be can re-use mapping_set_folio_order_range()?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-23 8:04 ` Baolin Wang
@ 2024-10-23 9:27 ` David Hildenbrand
2024-10-24 10:49 ` Daniel Gomez
0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-10-23 9:27 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 23.10.24 10:04, Baolin Wang wrote:
>
>
> On 2024/10/22 23:31, David Hildenbrand wrote:
>> On 22.10.24 05:41, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>>> + Kirill
>>>>>>>>
>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
>>>>>>>>>> control the THP
>>>>>>>>>> allocation, it is necessary to maintain compatibility with the
>>>>>>>>>> 'huge='
>>>>>>>>>> option, as well as considering the 'deny' and 'force' option
>>>>>>>>>> controlled
>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>
>>>>>>>>> No, it's not. No other filesystem honours these settings.
>>>>>>>>> tmpfs would
>>>>>>>>> not have had these settings if it were written today. It should
>>>>>>>>> simply
>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
>>>>>>>>> now that
>>>>>>>>> we have a better solution to the original problem.
>>>>>>>>>
>>>>>>>>> To reiterate my position:
>>>>>>>>>
>>>>>>>>> - When using tmpfs as a filesystem, it should behave like
>>>>>>>>> other
>>>>>>>>> filesystems.
>>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
>>>>>>>>> it should
>>>>>>>>> behave like anonymous memory.
>>>>>>>>
>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
>>>>>>>> has
>>>>>>>> existed for nearly 8 years, and the huge orders based on write
>>>>>>>> size may not
>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
>>>>>>>> as when the
>>>>>>>> write length is consistently 4K. So, I am still concerned that
>>>>>>>> ignoring the
>>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>>
>>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
>>>>>>> FADV_*
>>>>>>> handles to make kernel allocate PMD-size folio on any allocation
>>>>>>> or on
>>>>>>> allocations within i_size. I think this behaviour is useful beyond
>>>>>>> tmpfs.
>>>>>>>
>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
>>>>>>> compatible
>>>>>>> with current deployments and less special comparing to rest of
>>>>>>> filesystems on kernel side.
>>>>>>
>>>>>> I did a quick search, and I didn't find any other fs that require
>>>>>> PMD-sized
>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
>>>>>> other than
>>>>>> tmpfs. Please correct me if I missed something.
>>>>>
>>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>>
>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
>>>>> file on
>>>>> read from backing storage. Readahead is not always the right way.
>>>>>
>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>>> filesystems.
>>>>>>
>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
>>>>>> allocate large
>>>>>> folios based on the write size? If yes, that means it will change the
>>>>>> default huge behavior for tmpfs. Because previously having 'huge='
>>>>>> is not
>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
>>>>>> to what I
>>>>>> mentioned:
>>>>>> "Another possible choice is to make the huge pages allocation based
>>>>>> on write
>>>>>> size as the *default* behavior for tmpfs, ..."
>>>>>
>>>>> I am more worried about breaking existing users of huge pages. So
>>>>> changing
>>>>> behaviour of users who don't specify huge is okay to me.
>>>>
>>>> I think moving tmpfs to allocate large folios opportunistically by
>>>> default (as it was proposed initially) doesn't necessary conflict with
>>>> the default behaviour (huge=never). We just need to clarify that in
>>>> the documentation.
>>>>
>>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>>> disable large folios which is something other FS do not have control
>>>> of as of today. Ryan sent a proposal to actually control that globally
>>>> but I think it didn't move forward. So, what are we missing to go back
>>>> to implement large folios in tmpfs in the default case, as any other fs
>>>> leveraging large folios?
>>>
>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>> for tmpfs, huge page allocation will still be prohibited (which can
>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>> allocate large folios based on the write size.
>>
>> I consider allocating large folios in shmem/tmpfs on the write path less
>> controversial than allocating them on the page fault path -- especially
>> as long as we stay within the size to-be-written.
>>
>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>> shmem_enabled=never). Maybe because of some rather undesired
>> side-effects (maybe some are historical?): I recall issues with VMs with
>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>> splitting fails). I assume most of these problematic use cases don't use
>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>> thing.
>>
>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>> documentation; most documentation is only concerned about anon THP.
>> Which makes me conclude that they are not suggested as of now.
>>
>> I see more issues with allocating them on the page fault path and not
>> having a way to disable it -- compared to allocating them on the write()
>> path.
>
> I may not understand your issues. IIUC, you can disable allocating huge
> pages on the page fault path by using the 'huge=never' mount option or
> setting shmem_enabled=deny. No?
That's what I am saying: if there is some way to disable it that will
keep working, great.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-23 9:27 ` David Hildenbrand
@ 2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Daniel Gomez @ 2024-10-24 10:49 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
> On 23.10.24 10:04, Baolin Wang wrote:
> >
> >
> > On 2024/10/22 23:31, David Hildenbrand wrote:
> >> On 22.10.24 05:41, Baolin Wang wrote:
> >>>
> >>>
> >>> On 2024/10/21 21:34, Daniel Gomez wrote:
> >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> >>>>>>>> + Kirill
> >>>>>>>>
> >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
> >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
> >>>>>>>>>> control the THP
> >>>>>>>>>> allocation, it is necessary to maintain compatibility with the
> >>>>>>>>>> 'huge='
> >>>>>>>>>> option, as well as considering the 'deny' and 'force' option
> >>>>>>>>>> controlled
> >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> >>>>>>>>>
> >>>>>>>>> No, it's not. No other filesystem honours these settings.
> >>>>>>>>> tmpfs would
> >>>>>>>>> not have had these settings if it were written today. It should
> >>>>>>>>> simply
> >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
> >>>>>>>>> now that
> >>>>>>>>> we have a better solution to the original problem.
> >>>>>>>>>
> >>>>>>>>> To reiterate my position:
> >>>>>>>>>
> >>>>>>>>> - When using tmpfs as a filesystem, it should behave like
> >>>>>>>>> other
> >>>>>>>>> filesystems.
> >>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
> >>>>>>>>> it should
> >>>>>>>>> behave like anonymous memory.
> >>>>>>>>
> >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
> >>>>>>>> has
> >>>>>>>> existed for nearly 8 years, and the huge orders based on write
> >>>>>>>> size may not
> >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
> >>>>>>>> as when the
> >>>>>>>> write length is consistently 4K. So, I am still concerned that
> >>>>>>>> ignoring the
> >>>>>>>> 'huge' option could lead to compatibility issues.
> >>>>>>>
> >>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
> >>>>>>
> >>>>>> OK.
> >>>>>>
> >>>>>>> Maybe we need to get a new generic interface to request the semantics
> >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
> >>>>>>> FADV_*
> >>>>>>> handles to make kernel allocate PMD-size folio on any allocation
> >>>>>>> or on
> >>>>>>> allocations within i_size. I think this behaviour is useful beyond
> >>>>>>> tmpfs.
> >>>>>>>
> >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
> >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
> >>>>>>> compatible
> >>>>>>> with current deployments and less special comparing to rest of
> >>>>>>> filesystems on kernel side.
> >>>>>>
> >>>>>> I did a quick search, and I didn't find any other fs that require
> >>>>>> PMD-sized
> >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
> >>>>>> other than
> >>>>>> tmpfs. Please correct me if I missed something.
> >>>>>
> >>>>> What do you mean by "require"? THPs are always opportunistic.
> >>>>>
> >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
> >>>>> file on
> >>>>> read from backing storage. Readahead is not always the right way.
> >>>>>
> >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
> >>>>>>> filesystems.
> >>>>>>
> >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
> >>>>>> allocate large
> >>>>>> folios based on the write size? If yes, that means it will change the
> >>>>>> default huge behavior for tmpfs. Because previously having 'huge='
> >>>>>> is not
> >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
> >>>>>> to what I
> >>>>>> mentioned:
> >>>>>> "Another possible choice is to make the huge pages allocation based
> >>>>>> on write
> >>>>>> size as the *default* behavior for tmpfs, ..."
> >>>>>
> >>>>> I am more worried about breaking existing users of huge pages. So
> >>>>> changing
> >>>>> behaviour of users who don't specify huge is okay to me.
> >>>>
> >>>> I think moving tmpfs to allocate large folios opportunistically by
> >>>> default (as it was proposed initially) doesn't necessary conflict with
> >>>> the default behaviour (huge=never). We just need to clarify that in
> >>>> the documentation.
> >>>>
> >>>> However, and IIRC, one of the requests from Hugh was to have a way to
> >>>> disable large folios which is something other FS do not have control
> >>>> of as of today. Ryan sent a proposal to actually control that globally
> >>>> but I think it didn't move forward. So, what are we missing to go back
> >>>> to implement large folios in tmpfs in the default case, as any other fs
> >>>> leveraging large folios?
> >>>
> >>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> >>> with the 'huge=' mount option. This means that if 'huge=never' is set
> >>> for tmpfs, huge page allocation will still be prohibited (which can
> >>> address Hugh's request?). However, if 'huge=' is not set, we can
> >>> allocate large folios based on the write size.
So, in order to make tmpfs behave like other filesystems, we need to
allocate large folios by default. Not setting 'huge=' is the same as
setting it to 'huge=never' as per documentation. But 'huge=' is meant to
control THP, not large folios, so it should not have a conflict here, or
else, what case are you thinking?
So, to make tmpfs behave like other filesystems, we need to allocate
large folios by default. According to the documentation, not setting
'huge=' is the same as setting 'huge=never.' However, 'huge=' is
intended to control THP, not large folios, so there shouldn't be
a conflict in this case. Can you clarify what specific scenario or
conflict you're considering here? Perhaps when large folios order is the
same as PMD-size?
> >>
> >> I consider allocating large folios in shmem/tmpfs on the write path less
> >> controversial than allocating them on the page fault path -- especially
> >> as long as we stay within the size to-be-written.
> >>
> >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> >> shmem_enabled=never). Maybe because of some rather undesired
> >> side-effects (maybe some are historical?): I recall issues with VMs with
> >> THP+ memory ballooning, as we cannot reclaim pages of folios if
> >> splitting fails). I assume most of these problematic use cases don't use
> >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> >> thing.
> >>
> >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> >> documentation; most documentation is only concerned about anon THP.
> >> Which makes me conclude that they are not suggested as of now.
> >>
> >> I see more issues with allocating them on the page fault path and not
> >> having a way to disable it -- compared to allocating them on the write()
> >> path.
> >
> > I may not understand your issues. IIUC, you can disable allocating huge
> > pages on the page fault path by using the 'huge=never' mount option or
> > setting shmem_enabled=deny. No?
>
> That's what I am saying: if there is some way to disable it that will
> keep working, great.
I agree. That aligns with what I recall Hugh requested. However, I
believe if that is the way to go, we shouldn't limit it to tmpfs.
Otherwise, why should tmpfs be prevented from allocating large folios if
other filesystems in the system are allowed to allocate them? I think,
if we want to disable large folios we should make it more generic,
something similar to Ryan's proposal [1] for controlling folio sizes.
[1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
That said, there has already been disagreement on this point here [2].
[2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-24 10:49 ` Daniel Gomez
@ 2024-10-24 10:52 ` Daniel Gomez
2024-10-25 2:56 ` Baolin Wang
2024-10-25 20:21 ` David Hildenbrand
2 siblings, 0 replies; 37+ messages in thread
From: Daniel Gomez @ 2024-10-24 10:52 UTC (permalink / raw)
To: Daniel Gomez, David Hildenbrand, Baolin Wang, Daniel Gomez,
Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On Thu Oct 24, 2024 at 12:49 PM CEST, Daniel Gomez wrote:
> On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
> > On 23.10.24 10:04, Baolin Wang wrote:
> > >
> > >
> > > On 2024/10/22 23:31, David Hildenbrand wrote:
> > >> On 22.10.24 05:41, Baolin Wang wrote:
> > >>>
> > >>>
> > >>> On 2024/10/21 21:34, Daniel Gomez wrote:
> > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
> > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
> > >>>>>>>> + Kirill
> > >>>>>>>>
> > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
> > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
> > >>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
> > >>>>>>>>>> control the THP
> > >>>>>>>>>> allocation, it is necessary to maintain compatibility with the
> > >>>>>>>>>> 'huge='
> > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option
> > >>>>>>>>>> controlled
> > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> > >>>>>>>>>
> > >>>>>>>>> No, it's not. No other filesystem honours these settings.
> > >>>>>>>>> tmpfs would
> > >>>>>>>>> not have had these settings if it were written today. It should
> > >>>>>>>>> simply
> > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
> > >>>>>>>>> now that
> > >>>>>>>>> we have a better solution to the original problem.
> > >>>>>>>>>
> > >>>>>>>>> To reiterate my position:
> > >>>>>>>>>
> > >>>>>>>>> - When using tmpfs as a filesystem, it should behave like
> > >>>>>>>>> other
> > >>>>>>>>> filesystems.
> > >>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
> > >>>>>>>>> it should
> > >>>>>>>>> behave like anonymous memory.
> > >>>>>>>>
> > >>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
> > >>>>>>>> has
> > >>>>>>>> existed for nearly 8 years, and the huge orders based on write
> > >>>>>>>> size may not
> > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
> > >>>>>>>> as when the
> > >>>>>>>> write length is consistently 4K. So, I am still concerned that
> > >>>>>>>> ignoring the
> > >>>>>>>> 'huge' option could lead to compatibility issues.
> > >>>>>>>
> > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
> > >>>>>>
> > >>>>>> OK.
> > >>>>>>
> > >>>>>>> Maybe we need to get a new generic interface to request the semantics
> > >>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
> > >>>>>>> FADV_*
> > >>>>>>> handles to make kernel allocate PMD-size folio on any allocation
> > >>>>>>> or on
> > >>>>>>> allocations within i_size. I think this behaviour is useful beyond
> > >>>>>>> tmpfs.
> > >>>>>>>
> > >>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
> > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
> > >>>>>>> compatible
> > >>>>>>> with current deployments and less special comparing to rest of
> > >>>>>>> filesystems on kernel side.
> > >>>>>>
> > >>>>>> I did a quick search, and I didn't find any other fs that require
> > >>>>>> PMD-sized
> > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
> > >>>>>> other than
> > >>>>>> tmpfs. Please correct me if I missed something.
> > >>>>>
> > >>>>> What do you mean by "require"? THPs are always opportunistic.
> > >>>>>
> > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
> > >>>>> file on
> > >>>>> read from backing storage. Readahead is not always the right way.
> > >>>>>
> > >>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
> > >>>>>>> filesystems.
> > >>>>>>
> > >>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
> > >>>>>> allocate large
> > >>>>>> folios based on the write size? If yes, that means it will change the
> > >>>>>> default huge behavior for tmpfs. Because previously having 'huge='
> > >>>>>> is not
> > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
> > >>>>>> to what I
> > >>>>>> mentioned:
> > >>>>>> "Another possible choice is to make the huge pages allocation based
> > >>>>>> on write
> > >>>>>> size as the *default* behavior for tmpfs, ..."
> > >>>>>
> > >>>>> I am more worried about breaking existing users of huge pages. So
> > >>>>> changing
> > >>>>> behaviour of users who don't specify huge is okay to me.
> > >>>>
> > >>>> I think moving tmpfs to allocate large folios opportunistically by
> > >>>> default (as it was proposed initially) doesn't necessary conflict with
> > >>>> the default behaviour (huge=never). We just need to clarify that in
> > >>>> the documentation.
> > >>>>
> > >>>> However, and IIRC, one of the requests from Hugh was to have a way to
> > >>>> disable large folios which is something other FS do not have control
> > >>>> of as of today. Ryan sent a proposal to actually control that globally
> > >>>> but I think it didn't move forward. So, what are we missing to go back
> > >>>> to implement large folios in tmpfs in the default case, as any other fs
> > >>>> leveraging large folios?
> > >>>
> > >>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> > >>> with the 'huge=' mount option. This means that if 'huge=never' is set
> > >>> for tmpfs, huge page allocation will still be prohibited (which can
> > >>> address Hugh's request?). However, if 'huge=' is not set, we can
> > >>> allocate large folios based on the write size.
>
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?
>
> So, to make tmpfs behave like other filesystems, we need to allocate
> large folios by default. According to the documentation, not setting
> 'huge=' is the same as setting 'huge=never.' However, 'huge=' is
> intended to control THP, not large folios, so there shouldn't be
> a conflict in this case. Can you clarify what specific scenario or
> conflict you're considering here? Perhaps when large folios order is the
> same as PMD-size?
Sorry for duplicate paragraph.
>
> > >>
> > >> I consider allocating large folios in shmem/tmpfs on the write path less
> > >> controversial than allocating them on the page fault path -- especially
> > >> as long as we stay within the size to-be-written.
> > >>
> > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> > >> shmem_enabled=never). Maybe because of some rather undesired
> > >> side-effects (maybe some are historical?): I recall issues with VMs with
> > >> THP+ memory ballooning, as we cannot reclaim pages of folios if
> > >> splitting fails). I assume most of these problematic use cases don't use
> > >> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> > >> thing.
> > >>
> > >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> > >> documentation; most documentation is only concerned about anon THP.
> > >> Which makes me conclude that they are not suggested as of now.
> > >>
> > >> I see more issues with allocating them on the page fault path and not
> > >> having a way to disable it -- compared to allocating them on the write()
> > >> path.
> > >
> > > I may not understand your issues. IIUC, you can disable allocating huge
> > > pages on the page fault path by using the 'huge=never' mount option or
> > > setting shmem_enabled=deny. No?
> >
> > That's what I am saying: if there is some way to disable it that will
> > keep working, great.
>
> I agree. That aligns with what I recall Hugh requested. However, I
> believe if that is the way to go, we shouldn't limit it to tmpfs.
> Otherwise, why should tmpfs be prevented from allocating large folios if
> other filesystems in the system are allowed to allocate them? I think,
> if we want to disable large folios we should make it more generic,
> something similar to Ryan's proposal [1] for controlling folio sizes.
>
> [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>
> That said, there has already been disagreement on this point here [2].
>
> [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
@ 2024-10-25 2:56 ` Baolin Wang
2024-10-25 20:21 ` David Hildenbrand
2 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-25 2:56 UTC (permalink / raw)
To: Daniel Gomez, David Hildenbrand, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/24 18:49, Daniel Gomez wrote:
> On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote:
>> On 23.10.24 10:04, Baolin Wang wrote:
>>>
>>>
>>> On 2024/10/22 23:31, David Hildenbrand wrote:
>>>> On 22.10.24 05:41, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/21 21:34, Daniel Gomez wrote:
>>>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
>>>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>>>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>>>>>>>>>> + Kirill
>>>>>>>>>>
>>>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote:
>>>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>>>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to
>>>>>>>>>>>> control the THP
>>>>>>>>>>>> allocation, it is necessary to maintain compatibility with the
>>>>>>>>>>>> 'huge='
>>>>>>>>>>>> option, as well as considering the 'deny' and 'force' option
>>>>>>>>>>>> controlled
>>>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>>>>
>>>>>>>>>>> No, it's not. No other filesystem honours these settings.
>>>>>>>>>>> tmpfs would
>>>>>>>>>>> not have had these settings if it were written today. It should
>>>>>>>>>>> simply
>>>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option
>>>>>>>>>>> now that
>>>>>>>>>>> we have a better solution to the original problem.
>>>>>>>>>>>
>>>>>>>>>>> To reiterate my position:
>>>>>>>>>>>
>>>>>>>>>>> - When using tmpfs as a filesystem, it should behave like
>>>>>>>>>>> other
>>>>>>>>>>> filesystems.
>>>>>>>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
>>>>>>>>>>> it should
>>>>>>>>>>> behave like anonymous memory.
>>>>>>>>>>
>>>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option
>>>>>>>>>> has
>>>>>>>>>> existed for nearly 8 years, and the huge orders based on write
>>>>>>>>>> size may not
>>>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such
>>>>>>>>>> as when the
>>>>>>>>>> write length is consistently 4K. So, I am still concerned that
>>>>>>>>>> ignoring the
>>>>>>>>>> 'huge' option could lead to compatibility issues.
>>>>>>>>>
>>>>>>>>> Yeah, I don't think we are there yet to ignore the mount option.
>>>>>>>>
>>>>>>>> OK.
>>>>>>>>
>>>>>>>>> Maybe we need to get a new generic interface to request the semantics
>>>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of
>>>>>>>>> FADV_*
>>>>>>>>> handles to make kernel allocate PMD-size folio on any allocation
>>>>>>>>> or on
>>>>>>>>> allocations within i_size. I think this behaviour is useful beyond
>>>>>>>>> tmpfs.
>>>>>>>>>
>>>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these
>>>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs
>>>>>>>>> compatible
>>>>>>>>> with current deployments and less special comparing to rest of
>>>>>>>>> filesystems on kernel side.
>>>>>>>>
>>>>>>>> I did a quick search, and I didn't find any other fs that require
>>>>>>>> PMD-sized
>>>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems
>>>>>>>> other than
>>>>>>>> tmpfs. Please correct me if I missed something.
>>>>>>>
>>>>>>> What do you mean by "require"? THPs are always opportunistic.
>>>>>>>
>>>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a
>>>>>>> file on
>>>>>>> read from backing storage. Readahead is not always the right way.
>>>>>>>
>>>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of
>>>>>>>>> filesystems.
>>>>>>>>
>>>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still
>>>>>>>> allocate large
>>>>>>>> folios based on the write size? If yes, that means it will change the
>>>>>>>> default huge behavior for tmpfs. Because previously having 'huge='
>>>>>>>> is not
>>>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
>>>>>>>> to what I
>>>>>>>> mentioned:
>>>>>>>> "Another possible choice is to make the huge pages allocation based
>>>>>>>> on write
>>>>>>>> size as the *default* behavior for tmpfs, ..."
>>>>>>>
>>>>>>> I am more worried about breaking existing users of huge pages. So
>>>>>>> changing
>>>>>>> behaviour of users who don't specify huge is okay to me.
>>>>>>
>>>>>> I think moving tmpfs to allocate large folios opportunistically by
>>>>>> default (as it was proposed initially) doesn't necessary conflict with
>>>>>> the default behaviour (huge=never). We just need to clarify that in
>>>>>> the documentation.
>>>>>>
>>>>>> However, and IIRC, one of the requests from Hugh was to have a way to
>>>>>> disable large folios which is something other FS do not have control
>>>>>> of as of today. Ryan sent a proposal to actually control that globally
>>>>>> but I think it didn't move forward. So, what are we missing to go back
>>>>>> to implement large folios in tmpfs in the default case, as any other fs
>>>>>> leveraging large folios?
>>>>>
>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>> allocate large folios based on the write size.
>
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?
>
> So, to make tmpfs behave like other filesystems, we need to allocate
> large folios by default. According to the documentation, not setting
Right.
> 'huge=' is the same as setting 'huge=never.' However, 'huge=' is
I will update the documentation in next version. That means if 'huge='
option is not set, we can still allocate large folios based on the write
size (will be not same as setting 'huge=never').
> intended to control THP, not large folios, so there shouldn't be
> a conflict in this case. Can you clarify what specific scenario or
Yes, we should still keep the same semantics of
'huge=always/within_size/advise' setting, which only controls THP
allocations.
> conflict you're considering here? Perhaps when large folios order is the
> same as PMD-size?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
2024-10-25 2:56 ` Baolin Wang
@ 2024-10-25 20:21 ` David Hildenbrand
2024-10-28 9:48 ` David Hildenbrand
2024-10-28 21:56 ` Daniel Gomez
2 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2024-10-25 20:21 UTC (permalink / raw)
To: Daniel Gomez, Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
Sorry for the late reply!
>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>> allocate large folios based on the write size.
>
> So, in order to make tmpfs behave like other filesystems, we need to
> allocate large folios by default. Not setting 'huge=' is the same as
> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> control THP, not large folios, so it should not have a conflict here, or
> else, what case are you thinking?
I think we really have to move away from "huge/thp == PMD", that's a
historical artifact. Everything else will simply be inconsistent and
confusing in the future -- and I don't see any real need for that. For
anonymous memory and anon shmem we managed the transition. (there is a
longer writeup from me about this topic, so I won't go into detail).
I think I raised this in the past, but tmpfs/shmem is just like any
other file system .. except it sometimes really isn't and behaves much
more like (swappable) anonymous memory. (or mlocked files)
There are many systems out there that run without swap enabled, or with
extremely minimal swap (IIRC until recently kubernetes was completely
incompatible with swapping). Swap can even be disabled today for shmem
using a mount option.
That's a big difference to all other file systems where you are
guaranteed to have backend storage where you can simply evict under
memory pressure (might temporarily fail, of course).
I *think* that's the reason why we have the "huge=" parameter that also
controls the THP allocations during page faults (IOW possible memory
over-allocation). Maybe also because it was a new feature, and we only
had a single THP size.
There is, of course also the "fallocate() might not free up memory if
there is an unexpected reference on the page because splitting it will
fail" problem, that even exists when not over-allocating memory in the
first place ...
So ...I don't think tmpfs behaves like other file system in some cases.
And I don't think ignoring these points is a good idea.
Fortunately I don't maintain that code :)
If we don't want to go with the shmem_enabled toggles, we should
probably still extend the documentation to cover "all THP sizes", like
we did elsewhere.
huge=never: no THPs of any size
huge=always: THPs of any size (fault/write/etc)
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size
We could think about adding a "nowaste" extension and try make it the
default.
For example
"huge=always:nowaste: THPs of any size as long as we don't over-allocate
memory (write)"
The sysfs toggles have their beauty as well and could be useful (I'm
pretty sure they will be useful :) ):
"huge=always;sysfs": THPs of any size (fault/write/etc) as configured in
sysfs.
Too many options here to explore, too little time I have to spend on
this. Just to throw out some ideas.
What I can really suggest is not making this one of the remaining
interfaces where "huge" means "PMD-sized" once other sizes exist.
>
>>>>
>>>> I consider allocating large folios in shmem/tmpfs on the write path less
>>>> controversial than allocating them on the page fault path -- especially
>>>> as long as we stay within the size to-be-written.
>>>>
>>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>>>> shmem_enabled=never). Maybe because of some rather undesired
>>>> side-effects (maybe some are historical?): I recall issues with VMs with
>>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>>>> splitting fails). I assume most of these problematic use cases don't use
>>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>>>> thing.
>>>>
>>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>>>> documentation; most documentation is only concerned about anon THP.
>>>> Which makes me conclude that they are not suggested as of now.
>>>>
>>>> I see more issues with allocating them on the page fault path and not
>>>> having a way to disable it -- compared to allocating them on the write()
>>>> path.
>>>
>>> I may not understand your issues. IIUC, you can disable allocating huge
>>> pages on the page fault path by using the 'huge=never' mount option or
>>> setting shmem_enabled=deny. No?
>>
>> That's what I am saying: if there is some way to disable it that will
>> keep working, great.
>
> I agree. That aligns with what I recall Hugh requested. However, I
> believe if that is the way to go, we shouldn't limit it to tmpfs.
> Otherwise, why should tmpfs be prevented from allocating large folios if
> other filesystems in the system are allowed to allocate them?
See above. On systems without/little swap you might not want them for
shmem/tmpfs, but would happily use them elsewhere.
The "write() won't waste memory" case is really interesting, the
"fallocate cannot free the memory" still exists. A shrinker might help.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-25 20:21 ` David Hildenbrand
@ 2024-10-28 9:48 ` David Hildenbrand
2024-10-31 3:43 ` Baolin Wang
2024-10-28 21:56 ` Daniel Gomez
1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-10-28 9:48 UTC (permalink / raw)
To: Daniel Gomez, Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 25.10.24 22:21, David Hildenbrand wrote:
> Sorry for the late reply!
>
>>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>> allocate large folios based on the write size.
>>
>> So, in order to make tmpfs behave like other filesystems, we need to
>> allocate large folios by default. Not setting 'huge=' is the same as
>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>> control THP, not large folios, so it should not have a conflict here, or
>> else, what case are you thinking?
>
> I think we really have to move away from "huge/thp == PMD", that's a
> historical artifact. Everything else will simply be inconsistent and
> confusing in the future -- and I don't see any real need for that. For
> anonymous memory and anon shmem we managed the transition. (there is a
> longer writeup from me about this topic, so I won't go into detail).
>
>
> I think I raised this in the past, but tmpfs/shmem is just like any
> other file system .. except it sometimes really isn't and behaves much
> more like (swappable) anonymous memory. (or mlocked files)
>
> There are many systems out there that run without swap enabled, or with
> extremely minimal swap (IIRC until recently kubernetes was completely
> incompatible with swapping). Swap can even be disabled today for shmem
> using a mount option.
>
> That's a big difference to all other file systems where you are
> guaranteed to have backend storage where you can simply evict under
> memory pressure (might temporarily fail, of course).
>
> I *think* that's the reason why we have the "huge=" parameter that also
> controls the THP allocations during page faults (IOW possible memory
> over-allocation). Maybe also because it was a new feature, and we only
> had a single THP size.
>
> There is, of course also the "fallocate() might not free up memory if
> there is an unexpected reference on the page because splitting it will
> fail" problem, that even exists when not over-allocating memory in the
> first place ...
>
>
> So ...I don't think tmpfs behaves like other file system in some cases.
> And I don't think ignoring these points is a good idea.
>
> Fortunately I don't maintain that code :)
>
>
> If we don't want to go with the shmem_enabled toggles, we should
> probably still extend the documentation to cover "all THP sizes", like
> we did elsewhere.
>
> huge=never: no THPs of any size
> huge=always: THPs of any size (fault/write/etc)
> huge=fadvise: like "always" but only with fadvise/madvise
> huge=within_size: like "fadvise" but respect i_size
Thinking some more about that over the weekend, this is likely the way
to go, paired with conditionally changing the default to
always/within_size. I suggest a kconfig option for that.
That should probably do as a first shot; I assume people will want more
control over which size to use, especially during page faults, but that
can likely be added later.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-25 20:21 ` David Hildenbrand
2024-10-28 9:48 ` David Hildenbrand
@ 2024-10-28 21:56 ` Daniel Gomez
2024-10-29 12:20 ` David Hildenbrand
1 sibling, 1 reply; 37+ messages in thread
From: Daniel Gomez @ 2024-10-28 21:56 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote:
> Sorry for the late reply!
>
> >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
> >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
> >>>>> for tmpfs, huge page allocation will still be prohibited (which can
> >>>>> address Hugh's request?). However, if 'huge=' is not set, we can
> >>>>> allocate large folios based on the write size.
> >
> > So, in order to make tmpfs behave like other filesystems, we need to
> > allocate large folios by default. Not setting 'huge=' is the same as
> > setting it to 'huge=never' as per documentation. But 'huge=' is meant to
> > control THP, not large folios, so it should not have a conflict here, or
> > else, what case are you thinking?
>
> I think we really have to move away from "huge/thp == PMD", that's a
> historical artifact. Everything else will simply be inconsistent and
> confusing in the future -- and I don't see any real need for that. For
> anonymous memory and anon shmem we managed the transition. (there is a
> longer writeup from me about this topic, so I won't go into detail).
>
>
> I think I raised this in the past, but tmpfs/shmem is just like any
> other file system .. except it sometimes really isn't and behaves much
> more like (swappable) anonymous memory. (or mlocked files)
>
> There are many systems out there that run without swap enabled, or with
> extremely minimal swap (IIRC until recently kubernetes was completely
> incompatible with swapping). Swap can even be disabled today for shmem
> using a mount option.
>
> That's a big difference to all other file systems where you are
> guaranteed to have backend storage where you can simply evict under
> memory pressure (might temporarily fail, of course).
>
> I *think* that's the reason why we have the "huge=" parameter that also
> controls the THP allocations during page faults (IOW possible memory
> over-allocation). Maybe also because it was a new feature, and we only
> had a single THP size.
>
> There is, of course also the "fallocate() might not free up memory if
> there is an unexpected reference on the page because splitting it will
> fail" problem, that even exists when not over-allocating memory in the
> first place ...
>
>
> So ...I don't think tmpfs behaves like other file system in some cases.
> And I don't think ignoring these points is a good idea.
Assuming a system without swap, what's the difference you are concern
about between using the current tmpfs allocation method vs large folios
implementation?
>
> Fortunately I don't maintain that code :)
>
>
> If we don't want to go with the shmem_enabled toggles, we should
> probably still extend the documentation to cover "all THP sizes", like
> we did elsewhere.
>
> huge=never: no THPs of any size
> huge=always: THPs of any size (fault/write/etc)
> huge=fadvise: like "always" but only with fadvise/madvise
> huge=within_size: like "fadvise" but respect i_size
>
> We could think about adding a "nowaste" extension and try make it the
> default.
>
> For example
>
> "huge=always:nowaste: THPs of any size as long as we don't over-allocate
> memory (write)"
This is the default behaviour in other fs too. I don't think is
necessary to make it explicit.
>
> The sysfs toggles have their beauty as well and could be useful (I'm
> pretty sure they will be useful :) ):
>
> "huge=always;sysfs": THPs of any size (fault/write/etc) as configured in
> sysfs.
>
> Too many options here to explore, too little time I have to spend on
> this. Just to throw out some ideas.
>
> What I can really suggest is not making this one of the remaining
> interfaces where "huge" means "PMD-sized" once other sizes exist.
>
> >
> >>>>
> >>>> I consider allocating large folios in shmem/tmpfs on the write path less
> >>>> controversial than allocating them on the page fault path -- especially
> >>>> as long as we stay within the size to-be-written.
> >>>>
> >>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
> >>>> shmem_enabled=never). Maybe because of some rather undesired
> >>>> side-effects (maybe some are historical?): I recall issues with VMs with
> >>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
> >>>> splitting fails). I assume most of these problematic use cases don't use
> >>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
> >>>> thing.
> >>>>
> >>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
> >>>> documentation; most documentation is only concerned about anon THP.
> >>>> Which makes me conclude that they are not suggested as of now.
> >>>>
> >>>> I see more issues with allocating them on the page fault path and not
> >>>> having a way to disable it -- compared to allocating them on the write()
> >>>> path.
> >>>
> >>> I may not understand your issues. IIUC, you can disable allocating huge
> >>> pages on the page fault path by using the 'huge=never' mount option or
> >>> setting shmem_enabled=deny. No?
> >>
> >> That's what I am saying: if there is some way to disable it that will
> >> keep working, great.
> >
> > I agree. That aligns with what I recall Hugh requested. However, I
> > believe if that is the way to go, we shouldn't limit it to tmpfs.
> > Otherwise, why should tmpfs be prevented from allocating large folios if
> > other filesystems in the system are allowed to allocate them?
>
> See above. On systems without/little swap you might not want them for
> shmem/tmpfs, but would happily use them elsewhere.
>
> The "write() won't waste memory" case is really interesting, the
> "fallocate cannot free the memory" still exists. A shrinker might help.
The previous implementation with large folios allocation was wrong
and was actually wasting memory by rounding up while trying to find
the order. Matthew already pointed it out [1]. So, with that fixed, we
should not end up wasting memory.
https://lore.kernel.org/all/ZvVQoY8Tn_BNc79T@casper.infradead.org/
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-28 21:56 ` Daniel Gomez
@ 2024-10-29 12:20 ` David Hildenbrand
0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2024-10-29 12:20 UTC (permalink / raw)
To: Daniel Gomez, Baolin Wang, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 28.10.24 22:56, Daniel Gomez wrote:
> On Fri Oct 25, 2024 at 10:21 PM CEST, David Hildenbrand wrote:
>> Sorry for the late reply!
>>
>>>>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility
>>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is set
>>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>>> allocate large folios based on the write size.
>>>
>>> So, in order to make tmpfs behave like other filesystems, we need to
>>> allocate large folios by default. Not setting 'huge=' is the same as
>>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>>> control THP, not large folios, so it should not have a conflict here, or
>>> else, what case are you thinking?
>>
>> I think we really have to move away from "huge/thp == PMD", that's a
>> historical artifact. Everything else will simply be inconsistent and
>> confusing in the future -- and I don't see any real need for that. For
>> anonymous memory and anon shmem we managed the transition. (there is a
>> longer writeup from me about this topic, so I won't go into detail).
>>
>>
>> I think I raised this in the past, but tmpfs/shmem is just like any
>> other file system .. except it sometimes really isn't and behaves much
>> more like (swappable) anonymous memory. (or mlocked files)
>>
>> There are many systems out there that run without swap enabled, or with
>> extremely minimal swap (IIRC until recently kubernetes was completely
>> incompatible with swapping). Swap can even be disabled today for shmem
>> using a mount option.
>>
>> That's a big difference to all other file systems where you are
>> guaranteed to have backend storage where you can simply evict under
>> memory pressure (might temporarily fail, of course).
>>
>> I *think* that's the reason why we have the "huge=" parameter that also
>> controls the THP allocations during page faults (IOW possible memory
>> over-allocation). Maybe also because it was a new feature, and we only
>> had a single THP size.
>>
>> There is, of course also the "fallocate() might not free up memory if
>> there is an unexpected reference on the page because splitting it will
>> fail" problem, that even exists when not over-allocating memory in the
>> first place ...
>>
>>
>> So ...I don't think tmpfs behaves like other file system in some cases.
>> And I don't think ignoring these points is a good idea.
>
> Assuming a system without swap, what's the difference you are concern
> about between using the current tmpfs allocation method vs large folios
> implementation?
As raised above, there is the interesting interaction between
fallocate(FALLOC_FL_PUNCH_HOLE) and raised refcounts, where we can fail
to reclaim memory.
shmem_fallocate()->shmem_truncate_range()->truncate_inode_pages_range()->truncate_inode_partial_folio().
It's better than it was in the past -- in the past we didn't even try
splitting, but today splitting can still fail and we'll never try
reclaiming that memory again later. This is very different to anonymous
memory where we have the deferred split queue+remember which pages where
zapped implicitly using the page tables (instead of zeroing them out and
not freeing up the memory).
It's one of the issues people ran into when using THP+shmem for backing
guest VMs along with memory ballooning. For that reason, the
recommendation still is to disable THP when using shmem for backing
guest VMs and relying on memory overcommit optimizations such as memory
balloon inflation.
>
>>
>> Fortunately I don't maintain that code :)
>>
>>
>> If we don't want to go with the shmem_enabled toggles, we should
>> probably still extend the documentation to cover "all THP sizes", like
>> we did elsewhere.
>>
>> huge=never: no THPs of any size
>> huge=always: THPs of any size (fault/write/etc)
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> We could think about adding a "nowaste" extension and try make it the
>> default.
>>
>> For example
>>
>> "huge=always:nowaste: THPs of any size as long as we don't over-allocate
>> memory (write)"
>
> This is the default behaviour in other fs too. I don't think is
> necessary to make it explicit.
Please keep in mind that allocating THPs of different size during *page
faults* will have to fit into the whole picture we are creating here.
That's also what "huge=always" controls for shmem today IIRC.
>>>
>>>>>>
>>>>>> I consider allocating large folios in shmem/tmpfs on the write path less
>>>>>> controversial than allocating them on the page fault path -- especially
>>>>>> as long as we stay within the size to-be-written.
>>>>>>
>>>>>> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
>>>>>> shmem_enabled=never). Maybe because of some rather undesired
>>>>>> side-effects (maybe some are historical?): I recall issues with VMs with
>>>>>> THP+ memory ballooning, as we cannot reclaim pages of folios if
>>>>>> splitting fails). I assume most of these problematic use cases don't use
>>>>>> tmpfs as an ordinary file system (write()/read()), but mmap() the whole
>>>>>> thing.
>>>>>>
>>>>>> Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
>>>>>> documentation; most documentation is only concerned about anon THP.
>>>>>> Which makes me conclude that they are not suggested as of now.
>>>>>>
>>>>>> I see more issues with allocating them on the page fault path and not
>>>>>> having a way to disable it -- compared to allocating them on the write()
>>>>>> path.
>>>>>
>>>>> I may not understand your issues. IIUC, you can disable allocating huge
>>>>> pages on the page fault path by using the 'huge=never' mount option or
>>>>> setting shmem_enabled=deny. No?
>>>>
>>>> That's what I am saying: if there is some way to disable it that will
>>>> keep working, great.
>>>
>>> I agree. That aligns with what I recall Hugh requested. However, I
>>> believe if that is the way to go, we shouldn't limit it to tmpfs.
>>> Otherwise, why should tmpfs be prevented from allocating large folios if
>>> other filesystems in the system are allowed to allocate them?
>>
>> See above. On systems without/little swap you might not want them for
>> shmem/tmpfs, but would happily use them elsewhere.
>>
>> The "write() won't waste memory" case is really interesting, the
>> "fallocate cannot free the memory" still exists. A shrinker might help.
>
> The previous implementation with large folios allocation was wrong
> and was actually wasting memory by rounding up while trying to find
> the order. Matthew already pointed it out [1]. So, with that fixed, we
> should not end up wasting memory.
Again, we should have a clear path forward how we deal with page faults
and how this fits into the picture.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-28 9:48 ` David Hildenbrand
@ 2024-10-31 3:43 ` Baolin Wang
2024-10-31 8:53 ` David Hildenbrand
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-10-31 3:43 UTC (permalink / raw)
To: David Hildenbrand, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
Sorry for late reply.
On 2024/10/28 17:48, David Hildenbrand wrote:
> On 25.10.24 22:21, David Hildenbrand wrote:
>> Sorry for the late reply!
>>
>>>>>>> IMHO, as I discussed with Kirill, we still need maintain
>>>>>>> compatibility
>>>>>>> with the 'huge=' mount option. This means that if 'huge=never' is
>>>>>>> set
>>>>>>> for tmpfs, huge page allocation will still be prohibited (which can
>>>>>>> address Hugh's request?). However, if 'huge=' is not set, we can
>>>>>>> allocate large folios based on the write size.
>>>
>>> So, in order to make tmpfs behave like other filesystems, we need to
>>> allocate large folios by default. Not setting 'huge=' is the same as
>>> setting it to 'huge=never' as per documentation. But 'huge=' is meant to
>>> control THP, not large folios, so it should not have a conflict here, or
>>> else, what case are you thinking?
>>
>> I think we really have to move away from "huge/thp == PMD", that's a
>> historical artifact. Everything else will simply be inconsistent and
>> confusing in the future -- and I don't see any real need for that. For
>> anonymous memory and anon shmem we managed the transition. (there is a
>> longer writeup from me about this topic, so I won't go into detail).
>>
>>
>> I think I raised this in the past, but tmpfs/shmem is just like any
>> other file system .. except it sometimes really isn't and behaves much
>> more like (swappable) anonymous memory. (or mlocked files)
>>
>> There are many systems out there that run without swap enabled, or with
>> extremely minimal swap (IIRC until recently kubernetes was completely
>> incompatible with swapping). Swap can even be disabled today for shmem
>> using a mount option.
>>
>> That's a big difference to all other file systems where you are
>> guaranteed to have backend storage where you can simply evict under
>> memory pressure (might temporarily fail, of course).
>>
>> I *think* that's the reason why we have the "huge=" parameter that also
>> controls the THP allocations during page faults (IOW possible memory
>> over-allocation). Maybe also because it was a new feature, and we only
>> had a single THP size.
>>
>> There is, of course also the "fallocate() might not free up memory if
>> there is an unexpected reference on the page because splitting it will
>> fail" problem, that even exists when not over-allocating memory in the
>> first place ...
>>
>>
>> So ...I don't think tmpfs behaves like other file system in some cases.
>> And I don't think ignoring these points is a good idea.
>>
>> Fortunately I don't maintain that code :)
>>
>>
>> If we don't want to go with the shmem_enabled toggles, we should
>> probably still extend the documentation to cover "all THP sizes", like
>> we did elsewhere.
>>
>> huge=never: no THPs of any size
>> huge=always: THPs of any size (fault/write/etc)
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>
> Thinking some more about that over the weekend, this is likely the way
> to go, paired with conditionally changing the default to
> always/within_size. I suggest a kconfig option for that.
I am still worried about adding a new kconfig option, which might
complicate the tmpfs controls further.
> That should probably do as a first shot; I assume people will want more
> control over which size to use, especially during page faults, but that
> can likely be added later.
After some discussions, I think the first step is to achieve two goals:
1) Try to make tmpfs use large folios like other file systems, that
means we should avoid adding more complex control options (per Matthew).
2) Still need maintain compatibility with the 'huge=' mount option (per
Kirill), as I also remembered we have customers who use
'huge=within_size' to allocate THPs for better performance.
Based on these considerations, my first step is to neither add a new
'huge=' option parameter nor introduce the mTHP interfaces control for
tmpfs, but rather to change the default huge allocation behavior for
tmpfs. That is to say, when 'huge=' option is not configured, we will
allow the huge folios allocation based on the write size. As a result,
the behavior of huge pages for tmpfs will change as follows:
no 'huge=' set: can allocate any size huge folios based on write size
huge=never: no any size huge folios
huge=always: only PMD sized THP allocation as before
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size
The next step is to continue discussing whether to add a new Kconfig
option or FADV_* in the future.
So what do you think?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-31 3:43 ` Baolin Wang
@ 2024-10-31 8:53 ` David Hildenbrand
2024-10-31 10:04 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-10-31 8:53 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
>>>
>>> If we don't want to go with the shmem_enabled toggles, we should
>>> probably still extend the documentation to cover "all THP sizes", like
>>> we did elsewhere.
>>>
>>> huge=never: no THPs of any size
>>> huge=always: THPs of any size (fault/write/etc)
>>> huge=fadvise: like "always" but only with fadvise/madvise
>>> huge=within_size: like "fadvise" but respect i_size
>>
>> Thinking some more about that over the weekend, this is likely the way
>> to go, paired with conditionally changing the default to
>> always/within_size. I suggest a kconfig option for that.
>
> I am still worried about adding a new kconfig option, which might
> complicate the tmpfs controls further.
Why exactly?
If we are changing a default similar to
CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS,
it would make perfectly sense to give people building a kernel control
over that.
If we want to support this feature in a distro kernel like RHEL we'll
have to leave the default unmodified. Otherwise I see no way (excluding
downstream-only hacks) to backport this into distro kernels.
>
>> That should probably do as a first shot; I assume people will want more
>> control over which size to use, especially during page faults, but that
>> can likely be added later.
I know, it puts you in a bad position because there are different
opinions floating around. But let's try to find something that is
reasonable and still acceptable. And let's hope that Hugh will voice an
opinion :D
>
> After some discussions, I think the first step is to achieve two goals:
> 1) Try to make tmpfs use large folios like other file systems, that
> means we should avoid adding more complex control options (per Matthew).
> 2) Still need maintain compatibility with the 'huge=' mount option (per
> Kirill), as I also remembered we have customers who use
> 'huge=within_size' to allocate THPs for better performance.
>
> Based on these considerations, my first step is to neither add a new
> 'huge=' option parameter nor introduce the mTHP interfaces control for
> tmpfs, but rather to change the default huge allocation behavior for
> tmpfs. That is to say, when 'huge=' option is not configured, we will
> allow the huge folios allocation based on the write size. As a result,
> the behavior of huge pages for tmpfs will change as follows:
> > no 'huge=' set: can allocate any size huge folios based on write size
> huge=never: no any size huge folios> huge=always: only PMD sized THP
allocation as before
> huge=fadvise: like "always" but only with fadvise/madvise>
huge=within_size: like "fadvise" but respect i_size
I don't like that:
(a) there is no way to explicitly enable/name that new behavior.
(b) "always" etc. are only concerned about PMDs.
So again, I suggest:
huge=never: No THPs of any size
huge=always: THPs of any size
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size
"huge=" default depends on a Kconfig option.
With that we:
(1) Maximize the cases where we will use large folios of any sizes
(which Willy cares about).
(2) Have a way to disable them completely (which I care about).
(3) Allow distros to keep the default unchanged.
Likely, for now we will only try allocating PMD-sized THPs during page
faults, and allocate different sizes only during write(). So the effect
for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
completely unchanged even with "huge=always".
It will get more tricky once we change that behavior as well, but that's
something to likely figure out if it is a real problem at at different
day :)
I really preferred using the sysfs toggles (as discussed with Hugh in
the meeting back then), but I can also understand why we at least want
to try making tmpfs behave more like other file systems. But I'm a bit
more careful to not ignore the cases where it really isn't like any
other file system.
If we start making PMD-sized THPs special in any non-configurable way,
then we are effectively off *worse* than allowing to configure them
properly. So if someone voices "but we want only PMD-sized" ones, the
next one will say "but we only want cont-pte sized-ones" and then we
should provide an option to control the actual sizes to use differently,
in some way. But let's see if that is even required.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-31 8:53 ` David Hildenbrand
@ 2024-10-31 10:04 ` Baolin Wang
2024-10-31 10:46 ` David Hildenbrand
2024-10-31 10:46 ` David Hildenbrand
0 siblings, 2 replies; 37+ messages in thread
From: Baolin Wang @ 2024-10-31 10:04 UTC (permalink / raw)
To: David Hildenbrand, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/31 16:53, David Hildenbrand wrote:
>>>>
>>>> If we don't want to go with the shmem_enabled toggles, we should
>>>> probably still extend the documentation to cover "all THP sizes", like
>>>> we did elsewhere.
>>>>
>>>> huge=never: no THPs of any size
>>>> huge=always: THPs of any size (fault/write/etc)
>>>> huge=fadvise: like "always" but only with fadvise/madvise
>>>> huge=within_size: like "fadvise" but respect i_size
>>>
>>> Thinking some more about that over the weekend, this is likely the way
>>> to go, paired with conditionally changing the default to
>>> always/within_size. I suggest a kconfig option for that.
>>
>> I am still worried about adding a new kconfig option, which might
>> complicate the tmpfs controls further.
>
> Why exactly?
There will be more options to control huge pages allocation for tmpfs,
which may confuse users and make life harder? Yes, we can add some
documentation, but I'm still a bit cautious about this.
> If we are changing a default similar to
> CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS,
> it would make perfectly sense to give people building a kernel control
> over that.
>
> If we want to support this feature in a distro kernel like RHEL we'll
> have to leave the default unmodified. Otherwise I see no way (excluding
> downstream-only hacks) to backport this into distro kernels.
>
>>
>>> That should probably do as a first shot; I assume people will want more
>>> control over which size to use, especially during page faults, but that
>>> can likely be added later.
>
> I know, it puts you in a bad position because there are different
> opinions floating around. But let's try to find something that is
> reasonable and still acceptable. And let's hope that Hugh will voice an
> opinion :D
Yes, I am also waiting to see if Hugh has any inputs :)
>> After some discussions, I think the first step is to achieve two goals:
>> 1) Try to make tmpfs use large folios like other file systems, that
>> means we should avoid adding more complex control options (per Matthew).
>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>> Kirill), as I also remembered we have customers who use
>> 'huge=within_size' to allocate THPs for better performance.
>
>>
>> Based on these considerations, my first step is to neither add a new
>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>> tmpfs, but rather to change the default huge allocation behavior for
>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>> allow the huge folios allocation based on the write size. As a result,
>> the behavior of huge pages for tmpfs will change as follows:
> > > no 'huge=' set: can allocate any size huge folios based on write size
> > huge=never: no any size huge folios> huge=always: only PMD sized THP
> allocation as before
> > huge=fadvise: like "always" but only with fadvise/madvise>
> huge=within_size: like "fadvise" but respect i_size
>
> I don't like that:
>
> (a) there is no way to explicitly enable/name that new behavior.
But this is similar to other file systems that enable large folios
(setting mapping_set_large_folios()), and I haven't seen any other file
systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
a bit special?
If we all agree that tmpfs is a bit special when using huge pages, then
fine, a Kconfig option might be needed.
> (b) "always" etc. are only concerned about PMDs.
Yes, currently maintain the same semantics as before, in case users
still expect THPs.
> So again, I suggest:
>
> huge=never: No THPs of any size
> huge=always: THPs of any size
> huge=fadvise: like "always" but only with fadvise/madvise
> huge=within_size: like "fadvise" but respect i_size
>
> "huge=" default depends on a Kconfig option.
>
> With that we:
>
> (1) Maximize the cases where we will use large folios of any sizes
> (which Willy cares about).
> (2) Have a way to disable them completely (which I care about).
> (3) Allow distros to keep the default unchanged.
>
> Likely, for now we will only try allocating PMD-sized THPs during page
> faults, and allocate different sizes only during write(). So the effect
> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
> completely unchanged even with "huge=always".
>
> It will get more tricky once we change that behavior as well, but that's
> something to likely figure out if it is a real problem at at different
> day :)
>
>
> I really preferred using the sysfs toggles (as discussed with Hugh in
> the meeting back then), but I can also understand why we at least want
> to try making tmpfs behave more like other file systems. But I'm a bit
> more careful to not ignore the cases where it really isn't like any
> other file system.
That's also my previous thought, but Matthew is strongly against that.
Let's step by step.
> If we start making PMD-sized THPs special in any non-configurable way,
> then we are effectively off *worse* than allowing to configure them
> properly. So if someone voices "but we want only PMD-sized" ones, the
> next one will say "but we only want cont-pte sized-ones" and then we
> should provide an option to control the actual sizes to use differently,
> in some way. But let's see if that is even required.
Yes, I agree. So what I am thinking is, the 'huge=' option should be
gradually deprecated in the future and eventually tmpfs can allocate any
size large folios as default.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-31 10:04 ` Baolin Wang
2024-10-31 10:46 ` David Hildenbrand
@ 2024-10-31 10:46 ` David Hildenbrand
2024-11-05 12:45 ` Baolin Wang
1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-10-31 10:46 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
>
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.
If it's just "changing the default from "huge=never" to "huge=X" I don't
see a big problem here. Again, we already do that for anon THPs.
If we make more behavior depend on than (which I don't think we should
be doing), I agree that it would be more controversial.
[..]
>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
>
> Yes, I am also waiting to see if Hugh has any inputs :)
We keep saying that ... I have to find a way to summon him :)
>
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>> > > no 'huge=' set: can allocate any size huge folios based on write size
>> > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>> > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
>
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?
I'm afraid I don't have the energy to explain once more why I think
tmpfs is not just like any other file system in some cases.
And distributions are rather careful when it comes to something like
this ...
>
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
>
>> (b) "always" etc. are only concerned about PMDs.
>
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.
Again, I don't think that is a reasonable approach to make PMD-sized
ones special here. It will all get seriously confusing and inconsistent.
THPs are opportunistic after all, and page fault behavior will remain
unchanged (PMD-sized) for now. And even if we support other sizes during
page faults, we'd like start with the largest size (PMD-size) first, and
it likely might just all work better than before.
Happy to learn where this really makes a difference.
Of course, if you change the default behavior (which you are planning),
it's ... a changed default.
If there are reasons to have more tunables regarding the sizes to use,
then it should not be limited to PMD-size.
> >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>> (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
>
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.
Yes, I understand his view as well.
But I won't blindly agree to the "tmpfs is just like any other file
system" opinion :)
> >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
>
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.
Let's be realistic, it won't get removed any time soon. ;)
So changing "huge=always" etc. semantics to reflect our new size
options, and then try changing the default (with the option for
people/distros to have the old default) is a reasonable approach, at
least to me.
I'm trying to stay open-minded here, but the proposal I heard so far is
not particularly appealing.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-31 10:04 ` Baolin Wang
@ 2024-10-31 10:46 ` David Hildenbrand
2024-10-31 10:46 ` David Hildenbrand
1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2024-10-31 10:46 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
>
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.
If it's just "changing the default from "huge=never" to "huge=X" I don't
see a big problem here. Again, we already do that for anon THPs.
If we make more behavior depend on than (which I don't think we should
be doing), I agree that it would be more controversial.
[..]
>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
>
> Yes, I am also waiting to see if Hugh has any inputs :)
We keep saying that ... I have to find a way to summon him :)
>
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>> > > no 'huge=' set: can allocate any size huge folios based on write size
>> > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>> > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
>
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?
I'm afraid I don't have the energy to explain once more why I think
tmpfs is not just like any other file system in some cases.
And distributions are rather careful when it comes to something like
this ...
>
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
>
>> (b) "always" etc. are only concerned about PMDs.
>
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.
Again, I don't think that is a reasonable approach to make PMD-sized
ones special here. It will all get seriously confusing and inconsistent.
THPs are opportunistic after all, and page fault behavior will remain
unchanged (PMD-sized) for now. And even if we support other sizes during
page faults, we'd like start with the largest size (PMD-size) first, and
it likely might just all work better than before.
Happy to learn where this really makes a difference.
Of course, if you change the default behavior (which you are planning),
it's ... a changed default.
If there are reasons to have more tunables regarding the sizes to use,
then it should not be limited to PMD-size.
> >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>> (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
>
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.
Yes, I understand his view as well.
But I won't blindly agree to the "tmpfs is just like any other file
system" opinion :)
> >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
>
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.
Let's be realistic, it won't get removed any time soon. ;)
So changing "huge=always" etc. semantics to reflect our new size
options, and then try changing the default (with the option for
people/distros to have the old default) is a reasonable approach, at
least to me.
I'm trying to stay open-minded here, but the proposal I heard so far is
not particularly appealing.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-10-31 10:46 ` David Hildenbrand
@ 2024-11-05 12:45 ` Baolin Wang
2024-11-05 14:56 ` David Hildenbrand
0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2024-11-05 12:45 UTC (permalink / raw)
To: David Hildenbrand, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/10/31 18:46, David Hildenbrand wrote:
[snip]
>>> I don't like that:
>>>
>>> (a) there is no way to explicitly enable/name that new behavior.
>>
>> But this is similar to other file systems that enable large folios
>> (setting mapping_set_large_folios()), and I haven't seen any other file
>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>> a bit special?
>
> I'm afraid I don't have the energy to explain once more why I think
> tmpfs is not just like any other file system in some cases.
>
> And distributions are rather careful when it comes to something like
> this ...
>
>>
>> If we all agree that tmpfs is a bit special when using huge pages, then
>> fine, a Kconfig option might be needed.
>>
>>> (b) "always" etc. are only concerned about PMDs.
>>
>> Yes, currently maintain the same semantics as before, in case users
>> still expect THPs.
>
> Again, I don't think that is a reasonable approach to make PMD-sized
> ones special here. It will all get seriously confusing and inconsistent.
I agree PMD-sized should not be special. This is all for backward
compatibility with the ‘huge=’ mount option, and adding a new kconfig is
also for this purpose.
> THPs are opportunistic after all, and page fault behavior will remain
> unchanged (PMD-sized) for now. And even if we support other sizes during
> page faults, we'd like start with the largest size (PMD-size) first, and
> it likely might just all work better than before.
>
> Happy to learn where this really makes a difference.
>
> Of course, if you change the default behavior (which you are planning),
> it's ... a changed default.
>
> If there are reasons to have more tunables regarding the sizes to use,
> then it should not be limited to PMD-size.
I have tried to modify the code according to your suggestion (not tested
yet). These are what you had in mind?
static inline unsigned int
shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
loff_t write_end)
{
unsigned int order;
size_t size;
if (!mapping_large_folio_support(mapping) || !write_end)
return 0;
/* Calculate the write size based on the write_end */
size = write_end - (index << PAGE_SHIFT);
order = filemap_get_order(size);
if (!order)
return 0;
/* If we're not aligned, allocate a smaller folio */
if (index & ((1UL << order) - 1))
order = __ffs(index);
order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
return order > 0 ? BIT(order + 1) - 1 : 0;
}
static unsigned int shmem_huge_global_enabled(struct inode *inode,
pgoff_t index,
loff_t write_end, bool
shmem_huge_force,
unsigned long vm_flags)
{
bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
unsigned long within_size_orders;
unsigned int order;
loff_t i_size;
if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
return 0;
if (!S_ISREG(inode->i_mode))
return 0;
if (shmem_huge == SHMEM_HUGE_DENY)
return 0;
if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
return BIT(HPAGE_PMD_ORDER);
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_NEVER:
return 0;
case SHMEM_HUGE_ALWAYS:
if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
return BIT(HPAGE_PMD_ORDER);
return shmem_mapping_size_order(inode->i_mapping,
index, write_end);
case SHMEM_HUGE_WITHIN_SIZE:
if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
within_size_orders = BIT(HPAGE_PMD_ORDER);
else
within_size_orders =
shmem_mapping_size_order(inode->i_mapping,
index, write_end);
order = highest_order(within_size_orders);
while (within_size_orders) {
index = round_up(index + 1, 1 << order);
i_size = max(write_end, i_size_read(inode));
i_size = round_up(i_size, PAGE_SIZE);
if (i_size >> PAGE_SHIFT >= index)
return within_size_orders;
order = next_order(&within_size_orders, order);
}
fallthrough;
case SHMEM_HUGE_ADVISE:
if (vm_flags & VM_HUGEPAGE) {
if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
return BIT(HPAGE_PMD_ORDER);
return shmem_mapping_size_order(inode->i_mapping,
index, write_end);
}
fallthrough;
default:
return 0;
}
}
1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
mount option compatibility.
2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
then will get the possible huge orders based on the write size.
3) For tmpfs mmap() fault, always use PMD-sized huge order.
4) For shmem, ignore the write size logic and always use PMD-sized THP
to check if the global huge is enabled.
However, in case 2), if 'huge=always' and write size is less than 4K, so
we will allocate small pages, that will break the 'huge' semantics?
Maybe it's not something to worry too much about.
>>> huge=never: No THPs of any size
>>> huge=always: THPs of any size
>>> huge=fadvise: like "always" but only with fadvise/madvise
>>> huge=within_size: like "fadvise" but respect i_size
>>>
>>> "huge=" default depends on a Kconfig option.
>>>
>>> With that we:
>>>
>>> (1) Maximize the cases where we will use large folios of any sizes
>>> (which Willy cares about).
>>> (2) Have a way to disable them completely (which I care about).
>>> (3) Allow distros to keep the default unchanged.
>>>
>>> Likely, for now we will only try allocating PMD-sized THPs during page
>>> faults, and allocate different sizes only during write(). So the effect
>>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>>> completely unchanged even with "huge=always".
>>>
>>> It will get more tricky once we change that behavior as well, but that's
>>> something to likely figure out if it is a real problem at at different
>>> day :)
>>>
>>>
>>> I really preferred using the sysfs toggles (as discussed with Hugh in
>>> the meeting back then), but I can also understand why we at least want
>>> to try making tmpfs behave more like other file systems. But I'm a bit
>>> more careful to not ignore the cases where it really isn't like any
>>> other file system.
>>
>> That's also my previous thought, but Matthew is strongly against that.
>> Let's step by step.
>
> Yes, I understand his view as well.
>
> But I won't blindly agree to the "tmpfs is just like any other file
> system" opinion :)
>
> > >> If we start making PMD-sized THPs special in any non-configurable
> way,
>>> then we are effectively off *worse* than allowing to configure them
>>> properly. So if someone voices "but we want only PMD-sized" ones, the
>>> next one will say "but we only want cont-pte sized-ones" and then we
>>> should provide an option to control the actual sizes to use differently,
>>> in some way. But let's see if that is even required.
>>
>> Yes, I agree. So what I am thinking is, the 'huge=' option should be
>> gradually deprecated in the future and eventually tmpfs can allocate any
>> size large folios as default.
>
> Let's be realistic, it won't get removed any time soon. ;)
>
> So changing "huge=always" etc. semantics to reflect our new size
> options, and then try changing the default (with the option for
> people/distros to have the old default) is a reasonable approach, at
> least to me.
>
> I'm trying to stay open-minded here, but the proposal I heard so far is
> not particularly appealing.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-11-05 12:45 ` Baolin Wang
@ 2024-11-05 14:56 ` David Hildenbrand
2024-11-06 3:17 ` Baolin Wang
0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2024-11-05 14:56 UTC (permalink / raw)
To: Baolin Wang, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 05.11.24 13:45, Baolin Wang wrote:
>
>
> On 2024/10/31 18:46, David Hildenbrand wrote:
> [snip]
>
>>>> I don't like that:
>>>>
>>>> (a) there is no way to explicitly enable/name that new behavior.
>>>
>>> But this is similar to other file systems that enable large folios
>>> (setting mapping_set_large_folios()), and I haven't seen any other file
>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>>> a bit special?
>>
>> I'm afraid I don't have the energy to explain once more why I think
>> tmpfs is not just like any other file system in some cases.
>>
>> And distributions are rather careful when it comes to something like
>> this ...
>>
>>>
>>> If we all agree that tmpfs is a bit special when using huge pages, then
>>> fine, a Kconfig option might be needed.
>>>
>>>> (b) "always" etc. are only concerned about PMDs.
>>>
>>> Yes, currently maintain the same semantics as before, in case users
>>> still expect THPs.
>>
>> Again, I don't think that is a reasonable approach to make PMD-sized
>> ones special here. It will all get seriously confusing and inconsistent.
>
> I agree PMD-sized should not be special. This is all for backward
> compatibility with the ‘huge=’ mount option, and adding a new kconfig is
> also for this purpose.
>
>> THPs are opportunistic after all, and page fault behavior will remain
>> unchanged (PMD-sized) for now. And even if we support other sizes during
>> page faults, we'd like start with the largest size (PMD-size) first, and
>> it likely might just all work better than before.
>>
>> Happy to learn where this really makes a difference.
>>
>> Of course, if you change the default behavior (which you are planning),
>> it's ... a changed default.
>>
>> If there are reasons to have more tunables regarding the sizes to use,
>> then it should not be limited to PMD-size.
>
> I have tried to modify the code according to your suggestion (not tested
> yet). These are what you had in mind?
>
> static inline unsigned int
> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
> loff_t write_end)
> {
> unsigned int order;
> size_t size;
>
> if (!mapping_large_folio_support(mapping) || !write_end)
> return 0;
>
> /* Calculate the write size based on the write_end */
> size = write_end - (index << PAGE_SHIFT);
> order = filemap_get_order(size);
> if (!order)
> return 0;
>
> /* If we're not aligned, allocate a smaller folio */
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
>
> order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
> return order > 0 ? BIT(order + 1) - 1 : 0;
> }
>
> static unsigned int shmem_huge_global_enabled(struct inode *inode,
> pgoff_t index,
> loff_t write_end, bool
> shmem_huge_force,
> unsigned long vm_flags)
> {
> bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
> unsigned long within_size_orders;
> unsigned int order;
> loff_t i_size;
>
> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
> return 0;
> if (!S_ISREG(inode->i_mode))
> return 0;
> if (shmem_huge == SHMEM_HUGE_DENY)
> return 0;
> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
> return BIT(HPAGE_PMD_ORDER);
>
> switch (SHMEM_SB(inode->i_sb)->huge) {
> case SHMEM_HUGE_NEVER:
> return 0;
> case SHMEM_HUGE_ALWAYS:
> if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
> return BIT(HPAGE_PMD_ORDER);
>
> return shmem_mapping_size_order(inode->i_mapping,
> index, write_end);
> case SHMEM_HUGE_WITHIN_SIZE:
> if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
> within_size_orders = BIT(HPAGE_PMD_ORDER);
> else
> within_size_orders =
> shmem_mapping_size_order(inode->i_mapping,
>
> index, write_end);
>
> order = highest_order(within_size_orders);
> while (within_size_orders) {
> index = round_up(index + 1, 1 << order);
> i_size = max(write_end, i_size_read(inode));
> i_size = round_up(i_size, PAGE_SIZE);
> if (i_size >> PAGE_SHIFT >= index)
> return within_size_orders;
>
> order = next_order(&within_size_orders, order);
> }
> fallthrough;
> case SHMEM_HUGE_ADVISE:
> if (vm_flags & VM_HUGEPAGE) {
> if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
> return BIT(HPAGE_PMD_ORDER);
>
> return shmem_mapping_size_order(inode->i_mapping,
> index, write_end);
> }
> fallthrough;
> default:
> return 0;
> }
> }
>
> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
> mount option compatibility.
> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
> then will get the possible huge orders based on the write size.
> 3) For tmpfs mmap() fault, always use PMD-sized huge order.
> 4) For shmem, ignore the write size logic and always use PMD-sized THP
> to check if the global huge is enabled.
>
> However, in case 2), if 'huge=always' and write size is less than 4K, so
> we will allocate small pages, that will break the 'huge' semantics?
> Maybe it's not something to worry too much about.
Probably I didn't express clearly what I think we should, because this is
not quite what I had in mind.
I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
really required. As raised, if someone needs finer control, providing that
only for a single size is rather limiting.
This is what I hope we can do (doc update to show what I mean):
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5034915f4e8e8..d7d1a9acdbfc5 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be overridden. If the policy for
PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
default to ``never``.
-Hugepages in tmpfs/shmem
-========================
+tmpfs/shmem
+===========
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all
+available huge page sizes without any control over the exact sizes,
+behaving more like other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
always
Attempt to allocate huge pages every time we need a new page;
@@ -368,19 +381,20 @@ within_size
advise
Only allocate huge pages if requested with fadvise()/madvise();
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
+CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
``huge=never`` will not attempt to break up huge pages at all, just stop more
from being allocated.
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
deny
For use in emergencies, to force the huge option off from
@@ -388,13 +402,26 @@ deny
force
Force the huge option on for all - very useful for testing;
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting. An 'inherit' option is added to ensure compatibility with these
-global settings. Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+
+shmem / internal tmpfs
+----------------------
+
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
always
Attempt to allocate <size> huge pages every time we need a new page;
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe9..10de8f706d07b 100644
There is this question of "do we need the old way of doing it and only
allocate PMDs". For that, likely a config similar to the one you propose might
make sense, but I would want to see if there is real demand for that. In particular:
for whom the smaller sizes are a problem when bigger (PMD) sizes were
enabled in the past.
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
2024-11-05 14:56 ` David Hildenbrand
@ 2024-11-06 3:17 ` Baolin Wang
0 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2024-11-06 3:17 UTC (permalink / raw)
To: David Hildenbrand, Daniel Gomez, Daniel Gomez, Kirill A. Shutemov
Cc: Matthew Wilcox, akpm, hughd, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel,
Kirill A . Shutemov
On 2024/11/5 22:56, David Hildenbrand wrote:
> On 05.11.24 13:45, Baolin Wang wrote:
>>
>>
>> On 2024/10/31 18:46, David Hildenbrand wrote:
>> [snip]
>>
>>>>> I don't like that:
>>>>>
>>>>> (a) there is no way to explicitly enable/name that new behavior.
>>>>
>>>> But this is similar to other file systems that enable large folios
>>>> (setting mapping_set_large_folios()), and I haven't seen any other file
>>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>>>> a bit special?
>>>
>>> I'm afraid I don't have the energy to explain once more why I think
>>> tmpfs is not just like any other file system in some cases.
>>>
>>> And distributions are rather careful when it comes to something like
>>> this ...
>>>
>>>>
>>>> If we all agree that tmpfs is a bit special when using huge pages, then
>>>> fine, a Kconfig option might be needed.
>>>>
>>>>> (b) "always" etc. are only concerned about PMDs.
>>>>
>>>> Yes, currently maintain the same semantics as before, in case users
>>>> still expect THPs.
>>>
>>> Again, I don't think that is a reasonable approach to make PMD-sized
>>> ones special here. It will all get seriously confusing and inconsistent.
>>
>> I agree PMD-sized should not be special. This is all for backward
>> compatibility with the ‘huge=’ mount option, and adding a new kconfig is
>> also for this purpose.
>>
>>> THPs are opportunistic after all, and page fault behavior will remain
>>> unchanged (PMD-sized) for now. And even if we support other sizes during
>>> page faults, we'd like start with the largest size (PMD-size) first, and
>>> it likely might just all work better than before.
>>>
>>> Happy to learn where this really makes a difference.
>>>
>>> Of course, if you change the default behavior (which you are planning),
>>> it's ... a changed default.
>>>
>>> If there are reasons to have more tunables regarding the sizes to use,
>>> then it should not be limited to PMD-size.
>>
>> I have tried to modify the code according to your suggestion (not tested
>> yet). These are what you had in mind?
>>
>> static inline unsigned int
>> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
>> loff_t write_end)
>> {
>> unsigned int order;
>> size_t size;
>>
>> if (!mapping_large_folio_support(mapping) || !write_end)
>> return 0;
>>
>> /* Calculate the write size based on the write_end */
>> size = write_end - (index << PAGE_SHIFT);
>> order = filemap_get_order(size);
>> if (!order)
>> return 0;
>>
>> /* If we're not aligned, allocate a smaller folio */
>> if (index & ((1UL << order) - 1))
>> order = __ffs(index);
>>
>> order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>> return order > 0 ? BIT(order + 1) - 1 : 0;
>> }
>>
>> static unsigned int shmem_huge_global_enabled(struct inode *inode,
>> pgoff_t index,
>> loff_t write_end, bool
>> shmem_huge_force,
>> unsigned long vm_flags)
>> {
>> bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
>> unsigned long within_size_orders;
>> unsigned int order;
>> loff_t i_size;
>>
>> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>> return 0;
>> if (!S_ISREG(inode->i_mode))
>> return 0;
>> if (shmem_huge == SHMEM_HUGE_DENY)
>> return 0;
>> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>> return BIT(HPAGE_PMD_ORDER);
>>
>> switch (SHMEM_SB(inode->i_sb)->huge) {
>> case SHMEM_HUGE_NEVER:
>> return 0;
>> case SHMEM_HUGE_ALWAYS:
>> if (is_shmem ||
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>> return BIT(HPAGE_PMD_ORDER);
>>
>> return shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>> case SHMEM_HUGE_WITHIN_SIZE:
>> if (is_shmem ||
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>> within_size_orders = BIT(HPAGE_PMD_ORDER);
>> else
>> within_size_orders =
>> shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>>
>> order = highest_order(within_size_orders);
>> while (within_size_orders) {
>> index = round_up(index + 1, 1 << order);
>> i_size = max(write_end, i_size_read(inode));
>> i_size = round_up(i_size, PAGE_SIZE);
>> if (i_size >> PAGE_SHIFT >= index)
>> return within_size_orders;
>>
>> order = next_order(&within_size_orders, order);
>> }
>> fallthrough;
>> case SHMEM_HUGE_ADVISE:
>> if (vm_flags & VM_HUGEPAGE) {
>> if (is_shmem ||
>> IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
>> return BIT(HPAGE_PMD_ORDER);
>>
>> return
>> shmem_mapping_size_order(inode->i_mapping,
>> index,
>> write_end);
>> }
>> fallthrough;
>> default:
>> return 0;
>> }
>> }
>>
>> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
>> mount option compatibility.
>> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
>> then will get the possible huge orders based on the write size.
>> 3) For tmpfs mmap() fault, always use PMD-sized huge order.
>> 4) For shmem, ignore the write size logic and always use PMD-sized THP
>> to check if the global huge is enabled.
>>
>> However, in case 2), if 'huge=always' and write size is less than 4K, so
>> we will allocate small pages, that will break the 'huge' semantics?
>> Maybe it's not something to worry too much about.
>
> Probably I didn't express clearly what I think we should, because this is
> not quite what I had in mind.
>
> I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
> really required. As raised, if someone needs finer control, providing that
> only for a single size is rather limiting.
OK. I misunderstood your points.
> This is what I hope we can do (doc update to show what I mean):
Thanks for updating the doc. I'd like to include them in the next version.
> diff --git a/Documentation/admin-guide/mm/transhuge.rst
> b/Documentation/admin-guide/mm/transhuge.rst
> index 5034915f4e8e8..d7d1a9acdbfc5 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be
> overridden. If the policy for
> PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
> default to ``never``.
>
> -Hugepages in tmpfs/shmem
> -========================
> +tmpfs/shmem
> +===========
>
> -You can control hugepage allocation policy in tmpfs with mount option
> -``huge=``. It can have following values:
> +Traditionally, tmpfs only supported a single huge page size ("PMD").
> Today,
> +it also supports smaller sizes just like anonymous memory, often referred
> +to as "multi-size THP" (mTHP). Huge pages of any size are commonly
> +represented in the kernel as "large folios".
> +
> +While there is fine control over the huge page sizes to use for the
> internal
> +shmem mount (see below), ordinary tmpfs mounts will make use of all
> +available huge page sizes without any control over the exact sizes,
> +behaving more like other file systems.
> +
> +tmpfs mounts
> +------------
> +
> +The THP allocation policy for tmpfs mounts can be adjusted using the mount
> +option: ``huge=``. It can have following values:
>
> always
> Attempt to allocate huge pages every time we need a new page;
> @@ -368,19 +381,20 @@ within_size
> advise
> Only allocate huge pages if requested with fadvise()/madvise();
>
> -The default policy is ``never``.
> +Remember, that the kernel may use huge pages of all available sizes, and
> +that no fine control as for the internal tmpfs mount is available.
> +
> +The default policy in the past was ``never``, but it can now be adjusted
> +using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
> +CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
>
> ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
> ``huge=never`` will not attempt to break up huge pages at all, just
> stop more
> from being allocated.
>
> -There's also sysfs knob to control hugepage allocation policy for internal
> -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
> -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
> -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
> -
> -In addition to policies listed above, shmem_enabled allows two further
> -values:
> +In addition to policies listed above, the sysfs knob
> +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
> +allocation policy of tmpfs mounts, when set to the following values:
>
> deny
> For use in emergencies, to force the huge option off from
> @@ -388,13 +402,26 @@ deny
> force
> Force the huge option on for all - very useful for testing;
>
> -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
> -control mTHP allocation:
> -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
> -and its value for each mTHP is essentially consistent with the global
> -setting. An 'inherit' option is added to ensure compatibility with these
> -global settings. Conversely, the options 'force' and 'deny' are dropped,
> -which are rather testing artifacts from the old ages.
> +
> +shmem / internal tmpfs
> +----------------------
> +
> +The mount internal tmpfs mount is used for SysV SHM, memfds, shared
> anonymous
> +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
> +
> +To control the THP allocation policy for this internal tmpfs mount, the
> +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
> +per THP size in
> +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
> +can be used.
> +
> +The global knob has the same semantics as the ``huge=`` mount options
> +for tmpfs mounts, except that the different huge page sizes can be
> controlled
> +individually, and will only use the setting of the global knob when the
> +per-size knob is set to 'inherit'.
> +
> +The options 'force' and 'deny' are dropped for the individual sizes, which
> +are rather testing artifacts from the old ages.
>
> always
> Attempt to allocate <size> huge pages every time we need a new page;
> diff --git a/Documentation/filesystems/tmpfs.rst
> b/Documentation/filesystems/tmpfs.rst
> index 56a26c843dbe9..10de8f706d07b 100644
>
>
>
> There is this question of "do we need the old way of doing it and only
> allocate PMDs". For that, likely a config similar to the one you propose
> might
> make sense, but I would want to see if there is real demand for that. In
> particular:
> for whom the smaller sizes are a problem when bigger (PMD) sizes were
> enabled in the past.
I am also not sure if such a case exists. I can remove this kconfig for
now, and we can consider it again if someone really complains this in
the future.
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-11-06 3:17 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
2024-10-16 9:29 ` Baolin Wang
2024-10-16 13:45 ` Kefeng Wang
2024-10-17 9:52 ` Baolin Wang
2024-10-16 14:06 ` Matthew Wilcox
2024-10-17 9:34 ` Baolin Wang
2024-10-17 11:26 ` Kirill A. Shutemov
2024-10-21 6:24 ` Baolin Wang
2024-10-21 8:54 ` Kirill A. Shutemov
2024-10-21 13:34 ` Daniel Gomez
2024-10-22 3:41 ` Baolin Wang
2024-10-22 15:31 ` David Hildenbrand
2024-10-23 8:04 ` Baolin Wang
2024-10-23 9:27 ` David Hildenbrand
2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
2024-10-25 2:56 ` Baolin Wang
2024-10-25 20:21 ` David Hildenbrand
2024-10-28 9:48 ` David Hildenbrand
2024-10-31 3:43 ` Baolin Wang
2024-10-31 8:53 ` David Hildenbrand
2024-10-31 10:04 ` Baolin Wang
2024-10-31 10:46 ` David Hildenbrand
2024-10-31 10:46 ` David Hildenbrand
2024-11-05 12:45 ` Baolin Wang
2024-11-05 14:56 ` David Hildenbrand
2024-11-06 3:17 ` Baolin Wang
2024-10-28 21:56 ` Daniel Gomez
2024-10-29 12:20 ` David Hildenbrand
2024-10-22 3:34 ` Baolin Wang
2024-10-22 10:06 ` Kirill A. Shutemov
2024-10-23 9:25 ` Baolin Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).