* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-15 14:42 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <1428e3fe-ae7a-410d-97b5-7dd0249c41c0@huaweicloud.com>
On Fri, Aug 15, 2025 at 05:59:01PM +0800, Zhang Yi wrote:
> On 2025/8/15 0:54, Darrick J. Wong wrote:
> > On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> >> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
> >> fallocate utility by introducing a new 'fwzero' command in the xfs_io
> >> tool.
> >>
> >> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >> v1->v2:
> >> - Minor description modification to align with the kernel.
> >>
> >> io/prealloc.c | 36 ++++++++++++++++++++++++++++++++++++
> >> man/man8/xfs_io.8 | 6 ++++++
> >> 2 files changed, 42 insertions(+)
> >>
> >> diff --git a/io/prealloc.c b/io/prealloc.c
> >> index 8e968c9f..9a64bf53 100644
> >> --- a/io/prealloc.c
> >> +++ b/io/prealloc.c
> >> @@ -30,6 +30,10 @@
> >> #define FALLOC_FL_UNSHARE_RANGE 0x40
> >> #endif
> >>
> >> +#ifndef FALLOC_FL_WRITE_ZEROES
> >> +#define FALLOC_FL_WRITE_ZEROES 0x80
> >> +#endif
> >> +
> >> static cmdinfo_t allocsp_cmd;
> >> static cmdinfo_t freesp_cmd;
> >> static cmdinfo_t resvsp_cmd;
> >> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
> >> static cmdinfo_t finsert_cmd;
> >> static cmdinfo_t fzero_cmd;
> >> static cmdinfo_t funshare_cmd;
> >> +static cmdinfo_t fwzero_cmd;
> >>
> >> static int
> >> offset_length(
> >> @@ -377,6 +382,27 @@ funshare_f(
> >> return 0;
> >> }
> >>
> >> +static int
> >> +fwzero_f(
> >> + int argc,
> >> + char **argv)
> >> +{
> >> + xfs_flock64_t segment;
> >> + int mode = FALLOC_FL_WRITE_ZEROES;
> >
> > Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?
> >
>
> Since allocating blocks with written extents beyond the inode size
> is not permitted, the FALLOC_FL_WRITE_ZEROES flag cannot be used
> together with the FALLOC_FL_KEEP_SIZE.
Heh, apparently I didn't read the manpage well enough.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> Thanks,
> Yi.
>
> > (The code otherwise looks fine to me)
> >
> > --D
> >
> >> +
> >> + if (!offset_length(argv[1], argv[2], &segment)) {
> >> + exitcode = 1;
> >> + return 0;
> >> + }
> >> +
> >> + if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
> >> + perror("fallocate");
> >> + exitcode = 1;
> >> + return 0;
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> void
> >> prealloc_init(void)
> >> {
> >> @@ -489,4 +515,14 @@ prealloc_init(void)
> >> funshare_cmd.oneline =
> >> _("unshares shared blocks within the range");
> >> add_command(&funshare_cmd);
> >> +
> >> + fwzero_cmd.name = "fwzero";
> >> + fwzero_cmd.cfunc = fwzero_f;
> >> + fwzero_cmd.argmin = 2;
> >> + fwzero_cmd.argmax = 2;
> >> + fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
> >> + fwzero_cmd.args = _("off len");
> >> + fwzero_cmd.oneline =
> >> + _("zeroes space and eliminates holes by allocating and submitting write zeroes");
> >> + add_command(&fwzero_cmd);
> >> }
> >> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
> >> index b0dcfdb7..0a673322 100644
> >> --- a/man/man8/xfs_io.8
> >> +++ b/man/man8/xfs_io.8
> >> @@ -550,6 +550,12 @@ With the
> >> .B -k
> >> option, use the FALLOC_FL_KEEP_SIZE flag as well.
> >> .TP
> >> +.BI fwzero " offset length"
> >> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
> >> +.BR fallocate (2)
> >> +manual page to allocate and zero blocks within the range by submitting write
> >> +zeroes.
> >> +.TP
> >> .BI zero " offset length"
> >> Call xfsctl with
> >> .B XFS_IOC_ZERO_RANGE
> >> --
> >> 2.39.2
> >>
> >>
>
>
^ permalink raw reply
* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-15 14:29 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
chengzhihao1, yukuai3, yangerkun
In-Reply-To: <a0eda581-ae6c-4b49-8b4f-7bb039b17487@huaweicloud.com>
On Fri, Aug 15, 2025 at 05:29:19PM +0800, Zhang Yi wrote:
> Thank you for your review comments!
>
> On 2025/8/15 0:52, Darrick J. Wong wrote:
> > On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> >> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
> >> utility by introducing a new option -w|--write-zeroes.
> >>
> >> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >> v1->v2:
> >> - Minor description modification to align with the kernel.
> >>
> >> sys-utils/fallocate.1.adoc | 11 +++++++++--
> >> sys-utils/fallocate.c | 20 ++++++++++++++++----
> >> 2 files changed, 25 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
> >> index 44ee0ef4c..0ec9ff9a9 100644
> >> --- a/sys-utils/fallocate.1.adoc
> >> +++ b/sys-utils/fallocate.1.adoc
> >> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
> >
> > <snip all the long lines>
> >
> >> +*-w*, *--write-zeroes*::
> >> +Zeroes space in the byte range starting at _offset_ and continuing
> >> for _length_ bytes. Within the specified range, blocks are
> >> preallocated for the regions that span the holes in the file. After a
> >> successful call, subsequent reads from this range will return zeroes,
> >> subsequent writes to that range do not require further changes to the
> >> file mapping metadata.
> >
> > "...will return zeroes and subsequent writes to that range..." ?
> >
>
> Yeah.
>
> >> ++
> >> +Zeroing is done within the filesystem by preferably submitting write
> >
> > I think we should say less about what the filesystem actually does to
> > preserve some flexibility:
> >
> > "Zeroing is done within the filesystem. The filesystem may use a
> > hardware accelerated zeroing command, or it may submit regular writes.
> > The behavior depends on the filesystem design and available hardware."
> >
>
> Sure.
>
> >> zeores commands, the alternative way is submitting actual zeroed data,
> >> the specified range will be converted into written extents. The write
> >> zeroes command is typically faster than write actual data if the
> >> device supports unmap write zeroes, the specified range will not be
> >> physically zeroed out on the device.
> >> ++
> >> +Options *--keep-size* can not be specified for the write-zeroes
> >> operation.
> >> +
> >> include::man-common/help-version.adoc[]
> >>
> >> == AUTHORS
> [..]
> >> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
> >> else if (mode & FALLOC_FL_ZERO_RANGE)
> >> fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
> >> filename, str, length);
> >> + else if (mode & FALLOC_FL_WRITE_ZEROES)
> >> + fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
> >
> > "write zeroed" is a little strange, but I don't have a better
> > suggestion. :)
> >
>
> Hmm... What about simply using "zeroed", the same to FALLOC_FL_ZERO_RANGE?
> Users should be aware of the parameters they have passed to fallocate(),
> so they should not use this print for further differentiation.
No thanks, different inputs should produce different outputs. :)
--D
> Thanks,
> Yi.
>
^ permalink raw reply
* [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-15 12:14 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
Lichen Liu
When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
By default, a tmpfs mount is limited to using 50% of the available RAM
for its content. This can be problematic in memory-constrained
environments, particularly during a kdump capture.
In a kdump scenario, the capture kernel boots with a limited amount of
memory specified by the 'crashkernel' parameter. If the initramfs is
large, it may fail to unpack into the tmpfs rootfs due to insufficient
space. This is because to get X MB of usable space in tmpfs, 2*X MB of
memory must be available for the mount. This leads to an OOM failure
during the early boot process, preventing a successful crash dump.
This patch introduces a new kernel command-line parameter, rootfsflags,
which allows passing specific mount options directly to the rootfs when
it is first mounted. This gives users control over the rootfs behavior.
For example, a user can now specify rootfsflags=size=75% to allow the
tmpfs to use up to 75% of the available memory. This can significantly
reduce the memory pressure for kdump.
Consider a practical example:
To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
the default 50% limit, this requires a memory pool of 96MB to be
available for the tmpfs mount. The total memory requirement is therefore
approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
By using rootfsflags=size=75%, the memory pool required for the 48MB
tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
smaller crashkernel size, such as 192MB.
An alternative approach of reusing the existing rootflags parameter was
considered. However, a new, dedicated rootfsflags parameter was chosen
to avoid altering the current behavior of rootflags (which applies to
the final root filesystem) and to prevent any potential regressions.
Also add documentation for the new kernel parameter "rootfsflags"
This approach is inspired by prior discussions and patches on the topic.
Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
Ref: https://landley.net/notes-2015.html#01-01-2015
Ref: https://lkml.org/lkml/2021/6/29/783
Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
Signed-off-by: Lichen Liu <lichliu@redhat.com>
Tested-by: Rob Landley <rob@landley.net>
---
Changes in v2:
- Add documentation for the new kernel parameter.
Documentation/admin-guide/kernel-parameters.txt | 3 +++
fs/namespace.c | 11 ++++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..0c00f651d431 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6220,6 +6220,9 @@
rootflags= [KNL] Set root filesystem mount option string
+ rootfsflags= [KNL] Set initial root filesystem mount option string
+ (e.g. tmpfs for initramfs)
+
rootfstype= [KNL] Set root filesystem type
rootwait [KNL] Wait (indefinitely) for root device to show up.
diff --git a/fs/namespace.c b/fs/namespace.c
index 8f1000f9f3df..e484c26d5e3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
}
__setup("mphash_entries=", set_mphash_entries);
+static char * __initdata rootfs_flags;
+static int __init rootfs_flags_setup(char *str)
+{
+ rootfs_flags = str;
+ return 1;
+}
+
+__setup("rootfsflags=", rootfs_flags_setup);
+
static u64 event;
static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
static DEFINE_IDA(mnt_group_ida);
@@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
struct mnt_namespace *ns;
struct path root;
- mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
+ mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
if (IS_ERR(mnt))
panic("Can't create rootfs");
--
2.47.0
^ permalink raw reply related
* Re: [PATCH RESEND] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-15 12:12 UTC (permalink / raw)
To: Randy Dunlap
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, safinaskar,
kexec, rob, weilongchen, cyphar, linux-api, zohar, stefanb,
initramfs
In-Reply-To: <dd25041f-98e0-4bb5-bcd5-ba3507262c76@infradead.org>
Thanks Randy,
I will send a v2 with documentation.
On Fri, Aug 15, 2025 at 12:27 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Hi,
>
> On 8/14/25 3:34 AM, Lichen Liu wrote:
> > When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> > By default, a tmpfs mount is limited to using 50% of the available RAM
> > for its content. This can be problematic in memory-constrained
> > environments, particularly during a kdump capture.
> >
> > In a kdump scenario, the capture kernel boots with a limited amount of
> > memory specified by the 'crashkernel' parameter. If the initramfs is
> > large, it may fail to unpack into the tmpfs rootfs due to insufficient
> > space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> > memory must be available for the mount. This leads to an OOM failure
> > during the early boot process, preventing a successful crash dump.
> >
> > This patch introduces a new kernel command-line parameter, rootfsflags,
> > which allows passing specific mount options directly to the rootfs when
> > it is first mounted. This gives users control over the rootfs behavior.
> >
> > For example, a user can now specify rootfsflags=size=75% to allow the
> > tmpfs to use up to 75% of the available memory. This can significantly
> > reduce the memory pressure for kdump.
> >
> > Consider a practical example:
> >
> > To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> > the default 50% limit, this requires a memory pool of 96MB to be
> > available for the tmpfs mount. The total memory requirement is therefore
> > approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> > kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
> >
> > By using rootfsflags=size=75%, the memory pool required for the 48MB
> > tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> > requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> > smaller crashkernel size, such as 192MB.
> >
> > An alternative approach of reusing the existing rootflags parameter was
> > considered. However, a new, dedicated rootfsflags parameter was chosen
> > to avoid altering the current behavior of rootflags (which applies to
> > the final root filesystem) and to prevent any potential regressions.
> >
> > This approach is inspired by prior discussions and patches on the topic.
> > Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> > Ref: https://landley.net/notes-2015.html#01-01-2015
> > Ref: https://lkml.org/lkml/2021/6/29/783
> > Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
> >
> > Signed-off-by: Lichen Liu <lichliu@redhat.com>
> > Tested-by: Rob Landley <rob@landley.net>
> > ---
> > Hi VFS maintainers,
> >
> > Resending this patch as it did not get picked up.
> > This patch is intended for the VFS tree.
> >
> > fs/namespace.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 8f1000f9f3df..e484c26d5e3f 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
> > }
> > __setup("mphash_entries=", set_mphash_entries);
> >
> > +static char * __initdata rootfs_flags;
> > +static int __init rootfs_flags_setup(char *str)
> > +{
> > + rootfs_flags = str;
> > + return 1;
> > +}
> > +
> > +__setup("rootfsflags=", rootfs_flags_setup);
>
> Please document this option (alphabetically) in
> Documentation/admin-guide/kernel-parameters.txt.
>
> Thanks.
>
> > +
> > static u64 event;
> > static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
> > static DEFINE_IDA(mnt_group_ida);
> > @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
> > struct mnt_namespace *ns;
> > struct path root;
> >
> > - mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> > + mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
> > if (IS_ERR(mnt))
> > panic("Can't create rootfs");
> >
>
> --
> ~Randy
>
>
^ permalink raw reply
* Re: [PATCH v5 1/3] man/man2/mremap.2: explicitly document the simple move operation
From: Alejandro Colomar @ 2025-08-15 10:05 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
linux-api
In-Reply-To: <0a5d0d6e9f75e8e2de05506f73c41b069d77de36.1754924278.git.lorenzo.stoakes@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]
Hi Lorenzo,
On Mon, Aug 11, 2025 at 03:59:37PM +0100, Lorenzo Stoakes wrote:
> In preparation for discussing newly introduced mremap() behaviour to permit
> the move of multiple mappings at once, add a section to the mremap.2 man
> page to describe these operations in general.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Thanks! I've applied this patch.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=6ba37b9e14f6565d0cccecb634100d7fe11d22fb>
Have a lovely day!
Alex
> ---
> man/man2/mremap.2 | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..4e3c8e54e 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,6 +25,20 @@ moving it at the same time (controlled by the
> argument and
> the available virtual address space).
> .P
> +Mappings can also simply be moved
> +(without any resizing)
> +by specifying equal
> +.I old_size
> +and
> +.I new_size
> +and using the
> +.B MREMAP_FIXED
> +flag
> +(see below).
> +The
> +.B MREMAP_DONTUNMAP
> +flag may also be specified.
> +.P
> .I old_address
> is the old address of the virtual memory block that you
> want to expand (or shrink).
> --
> 2.50.1
>
--
<https://www.alejandro-colomar.es/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-15 9:59 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250814165430.GR7942@frogsfrogsfrogs>
On 2025/8/15 0:54, Darrick J. Wong wrote:
> On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
>> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
>> fallocate utility by introducing a new 'fwzero' command in the xfs_io
>> tool.
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> v1->v2:
>> - Minor description modification to align with the kernel.
>>
>> io/prealloc.c | 36 ++++++++++++++++++++++++++++++++++++
>> man/man8/xfs_io.8 | 6 ++++++
>> 2 files changed, 42 insertions(+)
>>
>> diff --git a/io/prealloc.c b/io/prealloc.c
>> index 8e968c9f..9a64bf53 100644
>> --- a/io/prealloc.c
>> +++ b/io/prealloc.c
>> @@ -30,6 +30,10 @@
>> #define FALLOC_FL_UNSHARE_RANGE 0x40
>> #endif
>>
>> +#ifndef FALLOC_FL_WRITE_ZEROES
>> +#define FALLOC_FL_WRITE_ZEROES 0x80
>> +#endif
>> +
>> static cmdinfo_t allocsp_cmd;
>> static cmdinfo_t freesp_cmd;
>> static cmdinfo_t resvsp_cmd;
>> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
>> static cmdinfo_t finsert_cmd;
>> static cmdinfo_t fzero_cmd;
>> static cmdinfo_t funshare_cmd;
>> +static cmdinfo_t fwzero_cmd;
>>
>> static int
>> offset_length(
>> @@ -377,6 +382,27 @@ funshare_f(
>> return 0;
>> }
>>
>> +static int
>> +fwzero_f(
>> + int argc,
>> + char **argv)
>> +{
>> + xfs_flock64_t segment;
>> + int mode = FALLOC_FL_WRITE_ZEROES;
>
> Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?
>
Since allocating blocks with written extents beyond the inode size
is not permitted, the FALLOC_FL_WRITE_ZEROES flag cannot be used
together with the FALLOC_FL_KEEP_SIZE.
Thanks,
Yi.
> (The code otherwise looks fine to me)
>
> --D
>
>> +
>> + if (!offset_length(argv[1], argv[2], &segment)) {
>> + exitcode = 1;
>> + return 0;
>> + }
>> +
>> + if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
>> + perror("fallocate");
>> + exitcode = 1;
>> + return 0;
>> + }
>> + return 0;
>> +}
>> +
>> void
>> prealloc_init(void)
>> {
>> @@ -489,4 +515,14 @@ prealloc_init(void)
>> funshare_cmd.oneline =
>> _("unshares shared blocks within the range");
>> add_command(&funshare_cmd);
>> +
>> + fwzero_cmd.name = "fwzero";
>> + fwzero_cmd.cfunc = fwzero_f;
>> + fwzero_cmd.argmin = 2;
>> + fwzero_cmd.argmax = 2;
>> + fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
>> + fwzero_cmd.args = _("off len");
>> + fwzero_cmd.oneline =
>> + _("zeroes space and eliminates holes by allocating and submitting write zeroes");
>> + add_command(&fwzero_cmd);
>> }
>> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
>> index b0dcfdb7..0a673322 100644
>> --- a/man/man8/xfs_io.8
>> +++ b/man/man8/xfs_io.8
>> @@ -550,6 +550,12 @@ With the
>> .B -k
>> option, use the FALLOC_FL_KEEP_SIZE flag as well.
>> .TP
>> +.BI fwzero " offset length"
>> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
>> +.BR fallocate (2)
>> +manual page to allocate and zero blocks within the range by submitting write
>> +zeroes.
>> +.TP
>> .BI zero " offset length"
>> Call xfsctl with
>> .B XFS_IOC_ZERO_RANGE
>> --
>> 2.39.2
>>
>>
^ permalink raw reply
* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-15 9:29 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250814165218.GQ7942@frogsfrogsfrogs>
Thank you for your review comments!
On 2025/8/15 0:52, Darrick J. Wong wrote:
> On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
>> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
>> utility by introducing a new option -w|--write-zeroes.
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> v1->v2:
>> - Minor description modification to align with the kernel.
>>
>> sys-utils/fallocate.1.adoc | 11 +++++++++--
>> sys-utils/fallocate.c | 20 ++++++++++++++++----
>> 2 files changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
>> index 44ee0ef4c..0ec9ff9a9 100644
>> --- a/sys-utils/fallocate.1.adoc
>> +++ b/sys-utils/fallocate.1.adoc
>> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
>
> <snip all the long lines>
>
>> +*-w*, *--write-zeroes*::
>> +Zeroes space in the byte range starting at _offset_ and continuing
>> for _length_ bytes. Within the specified range, blocks are
>> preallocated for the regions that span the holes in the file. After a
>> successful call, subsequent reads from this range will return zeroes,
>> subsequent writes to that range do not require further changes to the
>> file mapping metadata.
>
> "...will return zeroes and subsequent writes to that range..." ?
>
Yeah.
>> ++
>> +Zeroing is done within the filesystem by preferably submitting write
>
> I think we should say less about what the filesystem actually does to
> preserve some flexibility:
>
> "Zeroing is done within the filesystem. The filesystem may use a
> hardware accelerated zeroing command, or it may submit regular writes.
> The behavior depends on the filesystem design and available hardware."
>
Sure.
>> zeores commands, the alternative way is submitting actual zeroed data,
>> the specified range will be converted into written extents. The write
>> zeroes command is typically faster than write actual data if the
>> device supports unmap write zeroes, the specified range will not be
>> physically zeroed out on the device.
>> ++
>> +Options *--keep-size* can not be specified for the write-zeroes
>> operation.
>> +
>> include::man-common/help-version.adoc[]
>>
>> == AUTHORS
[..]
>> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
>> else if (mode & FALLOC_FL_ZERO_RANGE)
>> fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
>> filename, str, length);
>> + else if (mode & FALLOC_FL_WRITE_ZEROES)
>> + fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
>
> "write zeroed" is a little strange, but I don't have a better
> suggestion. :)
>
Hmm... What about simply using "zeroed", the same to FALLOC_FL_ZERO_RANGE?
Users should be aware of the parameters they have passed to fallocate(),
so they should not use this print for further differentiation.
Thanks,
Yi.
^ permalink raw reply
* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Mike Rapoport @ 2025-08-15 9:12 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814132233.GB802098@nvidia.com>
On Thu, Aug 14, 2025 at 10:22:33AM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > +{
>
> Why are we adding phys apis? Didn't we talk about this before and
> agree not to expose these?
>
> The places using it are goofy:
>
> +static int luo_fdt_setup(void)
> +{
> + fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + get_order(LUO_FDT_SIZE));
>
> + ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
>
> + WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));
>
> It literally allocated a page and then for some reason switches to
> phys with an open coded __pa??
>
> This is ugly, if you want a helper to match __get_free_pages() then
> make one that works on void * directly. You can get the order of the
> void * directly from the struct page IIRC when using GFP_COMP.
>
> Which is perhaps another comment, if this __get_free_pages() is going
> to be a common pattern (and I guess it will be) then the API should be
> streamlined alot more:
>
> void *kho_alloc_preserved_memory(gfp, size);
> void kho_free_preserved_memory(void *);
This looks backwards to me. KHO should not deal with memory allocation,
it's responsibility to preserve/restore memory objects it supports.
For __get_free_pages() the natural KHO API is kho_(un)preserve_pages().
With struct page/mesdesc we always have page_to_<specialized object> from
one side and page_to_pfn from the other side.
Then folio and phys/virt APIS just become a thin wrappers around the _page
APIs. And down the road we can add slab and maybe vmalloc.
Once folio won't overlap struct page, we'll have a hard time with only
kho_preserve_folio() for memory that's not actually folio (i.e. anon and
page cache)
> Which can wrapper the get_free_pages and the preserve logic and gives
> a nice path to possibly someday supporting non-PAGE_SIZE allocations.
>
> Jason
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Jason Gunthorpe @ 2025-08-14 17:01 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bCbjmRKtVVAok7GH8xvh8JWrga5Oj-iK-p=1M79AqvhRA@mail.gmail.com>
On Thu, Aug 14, 2025 at 03:05:04PM +0000, Pasha Tatashin wrote:
> On Thu, Aug 14, 2025 at 1:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > > +{
> >
> > Why are we adding phys apis? Didn't we talk about this before and
> > agree not to expose these?
>
> It is already there, this patch simply completes a lacking unpreserve part.
This patch yes, but that is because the later patches intend to use
it, which I argue those patches should not.
There should not be any users of these phys interfaces because they
make no sense. The API preserves folios and brings allocated folios
back on the other side. None of that is phys.
> > Which is perhaps another comment, if this __get_free_pages() is going
> > to be a common pattern (and I guess it will be) then the API should be
> > streamlined alot more:
> >
> > void *kho_alloc_preserved_memory(gfp, size);
> > void kho_free_preserved_memory(void *);
>
> Hm, not all GFP flags are compatible with KHO preserve, but we could
> add this or similar API, but first let's make KHO completely
> stateless: remove, finalize and abort parts from it.
Right, in those cases we often warn on and mask invalid flag
Jason
^ permalink raw reply
* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-14 16:54 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250813024250.2504126-1-yi.zhang@huaweicloud.com>
On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
> fallocate utility by introducing a new 'fwzero' command in the xfs_io
> tool.
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> v1->v2:
> - Minor description modification to align with the kernel.
>
> io/prealloc.c | 36 ++++++++++++++++++++++++++++++++++++
> man/man8/xfs_io.8 | 6 ++++++
> 2 files changed, 42 insertions(+)
>
> diff --git a/io/prealloc.c b/io/prealloc.c
> index 8e968c9f..9a64bf53 100644
> --- a/io/prealloc.c
> +++ b/io/prealloc.c
> @@ -30,6 +30,10 @@
> #define FALLOC_FL_UNSHARE_RANGE 0x40
> #endif
>
> +#ifndef FALLOC_FL_WRITE_ZEROES
> +#define FALLOC_FL_WRITE_ZEROES 0x80
> +#endif
> +
> static cmdinfo_t allocsp_cmd;
> static cmdinfo_t freesp_cmd;
> static cmdinfo_t resvsp_cmd;
> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
> static cmdinfo_t finsert_cmd;
> static cmdinfo_t fzero_cmd;
> static cmdinfo_t funshare_cmd;
> +static cmdinfo_t fwzero_cmd;
>
> static int
> offset_length(
> @@ -377,6 +382,27 @@ funshare_f(
> return 0;
> }
>
> +static int
> +fwzero_f(
> + int argc,
> + char **argv)
> +{
> + xfs_flock64_t segment;
> + int mode = FALLOC_FL_WRITE_ZEROES;
Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?
(The code otherwise looks fine to me)
--D
> +
> + if (!offset_length(argv[1], argv[2], &segment)) {
> + exitcode = 1;
> + return 0;
> + }
> +
> + if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
> + perror("fallocate");
> + exitcode = 1;
> + return 0;
> + }
> + return 0;
> +}
> +
> void
> prealloc_init(void)
> {
> @@ -489,4 +515,14 @@ prealloc_init(void)
> funshare_cmd.oneline =
> _("unshares shared blocks within the range");
> add_command(&funshare_cmd);
> +
> + fwzero_cmd.name = "fwzero";
> + fwzero_cmd.cfunc = fwzero_f;
> + fwzero_cmd.argmin = 2;
> + fwzero_cmd.argmax = 2;
> + fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
> + fwzero_cmd.args = _("off len");
> + fwzero_cmd.oneline =
> + _("zeroes space and eliminates holes by allocating and submitting write zeroes");
> + add_command(&fwzero_cmd);
> }
> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
> index b0dcfdb7..0a673322 100644
> --- a/man/man8/xfs_io.8
> +++ b/man/man8/xfs_io.8
> @@ -550,6 +550,12 @@ With the
> .B -k
> option, use the FALLOC_FL_KEEP_SIZE flag as well.
> .TP
> +.BI fwzero " offset length"
> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
> +.BR fallocate (2)
> +manual page to allocate and zero blocks within the range by submitting write
> +zeroes.
> +.TP
> .BI zero " offset length"
> Call xfsctl with
> .B XFS_IOC_ZERO_RANGE
> --
> 2.39.2
>
>
^ permalink raw reply
* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-14 16:52 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250813024015.2502234-1-yi.zhang@huaweicloud.com>
On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
> utility by introducing a new option -w|--write-zeroes.
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> v1->v2:
> - Minor description modification to align with the kernel.
>
> sys-utils/fallocate.1.adoc | 11 +++++++++--
> sys-utils/fallocate.c | 20 ++++++++++++++++----
> 2 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
> index 44ee0ef4c..0ec9ff9a9 100644
> --- a/sys-utils/fallocate.1.adoc
> +++ b/sys-utils/fallocate.1.adoc
> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
<snip all the long lines>
> +*-w*, *--write-zeroes*::
> +Zeroes space in the byte range starting at _offset_ and continuing
> for _length_ bytes. Within the specified range, blocks are
> preallocated for the regions that span the holes in the file. After a
> successful call, subsequent reads from this range will return zeroes,
> subsequent writes to that range do not require further changes to the
> file mapping metadata.
"...will return zeroes and subsequent writes to that range..." ?
> ++
> +Zeroing is done within the filesystem by preferably submitting write
I think we should say less about what the filesystem actually does to
preserve some flexibility:
"Zeroing is done within the filesystem. The filesystem may use a
hardware accelerated zeroing command, or it may submit regular writes.
The behavior depends on the filesystem design and available hardware."
> zeores commands, the alternative way is submitting actual zeroed data,
> the specified range will be converted into written extents. The write
> zeroes command is typically faster than write actual data if the
> device supports unmap write zeroes, the specified range will not be
> physically zeroed out on the device.
> ++
> +Options *--keep-size* can not be specified for the write-zeroes
> operation.
> +
> include::man-common/help-version.adoc[]
>
> == AUTHORS
> diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c
> index 13bf52915..8d37fdad7 100644
> --- a/sys-utils/fallocate.c
> +++ b/sys-utils/fallocate.c
> @@ -40,7 +40,7 @@
> #if defined(HAVE_LINUX_FALLOC_H) && \
> (!defined(FALLOC_FL_KEEP_SIZE) || !defined(FALLOC_FL_PUNCH_HOLE) || \
> !defined(FALLOC_FL_COLLAPSE_RANGE) || !defined(FALLOC_FL_ZERO_RANGE) || \
> - !defined(FALLOC_FL_INSERT_RANGE))
> + !defined(FALLOC_FL_INSERT_RANGE) || !defined(FALLOC_FL_WRITE_ZEROES))
> # include <linux/falloc.h> /* non-libc fallback for FALLOC_FL_* flags */
> #endif
>
> @@ -65,6 +65,10 @@
> # define FALLOC_FL_INSERT_RANGE 0x20
> #endif
>
> +#ifndef FALLOC_FL_WRITE_ZEROES
> +# define FALLOC_FL_WRITE_ZEROES 0x80
> +#endif
> +
> #include "nls.h"
> #include "strutils.h"
> #include "c.h"
> @@ -94,6 +98,7 @@ static void __attribute__((__noreturn__)) usage(void)
> fputs(_(" -o, --offset <num> offset for range operations, in bytes\n"), out);
> fputs(_(" -p, --punch-hole replace a range with a hole (implies -n)\n"), out);
> fputs(_(" -z, --zero-range zero and ensure allocation of a range\n"), out);
> + fputs(_(" -w, --write-zeroes write zeroes and ensure allocation of a range\n"), out);
> #ifdef HAVE_POSIX_FALLOCATE
> fputs(_(" -x, --posix use posix_fallocate(3) instead of fallocate(2)\n"), out);
> #endif
> @@ -304,6 +309,7 @@ int main(int argc, char **argv)
> { "dig-holes", no_argument, NULL, 'd' },
> { "insert-range", no_argument, NULL, 'i' },
> { "zero-range", no_argument, NULL, 'z' },
> + { "write-zeroes", no_argument, NULL, 'w' },
> { "offset", required_argument, NULL, 'o' },
> { "length", required_argument, NULL, 'l' },
> { "posix", no_argument, NULL, 'x' },
> @@ -312,8 +318,8 @@ int main(int argc, char **argv)
> };
>
> static const ul_excl_t excl[] = { /* rows and cols in ASCII order */
> - { 'c', 'd', 'i', 'p', 'x', 'z'},
> - { 'c', 'i', 'n', 'x' },
> + { 'c', 'd', 'i', 'p', 'w', 'x', 'z'},
> + { 'c', 'i', 'n', 'w', 'x' },
> { 0 }
> };
> int excl_st[ARRAY_SIZE(excl)] = UL_EXCL_STATUS_INIT;
> @@ -323,7 +329,7 @@ int main(int argc, char **argv)
> textdomain(PACKAGE);
> close_stdout_atexit();
>
> - while ((c = getopt_long(argc, argv, "hvVncpdizxl:o:", longopts, NULL))
> + while ((c = getopt_long(argc, argv, "hvVncpdizwxl:o:", longopts, NULL))
> != -1) {
>
> err_exclusive_options(c, longopts, excl, excl_st);
> @@ -353,6 +359,9 @@ int main(int argc, char **argv)
> case 'z':
> mode |= FALLOC_FL_ZERO_RANGE;
> break;
> + case 'w':
> + mode |= FALLOC_FL_WRITE_ZEROES;
> + break;
> case 'x':
> #ifdef HAVE_POSIX_FALLOCATE
> posix = 1;
> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
> else if (mode & FALLOC_FL_ZERO_RANGE)
> fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
> filename, str, length);
> + else if (mode & FALLOC_FL_WRITE_ZEROES)
> + fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
"write zeroed" is a little strange, but I don't have a better
suggestion. :)
--D
> + filename, str, length);
> else
> fprintf(stdout, _("%s: %s (%ju bytes) allocated.\n"),
> filename, str, length);
> --
> 2.39.2
>
>
^ permalink raw reply
* Re: [PATCH RESEND] fs: Add 'rootfsflags' to set rootfs mount options
From: Randy Dunlap @ 2025-08-14 16:23 UTC (permalink / raw)
To: Lichen Liu, viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
cyphar, linux-api, zohar, stefanb, initramfs
In-Reply-To: <20250814103424.3287358-2-lichliu@redhat.com>
Hi,
On 8/14/25 3:34 AM, Lichen Liu wrote:
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
>
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
>
> This patch introduces a new kernel command-line parameter, rootfsflags,
> which allows passing specific mount options directly to the rootfs when
> it is first mounted. This gives users control over the rootfs behavior.
>
> For example, a user can now specify rootfsflags=size=75% to allow the
> tmpfs to use up to 75% of the available memory. This can significantly
> reduce the memory pressure for kdump.
>
> Consider a practical example:
>
> To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> the default 50% limit, this requires a memory pool of 96MB to be
> available for the tmpfs mount. The total memory requirement is therefore
> approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
>
> By using rootfsflags=size=75%, the memory pool required for the 48MB
> tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> smaller crashkernel size, such as 192MB.
>
> An alternative approach of reusing the existing rootflags parameter was
> considered. However, a new, dedicated rootfsflags parameter was chosen
> to avoid altering the current behavior of rootflags (which applies to
> the final root filesystem) and to prevent any potential regressions.
>
> This approach is inspired by prior discussions and patches on the topic.
> Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> Ref: https://landley.net/notes-2015.html#01-01-2015
> Ref: https://lkml.org/lkml/2021/6/29/783
> Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
>
> Signed-off-by: Lichen Liu <lichliu@redhat.com>
> Tested-by: Rob Landley <rob@landley.net>
> ---
> Hi VFS maintainers,
>
> Resending this patch as it did not get picked up.
> This patch is intended for the VFS tree.
>
> fs/namespace.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8f1000f9f3df..e484c26d5e3f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
> }
> __setup("mphash_entries=", set_mphash_entries);
>
> +static char * __initdata rootfs_flags;
> +static int __init rootfs_flags_setup(char *str)
> +{
> + rootfs_flags = str;
> + return 1;
> +}
> +
> +__setup("rootfsflags=", rootfs_flags_setup);
Please document this option (alphabetically) in
Documentation/admin-guide/kernel-parameters.txt.
Thanks.
> +
> static u64 event;
> static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
> static DEFINE_IDA(mnt_group_ida);
> @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
> struct mnt_namespace *ns;
> struct path root;
>
> - mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> + mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
> if (IS_ERR(mnt))
> panic("Can't create rootfs");
>
--
~Randy
^ permalink raw reply
* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Pasha Tatashin @ 2025-08-14 15:05 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814132233.GB802098@nvidia.com>
On Thu, Aug 14, 2025 at 1:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > +{
>
> Why are we adding phys apis? Didn't we talk about this before and
> agree not to expose these?
It is already there, this patch simply completes a lacking unpreserve part.
We can talk about removing it in the future, but the phys interface
provides a benefit of not having to preserve power of two in length
objects.
>
> The places using it are goofy:
>
> +static int luo_fdt_setup(void)
> +{
> + fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + get_order(LUO_FDT_SIZE));
>
> + ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
>
> + WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));
>
> It literally allocated a page and then for some reason switches to
> phys with an open coded __pa??
>
> This is ugly, if you want a helper to match __get_free_pages() then
> make one that works on void * directly. You can get the order of the
> void * directly from the struct page IIRC when using GFP_COMP.
I will make this changes.
>
> Which is perhaps another comment, if this __get_free_pages() is going
> to be a common pattern (and I guess it will be) then the API should be
> streamlined alot more:
>
> void *kho_alloc_preserved_memory(gfp, size);
> void kho_free_preserved_memory(void *);
Hm, not all GFP flags are compatible with KHO preserve, but we could
add this or similar API, but first let's make KHO completely
stateless: remove, finalize and abort parts from it.
>
> Which can wrapper the get_free_pages and the preserve logic and gives
> a nice path to possibly someday supporting non-PAGE_SIZE allocations.
>
> Jason
^ permalink raw reply
* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pasha Tatashin @ 2025-08-14 14:57 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814131153.GA802098@nvidia.com>
On Thu, Aug 14, 2025 at 1:11 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:07AM +0000, Pasha Tatashin wrote:
> > - physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> > - if (IS_ERR(physxa))
> > - return PTR_ERR(physxa);
>
> It is probably better to introduce a function pointer argument to this
> xa_load_or_alloc() to do the alloc and init operation than to open
> code the thing.
Agreed, but this should be a separate clean-up, this particular patch
is a hotfix that should land soon (it was separated from this this
series). Once it lands, we are going to do this clean-up.
Pasha
^ permalink raw reply
* Re: [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management
From: Jason Gunthorpe @ 2025-08-14 14:02 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-19-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:24AM +0000, Pasha Tatashin wrote:
> +struct liveupdate_ioctl_get_fd_state {
> + __u32 size;
> + __u8 incoming;
> + __aligned_u64 token;
> + __u32 state;
> +};
Same remark about explicit padding and checking padding for 0
> + * luo_file_get_state - Get the preservation state of a specific file.
> + * @token: The token of the file to query.
> + * @statep: Output pointer to store the file's current live update state.
> + * @incoming: If true, query the state of a restored file from the incoming
> + * (previous kernel's) set. If false, query a file being prepared
> + * for preservation in the current set.
> + *
> + * Finds the file associated with the given @token in either the incoming
> + * or outgoing tracking arrays and returns its current LUO state
> + * (NORMAL, PREPARED, FROZEN, UPDATED).
> + *
> + * Return: 0 on success, -ENOENT if the token is not found.
> + */
> +int luo_file_get_state(u64 token, enum liveupdate_state *statep, bool incoming)
> +{
> + struct luo_file *luo_file;
> + struct xarray *target_xa;
> + int ret = 0;
> +
> + luo_state_read_enter();
Less globals, at this point everything should be within memory
attached to the file descriptor and not in globals. Doing this will
promote good maintainable structure and not a spaghetti
Also I think a BKL design is not a good idea for new code. We've had
so many bad experiences with this pattern promoting uncontrolled
incomprehensible locking.
The xarray already has a lock, why not have reasonable locking inside
the luo_file? Probably just a refcount?
> + target_xa = incoming ? &luo_files_xa_in : &luo_files_xa_out;
> + luo_file = xa_load(target_xa, token);
> +
> + if (!luo_file) {
> + ret = -ENOENT;
> + goto out_unlock;
> + }
> +
> + scoped_guard(mutex, &luo_file->mutex)
> + *statep = luo_file->state;
> +
> +out_unlock:
> + luo_state_read_exit();
If we are using cleanup.h then use it for this too..
But it seems kind of weird, why not just
xa_lock()
xa_load()
*statep = READ_ONCE(luo_file->state);
xa_unlock()
?
> +static int luo_ioctl_set_fd_event(struct luo_ucmd *ucmd)
> +{
> + struct liveupdate_ioctl_set_fd_event *argp = ucmd->cmd;
> + int ret;
> +
> + switch (argp->event) {
> + case LIVEUPDATE_PREPARE:
> + ret = luo_file_prepare(argp->token);
> + break;
> + case LIVEUPDATE_FREEZE:
> + ret = luo_file_freeze(argp->token);
> + break;
> + case LIVEUPDATE_FINISH:
> + ret = luo_file_finish(argp->token);
> + break;
> + case LIVEUPDATE_CANCEL:
> + ret = luo_file_cancel(argp->token);
> + break;
The token should be converted to a file here instead of duplicated in
each function
> static int luo_open(struct inode *inodep, struct file *filep)
> {
> if (atomic_cmpxchg(&luo_device_in_use, 0, 1))
> @@ -149,6 +191,8 @@ union ucmd_buffer {
> struct liveupdate_ioctl_fd_restore restore;
> struct liveupdate_ioctl_get_state state;
> struct liveupdate_ioctl_set_event event;
> + struct liveupdate_ioctl_get_fd_state fd_state;
> + struct liveupdate_ioctl_set_fd_event fd_event;
> };
>
> struct luo_ioctl_op {
> @@ -179,6 +223,10 @@ static const struct luo_ioctl_op luo_ioctl_ops[] = {
> struct liveupdate_ioctl_get_state, state),
> IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
> struct liveupdate_ioctl_set_event, event),
> + IOCTL_OP(LIVEUPDATE_IOCTL_GET_FD_STATE, luo_ioctl_get_fd_state,
> + struct liveupdate_ioctl_get_fd_state, token),
> + IOCTL_OP(LIVEUPDATE_IOCTL_SET_FD_EVENT, luo_ioctl_set_fd_event,
> + struct liveupdate_ioctl_set_fd_event, token),
> };
Keep sorted
Jason
^ permalink raw reply
* Re: [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface
From: Jason Gunthorpe @ 2025-08-14 13:49 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-17-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:22AM +0000, Pasha Tatashin wrote:
> +/**
> + * DOC: General ioctl format
> + *
> + * The ioctl interface follows a general format to allow for extensibility. Each
> + * ioctl is passed in a structure pointer as the argument providing the size of
> + * the structure in the first u32. The kernel checks that any structure space
> + * beyond what it understands is 0. This allows userspace to use the backward
> + * compatible portion while consistently using the newer, larger, structures.
> + *
> + * ioctls use a standard meaning for common errnos:
> + *
> + * - ENOTTY: The IOCTL number itself is not supported at all
> + * - E2BIG: The IOCTL number is supported, but the provided structure has
> + * non-zero in a part the kernel does not understand.
> + * - EOPNOTSUPP: The IOCTL number is supported, and the structure is
> + * understood, however a known field has a value the kernel does not
> + * understand or support.
> + * - EINVAL: Everything about the IOCTL was understood, but a field is not
> + * correct.
> + * - ENOENT: An ID or IOVA provided does not exist.
^^^^^^^^^
Maybe this should be 'token' ?
> + * - ENOMEM: Out of memory.
> + * - EOVERFLOW: Mathematics overflowed.
> + *
> + * As well as additional errnos, within specific ioctls.
> + */
Ah if you copy the comment make sure to faithfully follow it in the
implementation :)
> +struct liveupdate_ioctl_fd_unpreserve {
> + __u32 size;
> + __aligned_u64 token;
> +};
It is best to explicitly pad, so add a __u32 reserved between size and
token
Then you need to also check that the reserved is 0 when parsing it,
return -EOPNOTSUPP otherwise.
> +static atomic_t luo_device_in_use = ATOMIC_INIT(0);
I suggest you bundle this together into one struct with the misc_dev
and the other globals and largely pretend it is not global, eg refer
to it through container_of, etc
Following practices like this make it harder to abuse the globals.
> +struct luo_ucmd {
> + void __user *ubuffer;
> + u32 user_size;
> + void *cmd;
> +};
> +
> +static int luo_ioctl_fd_preserve(struct luo_ucmd *ucmd)
> +{
> + struct liveupdate_ioctl_fd_preserve *argp = ucmd->cmd;
> + int ret;
> +
> + ret = luo_register_file(argp->token, argp->fd);
> + if (!ret)
> + return ret;
> +
> + if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> + return -EFAULT;
This will overflow memory, ucmd->user_size may be > sizeof(*argp)
The respond function is an important part of this scheme:
static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
size_t cmd_len)
{
if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
min_t(size_t, ucmd->user_size, cmd_len)))
return -EFAULT;
The min (sizeof(*argp) in this case) can't be skipped!
> +static int luo_ioctl_fd_restore(struct luo_ucmd *ucmd)
> +{
> + struct liveupdate_ioctl_fd_restore *argp = ucmd->cmd;
> + struct file *file;
> + int ret;
> +
> + argp->fd = get_unused_fd_flags(O_CLOEXEC);
> + if (argp->fd < 0) {
> + pr_err("Failed to allocate new fd: %d\n", argp->fd);
No need
> + return argp->fd;
> + }
> +
> + ret = luo_retrieve_file(argp->token, &file);
> + if (ret < 0) {
> + put_unused_fd(argp->fd);
> +
> + return ret;
> + }
> +
> + fd_install(argp->fd, file);
> +
> + if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> + return -EFAULT;
Wrong order, fd_install must be last right before return 0. Failing
system calls should not leave behind installed FDs.
> +static int luo_ioctl_set_event(struct luo_ucmd *ucmd)
> +{
> + struct liveupdate_ioctl_set_event *argp = ucmd->cmd;
> + int ret;
> +
> + switch (argp->event) {
> + case LIVEUPDATE_PREPARE:
> + ret = luo_prepare();
> + break;
> + case LIVEUPDATE_FINISH:
> + ret = luo_finish();
> + break;
> + case LIVEUPDATE_CANCEL:
> + ret = luo_cancel();
> + break;
> + default:
> + ret = -EINVAL;
EOPNOTSUPP
> +union ucmd_buffer {
> + struct liveupdate_ioctl_fd_preserve preserve;
> + struct liveupdate_ioctl_fd_unpreserve unpreserve;
> + struct liveupdate_ioctl_fd_restore restore;
> + struct liveupdate_ioctl_get_state state;
> + struct liveupdate_ioctl_set_event event;
> +};
I discourage the column alignment. Also sort by name.
> +static const struct luo_ioctl_op luo_ioctl_ops[] = {
> + IOCTL_OP(LIVEUPDATE_IOCTL_FD_PRESERVE, luo_ioctl_fd_preserve,
> + struct liveupdate_ioctl_fd_preserve, token),
> + IOCTL_OP(LIVEUPDATE_IOCTL_FD_UNPRESERVE, luo_ioctl_fd_unpreserve,
> + struct liveupdate_ioctl_fd_unpreserve, token),
> + IOCTL_OP(LIVEUPDATE_IOCTL_FD_RESTORE, luo_ioctl_fd_restore,
> + struct liveupdate_ioctl_fd_restore, token),
> + IOCTL_OP(LIVEUPDATE_IOCTL_GET_STATE, luo_ioctl_get_state,
> + struct liveupdate_ioctl_get_state, state),
> + IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
> + struct liveupdate_ioctl_set_event, event),
Sort by name
Jason
^ permalink raw reply
* Re: [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-14 13:31 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-11-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:16AM +0000, Pasha Tatashin wrote:
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -383,6 +383,8 @@ Code Seq# Include File Comments
> 0xB8 01-02 uapi/misc/mrvl_cn10k_dpi.h Marvell CN10K DPI driver
> 0xB8 all uapi/linux/mshv.h Microsoft Hyper-V /dev/mshv driver
> <mailto:linux-hyperv@vger.kernel.org>
> +0xBA all uapi/linux/liveupdate.h Pasha Tatashin
> + <mailto:pasha.tatashin@soleen.com>
Let's not be greedy ;) Just take 00-0F for the moment
Jason
^ permalink raw reply
* Re: [PATCH v3 08/30] kho: don't unpreserve memory during abort
From: Jason Gunthorpe @ 2025-08-14 13:30 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-9-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:14AM +0000, Pasha Tatashin wrote:
> static int __kho_abort(void)
> {
> - int err = 0;
> - unsigned long order;
> - struct kho_mem_phys *physxa;
> -
> - xa_for_each(&kho_out.track.orders, order, physxa) {
> - struct kho_mem_phys_bits *bits;
> - unsigned long phys;
> -
> - xa_for_each(&physxa->phys_bits, phys, bits)
> - kfree(bits);
> -
> - xa_destroy(&physxa->phys_bits);
> - kfree(physxa);
> - }
> - xa_destroy(&kho_out.track.orders);
Now nothing ever cleans this up :\
Are you sure the issue isn't in the caller that it shouldn't be
calling kho abort until all the other stuff is cleaned up first?
I feel like this is another case of absuing globals gives an unclear
lifecycle model.
Jason
^ permalink raw reply
* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Jason Gunthorpe @ 2025-08-14 13:22 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-8-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> +{
Why are we adding phys apis? Didn't we talk about this before and
agree not to expose these?
The places using it are goofy:
+static int luo_fdt_setup(void)
+{
+ fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(LUO_FDT_SIZE));
+ ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
+ WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));
It literally allocated a page and then for some reason switches to
phys with an open coded __pa??
This is ugly, if you want a helper to match __get_free_pages() then
make one that works on void * directly. You can get the order of the
void * directly from the struct page IIRC when using GFP_COMP.
Which is perhaps another comment, if this __get_free_pages() is going
to be a common pattern (and I guess it will be) then the API should be
streamlined alot more:
void *kho_alloc_preserved_memory(gfp, size);
void kho_free_preserved_memory(void *);
Which can wrapper the get_free_pages and the preserve logic and gives
a nice path to possibly someday supporting non-PAGE_SIZE allocations.
Jason
^ permalink raw reply
* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Jason Gunthorpe @ 2025-08-14 13:11 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-2-pasha.tatashin@soleen.com>
On Thu, Aug 07, 2025 at 01:44:07AM +0000, Pasha Tatashin wrote:
> - physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> - if (IS_ERR(physxa))
> - return PTR_ERR(physxa);
It is probably better to introduce a function pointer argument to this
xa_load_or_alloc() to do the alloc and init operation than to open
code the thing.
Jason
^ permalink raw reply
* [PATCH RESEND] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-14 10:34 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
cyphar, linux-api, zohar, stefanb, initramfs, Lichen Liu
When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
By default, a tmpfs mount is limited to using 50% of the available RAM
for its content. This can be problematic in memory-constrained
environments, particularly during a kdump capture.
In a kdump scenario, the capture kernel boots with a limited amount of
memory specified by the 'crashkernel' parameter. If the initramfs is
large, it may fail to unpack into the tmpfs rootfs due to insufficient
space. This is because to get X MB of usable space in tmpfs, 2*X MB of
memory must be available for the mount. This leads to an OOM failure
during the early boot process, preventing a successful crash dump.
This patch introduces a new kernel command-line parameter, rootfsflags,
which allows passing specific mount options directly to the rootfs when
it is first mounted. This gives users control over the rootfs behavior.
For example, a user can now specify rootfsflags=size=75% to allow the
tmpfs to use up to 75% of the available memory. This can significantly
reduce the memory pressure for kdump.
Consider a practical example:
To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
the default 50% limit, this requires a memory pool of 96MB to be
available for the tmpfs mount. The total memory requirement is therefore
approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
By using rootfsflags=size=75%, the memory pool required for the 48MB
tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
smaller crashkernel size, such as 192MB.
An alternative approach of reusing the existing rootflags parameter was
considered. However, a new, dedicated rootfsflags parameter was chosen
to avoid altering the current behavior of rootflags (which applies to
the final root filesystem) and to prevent any potential regressions.
This approach is inspired by prior discussions and patches on the topic.
Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
Ref: https://landley.net/notes-2015.html#01-01-2015
Ref: https://lkml.org/lkml/2021/6/29/783
Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
Signed-off-by: Lichen Liu <lichliu@redhat.com>
Tested-by: Rob Landley <rob@landley.net>
---
Hi VFS maintainers,
Resending this patch as it did not get picked up.
This patch is intended for the VFS tree.
fs/namespace.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 8f1000f9f3df..e484c26d5e3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
}
__setup("mphash_entries=", set_mphash_entries);
+static char * __initdata rootfs_flags;
+static int __init rootfs_flags_setup(char *str)
+{
+ rootfs_flags = str;
+ return 1;
+}
+
+__setup("rootfsflags=", rootfs_flags_setup);
+
static u64 event;
static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
static DEFINE_IDA(mnt_group_ida);
@@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
struct mnt_namespace *ns;
struct path root;
- mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
+ mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
if (IS_ERR(mnt))
panic("Can't create rootfs");
--
2.47.0
^ permalink raw reply related
* Re: [PATCH] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-14 10:25 UTC (permalink / raw)
To: Askar Safin
Cc: brauner, kexec, linux-kernel, rob, viro, weilongchen, cyphar,
linux-fsdevel, linux-api, initramfs, Mimi Zohar, Stefan Berger
In-Reply-To: <20250814081339.3007358-1-safinaskar@zohomail.com>
On Thu, Aug 14, 2025 at 4:15 PM Askar Safin <safinaskar@zohomail.com> wrote:
>
> Lichen Liu <lichliu@redhat.com>:
> > When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> > By default, a tmpfs mount is limited to using 50% of the available RAM
> > for its content. This can be problematic in memory-constrained
> > environments, particularly during a kdump capture.
> >
> > In a kdump scenario, the capture kernel boots with a limited amount of
> > memory specified by the 'crashkernel' parameter. If the initramfs is
> > large, it may fail to unpack into the tmpfs rootfs due to insufficient
> > space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> > memory must be available for the mount. This leads to an OOM failure
> > during the early boot process, preventing a successful crash dump.
> >
> > This patch introduces a new kernel command-line parameter, rootfsflags,
> > which allows passing specific mount options directly to the rootfs when
> > it is first mounted. This gives users control over the rootfs behavior.
> >
> > For example, a user can now specify rootfsflags=size=75% to allow the
> > tmpfs to use up to 75% of the available memory. This can significantly
> > reduce the memory pressure for kdump.
> >
> > Consider a practical example:
> >
> > To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> > the default 50% limit, this requires a memory pool of 96MB to be
> > available for the tmpfs mount. The total memory requirement is therefore
> > approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> > kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
> >
> > By using rootfsflags=size=75%, the memory pool required for the 48MB
> > tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> > requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> > smaller crashkernel size, such as 192MB.
> >
> > An alternative approach of reusing the existing rootflags parameter was
> > considered. However, a new, dedicated rootfsflags parameter was chosen
> > to avoid altering the current behavior of rootflags (which applies to
> > the final root filesystem) and to prevent any potential regressions.
> >
> > This approach is inspired by prior discussions and patches on the topic.
> > Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> > Ref: https://landley.net/notes-2015.html#01-01-2015
> > Ref: https://lkml.org/lkml/2021/6/29/783
> > Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
> >
> > Signed-off-by: Lichen Liu <lichliu@redhat.com>
> > ---
> > fs/namespace.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index ddfd4457d338..a450db31613e 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
> > }
> > __setup("mphash_entries=", set_mphash_entries);
> >
> > +static char * __initdata rootfs_flags;
> > +static int __init rootfs_flags_setup(char *str)
> > +{
> > + rootfs_flags = str;
> > + return 1;
> > +}
> > +
> > +__setup("rootfsflags=", rootfs_flags_setup);
> > +
> > static u64 event;
> > static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
> > static DEFINE_IDA(mnt_group_ida);
> > @@ -6086,7 +6095,7 @@ static void __init init_mount_tree(void)
> > struct mnt_namespace *ns;
> > struct path root;
> >
> > - mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> > + mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
> > if (IS_ERR(mnt))
> > panic("Can't create rootfs");
> >
> > --
> > 2.50.1
>
> Thank you for this patch!
>
> I suggest periodically check linux-next to see whether the patch got there.
>
> If it was not applied in resonable time, then resend it.
> But this time, please, clearly specify tree, which should accept it.
> I think the most apropriate tree is VFS tree here.
> So, when resending please add linux-fsdevel@vger.kernel.org to CC and say in first paragraph
> in your mail that the patch is for VFS tree.
Thank You!
I checked the linux-next and it was not applied now. I will resend
this patch and CC linux-fsdevel@vger.kernel.org.
>
> --
> Askar Safin
>
^ permalink raw reply
* Re: [PATCH] fs: Add 'rootfsflags' to set rootfs mount options
From: Askar Safin @ 2025-08-14 8:13 UTC (permalink / raw)
To: lichliu
Cc: brauner, kexec, linux-kernel, rob, viro, weilongchen, cyphar,
linux-fsdevel, linux-api, initramfs, Mimi Zohar, Stefan Berger
In-Reply-To: <20250808015134.2875430-2-lichliu@redhat.com>
Lichen Liu <lichliu@redhat.com>:
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
>
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
>
> This patch introduces a new kernel command-line parameter, rootfsflags,
> which allows passing specific mount options directly to the rootfs when
> it is first mounted. This gives users control over the rootfs behavior.
>
> For example, a user can now specify rootfsflags=size=75% to allow the
> tmpfs to use up to 75% of the available memory. This can significantly
> reduce the memory pressure for kdump.
>
> Consider a practical example:
>
> To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> the default 50% limit, this requires a memory pool of 96MB to be
> available for the tmpfs mount. The total memory requirement is therefore
> approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
>
> By using rootfsflags=size=75%, the memory pool required for the 48MB
> tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> smaller crashkernel size, such as 192MB.
>
> An alternative approach of reusing the existing rootflags parameter was
> considered. However, a new, dedicated rootfsflags parameter was chosen
> to avoid altering the current behavior of rootflags (which applies to
> the final root filesystem) and to prevent any potential regressions.
>
> This approach is inspired by prior discussions and patches on the topic.
> Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> Ref: https://landley.net/notes-2015.html#01-01-2015
> Ref: https://lkml.org/lkml/2021/6/29/783
> Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
>
> Signed-off-by: Lichen Liu <lichliu@redhat.com>
> ---
> fs/namespace.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ddfd4457d338..a450db31613e 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
> }
> __setup("mphash_entries=", set_mphash_entries);
>
> +static char * __initdata rootfs_flags;
> +static int __init rootfs_flags_setup(char *str)
> +{
> + rootfs_flags = str;
> + return 1;
> +}
> +
> +__setup("rootfsflags=", rootfs_flags_setup);
> +
> static u64 event;
> static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
> static DEFINE_IDA(mnt_group_ida);
> @@ -6086,7 +6095,7 @@ static void __init init_mount_tree(void)
> struct mnt_namespace *ns;
> struct path root;
>
> - mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> + mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
> if (IS_ERR(mnt))
> panic("Can't create rootfs");
>
> --
> 2.50.1
Thank you for this patch!
I suggest periodically check linux-next to see whether the patch got there.
If it was not applied in resonable time, then resend it.
But this time, please, clearly specify tree, which should accept it.
I think the most apropriate tree is VFS tree here.
So, when resending please add linux-fsdevel@vger.kernel.org to CC and say in first paragraph
in your mail that the patch is for VFS tree.
--
Askar Safin
^ permalink raw reply
* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Pavel Tikhomirov @ 2025-08-14 7:07 UTC (permalink / raw)
To: Al Viro
Cc: Tycho Andersen, Andrei Vagin, Andrei Vagin, Christian Brauner,
linux-fsdevel, LKML, criu, Linux API, stable
In-Reply-To: <CAE1zp77jmFD=rySJVLf6yU+JKZnUpjkBagC3qQHrxPotrccEbQ@mail.gmail.com>
> It should be enough to run a zdtm test-suit to check that change does
> not break something for CRIU (will do).
jfyi: checked 0cc53520e68 with patch "[PATCH] use uniform permission
checks for all mount propagation changes" (+ s/from/to/), there is no
problem on criu-zdtm mount related tests. I see some problems on
socket related tests on it, but it looks unrelated.
^ permalink raw reply
* Re: [RFC][CFT] selftest for permission checks in mount propagation changes
From: Al Viro @ 2025-08-14 6:37 UTC (permalink / raw)
To: linux-fsdevel
Cc: Tycho Andersen, Andrei Vagin, Andrei Vagin, Christian Brauner,
Pavel Tikhomirov, LKML, criu, Linux API, stable
In-Reply-To: <20250814055702.GO222315@ZenIV>
> void do_unshare(void)
> {
> FILE *f;
> uid_t uid = geteuid();
> gid_t gid = getegid();
> unshare(CLONE_NEWNS|CLONE_NEWUSER);
> f = fopen("/proc/self/uid_map", "w");
> fprintf(f, "0 %d 1", uid);
> fclose(f);
> f = fopen("/proc/self/setgroups", "w");
> fprintf(f, "deny");
> fclose(f);
> f = fopen("/proc/self/gid_map", "w");
> fprintf(f, "0 %d 1", gid);
> fclose(f);
> mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL);
> }
This obviously needs error checking - in this form it won't do
anything good without userns enabled (coredump on the first
fprintf() in there, since there won't be /proc/self/uid_map);
should probably just report CLONE_NEWUSER failure, warn about
skipped tests, fall back to unshare(CLONE_NEWNS) and skip
everything in in_child()...
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox