Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Jason Gunthorpe @ 2025-08-14 13:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-2-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:07AM +0000, Pasha Tatashin wrote:
> -	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> -	if (IS_ERR(physxa))
> -		return PTR_ERR(physxa);

It is probably better to introduce a function pointer argument to this
xa_load_or_alloc() to do the alloc and init operation than to open
code the thing.

Jason

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Jason Gunthorpe @ 2025-08-14 13:22 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-8-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> +{

Why are we adding phys apis? Didn't we talk about this before and
agree not to expose these?

The places using it are goofy:

+static int luo_fdt_setup(void)
+{
+       fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+                                          get_order(LUO_FDT_SIZE));

+       ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);

+       WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));

It literally allocated a page and then for some reason switches to
phys with an open coded __pa??

This is ugly, if you want a helper to match __get_free_pages() then
make one that works on void * directly. You can get the order of the
void * directly from the struct page IIRC when using GFP_COMP.

Which is perhaps another comment, if this __get_free_pages() is going
to be a common pattern (and I guess it will be) then the API should be
streamlined alot more:

 void *kho_alloc_preserved_memory(gfp, size);
 void kho_free_preserved_memory(void *);

Which can wrapper the get_free_pages and the preserve logic and gives
a nice path to possibly someday supporting non-PAGE_SIZE allocations.

Jason

^ permalink raw reply

* Re: [PATCH v3 08/30] kho: don't unpreserve memory during abort
From: Jason Gunthorpe @ 2025-08-14 13:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-9-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:14AM +0000, Pasha Tatashin wrote:
>  static int __kho_abort(void)
>  {
> -	int err = 0;
> -	unsigned long order;
> -	struct kho_mem_phys *physxa;
> -
> -	xa_for_each(&kho_out.track.orders, order, physxa) {
> -		struct kho_mem_phys_bits *bits;
> -		unsigned long phys;
> -
> -		xa_for_each(&physxa->phys_bits, phys, bits)
> -			kfree(bits);
> -
> -		xa_destroy(&physxa->phys_bits);
> -		kfree(physxa);
> -	}
> -	xa_destroy(&kho_out.track.orders);

Now nothing ever cleans this up :\

Are you sure the issue isn't in the caller that it shouldn't be
calling kho abort until all the other stuff is cleaned up first?

I feel like this is another case of absuing globals gives an unclear
lifecycle model.

Jason

^ permalink raw reply

* Re: [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-14 13:31 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-11-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:16AM +0000, Pasha Tatashin wrote:
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -383,6 +383,8 @@ Code  Seq#    Include File                                             Comments
>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                                Marvell CN10K DPI driver
>  0xB8  all    uapi/linux/mshv.h                                         Microsoft Hyper-V /dev/mshv driver
>                                                                         <mailto:linux-hyperv@vger.kernel.org>
> +0xBA  all    uapi/linux/liveupdate.h                                   Pasha Tatashin
> +                                                                       <mailto:pasha.tatashin@soleen.com>

Let's not be greedy ;) Just take 00-0F for the moment

Jason

^ permalink raw reply

* Re: [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface
From: Jason Gunthorpe @ 2025-08-14 13:49 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-17-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:22AM +0000, Pasha Tatashin wrote:
> +/**
> + * DOC: General ioctl format
> + *
> + * The ioctl interface follows a general format to allow for extensibility. Each
> + * ioctl is passed in a structure pointer as the argument providing the size of
> + * the structure in the first u32. The kernel checks that any structure space
> + * beyond what it understands is 0. This allows userspace to use the backward
> + * compatible portion while consistently using the newer, larger, structures.
> + *
> + * ioctls use a standard meaning for common errnos:
> + *
> + *  - ENOTTY: The IOCTL number itself is not supported at all
> + *  - E2BIG: The IOCTL number is supported, but the provided structure has
> + *    non-zero in a part the kernel does not understand.
> + *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
> + *    understood, however a known field has a value the kernel does not
> + *    understand or support.
> + *  - EINVAL: Everything about the IOCTL was understood, but a field is not
> + *    correct.
> + *  - ENOENT: An ID or IOVA provided does not exist.
                    ^^^^^^^^^

Maybe this should be 'token' ?

> + *  - ENOMEM: Out of memory.
> + *  - EOVERFLOW: Mathematics overflowed.
> + *
> + * As well as additional errnos, within specific ioctls.
> + */

Ah if you copy the comment make sure to faithfully follow it in the
implementation :)

> +struct liveupdate_ioctl_fd_unpreserve {
> +       __u32           size;
> +       __aligned_u64   token;
> +};

It is best to explicitly pad, so add a __u32 reserved between size and
token

Then you need to also check that the reserved is 0 when parsing it,
return -EOPNOTSUPP otherwise.

> +static atomic_t luo_device_in_use = ATOMIC_INIT(0);

I suggest you bundle this together into one struct with the misc_dev
and the other globals and largely pretend it is not global, eg refer
to it through container_of, etc

Following practices like this make it harder to abuse the globals.

> +struct luo_ucmd {
> +	void __user *ubuffer;
> +	u32 user_size;
> +	void *cmd;
> +};
> +
> +static int luo_ioctl_fd_preserve(struct luo_ucmd *ucmd)
> +{
> +	struct liveupdate_ioctl_fd_preserve *argp = ucmd->cmd;
> +	int ret;
> +
> +	ret = luo_register_file(argp->token, argp->fd);
> +	if (!ret)
> +		return ret;
> +
> +	if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> +		return -EFAULT;

This will overflow memory, ucmd->user_size may be > sizeof(*argp)

The respond function is an important part of this scheme:

static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
                                       size_t cmd_len)
{
        if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
                         min_t(size_t, ucmd->user_size, cmd_len)))
                return -EFAULT;

The min (sizeof(*argp) in this case) can't be skipped!

> +static int luo_ioctl_fd_restore(struct luo_ucmd *ucmd)
> +{
> +	struct liveupdate_ioctl_fd_restore *argp = ucmd->cmd;
> +	struct file *file;
> +	int ret;
> +
> +	argp->fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (argp->fd < 0) {
> +		pr_err("Failed to allocate new fd: %d\n", argp->fd);

No need

> +		return argp->fd;
> +	}
> +
> +	ret = luo_retrieve_file(argp->token, &file);
> +	if (ret < 0) {
> +		put_unused_fd(argp->fd);
> +
> +		return ret;
> +	}
> +
> +	fd_install(argp->fd, file);
> +
> +	if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> +		return -EFAULT;

Wrong order, fd_install must be last right before return 0. Failing
system calls should not leave behind installed FDs.

> +static int luo_ioctl_set_event(struct luo_ucmd *ucmd)
> +{
> +	struct liveupdate_ioctl_set_event *argp = ucmd->cmd;
> +	int ret;
> +
> +	switch (argp->event) {
> +	case LIVEUPDATE_PREPARE:
> +		ret = luo_prepare();
> +		break;
> +	case LIVEUPDATE_FINISH:
> +		ret = luo_finish();
> +		break;
> +	case LIVEUPDATE_CANCEL:
> +		ret = luo_cancel();
> +		break;
> +	default:
> +		ret = -EINVAL;

EOPNOTSUPP

> +union ucmd_buffer {
> +	struct liveupdate_ioctl_fd_preserve	preserve;
> +	struct liveupdate_ioctl_fd_unpreserve	unpreserve;
> +	struct liveupdate_ioctl_fd_restore	restore;
> +	struct liveupdate_ioctl_get_state	state;
> +	struct liveupdate_ioctl_set_event	event;
> +};

I discourage the column alignment. Also sort by name.

> +static const struct luo_ioctl_op luo_ioctl_ops[] = {
> +	IOCTL_OP(LIVEUPDATE_IOCTL_FD_PRESERVE, luo_ioctl_fd_preserve,
> +		 struct liveupdate_ioctl_fd_preserve, token),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_FD_UNPRESERVE, luo_ioctl_fd_unpreserve,
> +		 struct liveupdate_ioctl_fd_unpreserve, token),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_FD_RESTORE, luo_ioctl_fd_restore,
> +		 struct liveupdate_ioctl_fd_restore, token),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_GET_STATE, luo_ioctl_get_state,
> +		 struct liveupdate_ioctl_get_state, state),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
> +		 struct liveupdate_ioctl_set_event, event),

Sort by name

Jason

^ permalink raw reply

* Re: [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management
From: Jason Gunthorpe @ 2025-08-14 14:02 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-19-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:24AM +0000, Pasha Tatashin wrote:
> +struct liveupdate_ioctl_get_fd_state {
> +	__u32		size;
> +	__u8		incoming;
> +	__aligned_u64	token;
> +	__u32		state;
> +};

Same remark about explicit padding and checking padding for 0

> + * luo_file_get_state - Get the preservation state of a specific file.
> + * @token: The token of the file to query.
> + * @statep: Output pointer to store the file's current live update state.
> + * @incoming: If true, query the state of a restored file from the incoming
> + *            (previous kernel's) set. If false, query a file being prepared
> + *            for preservation in the current set.
> + *
> + * Finds the file associated with the given @token in either the incoming
> + * or outgoing tracking arrays and returns its current LUO state
> + * (NORMAL, PREPARED, FROZEN, UPDATED).
> + *
> + * Return: 0 on success, -ENOENT if the token is not found.
> + */
> +int luo_file_get_state(u64 token, enum liveupdate_state *statep, bool incoming)
> +{
> +	struct luo_file *luo_file;
> +	struct xarray *target_xa;
> +	int ret = 0;
> +
> +	luo_state_read_enter();

Less globals, at this point everything should be within memory
attached to the file descriptor and not in globals. Doing this will
promote good maintainable structure and not a spaghetti

Also I think a BKL design is not a good idea for new code. We've had
so many bad experiences with this pattern promoting uncontrolled
incomprehensible locking.

The xarray already has a lock, why not have reasonable locking inside
the luo_file? Probably just a refcount?

> +	target_xa = incoming ? &luo_files_xa_in : &luo_files_xa_out;
> +	luo_file = xa_load(target_xa, token);
> +
> +	if (!luo_file) {
> +		ret = -ENOENT;
> +		goto out_unlock;
> +	}
> +
> +	scoped_guard(mutex, &luo_file->mutex)
> +		*statep = luo_file->state;
> +
> +out_unlock:
> +	luo_state_read_exit();

If we are using cleanup.h then use it for this too..

But it seems kind of weird, why not just

xa_lock()
xa_load()
*statep = READ_ONCE(luo_file->state);
xa_unlock()

?

> +static int luo_ioctl_set_fd_event(struct luo_ucmd *ucmd)
> +{
> +	struct liveupdate_ioctl_set_fd_event *argp = ucmd->cmd;
> +	int ret;
> +
> +	switch (argp->event) {
> +	case LIVEUPDATE_PREPARE:
> +		ret = luo_file_prepare(argp->token);
> +		break;
> +	case LIVEUPDATE_FREEZE:
> +		ret = luo_file_freeze(argp->token);
> +		break;
> +	case LIVEUPDATE_FINISH:
> +		ret = luo_file_finish(argp->token);
> +		break;
> +	case LIVEUPDATE_CANCEL:
> +		ret = luo_file_cancel(argp->token);
> +		break;

The token should be converted to a file here instead of duplicated in
each function

>  static int luo_open(struct inode *inodep, struct file *filep)
>  {
>  	if (atomic_cmpxchg(&luo_device_in_use, 0, 1))
> @@ -149,6 +191,8 @@ union ucmd_buffer {
>  	struct liveupdate_ioctl_fd_restore	restore;
>  	struct liveupdate_ioctl_get_state	state;
>  	struct liveupdate_ioctl_set_event	event;
> +	struct liveupdate_ioctl_get_fd_state	fd_state;
> +	struct liveupdate_ioctl_set_fd_event	fd_event;
>  };
>  
>  struct luo_ioctl_op {
> @@ -179,6 +223,10 @@ static const struct luo_ioctl_op luo_ioctl_ops[] = {
>  		 struct liveupdate_ioctl_get_state, state),
>  	IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
>  		 struct liveupdate_ioctl_set_event, event),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_GET_FD_STATE, luo_ioctl_get_fd_state,
> +		 struct liveupdate_ioctl_get_fd_state, token),
> +	IOCTL_OP(LIVEUPDATE_IOCTL_SET_FD_EVENT, luo_ioctl_set_fd_event,
> +		 struct liveupdate_ioctl_set_fd_event, token),
>  };

Keep sorted

Jason

^ permalink raw reply

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pasha Tatashin @ 2025-08-14 14:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814131153.GA802098@nvidia.com>

On Thu, Aug 14, 2025 at 1:11 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:07AM +0000, Pasha Tatashin wrote:
> > -     physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> > -     if (IS_ERR(physxa))
> > -             return PTR_ERR(physxa);
>
> It is probably better to introduce a function pointer argument to this
> xa_load_or_alloc() to do the alloc and init operation than to open
> code the thing.

Agreed, but this should be a separate clean-up, this particular patch
is a hotfix that should land soon (it was separated from this this
series). Once it lands, we are going to do this clean-up.

Pasha

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Pasha Tatashin @ 2025-08-14 15:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814132233.GB802098@nvidia.com>

On Thu, Aug 14, 2025 at 1:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > +{
>
> Why are we adding phys apis? Didn't we talk about this before and
> agree not to expose these?

It is already there, this patch simply completes a lacking unpreserve part.

We can talk about removing it in the future, but the phys interface
provides a benefit of not having to preserve  power of two in length
objects.

>
> The places using it are goofy:
>
> +static int luo_fdt_setup(void)
> +{
> +       fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> +                                          get_order(LUO_FDT_SIZE));
>
> +       ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
>
> +       WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));
>
> It literally allocated a page and then for some reason switches to
> phys with an open coded __pa??
>
> This is ugly, if you want a helper to match __get_free_pages() then
> make one that works on void * directly. You can get the order of the
> void * directly from the struct page IIRC when using GFP_COMP.

I will make this changes.

>
> Which is perhaps another comment, if this __get_free_pages() is going
> to be a common pattern (and I guess it will be) then the API should be
> streamlined alot more:
>
>  void *kho_alloc_preserved_memory(gfp, size);
>  void kho_free_preserved_memory(void *);

Hm, not all GFP flags are compatible with KHO preserve, but we could
add this or similar API, but first let's make KHO completely
stateless: remove, finalize and abort parts from it.

>
> Which can wrapper the get_free_pages and the preserve logic and gives
> a nice path to possibly someday supporting non-PAGE_SIZE allocations.
>
> Jason

^ permalink raw reply

* Re: [PATCH RESEND] fs: Add 'rootfsflags' to set rootfs mount options
From: Randy Dunlap @ 2025-08-14 16:23 UTC (permalink / raw)
  To: Lichen Liu, viro, brauner, jack
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs
In-Reply-To: <20250814103424.3287358-2-lichliu@redhat.com>

Hi,

On 8/14/25 3:34 AM, Lichen Liu wrote:
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
> 
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
> 
> This patch introduces a new kernel command-line parameter, rootfsflags,
> which allows passing specific mount options directly to the rootfs when
> it is first mounted. This gives users control over the rootfs behavior.
> 
> For example, a user can now specify rootfsflags=size=75% to allow the
> tmpfs to use up to 75% of the available memory. This can significantly
> reduce the memory pressure for kdump.
> 
> Consider a practical example:
> 
> To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> the default 50% limit, this requires a memory pool of 96MB to be
> available for the tmpfs mount. The total memory requirement is therefore
> approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
> 
> By using rootfsflags=size=75%, the memory pool required for the 48MB
> tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> smaller crashkernel size, such as 192MB.
> 
> An alternative approach of reusing the existing rootflags parameter was
> considered. However, a new, dedicated rootfsflags parameter was chosen
> to avoid altering the current behavior of rootflags (which applies to
> the final root filesystem) and to prevent any potential regressions.
> 
> This approach is inspired by prior discussions and patches on the topic.
> Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> Ref: https://landley.net/notes-2015.html#01-01-2015
> Ref: https://lkml.org/lkml/2021/6/29/783
> Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
> 
> Signed-off-by: Lichen Liu <lichliu@redhat.com>
> Tested-by: Rob Landley <rob@landley.net>
> ---
> Hi VFS maintainers,
> 
> Resending this patch as it did not get picked up.
> This patch is intended for the VFS tree.
> 
>  fs/namespace.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8f1000f9f3df..e484c26d5e3f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
>  }
>  __setup("mphash_entries=", set_mphash_entries);
>  
> +static char * __initdata rootfs_flags;
> +static int __init rootfs_flags_setup(char *str)
> +{
> +	rootfs_flags = str;
> +	return 1;
> +}
> +
> +__setup("rootfsflags=", rootfs_flags_setup);

Please document this option (alphabetically) in
Documentation/admin-guide/kernel-parameters.txt.

Thanks.

> +
>  static u64 event;
>  static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
>  static DEFINE_IDA(mnt_group_ida);
> @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
>  	struct mnt_namespace *ns;
>  	struct path root;
>  
> -	mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> +	mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
>  	if (IS_ERR(mnt))
>  		panic("Can't create rootfs");
>  

-- 
~Randy


^ permalink raw reply

* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-14 16:52 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
	chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250813024015.2502234-1-yi.zhang@huaweicloud.com>

On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
> utility by introducing a new option -w|--write-zeroes.
> 
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> v1->v2:
>  - Minor description modification to align with the kernel.
> 
>  sys-utils/fallocate.1.adoc | 11 +++++++++--
>  sys-utils/fallocate.c      | 20 ++++++++++++++++----
>  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
> index 44ee0ef4c..0ec9ff9a9 100644
> --- a/sys-utils/fallocate.1.adoc
> +++ b/sys-utils/fallocate.1.adoc
> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file

<snip all the long lines>

> +*-w*, *--write-zeroes*::
> +Zeroes space in the byte range starting at _offset_ and continuing
> for _length_ bytes. Within the specified range, blocks are
> preallocated for the regions that span the holes in the file. After a
> successful call, subsequent reads from this range will return zeroes,
> subsequent writes to that range do not require further changes to the
> file mapping metadata.

"...will return zeroes and subsequent writes to that range..." ?

> ++
> +Zeroing is done within the filesystem by preferably submitting write

I think we should say less about what the filesystem actually does to
preserve some flexibility:

"Zeroing is done within the filesystem. The filesystem may use a
hardware accelerated zeroing command, or it may submit regular writes.
The behavior depends on the filesystem design and available hardware."

> zeores commands, the alternative way is submitting actual zeroed data,
> the specified range will be converted into written extents. The write
> zeroes command is typically faster than write actual data if the
> device supports unmap write zeroes, the specified range will not be
> physically zeroed out on the device.
> ++
> +Options *--keep-size* can not be specified for the write-zeroes
> operation.
> +
>  include::man-common/help-version.adoc[]
>  
>  == AUTHORS
> diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c
> index 13bf52915..8d37fdad7 100644
> --- a/sys-utils/fallocate.c
> +++ b/sys-utils/fallocate.c
> @@ -40,7 +40,7 @@
>  #if defined(HAVE_LINUX_FALLOC_H) && \
>      (!defined(FALLOC_FL_KEEP_SIZE) || !defined(FALLOC_FL_PUNCH_HOLE) || \
>       !defined(FALLOC_FL_COLLAPSE_RANGE) || !defined(FALLOC_FL_ZERO_RANGE) || \
> -     !defined(FALLOC_FL_INSERT_RANGE))
> +     !defined(FALLOC_FL_INSERT_RANGE) || !defined(FALLOC_FL_WRITE_ZEROES))
>  # include <linux/falloc.h>	/* non-libc fallback for FALLOC_FL_* flags */
>  #endif
>  
> @@ -65,6 +65,10 @@
>  # define FALLOC_FL_INSERT_RANGE		0x20
>  #endif
>  
> +#ifndef FALLOC_FL_WRITE_ZEROES
> +# define FALLOC_FL_WRITE_ZEROES		0x80
> +#endif
> +
>  #include "nls.h"
>  #include "strutils.h"
>  #include "c.h"
> @@ -94,6 +98,7 @@ static void __attribute__((__noreturn__)) usage(void)
>  	fputs(_(" -o, --offset <num>   offset for range operations, in bytes\n"), out);
>  	fputs(_(" -p, --punch-hole     replace a range with a hole (implies -n)\n"), out);
>  	fputs(_(" -z, --zero-range     zero and ensure allocation of a range\n"), out);
> +	fputs(_(" -w, --write-zeroes   write zeroes and ensure allocation of a range\n"), out);
>  #ifdef HAVE_POSIX_FALLOCATE
>  	fputs(_(" -x, --posix          use posix_fallocate(3) instead of fallocate(2)\n"), out);
>  #endif
> @@ -304,6 +309,7 @@ int main(int argc, char **argv)
>  	    { "dig-holes",      no_argument,       NULL, 'd' },
>  	    { "insert-range",   no_argument,       NULL, 'i' },
>  	    { "zero-range",     no_argument,       NULL, 'z' },
> +	    { "write-zeroes",   no_argument,       NULL, 'w' },
>  	    { "offset",         required_argument, NULL, 'o' },
>  	    { "length",         required_argument, NULL, 'l' },
>  	    { "posix",          no_argument,       NULL, 'x' },
> @@ -312,8 +318,8 @@ int main(int argc, char **argv)
>  	};
>  
>  	static const ul_excl_t excl[] = {	/* rows and cols in ASCII order */
> -		{ 'c', 'd', 'i', 'p', 'x', 'z'},
> -		{ 'c', 'i', 'n', 'x' },
> +		{ 'c', 'd', 'i', 'p', 'w', 'x', 'z'},
> +		{ 'c', 'i', 'n', 'w', 'x' },
>  		{ 0 }
>  	};
>  	int excl_st[ARRAY_SIZE(excl)] = UL_EXCL_STATUS_INIT;
> @@ -323,7 +329,7 @@ int main(int argc, char **argv)
>  	textdomain(PACKAGE);
>  	close_stdout_atexit();
>  
> -	while ((c = getopt_long(argc, argv, "hvVncpdizxl:o:", longopts, NULL))
> +	while ((c = getopt_long(argc, argv, "hvVncpdizwxl:o:", longopts, NULL))
>  			!= -1) {
>  
>  		err_exclusive_options(c, longopts, excl, excl_st);
> @@ -353,6 +359,9 @@ int main(int argc, char **argv)
>  		case 'z':
>  			mode |= FALLOC_FL_ZERO_RANGE;
>  			break;
> +		case 'w':
> +			mode |= FALLOC_FL_WRITE_ZEROES;
> +			break;
>  		case 'x':
>  #ifdef HAVE_POSIX_FALLOCATE
>  			posix = 1;
> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
>  			else if (mode & FALLOC_FL_ZERO_RANGE)
>  				fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
>  								filename, str, length);
> +			else if (mode & FALLOC_FL_WRITE_ZEROES)
> +				fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),

"write zeroed" is a little strange, but I don't have a better
suggestion. :)

--D

> +								filename, str, length);
>  			else
>  				fprintf(stdout, _("%s: %s (%ju bytes) allocated.\n"),
>  								filename, str, length);
> -- 
> 2.39.2
> 
> 

^ permalink raw reply

* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-14 16:54 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
	chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
	yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250813024250.2504126-1-yi.zhang@huaweicloud.com>

On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
> fallocate utility by introducing a new 'fwzero' command in the xfs_io
> tool.
> 
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> v1->v2:
>  - Minor description modification to align with the kernel.
> 
>  io/prealloc.c     | 36 ++++++++++++++++++++++++++++++++++++
>  man/man8/xfs_io.8 |  6 ++++++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/io/prealloc.c b/io/prealloc.c
> index 8e968c9f..9a64bf53 100644
> --- a/io/prealloc.c
> +++ b/io/prealloc.c
> @@ -30,6 +30,10 @@
>  #define FALLOC_FL_UNSHARE_RANGE 0x40
>  #endif
>  
> +#ifndef FALLOC_FL_WRITE_ZEROES
> +#define FALLOC_FL_WRITE_ZEROES 0x80
> +#endif
> +
>  static cmdinfo_t allocsp_cmd;
>  static cmdinfo_t freesp_cmd;
>  static cmdinfo_t resvsp_cmd;
> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
>  static cmdinfo_t finsert_cmd;
>  static cmdinfo_t fzero_cmd;
>  static cmdinfo_t funshare_cmd;
> +static cmdinfo_t fwzero_cmd;
>  
>  static int
>  offset_length(
> @@ -377,6 +382,27 @@ funshare_f(
>  	return 0;
>  }
>  
> +static int
> +fwzero_f(
> +	int		argc,
> +	char		**argv)
> +{
> +	xfs_flock64_t	segment;
> +	int		mode = FALLOC_FL_WRITE_ZEROES;

Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?

(The code otherwise looks fine to me)

--D

> +
> +	if (!offset_length(argv[1], argv[2], &segment)) {
> +		exitcode = 1;
> +		return 0;
> +	}
> +
> +	if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
> +		perror("fallocate");
> +		exitcode = 1;
> +		return 0;
> +	}
> +	return 0;
> +}
> +
>  void
>  prealloc_init(void)
>  {
> @@ -489,4 +515,14 @@ prealloc_init(void)
>  	funshare_cmd.oneline =
>  	_("unshares shared blocks within the range");
>  	add_command(&funshare_cmd);
> +
> +	fwzero_cmd.name = "fwzero";
> +	fwzero_cmd.cfunc = fwzero_f;
> +	fwzero_cmd.argmin = 2;
> +	fwzero_cmd.argmax = 2;
> +	fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
> +	fwzero_cmd.args = _("off len");
> +	fwzero_cmd.oneline =
> +	_("zeroes space and eliminates holes by allocating and submitting write zeroes");
> +	add_command(&fwzero_cmd);
>  }
> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
> index b0dcfdb7..0a673322 100644
> --- a/man/man8/xfs_io.8
> +++ b/man/man8/xfs_io.8
> @@ -550,6 +550,12 @@ With the
>  .B -k
>  option, use the FALLOC_FL_KEEP_SIZE flag as well.
>  .TP
> +.BI fwzero " offset length"
> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
> +.BR fallocate (2)
> +manual page to allocate and zero blocks within the range by submitting write
> +zeroes.
> +.TP
>  .BI zero " offset length"
>  Call xfsctl with
>  .B XFS_IOC_ZERO_RANGE
> -- 
> 2.39.2
> 
> 

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Jason Gunthorpe @ 2025-08-14 17:01 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bCbjmRKtVVAok7GH8xvh8JWrga5Oj-iK-p=1M79AqvhRA@mail.gmail.com>

On Thu, Aug 14, 2025 at 03:05:04PM +0000, Pasha Tatashin wrote:
> On Thu, Aug 14, 2025 at 1:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > > +{
> >
> > Why are we adding phys apis? Didn't we talk about this before and
> > agree not to expose these?
> 
> It is already there, this patch simply completes a lacking unpreserve part.

This patch yes, but that is because the later patches intend to use
it, which I argue those patches should not.

There should not be any users of these phys interfaces because they
make no sense. The API preserves folios and brings allocated folios
back on the other side. None of that is phys.

> > Which is perhaps another comment, if this __get_free_pages() is going
> > to be a common pattern (and I guess it will be) then the API should be
> > streamlined alot more:
> >
> >  void *kho_alloc_preserved_memory(gfp, size);
> >  void kho_free_preserved_memory(void *);
> 
> Hm, not all GFP flags are compatible with KHO preserve, but we could
> add this or similar API, but first let's make KHO completely
> stateless: remove, finalize and abort parts from it.

Right, in those cases we often warn on and mask invalid flag

Jason

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Mike Rapoport @ 2025-08-15  9:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814132233.GB802098@nvidia.com>

On Thu, Aug 14, 2025 at 10:22:33AM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:13AM +0000, Pasha Tatashin wrote:
> > +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> > +{
> 
> Why are we adding phys apis? Didn't we talk about this before and
> agree not to expose these?
> 
> The places using it are goofy:
> 
> +static int luo_fdt_setup(void)
> +{
> +       fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> +                                          get_order(LUO_FDT_SIZE));
> 
> +       ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> 
> +       WARN_ON_ONCE(kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE));
> 
> It literally allocated a page and then for some reason switches to
> phys with an open coded __pa??
> 
> This is ugly, if you want a helper to match __get_free_pages() then
> make one that works on void * directly. You can get the order of the
> void * directly from the struct page IIRC when using GFP_COMP.
> 
> Which is perhaps another comment, if this __get_free_pages() is going
> to be a common pattern (and I guess it will be) then the API should be
> streamlined alot more:
> 
>  void *kho_alloc_preserved_memory(gfp, size);
>  void kho_free_preserved_memory(void *);

This looks backwards to me. KHO should not deal with memory allocation,
it's responsibility to preserve/restore memory objects it supports.

For __get_free_pages() the natural KHO API is kho_(un)preserve_pages().
With struct page/mesdesc we always have page_to_<specialized object> from
one side and page_to_pfn from the other side.

Then folio and phys/virt APIS just become a thin wrappers around the _page
APIs. And down the road we can add slab and maybe vmalloc. 

Once folio won't overlap struct page, we'll have a hard time with only
kho_preserve_folio() for memory that's not actually folio (i.e. anon and
page cache)
 
> Which can wrapper the get_free_pages and the preserve logic and gives
> a nice path to possibly someday supporting non-PAGE_SIZE allocations.
> 
> Jason
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-15  9:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
	chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250814165218.GQ7942@frogsfrogsfrogs>

Thank you for your review comments!

On 2025/8/15 0:52, Darrick J. Wong wrote:
> On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
>> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
>> utility by introducing a new option -w|--write-zeroes.
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> v1->v2:
>>  - Minor description modification to align with the kernel.
>>
>>  sys-utils/fallocate.1.adoc | 11 +++++++++--
>>  sys-utils/fallocate.c      | 20 ++++++++++++++++----
>>  2 files changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
>> index 44ee0ef4c..0ec9ff9a9 100644
>> --- a/sys-utils/fallocate.1.adoc
>> +++ b/sys-utils/fallocate.1.adoc
>> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
> 
> <snip all the long lines>
> 
>> +*-w*, *--write-zeroes*::
>> +Zeroes space in the byte range starting at _offset_ and continuing
>> for _length_ bytes. Within the specified range, blocks are
>> preallocated for the regions that span the holes in the file. After a
>> successful call, subsequent reads from this range will return zeroes,
>> subsequent writes to that range do not require further changes to the
>> file mapping metadata.
> 
> "...will return zeroes and subsequent writes to that range..." ?
> 

Yeah.

>> ++
>> +Zeroing is done within the filesystem by preferably submitting write
> 
> I think we should say less about what the filesystem actually does to
> preserve some flexibility:
> 
> "Zeroing is done within the filesystem. The filesystem may use a
> hardware accelerated zeroing command, or it may submit regular writes.
> The behavior depends on the filesystem design and available hardware."
> 

Sure.

>> zeores commands, the alternative way is submitting actual zeroed data,
>> the specified range will be converted into written extents. The write
>> zeroes command is typically faster than write actual data if the
>> device supports unmap write zeroes, the specified range will not be
>> physically zeroed out on the device.
>> ++
>> +Options *--keep-size* can not be specified for the write-zeroes
>> operation.
>> +
>>  include::man-common/help-version.adoc[]
>>  
>>  == AUTHORS
[..]
>> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
>>  			else if (mode & FALLOC_FL_ZERO_RANGE)
>>  				fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
>>  								filename, str, length);
>> +			else if (mode & FALLOC_FL_WRITE_ZEROES)
>> +				fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
> 
> "write zeroed" is a little strange, but I don't have a better
> suggestion. :)
> 

Hmm... What about simply using "zeroed", the same to FALLOC_FL_ZERO_RANGE?
Users should be aware of the parameters they have passed to fallocate(),
so they should not use this print for further differentiation.

Thanks,
Yi.


^ permalink raw reply

* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-15  9:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
	chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
	yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <20250814165430.GR7942@frogsfrogsfrogs>

On 2025/8/15 0:54, Darrick J. Wong wrote:
> On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
>> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
>> fallocate utility by introducing a new 'fwzero' command in the xfs_io
>> tool.
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> v1->v2:
>>  - Minor description modification to align with the kernel.
>>
>>  io/prealloc.c     | 36 ++++++++++++++++++++++++++++++++++++
>>  man/man8/xfs_io.8 |  6 ++++++
>>  2 files changed, 42 insertions(+)
>>
>> diff --git a/io/prealloc.c b/io/prealloc.c
>> index 8e968c9f..9a64bf53 100644
>> --- a/io/prealloc.c
>> +++ b/io/prealloc.c
>> @@ -30,6 +30,10 @@
>>  #define FALLOC_FL_UNSHARE_RANGE 0x40
>>  #endif
>>  
>> +#ifndef FALLOC_FL_WRITE_ZEROES
>> +#define FALLOC_FL_WRITE_ZEROES 0x80
>> +#endif
>> +
>>  static cmdinfo_t allocsp_cmd;
>>  static cmdinfo_t freesp_cmd;
>>  static cmdinfo_t resvsp_cmd;
>> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
>>  static cmdinfo_t finsert_cmd;
>>  static cmdinfo_t fzero_cmd;
>>  static cmdinfo_t funshare_cmd;
>> +static cmdinfo_t fwzero_cmd;
>>  
>>  static int
>>  offset_length(
>> @@ -377,6 +382,27 @@ funshare_f(
>>  	return 0;
>>  }
>>  
>> +static int
>> +fwzero_f(
>> +	int		argc,
>> +	char		**argv)
>> +{
>> +	xfs_flock64_t	segment;
>> +	int		mode = FALLOC_FL_WRITE_ZEROES;
> 
> Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?
> 

Since allocating blocks with written extents beyond the inode size
is not permitted, the FALLOC_FL_WRITE_ZEROES flag cannot be used
together with the FALLOC_FL_KEEP_SIZE.

Thanks,
Yi.

> (The code otherwise looks fine to me)
> 
> --D
> 
>> +
>> +	if (!offset_length(argv[1], argv[2], &segment)) {
>> +		exitcode = 1;
>> +		return 0;
>> +	}
>> +
>> +	if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
>> +		perror("fallocate");
>> +		exitcode = 1;
>> +		return 0;
>> +	}
>> +	return 0;
>> +}
>> +
>>  void
>>  prealloc_init(void)
>>  {
>> @@ -489,4 +515,14 @@ prealloc_init(void)
>>  	funshare_cmd.oneline =
>>  	_("unshares shared blocks within the range");
>>  	add_command(&funshare_cmd);
>> +
>> +	fwzero_cmd.name = "fwzero";
>> +	fwzero_cmd.cfunc = fwzero_f;
>> +	fwzero_cmd.argmin = 2;
>> +	fwzero_cmd.argmax = 2;
>> +	fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
>> +	fwzero_cmd.args = _("off len");
>> +	fwzero_cmd.oneline =
>> +	_("zeroes space and eliminates holes by allocating and submitting write zeroes");
>> +	add_command(&fwzero_cmd);
>>  }
>> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
>> index b0dcfdb7..0a673322 100644
>> --- a/man/man8/xfs_io.8
>> +++ b/man/man8/xfs_io.8
>> @@ -550,6 +550,12 @@ With the
>>  .B -k
>>  option, use the FALLOC_FL_KEEP_SIZE flag as well.
>>  .TP
>> +.BI fwzero " offset length"
>> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
>> +.BR fallocate (2)
>> +manual page to allocate and zero blocks within the range by submitting write
>> +zeroes.
>> +.TP
>>  .BI zero " offset length"
>>  Call xfsctl with
>>  .B XFS_IOC_ZERO_RANGE
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply

* Re: [PATCH v5 1/3] man/man2/mremap.2: explicitly document the simple move operation
From: Alejandro Colomar @ 2025-08-15 10:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <0a5d0d6e9f75e8e2de05506f73c41b069d77de36.1754924278.git.lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

Hi Lorenzo,

On Mon, Aug 11, 2025 at 03:59:37PM +0100, Lorenzo Stoakes wrote:
> In preparation for discussing newly introduced mremap() behaviour to permit
> the move of multiple mappings at once, add a section to the mremap.2 man
> page to describe these operations in general.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!  I've applied this patch.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=6ba37b9e14f6565d0cccecb634100d7fe11d22fb>


Have a lovely day!
Alex

> ---
>  man/man2/mremap.2 | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..4e3c8e54e 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,6 +25,20 @@ moving it at the same time (controlled by the
>  argument and
>  the available virtual address space).
>  .P
> +Mappings can also simply be moved
> +(without any resizing)
> +by specifying equal
> +.I old_size
> +and
> +.I new_size
> +and using the
> +.B MREMAP_FIXED
> +flag
> +(see below).
> +The
> +.B MREMAP_DONTUNMAP
> +flag may also be specified.
> +.P
>  .I old_address
>  is the old address of the virtual memory block that you
>  want to expand (or shrink).
> -- 
> 2.50.1
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH RESEND] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-15 12:12 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, safinaskar,
	kexec, rob, weilongchen, cyphar, linux-api, zohar, stefanb,
	initramfs
In-Reply-To: <dd25041f-98e0-4bb5-bcd5-ba3507262c76@infradead.org>

Thanks Randy,

I will send a v2 with documentation.

On Fri, Aug 15, 2025 at 12:27 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Hi,
>
> On 8/14/25 3:34 AM, Lichen Liu wrote:
> > When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> > By default, a tmpfs mount is limited to using 50% of the available RAM
> > for its content. This can be problematic in memory-constrained
> > environments, particularly during a kdump capture.
> >
> > In a kdump scenario, the capture kernel boots with a limited amount of
> > memory specified by the 'crashkernel' parameter. If the initramfs is
> > large, it may fail to unpack into the tmpfs rootfs due to insufficient
> > space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> > memory must be available for the mount. This leads to an OOM failure
> > during the early boot process, preventing a successful crash dump.
> >
> > This patch introduces a new kernel command-line parameter, rootfsflags,
> > which allows passing specific mount options directly to the rootfs when
> > it is first mounted. This gives users control over the rootfs behavior.
> >
> > For example, a user can now specify rootfsflags=size=75% to allow the
> > tmpfs to use up to 75% of the available memory. This can significantly
> > reduce the memory pressure for kdump.
> >
> > Consider a practical example:
> >
> > To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> > the default 50% limit, this requires a memory pool of 96MB to be
> > available for the tmpfs mount. The total memory requirement is therefore
> > approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> > kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
> >
> > By using rootfsflags=size=75%, the memory pool required for the 48MB
> > tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> > requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> > smaller crashkernel size, such as 192MB.
> >
> > An alternative approach of reusing the existing rootflags parameter was
> > considered. However, a new, dedicated rootfsflags parameter was chosen
> > to avoid altering the current behavior of rootflags (which applies to
> > the final root filesystem) and to prevent any potential regressions.
> >
> > This approach is inspired by prior discussions and patches on the topic.
> > Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> > Ref: https://landley.net/notes-2015.html#01-01-2015
> > Ref: https://lkml.org/lkml/2021/6/29/783
> > Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
> >
> > Signed-off-by: Lichen Liu <lichliu@redhat.com>
> > Tested-by: Rob Landley <rob@landley.net>
> > ---
> > Hi VFS maintainers,
> >
> > Resending this patch as it did not get picked up.
> > This patch is intended for the VFS tree.
> >
> >  fs/namespace.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 8f1000f9f3df..e484c26d5e3f 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
> >  }
> >  __setup("mphash_entries=", set_mphash_entries);
> >
> > +static char * __initdata rootfs_flags;
> > +static int __init rootfs_flags_setup(char *str)
> > +{
> > +     rootfs_flags = str;
> > +     return 1;
> > +}
> > +
> > +__setup("rootfsflags=", rootfs_flags_setup);
>
> Please document this option (alphabetically) in
> Documentation/admin-guide/kernel-parameters.txt.
>
> Thanks.
>
> > +
> >  static u64 event;
> >  static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
> >  static DEFINE_IDA(mnt_group_ida);
> > @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
> >       struct mnt_namespace *ns;
> >       struct path root;
> >
> > -     mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> > +     mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
> >       if (IS_ERR(mnt))
> >               panic("Can't create rootfs");
> >
>
> --
> ~Randy
>
>


^ permalink raw reply

* [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-15 12:14 UTC (permalink / raw)
  To: viro, brauner, jack
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	Lichen Liu

When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
By default, a tmpfs mount is limited to using 50% of the available RAM
for its content. This can be problematic in memory-constrained
environments, particularly during a kdump capture.

In a kdump scenario, the capture kernel boots with a limited amount of
memory specified by the 'crashkernel' parameter. If the initramfs is
large, it may fail to unpack into the tmpfs rootfs due to insufficient
space. This is because to get X MB of usable space in tmpfs, 2*X MB of
memory must be available for the mount. This leads to an OOM failure
during the early boot process, preventing a successful crash dump.

This patch introduces a new kernel command-line parameter, rootfsflags,
which allows passing specific mount options directly to the rootfs when
it is first mounted. This gives users control over the rootfs behavior.

For example, a user can now specify rootfsflags=size=75% to allow the
tmpfs to use up to 75% of the available memory. This can significantly
reduce the memory pressure for kdump.

Consider a practical example:

To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
the default 50% limit, this requires a memory pool of 96MB to be
available for the tmpfs mount. The total memory requirement is therefore
approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.

By using rootfsflags=size=75%, the memory pool required for the 48MB
tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
smaller crashkernel size, such as 192MB.

An alternative approach of reusing the existing rootflags parameter was
considered. However, a new, dedicated rootfsflags parameter was chosen
to avoid altering the current behavior of rootflags (which applies to
the final root filesystem) and to prevent any potential regressions.

Also add documentation for the new kernel parameter "rootfsflags"

This approach is inspired by prior discussions and patches on the topic.
Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
Ref: https://landley.net/notes-2015.html#01-01-2015
Ref: https://lkml.org/lkml/2021/6/29/783
Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs

Signed-off-by: Lichen Liu <lichliu@redhat.com>
Tested-by: Rob Landley <rob@landley.net>
---
Changes in v2:
  - Add documentation for the new kernel parameter.

 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 fs/namespace.c                                  | 11 ++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..0c00f651d431 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6220,6 +6220,9 @@
 
 	rootflags=	[KNL] Set root filesystem mount option string
 
+	rootfsflags=	[KNL] Set initial root filesystem mount option string
+			(e.g. tmpfs for initramfs)
+
 	rootfstype=	[KNL] Set root filesystem type
 
 	rootwait	[KNL] Wait (indefinitely) for root device to show up.
diff --git a/fs/namespace.c b/fs/namespace.c
index 8f1000f9f3df..e484c26d5e3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
 }
 __setup("mphash_entries=", set_mphash_entries);
 
+static char * __initdata rootfs_flags;
+static int __init rootfs_flags_setup(char *str)
+{
+	rootfs_flags = str;
+	return 1;
+}
+
+__setup("rootfsflags=", rootfs_flags_setup);
+
 static u64 event;
 static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
 static DEFINE_IDA(mnt_group_ida);
@@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
 	struct mnt_namespace *ns;
 	struct path root;
 
-	mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
+	mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
 	if (IS_ERR(mnt))
 		panic("Can't create rootfs");
 
-- 
2.47.0


^ permalink raw reply related

* Re: [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-15 14:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-kernel, linux-api, hch, tytso, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
	chengzhihao1, yukuai3, yangerkun
In-Reply-To: <a0eda581-ae6c-4b49-8b4f-7bb039b17487@huaweicloud.com>

On Fri, Aug 15, 2025 at 05:29:19PM +0800, Zhang Yi wrote:
> Thank you for your review comments!
> 
> On 2025/8/15 0:52, Darrick J. Wong wrote:
> > On Wed, Aug 13, 2025 at 10:40:15AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> >> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
> >> utility by introducing a new option -w|--write-zeroes.
> >>
> >> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >> v1->v2:
> >>  - Minor description modification to align with the kernel.
> >>
> >>  sys-utils/fallocate.1.adoc | 11 +++++++++--
> >>  sys-utils/fallocate.c      | 20 ++++++++++++++++----
> >>  2 files changed, 25 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
> >> index 44ee0ef4c..0ec9ff9a9 100644
> >> --- a/sys-utils/fallocate.1.adoc
> >> +++ b/sys-utils/fallocate.1.adoc
> >> @@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
> > 
> > <snip all the long lines>
> > 
> >> +*-w*, *--write-zeroes*::
> >> +Zeroes space in the byte range starting at _offset_ and continuing
> >> for _length_ bytes. Within the specified range, blocks are
> >> preallocated for the regions that span the holes in the file. After a
> >> successful call, subsequent reads from this range will return zeroes,
> >> subsequent writes to that range do not require further changes to the
> >> file mapping metadata.
> > 
> > "...will return zeroes and subsequent writes to that range..." ?
> > 
> 
> Yeah.
> 
> >> ++
> >> +Zeroing is done within the filesystem by preferably submitting write
> > 
> > I think we should say less about what the filesystem actually does to
> > preserve some flexibility:
> > 
> > "Zeroing is done within the filesystem. The filesystem may use a
> > hardware accelerated zeroing command, or it may submit regular writes.
> > The behavior depends on the filesystem design and available hardware."
> > 
> 
> Sure.
> 
> >> zeores commands, the alternative way is submitting actual zeroed data,
> >> the specified range will be converted into written extents. The write
> >> zeroes command is typically faster than write actual data if the
> >> device supports unmap write zeroes, the specified range will not be
> >> physically zeroed out on the device.
> >> ++
> >> +Options *--keep-size* can not be specified for the write-zeroes
> >> operation.
> >> +
> >>  include::man-common/help-version.adoc[]
> >>  
> >>  == AUTHORS
> [..]
> >> @@ -429,6 +438,9 @@ int main(int argc, char **argv)
> >>  			else if (mode & FALLOC_FL_ZERO_RANGE)
> >>  				fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
> >>  								filename, str, length);
> >> +			else if (mode & FALLOC_FL_WRITE_ZEROES)
> >> +				fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
> > 
> > "write zeroed" is a little strange, but I don't have a better
> > suggestion. :)
> > 
> 
> Hmm... What about simply using "zeroed", the same to FALLOC_FL_ZERO_RANGE?
> Users should be aware of the parameters they have passed to fallocate(),
> so they should not use this print for further differentiation.

No thanks, different inputs should produce different outputs. :)

--D

> Thanks,
> Yi.
> 

^ permalink raw reply

* Re: [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Darrick J. Wong @ 2025-08-15 14:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-xfs, linux-kernel, linux-api, hch, tytso, bmarzins,
	chaitanyak, shinichiro.kawasaki, brauner, martin.petersen,
	yi.zhang, chengzhihao1, yukuai3, yangerkun
In-Reply-To: <1428e3fe-ae7a-410d-97b5-7dd0249c41c0@huaweicloud.com>

On Fri, Aug 15, 2025 at 05:59:01PM +0800, Zhang Yi wrote:
> On 2025/8/15 0:54, Darrick J. Wong wrote:
> > On Wed, Aug 13, 2025 at 10:42:50AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
> >> fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
> >> fallocate utility by introducing a new 'fwzero' command in the xfs_io
> >> tool.
> >>
> >> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >> v1->v2:
> >>  - Minor description modification to align with the kernel.
> >>
> >>  io/prealloc.c     | 36 ++++++++++++++++++++++++++++++++++++
> >>  man/man8/xfs_io.8 |  6 ++++++
> >>  2 files changed, 42 insertions(+)
> >>
> >> diff --git a/io/prealloc.c b/io/prealloc.c
> >> index 8e968c9f..9a64bf53 100644
> >> --- a/io/prealloc.c
> >> +++ b/io/prealloc.c
> >> @@ -30,6 +30,10 @@
> >>  #define FALLOC_FL_UNSHARE_RANGE 0x40
> >>  #endif
> >>  
> >> +#ifndef FALLOC_FL_WRITE_ZEROES
> >> +#define FALLOC_FL_WRITE_ZEROES 0x80
> >> +#endif
> >> +
> >>  static cmdinfo_t allocsp_cmd;
> >>  static cmdinfo_t freesp_cmd;
> >>  static cmdinfo_t resvsp_cmd;
> >> @@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
> >>  static cmdinfo_t finsert_cmd;
> >>  static cmdinfo_t fzero_cmd;
> >>  static cmdinfo_t funshare_cmd;
> >> +static cmdinfo_t fwzero_cmd;
> >>  
> >>  static int
> >>  offset_length(
> >> @@ -377,6 +382,27 @@ funshare_f(
> >>  	return 0;
> >>  }
> >>  
> >> +static int
> >> +fwzero_f(
> >> +	int		argc,
> >> +	char		**argv)
> >> +{
> >> +	xfs_flock64_t	segment;
> >> +	int		mode = FALLOC_FL_WRITE_ZEROES;
> > 
> > Shouldn't this take a -k to add FALLOC_FL_KEEP_SIZE like fzero?
> > 
> 
> Since allocating blocks with written extents beyond the inode size
> is not permitted, the FALLOC_FL_WRITE_ZEROES flag cannot be used
> together with the FALLOC_FL_KEEP_SIZE.

Heh, apparently I didn't read the manpage well enough.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D


> Thanks,
> Yi.
> 
> > (The code otherwise looks fine to me)
> > 
> > --D
> > 
> >> +
> >> +	if (!offset_length(argv[1], argv[2], &segment)) {
> >> +		exitcode = 1;
> >> +		return 0;
> >> +	}
> >> +
> >> +	if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
> >> +		perror("fallocate");
> >> +		exitcode = 1;
> >> +		return 0;
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >>  void
> >>  prealloc_init(void)
> >>  {
> >> @@ -489,4 +515,14 @@ prealloc_init(void)
> >>  	funshare_cmd.oneline =
> >>  	_("unshares shared blocks within the range");
> >>  	add_command(&funshare_cmd);
> >> +
> >> +	fwzero_cmd.name = "fwzero";
> >> +	fwzero_cmd.cfunc = fwzero_f;
> >> +	fwzero_cmd.argmin = 2;
> >> +	fwzero_cmd.argmax = 2;
> >> +	fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
> >> +	fwzero_cmd.args = _("off len");
> >> +	fwzero_cmd.oneline =
> >> +	_("zeroes space and eliminates holes by allocating and submitting write zeroes");
> >> +	add_command(&fwzero_cmd);
> >>  }
> >> diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
> >> index b0dcfdb7..0a673322 100644
> >> --- a/man/man8/xfs_io.8
> >> +++ b/man/man8/xfs_io.8
> >> @@ -550,6 +550,12 @@ With the
> >>  .B -k
> >>  option, use the FALLOC_FL_KEEP_SIZE flag as well.
> >>  .TP
> >> +.BI fwzero " offset length"
> >> +Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
> >> +.BR fallocate (2)
> >> +manual page to allocate and zero blocks within the range by submitting write
> >> +zeroes.
> >> +.TP
> >>  .BI zero " offset length"
> >>  Call xfsctl with
> >>  .B XFS_IOC_ZERO_RANGE
> >> -- 
> >> 2.39.2
> >>
> >>
> 
> 

^ permalink raw reply

* Re: [PATCH v5 2/3] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-15 15:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <4e0c992a6374e417367475e3b3bbbc9d43380f4c.1754924278.git.lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 6204 bytes --]

Hi Lorenzo,

On Mon, Aug 11, 2025 at 03:59:38PM +0100, Lorenzo Stoakes wrote:
> Document the new behaviour introduced in Linux 6.17 whereby it is now
> possible to move multiple mappings in a single operation, as long as the
> operation is purely a move, that is old_size is equal to new_size and
> MREMAP_FIXED is specified.
> 
> This change also explains the limitations of of this method and the
> possibility of partial failure.
> 
> Finally, we pluralise language where it makes sense to so the documentation
> does not contradict either this new capability nor the pre-existing edge
> case.
> 
> Example code is enclosed below demonstrating the behaviour which is now
> possible:
> 
> int main(void)
> {
> 	unsigned long page_size = sysconf(_SC_PAGESIZE);
> 	void *ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> 			 MAP_ANON | MAP_PRIVATE, -1, 0);
> 	void *tgt_ptr = mmap(NULL, 10 * page_size, PROT_NONE,
> 			     MAP_ANON | MAP_PRIVATE, -1, 0);
> 	int i;
> 
> 	if (ptr == MAP_FAILED || tgt_ptr == MAP_FAILED) {
> 		perror("mmap");
> 		return EXIT_FAILURE;
> 	}
> 
> 	/* Unmap every other page. */
> 	for (i = 1; i < 10; i += 2)
> 		munmap(ptr + i * page_size, page_size);
> 
> 	/* Now move all 5 distinct mappings to tgt_ptr. */
> 	ptr = mremap(ptr, 10 * page_size, 10 * page_size,
> 		     MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr);
> 	if (ptr == MAP_FAILED) {
> 		perror("mremap");
> 		return EXIT_FAILURE;
> 	}
> 
> 	return EXIT_SUCCESS;
> }

I've applied some editorial changes to the program.

> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!  I've applied the patch, with a small amendment (see below).

<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=d99a3495372a69b48309f2a1a4e2067af2bfbe69>


Have a lovely day!
Alex

> ---
>  man/man2/mremap.2 | 68 +++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 57 insertions(+), 11 deletions(-)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 4e3c8e54e..6d14bf627 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -35,22 +35,36 @@ and using the
>  .B MREMAP_FIXED
>  flag
>  (see below).
> +Since Linux 6.17,
> +while
> +.I old_address
> +must be mapped,
> +.I old_size
> +may span multiple mappings
> +including unmapped areas between
> +them when performing a move like this.

And I've reworded this last line.  Repetitive consistent language helps
understanding documentation, so it's better to repeat "a simple move"
here again.

	diff --git i/man/man2/mremap.2 w/man/man2/mremap.2
	index 6d14bf627..65b4d5f58 100644
	--- i/man/man2/mremap.2
	+++ w/man/man2/mremap.2
	@@ -42,7 +42,7 @@ .SH DESCRIPTION
	 .I old_size
	 may span multiple mappings
	 including unmapped areas between
	-them when performing a move like this.
	+them when performing a simple move.
	 The
	 .B MREMAP_DONTUNMAP
	 flag may also be specified.

>  The
>  .B MREMAP_DONTUNMAP
>  flag may also be specified.
>  .P
> +If the operation is not
> +simply moving mappings,
> +then
> +.I old_size
> +must span only a single mapping.
> +.P
>  .I old_address
> -is the old address of the virtual memory block that you
> -want to expand (or shrink).
> +is the old address of the first virtual memory block that you
> +want to expand, shrink, and/or move.
>  Note that
>  .I old_address
>  has to be page aligned.
>  .I old_size
> -is the old size of the
> -virtual memory block.
> +is the size of the range containing
> +virtual memory blocks to be manipulated.
>  .I new_size
>  is the requested size of the
> -virtual memory block after the resize.
> +virtual memory blocks after the resize.
>  An optional fifth argument,
>  .IR new_address ,
>  may be provided; see the description of
> @@ -119,13 +133,42 @@ If
>  is specified, then
>  .B MREMAP_MAYMOVE
>  must also be specified.
> +.IP
> +Since Linux 6.17,
> +if
> +.I old_size
> +is equal to
> +.I new_size
> +and
> +.B MREMAP_FIXED
> +is specified, then
> +.I old_size
> +may span beyond the mapping in which
> +.I old_address
> +resides.
> +In this case,
> +gaps between mappings in the original range
> +are maintained in the new range.
> +The whole operation is performed atomically
> +unless an error arises,
> +in which case the operation may be partially
> +completed,
> +that is,
> +some mappings may be moved and others not.
> +.IP
> +Moving multiple mappings is not permitted if
> +any of those mappings have either
> +been registered with
> +.BR userfaultfd (2) ,
> +or map drivers that
> +specify their own custom address mapping logic.
>  .TP
>  .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
>  .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
>  This flag, which must be used in conjunction with
>  .BR MREMAP_MAYMOVE ,
> -remaps a mapping to a new address but does not unmap the mapping at
> -.IR old_address .
> +remaps mappings to a new address but does not unmap them
> +from their original address.
>  .IP
>  The
>  .B MREMAP_DONTUNMAP
> @@ -163,13 +206,13 @@ mapped.
>  See NOTES for some possible applications of
>  .BR MREMAP_DONTUNMAP .
>  .P
> -If the memory segment specified by
> +If the memory segments specified by
>  .I old_address
>  and
>  .I old_size
> -is locked (using
> +are locked (using
>  .BR mlock (2)
> -or similar), then this lock is maintained when the segment is
> +or similar), then this lock is maintained when the segments are
>  resized and/or relocated.
>  As a consequence, the amount of memory locked by the process may change.
>  .SH RETURN VALUE
> @@ -202,7 +245,10 @@ virtual memory address for this process.
>  You can also get
>  .B EFAULT
>  even if there exist mappings that cover the
> -whole address space requested, but those mappings are of different types.
> +whole address space requested, but those mappings are of different types,
> +and the
> +.BR mremap ()
> +operation being performed does not support this.
>  .TP
>  .B EINVAL
>  An invalid argument was given.
> -- 
> 2.50.1
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 3/3] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Alejandro Colomar @ 2025-08-15 21:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <ab2264d8c29d103d400c028f0417cada002ffc11.1754924278.git.lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

Hi Lorenzo,

On Mon, Aug 11, 2025 at 03:59:39PM +0100, Lorenzo Stoakes wrote:
> There is pre-existing logic that appears to be undocumented for an mremap()
> shrink operation, where it turns out that the usual 'input range must span
> a single mapping' requirement no longer applies.
> 
> In fact, it turns out that the input range specified by [old_address,
> old_address + old_size) may span any number of mappings.
> 
> If shrinking in-place (that is, neither the MREMAP_FIXED nor
> MREMAP_DONTUNMAP flags are specified), then the new span may also span any
> number of VMAs - [old_address, old_address + new_size).
> 
> If shrinking and moving, the range specified by [old_address, old_address +
> new_size) must span a single VMA.
> 
> There must be at least one VMA contained within the [old_address,
> old_address + old_size) range, and old_address must be within the range of
> a VMA.
> 
> Explicitly document this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  man/man2/mremap.2 | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 6d14bf627..53d4eda29 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -47,8 +47,35 @@ The
>  .B MREMAP_DONTUNMAP
>  flag may also be specified.
>  .P
> -If the operation is not
> -simply moving mappings,
> +Equally, if the operation performs a shrink,
> +that is if

Missing comma.

> +.I old_size
> +is greater than
> +.IR new_size ,
> +then
> +.I old_size
> +may also span multiple mappings
> +which do not have to be
> +adjacent to one another.

I'm wondering if there's a missing comma or not before 'which'.
The meaning of the sentence would be different.

So, I should ask:

Does old_size > new_size mean that old_size may span multiple mappings
and you're commenting that multiple mappings need not be adjacent?

Or are multiple mappings always allowed and old_size > new_size allows
non-adjacent ones?

I suspect it's the former, right?  Then, it's missing a comma, right?


Other than this, the patch looks good.


Have a lovely night!
Alex

> +If this shrink is performed
> +in-place,
> +that is,
> +neither
> +.BR MREMAP_FIXED ,
> +nor
> +.B MREMAP_DONTUNMAP
> +are specified,
> +.I new_size
> +may also span multiple VMAs.
> +However, if the range is moved,
> +then
> +.I new_size
> +must span only a single mapping.
> +.P
> +If the operation is neither a
> +.B MREMAP_FIXED
> +move
> +nor a shrink,
>  then
>  .I old_size
>  must span only a single mapping.
> -- 
> 2.50.1
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v18 0/8] fork: Support shadow stacks in clone3()
From: Edgecombe, Rick P @ 2025-08-15 22:39 UTC (permalink / raw)
  To: dietmar.eggemann@arm.com, broonie@kernel.org,
	Szabolcs.Nagy@arm.com, brauner@kernel.org,
	dave.hansen@linux.intel.com, debug@rivosinc.com, mgorman@suse.de,
	vincent.guittot@linaro.org, fweimer@redhat.com, mingo@redhat.com,
	rostedt@goodmis.org, hjl.tools@gmail.com, tglx@linutronix.de,
	vschneid@redhat.com, shuah@kernel.org, hpa@zytor.com,
	peterz@infradead.org, bp@alien8.de, bsegall@google.com,
	x86@kernel.org, juri.lelli@redhat.com
  Cc: yury.khrustalev@arm.com, linux-kselftest@vger.kernel.org,
	akpm@linux-foundation.org, jannh@google.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	will@kernel.org, wilco.dijkstra@arm.com,
	skhan@linuxfoundation.org, kees@kernel.org,
	linux-api@vger.kernel.org
In-Reply-To: <20250702-clone3-shadow-stack-v18-0-7965d2b694db@kernel.org>

On Wed, 2025-07-02 at 11:39 +0100, Mark Brown wrote:
> Changes in v16:
> - Rebase onto v6.15-rc2.
> - Roll in fixes from x86 testing from Rick Edgecombe.
> - Rework so that the argument is shadow_stack_token.
> - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@kernel.org

Sorry for the delay.

Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-17  7:52 UTC (permalink / raw)
  To: cyphar
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
	viro, Ian Kent, autofs mailing list
In-Reply-To: <20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

I noticed that you changed docs for automounts.
So I dig into automounts implementation.
And I found a bug in openat2.
If RESOLVE_NO_XDEV is specified, then name resolution
doesn't cross automount points (i. e. we get EXDEV),
but automounts still happen!
I think this is a bug.
Bug is reproduced in 6.17-rc1.
In the end of this mail you will find reproducer.
And miniconfig.

If you send patches for this bug, please, CC me.

Are automounts actually used? Is it possible to deprecate or
remove them? It seems for me automounts are rarely tested obscure
feature, which affects core namei code.

This reproducer is based on "tracing" automount, which
actually *IS* already deprecated. But automount mechanism
itself is not deprecated, as well as I know.

Also, I did read namei code, and I think that
options AT_NO_AUTOMOUNT, FSPICK_NO_AUTOMOUNT, etc affect
last component only, not all of them. I didn't test this yet.
I plan to test this within next days.

Also, I still didn't finish my experiments. Hopefully I will
finish them in 7 days. :)

Askar Safin

====

miniconfig:

CONFIG_64BIT=y

CONFIG_EXPERT=y

CONFIG_PRINTK=y
CONFIG_PRINTK_TIME=y

CONFIG_TTY=y
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y

CONFIG_PROC_FS=y
CONFIG_DEVTMPFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_DEBUG_FS=y
CONFIG_USER_EVENTS=y
CONFIG_FTRACE=y
CONFIG_MULTIUSER=y
CONFIG_NAMESPACES=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y


CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y

CONFIG_BLK_DEV_INITRD=y
CONFIG_RD_GZIP=y

CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y

CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED=y

CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y

====

/*
Author: Askar Safin
Public domain

Make sure your kernel is compiled with CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED=y

If that openat2 bug reproduces, then this program will
print "BUG REPRODUCED". If openat2 is fixed, then
the program will print "BUG NOT REPRODUCED".
Any other output means that something gone wrong,
i. e. results are indeterminate.

This program requires root in initial user namespace
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <linux/openat2.h>

int
main (void)
{
    if (unshare (CLONE_NEWNS) != 0)
        {
            perror ("unshare");
            return 1;
        }
    if (mount (NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) != 0)
        {
            perror ("mount(NULL, /, NULL, MS_REC | MS_PRIVATE, NULL)");
            return 1;
        }
    if (mount (NULL, "/tmp", "tmpfs", 0, NULL) != 0)
        {
            perror ("mount tmpfs");
            return 1;
        }
    if (mkdir ("/tmp/debugfs", 0777) != 0)
        {
            perror ("mkdir(/tmp/debugfs)");
            return 1;
        }
    if (mount (NULL, "/tmp/debugfs", "debugfs", 0, NULL) != 0)
        {
            perror ("mount debugfs");
            return 1;
        }
    {
        struct statx tracing;
        if (statx (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT, 0, &tracing) != 0)
            {
                perror ("statx tracing");
                return 1;
            }
        if (!(tracing.stx_attributes_mask & STATX_ATTR_MOUNT_ROOT))
            {
                fprintf (stderr, "???\n");
                return 1;
            }
        // Let's check that nothing is mounted at /tmp/debugfs/tracing yet
        if (tracing.stx_attributes & STATX_ATTR_MOUNT_ROOT)
            {
                fprintf (stderr, "Something already mounted at /tmp/debugfs/tracing\n");
                return 1;
            }
    }
    if (chdir ("/tmp/debugfs") != 0)
        {
            perror ("chdir");
            return 1;
        }
    {
        struct open_how how;
        memset (&how, 0, sizeof how);
        how.flags = O_DIRECTORY;
        how.mode = 0;
        how.resolve = RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS;
        if (syscall (SYS_openat2, AT_FDCWD, "tracing", &how, sizeof how) != -1)
            {
                fprintf (stderr, "openat2 crossed automount point");
                return 1;
            }
        if (errno != EXDEV)
            {
                fprintf (stderr, "wrong errno");
                return 1;
            }
    }
    {
        struct statx tracing;
        if (statx (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT, 0, &tracing) != 0)
            {
                perror ("statx tracing (2)");
                return 1;
            }
        if (!(tracing.stx_attributes_mask & STATX_ATTR_MOUNT_ROOT))
            {
                fprintf (stderr, "???\n");
                return 1;
            }
        if (tracing.stx_attributes & STATX_ATTR_MOUNT_ROOT)
            {
                fprintf (stderr, "BUG REPRODUCED. Something mounted at /tmp/debugfs/tracing\n");
                return 0;
            }
        else
            {
                fprintf (stderr, "BUG NOT REPRODUCED\n");
                return 0;
            }
    }
}

^ permalink raw reply

* Re: [PATCH] LoongArch: Increase COMMAND_LINE_SIZE to 4096
From: Xose Vazquez Perez @ 2025-08-17  7:57 UTC (permalink / raw)
  To: Ming Wang; +Cc: LINUX_ARCH-ML, API-ML, KERNEL-ML, X86-ML

Ming Wang wrote:

> The default COMMAND_LINE_SIZE of 512, inherited from asm-generic, is
> too small for modern use cases. For example, kdump configurations or
> extensive debugging parameters can easily exceed this limit.
> 
> Therefore, increase the command line size to 4096 bytes, aligning
> LoongArch with the MIPS architecture. This change follows a broader
> trend among architectures to raise this limit to support modern needs;
> for instance, PowerPC increased its value for similar reasons in
> commit a5980d064fe2 ("powerpc: Bump COMMAND_LINE_SIZE to 2048").
> 
> Similar to the change made for RISC-V in commit 61fc1ee8be26
> ("riscv: Bump COMMAND_LINE_SIZE value to 1024"), this is considered
> a safe change. The broader kernel community has reached a consensus
> that modifying COMMAND_LINE_SIZE from UAPI headers does not
> constitute a uABI breakage, as well-behaved userspace applications
> should not rely on this macro.
> 
> Suggested-by: Huang Cun <cunhuang@tencent.com>
> Signed-off-by: Ming Wang <wangming01@loongson.cn>
> ---
>  arch/loongarch/include/uapi/asm/setup.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
>  create mode 100644 arch/loongarch/include/uapi/asm/setup.h
> 
> diff --git a/arch/loongarch/include/uapi/asm/setup.h b/arch/loongarch/include/uapi/asm/setup.h
> new file mode 100644
> index 000000000000..d46363ce3e02
> --- /dev/null
> +++ b/arch/loongarch/include/uapi/asm/setup.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_ASM_LOONGARCH_SETUP_H
> +#define _UAPI_ASM_LOONGARCH_SETUP_H
> +
> +#define COMMAND_LINE_SIZE	4096
> +
> +#endif /* _UAPI_ASM_LOONGARCH_SETUP_H */
> -- 
> 2.43.0

A bit chaotic and arbitrary sizes:

$ git grep "define.*COMMAND_LINE_SIZE"
arch/alpha/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	256
arch/arc/include/asm/setup.h:#define COMMAND_LINE_SIZE 256
arch/arm64/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	2048
arch/arm/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE 1024
arch/m68k/include/asm/setup.h:#define CL_SIZE COMMAND_LINE_SIZE
arch/m68k/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE 256
arch/microblaze/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	256
arch/mips/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	4096
arch/mips/loongson64/reset.c:#define KEXEC_ARGV_SIZE	COMMAND_LINE_SIZE
arch/parisc/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	1024
arch/powerpc/boot/ops.h:#define	BOOT_COMMAND_LINE_SIZE	2048
arch/powerpc/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	2048
arch/riscv/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	1024
arch/s390/include/asm/setup.h:#define COMMAND_LINE_SIZE CONFIG_COMMAND_LINE_SIZE
arch/s390/include/asm/setup.h:#define LEGACY_COMMAND_LINE_SIZE	896
arch/sparc/include/uapi/asm/setup.h:# define COMMAND_LINE_SIZE 2048
arch/sparc/include/uapi/asm/setup.h:# define COMMAND_LINE_SIZE 256
arch/um/include/asm/setup.h:#define COMMAND_LINE_SIZE 4096
arch/x86/include/asm/setup.h:#define COMMAND_LINE_SIZE 2048
arch/xtensa/include/uapi/asm/setup.h:#define COMMAND_LINE_SIZE	256
include/uapi/asm-generic/setup.h:#define COMMAND_LINE_SIZE	512
kernel/trace/ftrace.c:#define FTRACE_FILTER_SIZE		COMMAND_LINE_SIZE
tools/power/x86/turbostat/turbostat.c:#define COMMAND_LINE_SIZE 2048
tools/testing/selftests/kho/init.c:#define COMMAND_LINE_SIZE	2048

Maybe they should be standardized ???

And for s390 it is configurable, see 622021cd6c560

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox