Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-08-13 12:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Pratyush Yadav, Vipin Sharma, Pasha Tatashin, jasonmiu, graf,
	changyuanl, rppt, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, parav,
	leonro, witu
In-Reply-To: <2025081351-tinsel-sprinkler-af77@gregkh>

On Wed, Aug 13, 2025 at 02:14:23PM +0200, Greg KH wrote:
> On Wed, Aug 13, 2025 at 02:02:07PM +0200, Pratyush Yadav wrote:
> > On Wed, Aug 13 2025, Greg KH wrote:
> > 
> > > On Tue, Aug 12, 2025 at 11:34:37PM -0700, Vipin Sharma wrote:
> > >> On 2025-08-07 01:44:35, Pasha Tatashin wrote:
> > >> > From: Pratyush Yadav <ptyadav@amazon.de>
> > >> > +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
> > >> > +					unsigned int nr_folios)
> > >> > +{
> > >> > +	unsigned int i;
> > >> > +
> > >> > +	for (i = 0; i < nr_folios; i++) {
> > >> > +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> > >> > +		struct folio *folio;
> > >> > +
> > >> > +		if (!pfolio->foliodesc)
> > >> > +			continue;
> > >> > +
> > >> > +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > >> > +
> > >> > +		kho_unpreserve_folio(folio);
> > >> 
> > >> This one is missing WARN_ON_ONCE() similar to the one in
> > >> memfd_luo_preserve_folios().
> > >
> > > So you really want to cause a machine to reboot and get a CVE issued for
> > > this, if it could be triggered?  That's bold :)
> > >
> > > Please don't.  If that can happen, handle the issue and move on, don't
> > > crash boxes.
> > 
> > Why would a WARN() crash the machine? That is what BUG() does, not
> > WARN().
> 
> See 'panic_on_warn' which is enabled in a few billion Linux systems
> these days :(

This has been discussed so many times already:

https://lwn.net/Articles/969923/

When someone tried to formalize this "don't use WARN_ON" position 
in the coding-style.rst it was NAK'd:

https://lwn.net/ml/linux-kernel/10af93f8-83f2-48ce-9bc3-80fe4c60082c@redhat.com/

Based on Linus's opposition to the idea:

https://lore.kernel.org/all/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@mail.gmail.com/

Use the warn ons. Make sure they can't be triggered by userspace. Use
them to detect corruption/malfunction in the kernel.

In this case if kho_unpreserve_folio() fails in this call chain it
means some error unwind is wrongly happening out of sequence, and we
are now forced to leak memory. Unwind is not something that userspace
should be controlling, so of course we want a WARN_ON here.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-08-13 12:29 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250813063407.GA3182745.vipinsh@google.com>

Hi Vipin,

Thanks for the review.

On Tue, Aug 12 2025, Vipin Sharma wrote:

> On 2025-08-07 01:44:35, Pasha Tatashin wrote:
>> From: Pratyush Yadav <ptyadav@amazon.de>
>> +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
>> +					unsigned int nr_folios)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr_folios; i++) {
>> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
>> +		struct folio *folio;
>> +
>> +		if (!pfolio->foliodesc)
>> +			continue;
>> +
>> +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
>> +
>> +		kho_unpreserve_folio(folio);
>
> This one is missing WARN_ON_ONCE() similar to the one in
> memfd_luo_preserve_folios().

Right, will add.

>
>> +		unpin_folio(folio);

Looking at this code caught my eye. This can also be called from LUO's
finish callback if no one claimed the memfd after live update. In that
case, unpin_folio() is going to underflow the pincount or refcount on
the folio since after the kexec, the folio is no longer pinned. We
should only be doing folio_put().

I think this function should take a argument to specify which of these
cases it is dealing with.

>> +	}
>> +}
>> +
>> +static void *memfd_luo_create_fdt(unsigned long size)
>> +{
>> +	unsigned int order = get_order(size);
>> +	struct folio *fdt_folio;
>> +	int err = 0;
>> +	void *fdt;
>> +
>> +	if (order > MAX_PAGE_ORDER)
>> +		return NULL;
>> +
>> +	fdt_folio = folio_alloc(GFP_KERNEL, order);
>
> __GFP_ZERO should also be used here. Otherwise this can lead to
> unintentional passing of old kernel memory.

fdt_create() zeroes out the buffer so this should not be a problem.

>
>> +static int memfd_luo_prepare(struct liveupdate_file_handler *handler,
>> +			     struct file *file, u64 *data)
>> +{
>> +	struct memfd_luo_preserved_folio *preserved_folios;
>> +	struct inode *inode = file_inode(file);
>> +	unsigned int max_folios, nr_folios = 0;
>> +	int err = 0, preserved_size;
>> +	struct folio **folios;
>> +	long size, nr_pinned;
>> +	pgoff_t offset;
>> +	void *fdt;
>> +	u64 pos;
>> +
>> +	if (WARN_ON_ONCE(!shmem_file(file)))
>> +		return -EINVAL;
>
> This one is only check for shmem_file, whereas in
> memfd_luo_can_preserve() there is check for inode->i_nlink also. Is that
> not needed here?

Actually, this should never happen since the LUO can_preserve() callback
should make sure of this. I think it would be perfectly fine to just
drop this check. I only added it because I was being extra careful.

>
>> +
>> +	inode_lock(inode);
>> +	shmem_i_mapping_freeze(inode, true);
>> +
>> +	size = i_size_read(inode);
>> +	if ((PAGE_ALIGN(size) / PAGE_SIZE) > UINT_MAX) {
>> +		err = -E2BIG;
>> +		goto err_unlock;
>> +	}
>> +
>> +	/*
>> +	 * Guess the number of folios based on inode size. Real number might end
>> +	 * up being smaller if there are higher order folios.
>> +	 */
>> +	max_folios = PAGE_ALIGN(size) / PAGE_SIZE;
>> +	folios = kvmalloc_array(max_folios, sizeof(*folios), GFP_KERNEL);
>
> __GFP_ZERO?

Why? This is only used in this function and gets freed on return. And
the function only looks at the elements that get initialized by
memfd_pin_folios().

>
>> +static int memfd_luo_freeze(struct liveupdate_file_handler *handler,
>> +			    struct file *file, u64 *data)
>> +{
>> +	u64 pos = file->f_pos;
>> +	void *fdt;
>> +	int err;
>> +
>> +	if (WARN_ON_ONCE(!*data))
>> +		return -EINVAL;
>> +
>> +	fdt = phys_to_virt(*data);
>> +
>> +	/*
>> +	 * The pos or size might have changed since prepare. Everything else
>> +	 * stays the same.
>> +	 */
>> +	err = fdt_setprop(fdt, 0, "pos", &pos, sizeof(pos));
>> +	if (err)
>> +		return err;
>
> Comment is talking about pos and size but code is only updating pos. 

Right. Comment is out of date. size can no longer change since prepare.
So will update the comment.

>
>> +static int memfd_luo_retrieve(struct liveupdate_file_handler *handler, u64 data,
>> +			      struct file **file_p)
>> +{
>> +	const struct memfd_luo_preserved_folio *pfolios;
>> +	int nr_pfolios, len, ret = 0, i = 0;
>> +	struct address_space *mapping;
>> +	struct folio *folio, *fdt_folio;
>> +	const u64 *pos, *size;
>> +	struct inode *inode;
>> +	struct file *file;
>> +	const void *fdt;
>> +
>> +	fdt_folio = memfd_luo_get_fdt(data);
>> +	if (!fdt_folio)
>> +		return -ENOENT;
>> +
>> +	fdt = page_to_virt(folio_page(fdt_folio, 0));
>> +
>> +	pfolios = fdt_getprop(fdt, 0, "folios", &len);
>> +	if (!pfolios || len % sizeof(*pfolios)) {
>> +		pr_err("invalid 'folios' property\n");
>
> Print should clearly state that error is because fields is not found or
> len is not multiple of sizeof(*pfolios).

Eh, there is already too much boilerplate one has to write (and read)
for parsing the FDT. Is there really a need for an extra 3-4 lines of
code for _each_ property that is parsed?

Long term, I think we shouldn't be doing this manually anyway. I think
the maintainable path forward is to define a schema for the serialized
data and have a parser that takes in the schema and gives out a parsed
struct, doing all sorts of checks in the process.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Greg KH @ 2025-08-13 12:14 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Vipin Sharma, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <mafs01ppfxwe8.fsf@kernel.org>

On Wed, Aug 13, 2025 at 02:02:07PM +0200, Pratyush Yadav wrote:
> On Wed, Aug 13 2025, Greg KH wrote:
> 
> > On Tue, Aug 12, 2025 at 11:34:37PM -0700, Vipin Sharma wrote:
> >> On 2025-08-07 01:44:35, Pasha Tatashin wrote:
> >> > From: Pratyush Yadav <ptyadav@amazon.de>
> >> > +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
> >> > +					unsigned int nr_folios)
> >> > +{
> >> > +	unsigned int i;
> >> > +
> >> > +	for (i = 0; i < nr_folios; i++) {
> >> > +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> >> > +		struct folio *folio;
> >> > +
> >> > +		if (!pfolio->foliodesc)
> >> > +			continue;
> >> > +
> >> > +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> >> > +
> >> > +		kho_unpreserve_folio(folio);
> >> 
> >> This one is missing WARN_ON_ONCE() similar to the one in
> >> memfd_luo_preserve_folios().
> >
> > So you really want to cause a machine to reboot and get a CVE issued for
> > this, if it could be triggered?  That's bold :)
> >
> > Please don't.  If that can happen, handle the issue and move on, don't
> > crash boxes.
> 
> Why would a WARN() crash the machine? That is what BUG() does, not
> WARN().

See 'panic_on_warn' which is enabled in a few billion Linux systems
these days :(

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-08-13 12:02 UTC (permalink / raw)
  To: Greg KH
  Cc: Vipin Sharma, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, rppt, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <2025081310-custodian-ashamed-3104@gregkh>

On Wed, Aug 13 2025, Greg KH wrote:

> On Tue, Aug 12, 2025 at 11:34:37PM -0700, Vipin Sharma wrote:
>> On 2025-08-07 01:44:35, Pasha Tatashin wrote:
>> > From: Pratyush Yadav <ptyadav@amazon.de>
>> > +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
>> > +					unsigned int nr_folios)
>> > +{
>> > +	unsigned int i;
>> > +
>> > +	for (i = 0; i < nr_folios; i++) {
>> > +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
>> > +		struct folio *folio;
>> > +
>> > +		if (!pfolio->foliodesc)
>> > +			continue;
>> > +
>> > +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
>> > +
>> > +		kho_unpreserve_folio(folio);
>> 
>> This one is missing WARN_ON_ONCE() similar to the one in
>> memfd_luo_preserve_folios().
>
> So you really want to cause a machine to reboot and get a CVE issued for
> this, if it could be triggered?  That's bold :)
>
> Please don't.  If that can happen, handle the issue and move on, don't
> crash boxes.

Why would a WARN() crash the machine? That is what BUG() does, not
WARN().

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Greg KH @ 2025-08-13  7:09 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250813063407.GA3182745.vipinsh@google.com>

On Tue, Aug 12, 2025 at 11:34:37PM -0700, Vipin Sharma wrote:
> On 2025-08-07 01:44:35, Pasha Tatashin wrote:
> > From: Pratyush Yadav <ptyadav@amazon.de>
> > +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
> > +					unsigned int nr_folios)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < nr_folios; i++) {
> > +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> > +		struct folio *folio;
> > +
> > +		if (!pfolio->foliodesc)
> > +			continue;
> > +
> > +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > +
> > +		kho_unpreserve_folio(folio);
> 
> This one is missing WARN_ON_ONCE() similar to the one in
> memfd_luo_preserve_folios().

So you really want to cause a machine to reboot and get a CVE issued for
this, if it could be triggered?  That's bold :)

Please don't.  If that can happen, handle the issue and move on, don't
crash boxes.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Vipin Sharma @ 2025-08-13  6:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-30-pasha.tatashin@soleen.com>

On 2025-08-07 01:44:35, Pasha Tatashin wrote:
> From: Pratyush Yadav <ptyadav@amazon.de>
> +static void memfd_luo_unpreserve_folios(const struct memfd_luo_preserved_folio *pfolios,
> +					unsigned int nr_folios)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_folios; i++) {
> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> +		struct folio *folio;
> +
> +		if (!pfolio->foliodesc)
> +			continue;
> +
> +		folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> +
> +		kho_unpreserve_folio(folio);

This one is missing WARN_ON_ONCE() similar to the one in
memfd_luo_preserve_folios().

> +		unpin_folio(folio);
> +	}
> +}
> +
> +static void *memfd_luo_create_fdt(unsigned long size)
> +{
> +	unsigned int order = get_order(size);
> +	struct folio *fdt_folio;
> +	int err = 0;
> +	void *fdt;
> +
> +	if (order > MAX_PAGE_ORDER)
> +		return NULL;
> +
> +	fdt_folio = folio_alloc(GFP_KERNEL, order);

__GFP_ZERO should also be used here. Otherwise this can lead to
unintentional passing of old kernel memory.

> +static int memfd_luo_prepare(struct liveupdate_file_handler *handler,
> +			     struct file *file, u64 *data)
> +{
> +	struct memfd_luo_preserved_folio *preserved_folios;
> +	struct inode *inode = file_inode(file);
> +	unsigned int max_folios, nr_folios = 0;
> +	int err = 0, preserved_size;
> +	struct folio **folios;
> +	long size, nr_pinned;
> +	pgoff_t offset;
> +	void *fdt;
> +	u64 pos;
> +
> +	if (WARN_ON_ONCE(!shmem_file(file)))
> +		return -EINVAL;

This one is only check for shmem_file, whereas in
memfd_luo_can_preserve() there is check for inode->i_nlink also. Is that
not needed here?

> +
> +	inode_lock(inode);
> +	shmem_i_mapping_freeze(inode, true);
> +
> +	size = i_size_read(inode);
> +	if ((PAGE_ALIGN(size) / PAGE_SIZE) > UINT_MAX) {
> +		err = -E2BIG;
> +		goto err_unlock;
> +	}
> +
> +	/*
> +	 * Guess the number of folios based on inode size. Real number might end
> +	 * up being smaller if there are higher order folios.
> +	 */
> +	max_folios = PAGE_ALIGN(size) / PAGE_SIZE;
> +	folios = kvmalloc_array(max_folios, sizeof(*folios), GFP_KERNEL);

__GFP_ZERO?

> +static int memfd_luo_freeze(struct liveupdate_file_handler *handler,
> +			    struct file *file, u64 *data)
> +{
> +	u64 pos = file->f_pos;
> +	void *fdt;
> +	int err;
> +
> +	if (WARN_ON_ONCE(!*data))
> +		return -EINVAL;
> +
> +	fdt = phys_to_virt(*data);
> +
> +	/*
> +	 * The pos or size might have changed since prepare. Everything else
> +	 * stays the same.
> +	 */
> +	err = fdt_setprop(fdt, 0, "pos", &pos, sizeof(pos));
> +	if (err)
> +		return err;

Comment is talking about pos and size but code is only updating pos. 

> +static int memfd_luo_retrieve(struct liveupdate_file_handler *handler, u64 data,
> +			      struct file **file_p)
> +{
> +	const struct memfd_luo_preserved_folio *pfolios;
> +	int nr_pfolios, len, ret = 0, i = 0;
> +	struct address_space *mapping;
> +	struct folio *folio, *fdt_folio;
> +	const u64 *pos, *size;
> +	struct inode *inode;
> +	struct file *file;
> +	const void *fdt;
> +
> +	fdt_folio = memfd_luo_get_fdt(data);
> +	if (!fdt_folio)
> +		return -ENOENT;
> +
> +	fdt = page_to_virt(folio_page(fdt_folio, 0));
> +
> +	pfolios = fdt_getprop(fdt, 0, "folios", &len);
> +	if (!pfolios || len % sizeof(*pfolios)) {
> +		pr_err("invalid 'folios' property\n");

Print should clearly state that error is because fields is not found or
len is not multiple of sizeof(*pfolios).


^ permalink raw reply

* [PATCH xfsprogs v2] xfs_io: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-13  2:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi,
	linux-xfs
  Cc: linux-kernel, linux-api, hch, tytso, djwong, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
fallocate utility by introducing a new 'fwzero' command in the xfs_io
tool.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
v1->v2:
 - Minor description modification to align with the kernel.

 io/prealloc.c     | 36 ++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |  6 ++++++
 2 files changed, 42 insertions(+)

diff --git a/io/prealloc.c b/io/prealloc.c
index 8e968c9f..9a64bf53 100644
--- a/io/prealloc.c
+++ b/io/prealloc.c
@@ -30,6 +30,10 @@
 #define FALLOC_FL_UNSHARE_RANGE 0x40
 #endif
 
+#ifndef FALLOC_FL_WRITE_ZEROES
+#define FALLOC_FL_WRITE_ZEROES 0x80
+#endif
+
 static cmdinfo_t allocsp_cmd;
 static cmdinfo_t freesp_cmd;
 static cmdinfo_t resvsp_cmd;
@@ -41,6 +45,7 @@ static cmdinfo_t fcollapse_cmd;
 static cmdinfo_t finsert_cmd;
 static cmdinfo_t fzero_cmd;
 static cmdinfo_t funshare_cmd;
+static cmdinfo_t fwzero_cmd;
 
 static int
 offset_length(
@@ -377,6 +382,27 @@ funshare_f(
 	return 0;
 }
 
+static int
+fwzero_f(
+	int		argc,
+	char		**argv)
+{
+	xfs_flock64_t	segment;
+	int		mode = FALLOC_FL_WRITE_ZEROES;
+
+	if (!offset_length(argv[1], argv[2], &segment)) {
+		exitcode = 1;
+		return 0;
+	}
+
+	if (fallocate(file->fd, mode, segment.l_start, segment.l_len)) {
+		perror("fallocate");
+		exitcode = 1;
+		return 0;
+	}
+	return 0;
+}
+
 void
 prealloc_init(void)
 {
@@ -489,4 +515,14 @@ prealloc_init(void)
 	funshare_cmd.oneline =
 	_("unshares shared blocks within the range");
 	add_command(&funshare_cmd);
+
+	fwzero_cmd.name = "fwzero";
+	fwzero_cmd.cfunc = fwzero_f;
+	fwzero_cmd.argmin = 2;
+	fwzero_cmd.argmax = 2;
+	fwzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
+	fwzero_cmd.args = _("off len");
+	fwzero_cmd.oneline =
+	_("zeroes space and eliminates holes by allocating and submitting write zeroes");
+	add_command(&fwzero_cmd);
 }
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index b0dcfdb7..0a673322 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -550,6 +550,12 @@ With the
 .B -k
 option, use the FALLOC_FL_KEEP_SIZE flag as well.
 .TP
+.BI fwzero " offset length"
+Call fallocate with FALLOC_FL_WRITE_ZEROES flag as described in the
+.BR fallocate (2)
+manual page to allocate and zero blocks within the range by submitting write
+zeroes.
+.TP
 .BI zero " offset length"
 Call xfsctl with
 .B XFS_IOC_ZERO_RANGE
-- 
2.39.2


^ permalink raw reply related

* [PATCH util-linux v2] fallocate: add FALLOC_FL_WRITE_ZEROES support
From: Zhang Yi @ 2025-08-13  2:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-block, dm-devel, linux-nvme, linux-scsi
  Cc: linux-kernel, linux-api, hch, tytso, djwong, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES to the fallocate
utility by introducing a new option -w|--write-zeroes.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
v1->v2:
 - Minor description modification to align with the kernel.

 sys-utils/fallocate.1.adoc | 11 +++++++++--
 sys-utils/fallocate.c      | 20 ++++++++++++++++----
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/sys-utils/fallocate.1.adoc b/sys-utils/fallocate.1.adoc
index 44ee0ef4c..0ec9ff9a9 100644
--- a/sys-utils/fallocate.1.adoc
+++ b/sys-utils/fallocate.1.adoc
@@ -12,7 +12,7 @@ fallocate - preallocate or deallocate space to a file
 
 == SYNOPSIS
 
-*fallocate* [*-c*|*-p*|*-z*] [*-o* _offset_] *-l* _length_ [*-n*] _filename_
+*fallocate* [*-c*|*-p*|*-z*|*-w*] [*-o* _offset_] *-l* _length_ [*-n*] _filename_
 
 *fallocate* *-d* [*-o* _offset_] [*-l* _length_] _filename_
 
@@ -28,7 +28,7 @@ The exit status returned by *fallocate* is 0 on success and 1 on failure.
 
 The _length_ and _offset_ arguments may be followed by the multiplicative suffixes KiB (=1024), MiB (=1024*1024), and so on for GiB, TiB, PiB, EiB, ZiB, and YiB (the "iB" is optional, e.g., "K" has the same meaning as "KiB") or the suffixes KB (=1000), MB (=1000*1000), and so on for GB, TB, PB, EB, ZB, and YB.
 
-The options *--collapse-range*, *--dig-holes*, *--punch-hole*, *--zero-range* and *--posix* are mutually exclusive.
+The options *--collapse-range*, *--dig-holes*, *--punch-hole*, *--zero-range*, *--write-zeroes* and *--posix* are mutually exclusive.
 
 *-c*, *--collapse-range*::
 Removes a byte range from a file, without leaving a hole. The byte range to be collapsed starts at _offset_ and continues for _length_ bytes. At the completion of the operation, the contents of the file starting at the location __offset__+_length_ will be appended at the location _offset_, and the file will be _length_ bytes smaller. The option *--keep-size* may not be specified for the collapse-range operation.
@@ -76,6 +76,13 @@ Option *--keep-size* can be specified to prevent file length modification.
 +
 Available since Linux 3.14 for ext4 (only for extent-based files) and XFS.
 
+*-w*, *--write-zeroes*::
+Zeroes space in the byte range starting at _offset_ and continuing for _length_ bytes. Within the specified range, blocks are preallocated for the regions that span the holes in the file. After a successful call, subsequent reads from this range will return zeroes, subsequent writes to that range do not require further changes to the file mapping metadata.
++
+Zeroing is done within the filesystem by preferably submitting write zeores commands, the alternative way is submitting actual zeroed data, the specified range will be converted into written extents. The write zeroes command is typically faster than write actual data if the device supports unmap write zeroes, the specified range will not be physically zeroed out on the device.
++
+Options *--keep-size* can not be specified for the write-zeroes operation.
+
 include::man-common/help-version.adoc[]
 
 == AUTHORS
diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c
index 13bf52915..8d37fdad7 100644
--- a/sys-utils/fallocate.c
+++ b/sys-utils/fallocate.c
@@ -40,7 +40,7 @@
 #if defined(HAVE_LINUX_FALLOC_H) && \
     (!defined(FALLOC_FL_KEEP_SIZE) || !defined(FALLOC_FL_PUNCH_HOLE) || \
      !defined(FALLOC_FL_COLLAPSE_RANGE) || !defined(FALLOC_FL_ZERO_RANGE) || \
-     !defined(FALLOC_FL_INSERT_RANGE))
+     !defined(FALLOC_FL_INSERT_RANGE) || !defined(FALLOC_FL_WRITE_ZEROES))
 # include <linux/falloc.h>	/* non-libc fallback for FALLOC_FL_* flags */
 #endif
 
@@ -65,6 +65,10 @@
 # define FALLOC_FL_INSERT_RANGE		0x20
 #endif
 
+#ifndef FALLOC_FL_WRITE_ZEROES
+# define FALLOC_FL_WRITE_ZEROES		0x80
+#endif
+
 #include "nls.h"
 #include "strutils.h"
 #include "c.h"
@@ -94,6 +98,7 @@ static void __attribute__((__noreturn__)) usage(void)
 	fputs(_(" -o, --offset <num>   offset for range operations, in bytes\n"), out);
 	fputs(_(" -p, --punch-hole     replace a range with a hole (implies -n)\n"), out);
 	fputs(_(" -z, --zero-range     zero and ensure allocation of a range\n"), out);
+	fputs(_(" -w, --write-zeroes   write zeroes and ensure allocation of a range\n"), out);
 #ifdef HAVE_POSIX_FALLOCATE
 	fputs(_(" -x, --posix          use posix_fallocate(3) instead of fallocate(2)\n"), out);
 #endif
@@ -304,6 +309,7 @@ int main(int argc, char **argv)
 	    { "dig-holes",      no_argument,       NULL, 'd' },
 	    { "insert-range",   no_argument,       NULL, 'i' },
 	    { "zero-range",     no_argument,       NULL, 'z' },
+	    { "write-zeroes",   no_argument,       NULL, 'w' },
 	    { "offset",         required_argument, NULL, 'o' },
 	    { "length",         required_argument, NULL, 'l' },
 	    { "posix",          no_argument,       NULL, 'x' },
@@ -312,8 +318,8 @@ int main(int argc, char **argv)
 	};
 
 	static const ul_excl_t excl[] = {	/* rows and cols in ASCII order */
-		{ 'c', 'd', 'i', 'p', 'x', 'z'},
-		{ 'c', 'i', 'n', 'x' },
+		{ 'c', 'd', 'i', 'p', 'w', 'x', 'z'},
+		{ 'c', 'i', 'n', 'w', 'x' },
 		{ 0 }
 	};
 	int excl_st[ARRAY_SIZE(excl)] = UL_EXCL_STATUS_INIT;
@@ -323,7 +329,7 @@ int main(int argc, char **argv)
 	textdomain(PACKAGE);
 	close_stdout_atexit();
 
-	while ((c = getopt_long(argc, argv, "hvVncpdizxl:o:", longopts, NULL))
+	while ((c = getopt_long(argc, argv, "hvVncpdizwxl:o:", longopts, NULL))
 			!= -1) {
 
 		err_exclusive_options(c, longopts, excl, excl_st);
@@ -353,6 +359,9 @@ int main(int argc, char **argv)
 		case 'z':
 			mode |= FALLOC_FL_ZERO_RANGE;
 			break;
+		case 'w':
+			mode |= FALLOC_FL_WRITE_ZEROES;
+			break;
 		case 'x':
 #ifdef HAVE_POSIX_FALLOCATE
 			posix = 1;
@@ -429,6 +438,9 @@ int main(int argc, char **argv)
 			else if (mode & FALLOC_FL_ZERO_RANGE)
 				fprintf(stdout, _("%s: %s (%ju bytes) zeroed.\n"),
 								filename, str, length);
+			else if (mode & FALLOC_FL_WRITE_ZEROES)
+				fprintf(stdout, _("%s: %s (%ju bytes) write zeroed.\n"),
+								filename, str, length);
 			else
 				fprintf(stdout, _("%s: %s (%ju bytes) allocated.\n"),
 								filename, str, length);
-- 
2.39.2


^ permalink raw reply related

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-12 18:25 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-6-f61405c80f34@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 968 bytes --]

On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
> +Note that the Linux kernel reuses filesystem instances
> +for many filesystems,
> +so (depending on the filesystem being configured and parameters used)
> +it is possible for the filesystem instance "created" by
> +.B \%FSCONFIG_CMD_CREATE
> +to, in fact, be a reference
> +to an existing filesystem instance in the kernel.
> +The kernel will attempt to merge the specified parameters
> +of this filesystem configuration context
> +with those of the filesystem instance being reused,
> +but some parameters may be
> +.IR "silently ignored" .

While looking at this again, I realised this explanation is almost
certainly incorrect in a few places (and was based on a misunderstanding
of how sget_fc() works and how it interacts with vfs_get_tree()).

I'll rewrite this in the next version.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-12 16:18 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-12.1755007445-rural-feudal-spacebar-forehead-28QkCN@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 481 bytes --]

On 2025-08-13, Aleksa Sarai <cyphar@cyphar.com> wrote:
> But yeah, "filesystem context" is more accurate here, so probably just:

Oops, I meant to include

>   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
>   once in the lifetime of a filesystem context.
                                                 to create a mount object.

at the end.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 08/12] man/man2/move_mount.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-12 14:36 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <1989db97e30.b71849c573511.8013418925525314426@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

On 2025-08-12, Askar Safin <safinaskar@zohomail.com> wrote:
> move_mount for v2 contained this:
> > Mounts cannot be moved beneath the rootfs
> 
> In v3 you changed this to this:
> > Mount objects cannot be attached beneath the filesystem root
> 
> You made this phrase worse.
> 
> "Filesystem root" can be understood as "root of superblock".
> So, please, change this to "root directory" or something.

Maybe I should borrow the "root mount" terminology from pivot_root(2)?
(Though they use "root mount in the mount namespace of the calling
process", which is a little wordy.) I didn't like using "rootfs" as
shorthand in a man-page.

> > This would create a new bind-mount of /home/cyphar as attached mount object, and then attach
> You meant "as detached mount object"

Thanks, I have already fixed this in my branch (and the two other
misuses of "attach" in fsopen(2)). FWIW, open_tree(2) was the first
man-page in this series that I wrote, so I hadn't settled on the wording
the first draft.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-12 14:33 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <1989d90de76.d3b8b3cc73065.2447955224950374755@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2231 bytes --]

On 2025-08-12, Askar Safin <safinaskar@zohomail.com> wrote:
> fsmount:
> > Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object.
>
> I don't understand what you meant here. This phrase in its current form is wrong.
> Consider this scenario: we did this:
> fsopen(...)
> fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
> fsconfig(..., FSCONFIG_CMD_CREATE, ...)
> fsmount(...)
> fsopen(...)
> fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
> fsconfig(..., FSCONFIG_CMD_CREATE, ...)
> fsmount(...)
> 
> We used FSCONFIG_CMD_CREATE here as opposed to FSCONFIG_CMD_CREATE_EXCL, thus
> it is possible that second fsmount will return mount for the same superblock.
> Thus that statement "fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object"
> is not true.

Yeah, the superblock reuse behaviour makes this description less
coherent than what I was going for. My thinking was that a reused
superblock is (to userspace) conceptually a new filesystem instance
because they create it the same way as any other filesystem instance.
(In fact, the rest of the VFS treats them the same way too -- only
sget_fc() knows about superblock reuse.)

But yeah, "filesystem context" is more accurate here, so probably just:

  Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
  once in the lifetime of a filesystem context.

Though maybe we should mention that it's fsopen(2)-only (even though
it's mentioned earlier in the DESCRIPTION)? If you read the sentence in
isolation you might get the wrong impression. Do you have any
alternative suggestions?

FWIW, superblock reuse is one of those things that is a fairly hairy
implementation detail of the VFS, and as such it has quite odd
semantics. I probably wouldn't have documented it as heavily if it
wasn't for the addition of FSCONFIG_CMD_CREATE_EXCL (maybe an entry in
BUGS or CAVEATS at most -- this behaviour has an even worse impact on
mount(2) but it's completely undocumented there).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 08/12] man/man2/move_mount.2: document "new" mount API
From: Askar Safin @ 2025-08-12 10:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-8-f61405c80f34@cyphar.com>

move_mount for v2 contained this:
> Mounts cannot be moved beneath the rootfs

In v3 you changed this to this:
> Mount objects cannot be attached beneath the filesystem root

You made this phrase worse.

"Filesystem root" can be understood as "root of superblock".
So, please, change this to "root directory" or something.

> This would create a new bind-mount of /home/cyphar as attached mount object, and then attach
You meant "as detached mount object"

Also: I see that you addressed all my v2 comments. Thank you!

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-08-12  9:16 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-7-f61405c80f34@cyphar.com>

fsmount:
> Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object.

I don't understand what you meant here. This phrase in its current form is wrong.
Consider this scenario: we did this:
fsopen(...)
fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
fsconfig(..., FSCONFIG_CMD_CREATE, ...)
fsmount(...)
fsopen(...)
fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
fsconfig(..., FSCONFIG_CMD_CREATE, ...)
fsmount(...)

We used FSCONFIG_CMD_CREATE here as opposed to FSCONFIG_CMD_CREATE_EXCL, thus
it is possible that second fsmount will return mount for the same superblock.
Thus that statement "fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object"
is not true.

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 26/30] mm: shmem: use SHMEM_F_* flags instead of VM_* flags
From: Vipin Sharma @ 2025-08-11 23:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-27-pasha.tatashin@soleen.com>

On 2025-08-07 01:44:32, Pasha Tatashin wrote:
> From: Pratyush Yadav <ptyadav@amazon.de>
> @@ -3123,7 +3123,9 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
>  	spin_lock_init(&info->lock);
>  	atomic_set(&info->stop_eviction, 0);
>  	info->seals = F_SEAL_SEAL;
> -	info->flags = flags & VM_NORESERVE;
> +	info->flags = 0;

This is not needed as the 'info' is being set to 0 just above
spin_lock_init.

> +	if (flags & VM_NORESERVE)
> +		info->flags |= SHMEM_F_NORESERVE;

As info->flags will be 0, this can be just direct assignment '='.

>  	info->i_crtime = inode_get_mtime(inode);
>  	info->fsflags = (dir == NULL) ? 0 :
>  		SHMEM_I(dir)->fsflags & SHMEM_FL_INHERITED;
> @@ -5862,8 +5864,10 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
>  /* common code */
>  
>  static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
> -			loff_t size, unsigned long flags, unsigned int i_flags)
> +				       loff_t size, unsigned long vm_flags,
> +				       unsigned int i_flags)

Nit: Might be just my editor, but this alignment seems off.

>  {
> +	unsigned long flags = (vm_flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0;
>  	struct inode *inode;
>  	struct file *res;
>  
> @@ -5880,7 +5884,7 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
>  		return ERR_PTR(-ENOMEM);
>  
>  	inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL,
> -				S_IFREG | S_IRWXUGO, 0, flags);
> +				S_IFREG | S_IRWXUGO, 0, vm_flags);
>  	if (IS_ERR(inode)) {
>  		shmem_unacct_size(flags, size);
>  		return ERR_CAST(inode);
> -- 
> 2.50.1.565.gc32cd1483b-goog
> 

^ permalink raw reply

* [PATCH v5 3/3] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-08-11 14:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754924278.git.lorenzo.stoakes@oracle.com>

There is pre-existing logic that appears to be undocumented for an mremap()
shrink operation, where it turns out that the usual 'input range must span
a single mapping' requirement no longer applies.

In fact, it turns out that the input range specified by [old_address,
old_address + old_size) may span any number of mappings.

If shrinking in-place (that is, neither the MREMAP_FIXED nor
MREMAP_DONTUNMAP flags are specified), then the new span may also span any
number of VMAs - [old_address, old_address + new_size).

If shrinking and moving, the range specified by [old_address, old_address +
new_size) must span a single VMA.

There must be at least one VMA contained within the [old_address,
old_address + old_size) range, and old_address must be within the range of
a VMA.

Explicitly document this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 6d14bf627..53d4eda29 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -47,8 +47,35 @@ The
 .B MREMAP_DONTUNMAP
 flag may also be specified.
 .P
-If the operation is not
-simply moving mappings,
+Equally, if the operation performs a shrink,
+that is if
+.I old_size
+is greater than
+.IR new_size ,
+then
+.I old_size
+may also span multiple mappings
+which do not have to be
+adjacent to one another.
+If this shrink is performed
+in-place,
+that is,
+neither
+.BR MREMAP_FIXED ,
+nor
+.B MREMAP_DONTUNMAP
+are specified,
+.I new_size
+may also span multiple VMAs.
+However, if the range is moved,
+then
+.I new_size
+must span only a single mapping.
+.P
+If the operation is neither a
+.B MREMAP_FIXED
+move
+nor a shrink,
 then
 .I old_size
 must span only a single mapping.
-- 
2.50.1

^ permalink raw reply related

* [PATCH v5 1/3] man/man2/mremap.2: explicitly document the simple move operation
From: Lorenzo Stoakes @ 2025-08-11 14:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754924278.git.lorenzo.stoakes@oracle.com>

In preparation for discussing newly introduced mremap() behaviour to permit
the move of multiple mappings at once, add a section to the mremap.2 man
page to describe these operations in general.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 2168ca728..4e3c8e54e 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -25,6 +25,20 @@ moving it at the same time (controlled by the
 argument and
 the available virtual address space).
 .P
+Mappings can also simply be moved
+(without any resizing)
+by specifying equal
+.I old_size
+and
+.I new_size
+and using the
+.B MREMAP_FIXED
+flag
+(see below).
+The
+.B MREMAP_DONTUNMAP
+flag may also be specified.
+.P
 .I old_address
 is the old address of the virtual memory block that you
 want to expand (or shrink).
-- 
2.50.1


^ permalink raw reply related

* [PATCH v5 2/3] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-08-11 14:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754924278.git.lorenzo.stoakes@oracle.com>

Document the new behaviour introduced in Linux 6.17 whereby it is now
possible to move multiple mappings in a single operation, as long as the
operation is purely a move, that is old_size is equal to new_size and
MREMAP_FIXED is specified.

This change also explains the limitations of of this method and the
possibility of partial failure.

Finally, we pluralise language where it makes sense to so the documentation
does not contradict either this new capability nor the pre-existing edge
case.

Example code is enclosed below demonstrating the behaviour which is now
possible:

int main(void)
{
	unsigned long page_size = sysconf(_SC_PAGESIZE);
	void *ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
			 MAP_ANON | MAP_PRIVATE, -1, 0);
	void *tgt_ptr = mmap(NULL, 10 * page_size, PROT_NONE,
			     MAP_ANON | MAP_PRIVATE, -1, 0);
	int i;

	if (ptr == MAP_FAILED || tgt_ptr == MAP_FAILED) {
		perror("mmap");
		return EXIT_FAILURE;
	}

	/* Unmap every other page. */
	for (i = 1; i < 10; i += 2)
		munmap(ptr + i * page_size, page_size);

	/* Now move all 5 distinct mappings to tgt_ptr. */
	ptr = mremap(ptr, 10 * page_size, 10 * page_size,
		     MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr);
	if (ptr == MAP_FAILED) {
		perror("mremap");
		return EXIT_FAILURE;
	}

	return EXIT_SUCCESS;
}

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 68 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 4e3c8e54e..6d14bf627 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -35,22 +35,36 @@ and using the
 .B MREMAP_FIXED
 flag
 (see below).
+Since Linux 6.17,
+while
+.I old_address
+must be mapped,
+.I old_size
+may span multiple mappings
+including unmapped areas between
+them when performing a move like this.
 The
 .B MREMAP_DONTUNMAP
 flag may also be specified.
 .P
+If the operation is not
+simply moving mappings,
+then
+.I old_size
+must span only a single mapping.
+.P
 .I old_address
-is the old address of the virtual memory block that you
-want to expand (or shrink).
+is the old address of the first virtual memory block that you
+want to expand, shrink, and/or move.
 Note that
 .I old_address
 has to be page aligned.
 .I old_size
-is the old size of the
-virtual memory block.
+is the size of the range containing
+virtual memory blocks to be manipulated.
 .I new_size
 is the requested size of the
-virtual memory block after the resize.
+virtual memory blocks after the resize.
 An optional fifth argument,
 .IR new_address ,
 may be provided; see the description of
@@ -119,13 +133,42 @@ If
 is specified, then
 .B MREMAP_MAYMOVE
 must also be specified.
+.IP
+Since Linux 6.17,
+if
+.I old_size
+is equal to
+.I new_size
+and
+.B MREMAP_FIXED
+is specified, then
+.I old_size
+may span beyond the mapping in which
+.I old_address
+resides.
+In this case,
+gaps between mappings in the original range
+are maintained in the new range.
+The whole operation is performed atomically
+unless an error arises,
+in which case the operation may be partially
+completed,
+that is,
+some mappings may be moved and others not.
+.IP
+Moving multiple mappings is not permitted if
+any of those mappings have either
+been registered with
+.BR userfaultfd (2) ,
+or map drivers that
+specify their own custom address mapping logic.
 .TP
 .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
 .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
 This flag, which must be used in conjunction with
 .BR MREMAP_MAYMOVE ,
-remaps a mapping to a new address but does not unmap the mapping at
-.IR old_address .
+remaps mappings to a new address but does not unmap them
+from their original address.
 .IP
 The
 .B MREMAP_DONTUNMAP
@@ -163,13 +206,13 @@ mapped.
 See NOTES for some possible applications of
 .BR MREMAP_DONTUNMAP .
 .P
-If the memory segment specified by
+If the memory segments specified by
 .I old_address
 and
 .I old_size
-is locked (using
+are locked (using
 .BR mlock (2)
-or similar), then this lock is maintained when the segment is
+or similar), then this lock is maintained when the segments are
 resized and/or relocated.
 As a consequence, the amount of memory locked by the process may change.
 .SH RETURN VALUE
@@ -202,7 +245,10 @@ virtual memory address for this process.
 You can also get
 .B EFAULT
 even if there exist mappings that cover the
-whole address space requested, but those mappings are of different types.
+whole address space requested, but those mappings are of different types,
+and the
+.BR mremap ()
+operation being performed does not support this.
 .TP
 .B EINVAL
 An invalid argument was given.
-- 
2.50.1


^ permalink raw reply related

* [PATCH v5 0/3] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-08-11 14:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api

We have added new functionality to mremap() in Linux 6.17, permitting the
move of multiple VMAs when performing a move alone (that is - providing
MREMAP_MAYMOVE | MREMAP_FIXED flags and specifying old_size == new_size).

We document this new feature.

Additionally, we document previously undocumented behaviour around
shrinking of input VMA ranges which permits the input range to span
multiple VMAs.

v5:
* Reword point about 'old_address' needing to be in a mapped range more clearly
  as per Alejandro.
* Update wording accordingly around this, e.g. referencing unmapped areas
  between mappings to be consistent.
* Minor wording fixup stating that the MREMAP_DONTUNMAP flag may _also_ be
  specified.
* Separated out adding a section on 'pure' moves and describing the new
  behaviour, as per Alejandro.
* Update the commit message of 2/3 to reflect the above.
* Removed erroneously introduced blank line in 2/3 as per Alejandro.

v4:
* Update description of newly discovered mremap() behaviour to highlight the
  fact that, if in-place, [old_address, old_address + new_length) may span
  multiple VMAs also.
* Fix up commit message for 2/2 to correct typo on specified range.
* Added code sample to 1/2 as per Alejandro.
https://lore.kernel.org/all/cover.1754414738.git.lorenzo.stoakes@oracle.com/

v3:
* Use more precise language around mremap() move description as per Jann.
* Fix some typos in commit messages.
https://lore.kernel.org/all/cover.1753795807.git.lorenzo.stoakes@oracle.com/

v2:
* Split out the two man page changes as requested by Alejandro.
https://lore.kernel.org/all/cover.1753711160.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/20250723174634.75054-1-lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (3):
  man/man2/mremap.2: explicitly document the simple move operation
  man/man2/mremap.2: describe multiple mapping move
  man/man2/mremap.2: describe previously undocumented shrink behaviour

 man/man2/mremap.2 | 109 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 98 insertions(+), 11 deletions(-)

--
2.50.1

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Alejandro Colomar @ 2025-08-11 11:27 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Askar Safin, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-09.1754760145-silky-magic-obituary-sting-3OnpC7@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

Hi Aleksa,

On Sun, Aug 10, 2025 at 03:32:25AM +1000, Aleksa Sarai wrote:
> On 2025-08-09, Askar Safin <safinaskar@zohomail.com> wrote:
> > I plan to do a lot of testing of "new" mount API on my computer.
> > It is quiet possible that I will find some bugs in these manpages during testing.
> > (I already found some, but I'm not sure.)
> > I think this will take 3-7 days.
> > So, Alejandro Colomar, please, don't merge this patchset until then.
> 
> I don't plan to work on this again for the next week at least (I've
> already spent over a week on these docs -- writing, rewriting, and then
> rewriting once more for good measure; I've started seeing groff in my
> nightmares...), so I will go through review comments after you're done.
>
> There are some rough edges on these APIs I found while writing these
> docs, so I plan to fix those this cycle if possible (hopefully those
> aren't the bugs you said you found in the docs). Two of the fixes have
> already been merged in the vfs tree for 6.18 (the -ENODATA handling bug,
> as well as a bug in open_tree_attr() that would've let userspace trigger
> UAFs). (Once 6.18 is out, I will send a follow-up patchset to document
> the fixes.)
> 
> FYI, I've already fixed the few ".BR \% FOO" typos. (My terminal font
> doesn't have a bold typeface, so when reviewing the rendered man-pages,
> mistakes involving .B are hard to spot.)

You can review in PDF if you want.  See the pdfman(1) script under
src/bin/.  It's quite portable:

	$ cat src/bin/pdfman 
	#!/bin/bash
	#
	# Copyright, the authors of the Linux man-pages project
	# SPDX-License-Identifier: GPL-3.0-or-later

	set -Eeuo pipefail;
	shopt -s lastpipe;

	printf '%s\n' "${!#}.XXXXXX" \
	| sed 's,.*/,,' \
	| xargs mktemp -t \
	| read -r tmp;

	man -Tpdf "$@" >"$tmp";
	xdg-open "$tmp";

It works essentially like man(1), so you can pass any man(7) file as its
argument to read it as a PDF.

(You may or may not have it available in your system, if your distro
 packages a recent enough version of the project.)


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/2] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-11  9:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <39fb9e1b-b806-47da-a711-20c6cc12913a@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

On Mon, Aug 11, 2025 at 10:25:56AM +0100, Lorenzo Stoakes wrote:
> On Mon, Aug 11, 2025 at 11:20:05AM +0200, Alejandro Colomar wrote:
> > Hi Lorenzo,
> >
> > On Mon, Aug 11, 2025 at 06:30:38AM +0100, Lorenzo Stoakes wrote:
> > > > > +Mappings can also simply be moved
> > > > > +(without any resizing)
> > > > > +by specifying equal
> > > > > +.I old_size
> > > > > +and
> > > > > +.I new_size
> > > > > +and using the
> > > > > +.B MREMAP_FIXED
> > > > > +flag
> > > > > +(see below).
> > > > > +Since Linux 6.17,
> > > > > +while
> > > > > +.I old_address
> > > > > +must reside within a mapping,
> > > >
> > > > I don't understand this.  What does it mean that old_address must reside
> > > > within a mapping?  It's a point, not a size, so I'm not sure I
> > > > understand it.
> > >
> > > I think if it were a size it would be more confusing no?
> > >
> > > It's an address, the address must be located within an existing memory mapping.
> >
> > What I don't understand is: how could you not comply with that?  Could
> > you pass some old_address that is in two mappings?  Being a single
> > address, that would be impossible, right?
> 
> It can be in an unmapped area. It's either in an unmapped area or a mapped one.
> 
> I could simply reword this as 'old_address must be mapped'?

Yup, that seems better.  Thanks!

Cheers,
Alex

> > > Will replace with 'located' for clarity.
> > >
> > > >
> > > > > +.I old_size
> > > > > +may span multiple mappings
> > > > > +which do not have to be
> > > > > +adjacent to one another when
> > > > > +performing a move like this.
> >
> >
> > Have a lovely day!
> > Alex
> >
> > --
> > <https://www.alejandro-colomar.es/>

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-08-11  9:25 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <iny6ro5f37vcthqwscklqx73jscahodilug5d6umleyzq6a67k@ecoey5ud3aer>

On Mon, Aug 11, 2025 at 11:20:05AM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Mon, Aug 11, 2025 at 06:30:38AM +0100, Lorenzo Stoakes wrote:
> > > > +Mappings can also simply be moved
> > > > +(without any resizing)
> > > > +by specifying equal
> > > > +.I old_size
> > > > +and
> > > > +.I new_size
> > > > +and using the
> > > > +.B MREMAP_FIXED
> > > > +flag
> > > > +(see below).
> > > > +Since Linux 6.17,
> > > > +while
> > > > +.I old_address
> > > > +must reside within a mapping,
> > >
> > > I don't understand this.  What does it mean that old_address must reside
> > > within a mapping?  It's a point, not a size, so I'm not sure I
> > > understand it.
> >
> > I think if it were a size it would be more confusing no?
> >
> > It's an address, the address must be located within an existing memory mapping.
>
> What I don't understand is: how could you not comply with that?  Could
> you pass some old_address that is in two mappings?  Being a single
> address, that would be impossible, right?

It can be in an unmapped area. It's either in an unmapped area or a mapped one.

I could simply reword this as 'old_address must be mapped'?

>
> > Will replace with 'located' for clarity.
> >
> > >
> > > > +.I old_size
> > > > +may span multiple mappings
> > > > +which do not have to be
> > > > +adjacent to one another when
> > > > +performing a move like this.
>
>
> Have a lovely day!
> Alex
>
> --
> <https://www.alejandro-colomar.es/>

^ permalink raw reply

* Re: [PATCH v4 2/2] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Alejandro Colomar @ 2025-08-11  9:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <0b0928a3-03b0-4a36-817b-b75c1f5c78f9@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

On Mon, Aug 11, 2025 at 06:31:25AM +0100, Lorenzo Stoakes wrote:
> > Since this is documenting old behavior, could we have this patch before
> > the patch documenting new behavior?  Or do you prefer it in this order?
> 
> I think it's fine in this order, it's more convenient for me as it'd be a pain
> to re-order otherwise, and we've waited ~20 years (or longer?) to document this
> so a delay in ordering is probably fine :P
> 
> Cheers, Lorenzo

Okay, that's fine.  Thanks!


Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/2] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-11  9:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <664b00e3-69a0-498a-a7dd-a3d294c0c188@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

Hi Lorenzo,

On Mon, Aug 11, 2025 at 06:30:38AM +0100, Lorenzo Stoakes wrote:
> > > +Mappings can also simply be moved
> > > +(without any resizing)
> > > +by specifying equal
> > > +.I old_size
> > > +and
> > > +.I new_size
> > > +and using the
> > > +.B MREMAP_FIXED
> > > +flag
> > > +(see below).
> > > +Since Linux 6.17,
> > > +while
> > > +.I old_address
> > > +must reside within a mapping,
> >
> > I don't understand this.  What does it mean that old_address must reside
> > within a mapping?  It's a point, not a size, so I'm not sure I
> > understand it.
> 
> I think if it were a size it would be more confusing no?
> 
> It's an address, the address must be located within an existing memory mapping.

What I don't understand is: how could you not comply with that?  Could
you pass some old_address that is in two mappings?  Being a single
address, that would be impossible, right?

> Will replace with 'located' for clarity.
> 
> >
> > > +.I old_size
> > > +may span multiple mappings
> > > +which do not have to be
> > > +adjacent to one another when
> > > +performing a move like this.


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 2/2] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-08-11  5:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <iolucvrp6is43yjulbluchhw5wy6urq2gtwmcelg5atwuv63se@ovzuthfrup26>

On Sat, Aug 09, 2025 at 04:33:12PM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Tue, Aug 05, 2025 at 06:31:56PM +0100, Lorenzo Stoakes wrote:
> > There is pre-existing logic that appears to be undocumented for an mremap()
> > shrink operation, where it turns out that the usual 'input range must span
> > a single mapping' requirement no longer applies.
> >
> > In fact, it turns out that the input range specified by [old_address,
> > old_address + old_size) may span any number of mappings.
> >
> > If shrinking in-place (that is, neither the MREMAP_FIXED nor
> > MREMAP_DONTUNMAP flags are specified), then the new span may also span any
> > number of VMAs - [old_address, old_address + new_size).
> >
> > If shrinking and moving, the range specified by [old_address, old_address +
> > new_size) must span a single VMA.
> >
> > There must be at least one VMA contained within the [old_address,
> > old_address + old_size) range, and old_address must be within the range of
> > a VMA.
> >
> > Explicitly document this.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Since this is documenting old behavior, could we have this patch before
> the patch documenting new behavior?  Or do you prefer it in this order?

I think it's fine in this order, it's more convenient for me as it'd be a pain
to re-order otherwise, and we've waited ~20 years (or longer?) to document this
so a delay in ordering is probably fine :P

Cheers, Lorenzo

>
>
> Cheers,
> Alex
>
> > ---
> >  man/man2/mremap.2 | 31 +++++++++++++++++++++++++++++--
> >  1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> > index 6ba51310c..631c835b8 100644
> > --- a/man/man2/mremap.2
> > +++ b/man/man2/mremap.2
> > @@ -48,8 +48,35 @@ The
> >  .B MREMAP_DONTUNMAP
> >  flag may be specified.
> >  .P
> > -If the operation is not
> > -simply moving mappings,
> > +Equally, if the operation performs a shrink,
> > +that is if
> > +.I old_size
> > +is greater than
> > +.IR new_size ,
> > +then
> > +.I old_size
> > +may also span multiple mappings
> > +which do not have to be
> > +adjacent to one another.
> > +If this shrink is performed
> > +in-place,
> > +that is,
> > +neither
> > +.BR MREMAP_FIXED ,
> > +nor
> > +.B MREMAP_DONTUNMAP
> > +are specified,
> > +.I new_size
> > +may also span multiple VMAs.
> > +However, if the range is moved,
> > +then
> > +.I new_size
> > +must span only a single mapping.
> > +.P
> > +If the operation is neither a
> > +.B MREMAP_FIXED
> > +move
> > +nor a shrink,
> >  then
> >  .I old_size
> >  must span only a single mapping.
> > --
> > 2.50.1
> >
>
> --
> <https://www.alejandro-colomar.es/>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox