[LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
@ 2025-01-29  7:06 Ojaswin Mujoo
  2025-01-29  8:59 ` John Garry
  2025-03-23  7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
  0 siblings, 2 replies; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-01-29  7:06 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-xfs, linux-fsdevel, John Garry, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

Greetings,

I would like to submit a proposal to discuss the design of extsize and
forcealign and various open questions around it.

 ** Background **

Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
multi-KB range on disk to go atomically. This feature has a wide variety of use
cases especially for databases like mysql and postgres that can leverage atomic
writes to gain significant performance. However, in order to enable atomic
writes on Linux, the underlying disk may have some size and alignment
constraints that the upper layers like filesystems should follow. extsize with
forcealign is one of the ways filesystems can make sure the IO submitted to the
disk adheres to the atomic writes constraints.

extsize is a hint to the FS to allocate extents at a certian logical alignment
and size. forcealign builds on this by forcing the allocator to enforce the
alignment guarantees for physical blocks as well, which is essential for atomic
writes.

 ** Points of discussion **

Extsize hints feature is already supported by XFS [1] with forcealign still
under development and discussion [2]. After taking a look at ext4's multi-block
allocator design, supporting extsize with forcealign can be done in ext4 as
well. There is a RFC proposed which adds support for extsize hints feature in
ext4 [3]. However there are some caveats and deviations from XFS design. With
these in mind, I would like to propose LSFMM topic on:

 * exact semantics of extsize w/ forcealign which can bring a consistent
   interface among ext4 and xfs and possibly any other FS that plans to
   implement them in the future.

 * Documenting how forcealign with extsize should behave with various FS
   operations like fallocate, truncate, punch hole, insert/collapse range etcÂ 

 * Implementing extsize with delayed allocation and the challenges there.

 * Discussing tooling support of forcealign like how are we planning to maintain
   block alignment gurantees during fsck, resize and other times where we might
   need to move blocks around?

 * Documenting any areas where FSes might differ in their implementations of the
   same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas
   XFS has support for that.

Hopefully this discussion will be relevant in defining consistent semantics for
extsize hints and forcealign which might as well come useful for other FS
developers too.

Thoughts and suggestions are welcome.

References:
[1] https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html
[2] https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/
[3] https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-01-29  7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo
@ 2025-01-29  8:59 ` John Garry
  2025-01-29 16:06   ` Ojaswin Mujoo
  2025-03-23  7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
  1 sibling, 1 reply; 15+ messages in thread
From: John Garry @ 2025-01-29  8:59 UTC (permalink / raw)
  To: Ojaswin Mujoo, lsf-pc
  Cc: linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list,
	jack, tytso, linux-ext4

On 29/01/2025 07:06, Ojaswin Mujoo wrote:

Hi Ojaswin,

> 
> I would like to submit a proposal to discuss the design of extsize and
> forcealign and various open questions around it.
> 
>   ** Background **
> 
> Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
> multi-KB range on disk to go atomically. This feature has a wide variety of use
> cases especially for databases like mysql and postgres that can leverage atomic
> writes to gain significant performance. However, in order to enable atomic
> writes on Linux, the underlying disk may have some size and alignment
> constraints that the upper layers like filesystems should follow. extsize with
> forcealign is one of the ways filesystems can make sure the IO submitted to the
> disk adheres to the atomic writes constraints.
> 
> extsize is a hint to the FS to allocate extents at a certian logical alignment
> and size. forcealign builds on this by forcing the allocator to enforce the
> alignment guarantees for physical blocks as well, which is essential for atomic
> writes.
> 
>   ** Points of discussion **
> 
> Extsize hints feature is already supported by XFS [1] with forcealign still
> under development and discussion [2].

 From 
https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ 
thread, the alternate solution to forcealign for XFS is to use a 
software-emulated fallback for unaligned atomic writes. I am looking at 
a PoC implementation now. Note that this does rely on CoW.

There has been push back on forcealign for XFS, so we need to 
prove/disprove that this software-emulated fallback can work, see 
https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/

> After taking a look at ext4's multi-block
> allocator design, supporting extsize with forcealign can be done in ext4 as
> well. There is a RFC proposed which adds support for extsize hints feature in
> ext4 [3]. However there are some caveats and deviations from XFS design. With
> these in mind, I would like to propose LSFMM topic on:
> 
>   * exact semantics of extsize w/ forcealign which can bring a consistent
>     interface among ext4 and xfs and possibly any other FS that plans to
>     implement them in the future.
> 
>   * Documenting how forcealign with extsize should behave with various FS
>     operations like fallocate, truncate, punch hole, insert/collapse range etcÂ
> 
>   * Implementing extsize with delayed allocation and the challenges there.
> 
>   * Discussing tooling support of forcealign like how are we planning to maintain
>     block alignment gurantees during fsck, resize and other times where we might
>     need to move blocks around?
> 
>   * Documenting any areas where FSes might differ in their implementations of the
>     same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas
>     XFS has support for that.
> 
> Hopefully this discussion will be relevant in defining consistent semantics for
> extsize hints and forcealign which might as well come useful for other FS
> developers too.
> 
> Thoughts and suggestions are welcome.
> 
> References:
> [1] https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVK2oQKuYw$
> [2] https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVLgqkSeIg$
> [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVJ_GK50Cg$
> 
> Regards,
> ojaswin


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-01-29  8:59 ` John Garry
@ 2025-01-29 16:06   ` Ojaswin Mujoo
  2025-01-30 14:08     ` John Garry
  0 siblings, 1 reply; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-01-29 16:06 UTC (permalink / raw)
  To: John Garry
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote:
> On 29/01/2025 07:06, Ojaswin Mujoo wrote:
> 
> Hi Ojaswin,
> 
> > 
> > I would like to submit a proposal to discuss the design of extsize and
> > forcealign and various open questions around it.
> > 
> >   ** Background **
> > 
> > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
> > multi-KB range on disk to go atomically. This feature has a wide variety of use
> > cases especially for databases like mysql and postgres that can leverage atomic
> > writes to gain significant performance. However, in order to enable atomic
> > writes on Linux, the underlying disk may have some size and alignment
> > constraints that the upper layers like filesystems should follow. extsize with
> > forcealign is one of the ways filesystems can make sure the IO submitted to the
> > disk adheres to the atomic writes constraints.
> > 
> > extsize is a hint to the FS to allocate extents at a certian logical alignment
> > and size. forcealign builds on this by forcing the allocator to enforce the
> > alignment guarantees for physical blocks as well, which is essential for atomic
> > writes.
> > 
> >   ** Points of discussion **
> > 
> > Extsize hints feature is already supported by XFS [1] with forcealign still
> > under development and discussion [2].
> 
> From
> https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/
> thread, the alternate solution to forcealign for XFS is to use a
> software-emulated fallback for unaligned atomic writes. I am looking at a
> PoC implementation now. Note that this does rely on CoW.
> 
> There has been push back on forcealign for XFS, so we need to prove/disprove
> that this software-emulated fallback can work, see
> https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/
> 

Hey John,

Thanks for taking a look. I did go through the 2 series sometime back.
I agree that there are some open challenges in getting the multi block
atomic write interface correct especially for mixed mappings and this is
one of the main reasons we want to explore the exchange_range fallback
in case blocks are not aligned. 

That being said, I believe forcealign as a feature still holds a lot 
of relevance as:

1. Right now, it is the only way to guarantee aligned blocks and hence
   gurantee that our atomic writes can always benefit from hardware atomic
   write support. IIUC DBs are not very keen on losing out on performance
   due to some writes going via the software fallback path.

2. Not all FSes support COW (major example being ext4) and hence it will
   be very difficult to have a software fallback incase the blocks are 
	 not aligned. 

3. As pointed out in [1], even with exchange_range there is still value
   in having forcealign to find the new blocks to be exchanged.

I agree that forcealign is not the only way we can have atomic writes
work but I do feel there is value in having forcealign for FSes and
hence we should have a discussion around it so we can get the interface
right. 

Just to be clear, the intention of this proposal is to mainly discuss
forcealign as a feature. I am hoping there would be another different
proposal to discuss atomic writes and the plethora of other open
challenges there ;)

[1]  https://lore.kernel.org/linux-xfs/20250117182945.GH1611770@frogsfrogsfrogs/


> > After taking a look at ext4's multi-block
> > allocator design, supporting extsize with forcealign can be done in ext4 as
> > well. There is a RFC proposed which adds support for extsize hints feature in
> > ext4 [3]. However there are some caveats and deviations from XFS design. With
> > these in mind, I would like to propose LSFMM topic on:
> > 
> >   * exact semantics of extsize w/ forcealign which can bring a consistent
> >     interface among ext4 and xfs and possibly any other FS that plans to
> >     implement them in the future.
> > 
> >   * Documenting how forcealign with extsize should behave with various FS
> >     operations like fallocate, truncate, punch hole, insert/collapse range etcÂ
> > 
> >   * Implementing extsize with delayed allocation and the challenges there.
> > 
> >   * Discussing tooling support of forcealign like how are we planning to maintain
> >     block alignment gurantees during fsck, resize and other times where we might
> >     need to move blocks around?
> > 
> >   * Documenting any areas where FSes might differ in their implementations of the
> >     same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas
> >     XFS has support for that.
> > 
> > Hopefully this discussion will be relevant in defining consistent semantics for
> > extsize hints and forcealign which might as well come useful for other FS
> > developers too.
> > 
> > Thoughts and suggestions are welcome.
> > 
> > References:
> > [1] https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html 
> > [2] https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/ 
> > [3] https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/ 
> > 
> > Regards,
> > ojaswin
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-01-29 16:06   ` Ojaswin Mujoo
@ 2025-01-30 14:08     ` John Garry
  2025-02-01  7:12       ` Ojaswin Mujoo
  0 siblings, 1 reply; 15+ messages in thread
From: John Garry @ 2025-01-30 14:08 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On 29/01/2025 16:06, Ojaswin Mujoo wrote:
> On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote:
>> On 29/01/2025 07:06, Ojaswin Mujoo wrote:
>>
>> Hi Ojaswin,
>>
>>>
>>> I would like to submit a proposal to discuss the design of extsize and
>>> forcealign and various open questions around it.
>>>
>>>    ** Background **
>>>
>>> Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
>>> multi-KB range on disk to go atomically. This feature has a wide variety of use
>>> cases especially for databases like mysql and postgres that can leverage atomic
>>> writes to gain significant performance. However, in order to enable atomic
>>> writes on Linux, the underlying disk may have some size and alignment
>>> constraints that the upper layers like filesystems should follow. extsize with
>>> forcealign is one of the ways filesystems can make sure the IO submitted to the
>>> disk adheres to the atomic writes constraints.
>>>
>>> extsize is a hint to the FS to allocate extents at a certian logical alignment
>>> and size. forcealign builds on this by forcing the allocator to enforce the
>>> alignment guarantees for physical blocks as well, which is essential for atomic
>>> writes.
>>>
>>>    ** Points of discussion **
>>>
>>> Extsize hints feature is already supported by XFS [1] with forcealign still
>>> under development and discussion [2].
>>
>> From
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!IuMiPMbR5L3B8f31W8tbRlB7d0dMLg2nxW8k7KOGF3t031T99wahnbwnIeDn6N3AdveQJvmbL4V_FBwB0T9U9Q$
>> thread, the alternate solution to forcealign for XFS is to use a
>> software-emulated fallback for unaligned atomic writes. I am looking at a
>> PoC implementation now. Note that this does rely on CoW.
>>
>> There has been push back on forcealign for XFS, so we need to prove/disprove
>> that this software-emulated fallback can work, see
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/__;!!ACWV5N9M2RV99hQ!IuMiPMbR5L3B8f31W8tbRlB7d0dMLg2nxW8k7KOGF3t031T99wahnbwnIeDn6N3AdveQJvmbL4V_FBwv-uf6Ig$
>>
> 
> Hey John,
> 
> Thanks for taking a look. I did go through the 2 series sometime back.
> I agree that there are some open challenges in getting the multi block
> atomic write interface correct especially for mixed mappings and this is
> one of the main reasons we want to explore the exchange_range fallback
> in case blocks are not aligned.

Right, so for XFS I am looking at a CoW-based fallback for 
unaligned/mixed mapping atomic writes. I have no idea on how this could 
work for ext4.

> 
> That being said, I believe forcealign as a feature still holds a lot
> of relevance as:
> 
> 1. Right now, it is the only way to guarantee aligned blocks and hence
>     gurantee that our atomic writes can always benefit from hardware atomic
>     write support. IIUC DBs are not very keen on losing out on performance
>     due to some writes going via the software fallback path.

Sure, we need performance figures for this first.

> 
> 2. Not all FSes support COW (major example being ext4) and hence it will
>     be very difficult to have a software fallback incase the blocks are
> 	 not aligned.

Understood

> 
> 3. As pointed out in [1], even with exchange_range there is still value
>     in having forcealign to find the new blocks to be exchanged.

Yeah, again, we need performance figures.

For my test case, I am trying 16K atomic writes with 4K FS block size, 
so I expect the software fallback to not kick in often after running the 
system for a while (as eventually we will get an aligned allocations). I 
am concerned of prospect of heavily fragmented files, though.

> 
> I agree that forcealign is not the only way we can have atomic writes
> work but I do feel there is value in having forcealign for FSes and
> hence we should have a discussion around it so we can get the interface
> right.
> 

I thought that the interface for forcealign according to the candidate 
xfs implementation was quite straightforward. no?

What was not clear was the age-old issue of how to issue an atomic write 
of mixed extents, which is really an atomic write issue.

> Just to be clear, the intention of this proposal is to mainly discuss
> forcealign as a feature. I am hoping there would be another different
> proposal to discuss atomic writes and the plethora of other open
> challenges there ;)

Thanks,
John

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-01-30 14:08     ` John Garry
@ 2025-02-01  7:12       ` Ojaswin Mujoo
  2025-02-04 12:20         ` John Garry
  0 siblings, 1 reply; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-02-01  7:12 UTC (permalink / raw)
  To: John Garry
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On Thu, Jan 30, 2025 at 02:08:30PM +0000, John Garry wrote:
> On 29/01/2025 16:06, Ojaswin Mujoo wrote:
> > On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote:
> > > On 29/01/2025 07:06, Ojaswin Mujoo wrote:
> > > 
> > > Hi Ojaswin,
> > > 
> > > > 
> > > > I would like to submit a proposal to discuss the design of extsize and
> > > > forcealign and various open questions around it.
> > > > 
> > > >    ** Background **
> > > > 
> > > > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
> > > > multi-KB range on disk to go atomically. This feature has a wide variety of use
> > > > cases especially for databases like mysql and postgres that can leverage atomic
> > > > writes to gain significant performance. However, in order to enable atomic
> > > > writes on Linux, the underlying disk may have some size and alignment
> > > > constraints that the upper layers like filesystems should follow. extsize with
> > > > forcealign is one of the ways filesystems can make sure the IO submitted to the
> > > > disk adheres to the atomic writes constraints.
> > > > 
> > > > extsize is a hint to the FS to allocate extents at a certian logical alignment
> > > > and size. forcealign builds on this by forcing the allocator to enforce the
> > > > alignment guarantees for physical blocks as well, which is essential for atomic
> > > > writes.
> > > > 
> > > >    ** Points of discussion **
> > > > 
> > > > Extsize hints feature is already supported by XFS [1] with forcealign still
> > > > under development and discussion [2].
> > > 
> > > From
> > > https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ 
> > > thread, the alternate solution to forcealign for XFS is to use a
> > > software-emulated fallback for unaligned atomic writes. I am looking at a
> > > PoC implementation now. Note that this does rely on CoW.
> > > 
> > > There has been push back on forcealign for XFS, so we need to prove/disprove
> > > that this software-emulated fallback can work, see
> > > https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/ 
> > > 
> > 
> > Hey John,
> > 
> > Thanks for taking a look. I did go through the 2 series sometime back.
> > I agree that there are some open challenges in getting the multi block
> > atomic write interface correct especially for mixed mappings and this is
> > one of the main reasons we want to explore the exchange_range fallback
> > in case blocks are not aligned.
> 
> Right, so for XFS I am looking at a CoW-based fallback for unaligned/mixed
> mapping atomic writes. I have no idea on how this could work for ext4.
> 
> > 
> > That being said, I believe forcealign as a feature still holds a lot
> > of relevance as:
> > 
> > 1. Right now, it is the only way to guarantee aligned blocks and hence
> >     gurantee that our atomic writes can always benefit from hardware atomic
> >     write support. IIUC DBs are not very keen on losing out on performance
> >     due to some writes going via the software fallback path.
> 
> Sure, we need performance figures for this first.
> 
> > 
> > 2. Not all FSes support COW (major example being ext4) and hence it will
> >     be very difficult to have a software fallback incase the blocks are
> > 	 not aligned.
> 
> Understood
> 
> > 
> > 3. As pointed out in [1], even with exchange_range there is still value
> >     in having forcealign to find the new blocks to be exchanged.
> 
> Yeah, again, we need performance figures.
> 
> For my test case, I am trying 16K atomic writes with 4K FS block size, so I
> expect the software fallback to not kick in often after running the system
> for a while (as eventually we will get an aligned allocations). I am
> concerned of prospect of heavily fragmented files, though.

Yes that's true, if the FS is up long enough there is bound to be
fragmentation eventually which might make it harder for extsize to
get the blocks.

With software fallback, there's again the point that many FSes will need
some sort of COW/exchange_range support before they can support anything
like that. 

Although I;ve not looked at what it will take to add that to
ext4 but I'm assuming it will not be trivial at all. 

> 
> > 
> > I agree that forcealign is not the only way we can have atomic writes
> > work but I do feel there is value in having forcealign for FSes and
> > hence we should have a discussion around it so we can get the interface
> > right.
> > 
> 
> I thought that the interface for forcealign according to the candidate xfs
> implementation was quite straightforward. no?

As mentioned in the original proposal, there are still a open problems
around extsize and forcealign. 

- The allocation and deallocation semantics are not completely clear to
	me for example we allow operations like unaligned punch_hole but not
	unaligned insert and collapse range, and I couldn't see that
	documented anywhere.

- There are challenges in extsize with delayed allocation as well as how
	the tooling should handle forcealigned inodes. 

- How are FSes supposed to behave when forcealign/extsize is used with
	other FS features that change the allocation granularity like bigalloc
	or rtvol.

I agree that XFS's implementation is a good reference but I'm
sure as I continue working on the same from ext4 perspective we will have 
more points of discussion. So I definitely feel that its worth
discussing this at LSFMM.

> 
> What was not clear was the age-old issue of how to issue an atomic write of
> mixed extents, which is really an atomic write issue.

Right, btw are you planning any talk for atomic writes at LSFMM?

Regards,
ojaswin

> 
> > Just to be clear, the intention of this proposal is to mainly discuss
> > forcealign as a feature. I am hoping there would be another different
> > proposal to discuss atomic writes and the plethora of other open
> > challenges there ;)
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-02-01  7:12       ` Ojaswin Mujoo
@ 2025-02-04 12:20         ` John Garry
  2025-02-04 20:12           ` Dave Chinner
  2025-02-07  6:08           ` Ojaswin Mujoo
  0 siblings, 2 replies; 15+ messages in thread
From: John Garry @ 2025-02-04 12:20 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On 01/02/2025 07:12, Ojaswin Mujoo wrote:

Hi Ojaswin,

>> For my test case, I am trying 16K atomic writes with 4K FS block size, so I
>> expect the software fallback to not kick in often after running the system
>> for a while (as eventually we will get an aligned allocations). I am
>> concerned of prospect of heavily fragmented files, though.
> Yes that's true, if the FS is up long enough there is bound to be
> fragmentation eventually which might make it harder for extsize to
> get the blocks.
> 
> With software fallback, there's again the point that many FSes will need
> some sort of COW/exchange_range support before they can support anything
> like that.
> 
> Although I;ve not looked at what it will take to add that to
> ext4 but I'm assuming it will not be trivial at all.

Sure, but then again you may not have issues with getting forcealign 
support accepted for ext4. However, I would have thought that bigalloc 
was good enough to use initially.

> 
>>> I agree that forcealign is not the only way we can have atomic writes
>>> work but I do feel there is value in having forcealign for FSes and
>>> hence we should have a discussion around it so we can get the interface
>>> right.
>>>
>> I thought that the interface for forcealign according to the candidate xfs
>> implementation was quite straightforward. no?
> As mentioned in the original proposal, there are still a open problems
> around extsize and forcealign.
> 
> - The allocation and deallocation semantics are not completely clear to
> 	me for example we allow operations like unaligned punch_hole but not
> 	unaligned insert and collapse range, and I couldn't see that
> 	documented anywhere.

For xfs, we were imposing the same restrictions as which we have for 
rtextsize > 1.

If you check the following:
https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/

You can see how the large allocunit value is affected by forcealign, and 
then check callers of xfs_is_falloc_aligned() -> 
xfs_inode_alloc_unitsize() to see how this affects some fallocate modes.

> 
> - There are challenges in extsize with delayed allocation as well as how
> 	the tooling should handle forcealigned inodes.

Yeah, maybe. I was only testing my xfs forcealign solution for dio (and 
no delayed alloc).

> 
> - How are FSes supposed to behave when forcealign/extsize is used with
> 	other FS features that change the allocation granularity like bigalloc
> 	or rtvol.

As you would expect, they need to be aligned with one another.

For example, in the case of xfs rtvol, rextsize needs to be a multiple 
of extsize when forcealign is enabled. Or the other way around, I forget 
now..

> 
> I agree that XFS's implementation is a good reference but I'm
> sure as I continue working on the same from ext4 perspective we will have
> more points of discussion. So I definitely feel that its worth
> discussing this at LSFMM.

Understood, but I wait to see what happens to my CoW-based method for 
XFS to see where that goes before commenting on what needs to be 
discussed for xfs

> 
>> What was not clear was the age-old issue of how to issue an atomic write of
>> mixed extents, which is really an atomic write issue.
> Right, btw are you planning any talk for atomic writes at LSFMM?

I hadn't planned on it, but I guess that Martin will add something to 
the agenda.

Thanks,
John


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-02-04 12:20         ` John Garry
@ 2025-02-04 20:12           ` Dave Chinner
  2025-02-07  6:08           ` Ojaswin Mujoo
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2025-02-04 20:12 UTC (permalink / raw)
  To: John Garry
  Cc: Ojaswin Mujoo, lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner,
	hch, ritesh.list, jack, tytso, linux-ext4

On Tue, Feb 04, 2025 at 12:20:25PM +0000, John Garry wrote:
> On 01/02/2025 07:12, Ojaswin Mujoo wrote:
> 
> Hi Ojaswin,
> 
> > > For my test case, I am trying 16K atomic writes with 4K FS block size, so I
> > > expect the software fallback to not kick in often after running the system
> > > for a while (as eventually we will get an aligned allocations). I am
> > > concerned of prospect of heavily fragmented files, though.
> > Yes that's true, if the FS is up long enough there is bound to be
> > fragmentation eventually which might make it harder for extsize to
> > get the blocks.
> > 
> > With software fallback, there's again the point that many FSes will need
> > some sort of COW/exchange_range support before they can support anything
> > like that.
> > 
> > Although I;ve not looked at what it will take to add that to
> > ext4 but I'm assuming it will not be trivial at all.
> 
> Sure, but then again you may not have issues with getting forcealign support
> accepted for ext4. However, I would have thought that bigalloc was good
> enough to use initially.
> 
> > 
> > > > I agree that forcealign is not the only way we can have atomic writes
> > > > work but I do feel there is value in having forcealign for FSes and
> > > > hence we should have a discussion around it so we can get the interface
> > > > right.
> > > > 
> > > I thought that the interface for forcealign according to the candidate xfs
> > > implementation was quite straightforward. no?
> > As mentioned in the original proposal, there are still a open problems
> > around extsize and forcealign.
> > 
> > - The allocation and deallocation semantics are not completely clear to
> > 	me for example we allow operations like unaligned punch_hole but not
> > 	unaligned insert and collapse range, and I couldn't see that
> > 	documented anywhere.
> 
> For xfs, we were imposing the same restrictions as which we have for
> rtextsize > 1.
> 
> If you check the following:
> https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/
> 
> You can see how the large allocunit value is affected by forcealign, and
> then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize()
> to see how this affects some fallocate modes.
> 
> > 
> > - There are challenges in extsize with delayed allocation as well as how
> > 	the tooling should handle forcealigned inodes.
> 
> Yeah, maybe. I was only testing my xfs forcealign solution for dio (and no
> delayed alloc).

XFS turns off delalloc when extsize hints are set. See
xfs_buffered_write_iomap_begin() - it starts with:

	/* we can't use delayed allocations when using extent size hints */
        if (xfs_get_extsz_hint(ip))
                return xfs_direct_write_iomap_begin(inode, offset, count,
                                flags, iomap, srcmap);

and so it treats the allocation like a direct IO write and so
force-align should work with buffered writes as expected.

This delalloc constraint is a historic relic in XFS - now that we
use unwritten extents for delalloc we -could- use delalloc with
extsize hints; it just requires the delalloc extents to be aligned
to extsize hints.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-02-04 12:20         ` John Garry
  2025-02-04 20:12           ` Dave Chinner
@ 2025-02-07  6:08           ` Ojaswin Mujoo
  2025-02-07 12:01             ` John Garry
  1 sibling, 1 reply; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-02-07  6:08 UTC (permalink / raw)
  To: John Garry
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On Tue, Feb 04, 2025 at 12:20:25PM +0000, John Garry wrote:
> On 01/02/2025 07:12, Ojaswin Mujoo wrote:
> 
> Hi Ojaswin,
> 
> > > For my test case, I am trying 16K atomic writes with 4K FS block size, so I
> > > expect the software fallback to not kick in often after running the system
> > > for a while (as eventually we will get an aligned allocations). I am
> > > concerned of prospect of heavily fragmented files, though.
> > Yes that's true, if the FS is up long enough there is bound to be
> > fragmentation eventually which might make it harder for extsize to
> > get the blocks.
> > 
> > With software fallback, there's again the point that many FSes will need
> > some sort of COW/exchange_range support before they can support anything
> > like that.
> > 
> > Although I;ve not looked at what it will take to add that to
> > ext4 but I'm assuming it will not be trivial at all.
> 
> Sure, but then again you may not have issues with getting forcealign support
> accepted for ext4. However, I would have thought that bigalloc was good
> enough to use initially.

Yes, bigalloc is indeed good enough as a start but yes eventually
something like forcealign will be beneficial as not everyone prefers an
FS-wide cluster-size allocation granularity.

We do have a patch for atomic writes with bigalloc that was sent way
back in mid 2024 but then we went into the same discussion of mixed
mapping[1].

Hmm I think it might be time to revisit that and see if we can do
something better there.

[1] https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/
> 
> > 
> > > > I agree that forcealign is not the only way we can have atomic writes
> > > > work but I do feel there is value in having forcealign for FSes and
> > > > hence we should have a discussion around it so we can get the interface
> > > > right.
> > > > 
> > > I thought that the interface for forcealign according to the candidate xfs
> > > implementation was quite straightforward. no?
> > As mentioned in the original proposal, there are still a open problems
> > around extsize and forcealign.
> > 
> > - The allocation and deallocation semantics are not completely clear to
> > 	me for example we allow operations like unaligned punch_hole but not
> > 	unaligned insert and collapse range, and I couldn't see that
> > 	documented anywhere.
> 
> For xfs, we were imposing the same restrictions as which we have for
> rtextsize > 1.
> 
> If you check the following:
> https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/
> 
> You can see how the large allocunit value is affected by forcealign, and
> then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize()
> to see how this affects some fallocate modes.

True, but it's something that just implicitly happens when we use
forcealign. I eventually found out while testing forcealign with
different operations but such things can come as a surprise to users
especially when we support some operations to be unaligned and then
reject some other similar ones.

punch_hole/collapse_range is just an example and yes it might not be
very important to support unaligned collapse range but in the long run
it would be good to have these things documented/discussed.
> 
> > 
> > - There are challenges in extsize with delayed allocation as well as how
> > 	the tooling should handle forcealigned inodes.
> 
> Yeah, maybe. I was only testing my xfs forcealign solution for dio (and no
> delayed alloc).
> 
> > 
> > - How are FSes supposed to behave when forcealign/extsize is used with
> > 	other FS features that change the allocation granularity like bigalloc
> > 	or rtvol.
> 
> As you would expect, they need to be aligned with one another.
> 
> For example, in the case of xfs rtvol, rextsize needs to be a multiple of
> extsize when forcealign is enabled. Or the other way around, I forget now..
> 
> > 
> > I agree that XFS's implementation is a good reference but I'm
> > sure as I continue working on the same from ext4 perspective we will have
> > more points of discussion. So I definitely feel that its worth
> > discussing this at LSFMM.
> 
> Understood, but I wait to see what happens to my CoW-based method for XFS to
> see where that goes before commenting on what needs to be discussed for xfs

Got it.
> 
> > 
> > > What was not clear was the age-old issue of how to issue an atomic write of
> > > mixed extents, which is really an atomic write issue.
> > Right, btw are you planning any talk for atomic writes at LSFMM?
> 
> I hadn't planned on it, but I guess that Martin will add something to the
> agenda.
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-02-07  6:08           ` Ojaswin Mujoo
@ 2025-02-07 12:01             ` John Garry
  2025-02-08 17:05               ` Ojaswin Mujoo
  0 siblings, 1 reply; 15+ messages in thread
From: John Garry @ 2025-02-07 12:01 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4


> Yes, bigalloc is indeed good enough as a start but yes eventually
> something like forcealign will be beneficial as not everyone prefers an
> FS-wide cluster-size allocation granularity.
> 
> We do have a patch for atomic writes with bigalloc that was sent way
> back in mid 2024 but then we went into the same discussion of mixed
> mapping[1].
> 
> Hmm I think it might be time to revisit that and see if we can do
> something better there.
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/__;!!ACWV5N9M2RV99hQ!OJKieZJEIvc-M87u_dxAxiEGC4zN0PQmfdLT6k73Y7_Lvr9m-iodyrytRCFxDPbVzsOlk-1kuXXvaKLA-y9kCQ$

Feel free to pick up the iomap patches I had for zeroing when trying to 
atomic write mixed mappings - that's in my v3 series IIRC.

But you might still get some push back on them...

>>
>>>
>>>>> I agree that forcealign is not the only way we can have atomic writes
>>>>> work but I do feel there is value in having forcealign for FSes and
>>>>> hence we should have a discussion around it so we can get the interface
>>>>> right.
>>>>>
>>>> I thought that the interface for forcealign according to the candidate xfs
>>>> implementation was quite straightforward. no?
>>> As mentioned in the original proposal, there are still a open problems
>>> around extsize and forcealign.
>>>
>>> - The allocation and deallocation semantics are not completely clear to
>>> 	me for example we allow operations like unaligned punch_hole but not
>>> 	unaligned insert and collapse range, and I couldn't see that
>>> 	documented anywhere.
>>
>> For xfs, we were imposing the same restrictions as which we have for
>> rtextsize > 1.
>>
>> If you check the following:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!OJKieZJEIvc-M87u_dxAxiEGC4zN0PQmfdLT6k73Y7_Lvr9m-iodyrytRCFxDPbVzsOlk-1kuXXvaKLSPqPbqA$
>>
>> You can see how the large allocunit value is affected by forcealign, and
>> then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize()
>> to see how this affects some fallocate modes.
> 
> True, but it's something that just implicitly happens when we use
> forcealign. I eventually found out while testing forcealign with
> different operations but such things can come as a surprise to users
> especially when we support some operations to be unaligned and then
> reject some other similar ones.
> 
> punch_hole/collapse_range is just an example and yes it might not be
> very important to support unaligned collapse range but in the long run
> it would be good to have these things documented/discussed.

Maybe the man pages can be documented for forcealign/rtextsize > 1 punch 
holes/collapse behaviour - at a quick glance, I could not see anything. 
Indeed, I am not sure how bigalloc affects punch holes/collapse range 
either.

Thanks,
John

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes
  2025-02-07 12:01             ` John Garry
@ 2025-02-08 17:05               ` Ojaswin Mujoo
  0 siblings, 0 replies; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-02-08 17:05 UTC (permalink / raw)
  To: John Garry
  Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch,
	ritesh.list, jack, tytso, linux-ext4

On Fri, Feb 07, 2025 at 12:01:32PM +0000, John Garry wrote:
> 
> > Yes, bigalloc is indeed good enough as a start but yes eventually
> > something like forcealign will be beneficial as not everyone prefers an
> > FS-wide cluster-size allocation granularity.
> > 
> > We do have a patch for atomic writes with bigalloc that was sent way
> > back in mid 2024 but then we went into the same discussion of mixed
> > mapping[1].
> > 
> > Hmm I think it might be time to revisit that and see if we can do
> > something better there.
> > 
> > [1] https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/ 
> 
> Feel free to pick up the iomap patches I had for zeroing when trying to
> atomic write mixed mappings - that's in my v3 series IIRC.

Thanks I'll give it a try.
> 
> But you might still get some push back on them...

Right, it would be good if we all can come to a consensus of what to do
if an FS wants to implement something like forcealign for atomic writes
but does not have a way to implement software fallback. As I see, we
seem to be 2 (un)popular options:

1. Reject atomic writes on mixed mappings. This is not user space
friendly but simplest to implement

2. Zero out the unwritten part of the mapping and convert to a single
   mapping before performing the IO.

All options have their shortcomings but I think 2 is still okay. I
believe thats the path we've taken in the latest XFS patches right.

> 
> > > 
> > > > 
> > > > > > I agree that forcealign is not the only way we can have atomic writes
> > > > > > work but I do feel there is value in having forcealign for FSes and
> > > > > > hence we should have a discussion around it so we can get the interface
> > > > > > right.
> > > > > > 
> > > > > I thought that the interface for forcealign according to the candidate xfs
> > > > > implementation was quite straightforward. no?
> > > > As mentioned in the original proposal, there are still a open problems
> > > > around extsize and forcealign.
> > > > 
> > > > - The allocation and deallocation semantics are not completely clear to
> > > > 	me for example we allow operations like unaligned punch_hole but not
> > > > 	unaligned insert and collapse range, and I couldn't see that
> > > > 	documented anywhere.
> > > 
> > > For xfs, we were imposing the same restrictions as which we have for
> > > rtextsize > 1.
> > > 
> > > If you check the following:
> > > https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/ 
> > > 
> > > You can see how the large allocunit value is affected by forcealign, and
> > > then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize()
> > > to see how this affects some fallocate modes.
> > 
> > True, but it's something that just implicitly happens when we use
> > forcealign. I eventually found out while testing forcealign with
> > different operations but such things can come as a surprise to users
> > especially when we support some operations to be unaligned and then
> > reject some other similar ones.
> > 
> > punch_hole/collapse_range is just an example and yes it might not be
> > very important to support unaligned collapse range but in the long run
> > it would be good to have these things documented/discussed.
> 
> Maybe the man pages can be documented for forcealign/rtextsize > 1 punch
> holes/collapse behaviour - at a quick glance, I could not see anything.

Yep sounds good.

> Indeed, I am not sure how bigalloc affects punch holes/collapse range
> either.

Yeah, I think even bigalloc has the similar behavior of disallowing
unaligned insert/collapse ranges but allowing punch hole. 
> 
> Thanks,
> John

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc
  2025-01-29  7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo
  2025-01-29  8:59 ` John Garry
@ 2025-03-23  7:00 ` Ritesh Harjani (IBM)
  2025-03-23  7:00   ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM)
  2025-03-23  7:02   ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM)
  1 sibling, 2 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-03-23  7:00 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o,
	Ojaswin Mujoo, Ritesh Harjani (IBM)

This is an RFC patch before LSFMM to preview the change of how multi-fsblock atomic write
support with bigalloc look like. There is a scope of improvement in the
implementation, however this shows the general idea of the design. More details
are provided in the actual patch. There are still todos and more testing is
needed. But with iomap limitation of single fsblock atomic write now lifted,
the patch has definitely started to look better.

This is based out of vfs.all tree [1] for 6.15, which now has the necessary
iomap changes required for the bigalloc support in ext4.

TODOs:
1. Add better testcases to test atomic write support with bigalloc.
2. Discuss the approach of keeping the jbd2 txn open while zeroing the short
   underlying unwritten extents or short holes to create a single mapped type
   extent mapping. This anyway should be a non-perfomance critical path.
3. We use ext4_map_blocks() in loop instead of modifying the block allocator.
   Again since it's non-performance sensitive path, so hopefully it should ok?
   Because otherwise one can argue why take and release
   EXT4_I(inode)->i_data_sem multiple times. We won't take & release any group
   lock for this, since we know that with bigalloc the cluster is anyway
   available to us.
4. Once when we start supporting file/inode marked with atomic writes attribute,
   maybe we can add some optimizations like zero out the entire underlying
   cluster when someone forcefully wants to fzero or fpunch an underlying disk
   block, to keep the mapped extent intact.
5. Stress test of this is still pending through fsx and xfstests.

Reviews are appreciated.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.all&id=4f76518956c037517a4e4b120186075d3afb8266

Ritesh Harjani (IBM) (1):
  ext4: Add atomic write support for bigalloc

 fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  8 +++--
 2 files changed, 93 insertions(+), 5 deletions(-)

--
2.48.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc
  2025-03-23  7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
@ 2025-03-23  7:00   ` Ritesh Harjani (IBM)
  2025-03-23  7:02     ` Ritesh Harjani (IBM)
  2025-03-23  7:02   ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM)
  1 sibling, 1 reply; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-03-23  7:00 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o,
	Ojaswin Mujoo, Ritesh Harjani (IBM)

EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
    mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>

With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize].

We first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
for block allocation depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
   then we don't call ext4_map_blocks() to allocate anything. We don't need to
   even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
   extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
   extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
   ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
   within the requested range. This then provide a single mapped extent type
   mapping for the requested range.

Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.

Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  8 +++--
 2 files changed, 93 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d04d8a7f12e7..0096a597ad04 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 		iomap->addr = IOMAP_NULL_ADDR;
 	}
 }
+/*
+ * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
+ * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
+ * mappings. This function is only called when the bigalloc is enabled, so we
+ * know that the allocated physical extent start is always aligned properly.
+ *
+ * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
+ * requested range does not have a single mapping type (Hole, Mapped, or
+ * Unwritten) throughout. In that case we will loop over the requested range to
+ * allocate and zero out the unwritten / holes in between, to get a single
+ * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
+ * critical path, so it should be ok to loop using ext4_map_blocks() with
+ * appropriate flags to allocate & zero the underlying short holes/unwritten
+ * extents within the requested range.
+ */
+static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+				  struct ext4_map_blocks *map)
+{
+	ext4_lblk_t m_lblk = map->m_lblk;
+	unsigned int m_len = map->m_len;
+	unsigned int mapped_len = 0, flags = 0;
+	u8 blkbits = inode->i_blkbits;
+	int ret;
+
+	WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
+
+	ret = ext4_map_blocks(handle, inode, map, 0);
+	if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
+		flags = EXT4_GET_BLOCKS_CREATE;
+	else if ((ret == 0 && map->m_len >= m_len) ||
+		(ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
+		flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+	else
+		flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+
+	do {
+		ret = ext4_map_blocks(handle, inode, map, flags);
+		if (ret < 0)
+			return ret;
+		mapped_len += map->m_len;
+		map->m_lblk += map->m_len;
+		map->m_len = m_len - mapped_len;
+	} while (mapped_len < m_len);
+
+	map->m_lblk = m_lblk;
+	map->m_len = m_len;
+
+	/*
+	 * We might have done some work in above loop. Let's ensure we query the
+	 * start of the physical extent, based on the origin m_lblk and m_len
+	 * and also ensure we were able to allocate the required range for doing
+	 * atomic write.
+	 */
+	ret = ext4_map_blocks(handle, inode, map, 0);
+	if (ret != m_len) {
+		ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
+				m_lblk, m_len);
+		return -EINVAL;
+	}
+	return mapped_len;
+}

 static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 			    unsigned int flags)
@@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

-	ret = ext4_map_blocks(handle, inode, map, m_flags);
+	if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
+		ret = ext4_map_blocks_atomic(handle, inode, map);
+	else
+		ret = ext4_map_blocks(handle, inode, map, m_flags);

 	/*
 	 * We cannot fill holes in indirect tree based inodes as that could
@@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	int ret;
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
+	unsigned int m_len_orig;

 	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
 		return -EINVAL;
@@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	map.m_lblk = offset >> blkbits;
 	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+	m_len_orig = map.m_len;

 	if (flags & IOMAP_WRITE) {
 		/*
@@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		 */
 		if (offset + length <= i_size_read(inode)) {
 			ret = ext4_map_blocks(NULL, inode, &map, 0);
-			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
-				goto out;
+			/*
+			 * For atomic writes the entire requested length should
+			 * be mapped.
+			 */
+			if (map.m_flags & EXT4_MAP_MAPPED) {
+				if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
+				   (flags & IOMAP_ATOMIC && ret >= m_len_orig))
+					goto out;
+			}
+			map.m_len = m_len_orig;
 		}
 		ret = ext4_iomap_alloc(inode, &map, flags);
 	} else {
@@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	 */
 	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);

+	/*
+	 * Before returning to iomap, let's ensure the allocated mapping
+	 * covers the entire requested length for atomic writes.
+	 */
+	if (flags & IOMAP_ATOMIC) {
+		if (map.m_len < (length >> blkbits)) {
+			WARN_ON(1);
+			return -EINVAL;
+		}
+	}
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);

 	return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a50e5c31b937..cbb24d535d59 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
 /*
  * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
  * @sb: super block
- * TODO: Later add support for bigalloc
  */
 static void ext4_atomic_write_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct block_device *bdev = sb->s_bdev;
+	unsigned int blkbits = sb->s_blocksize_bits;
+	unsigned int clustersize = sb->s_blocksize;

 	if (!bdev_can_atomic_write(bdev))
 		return;
@@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
 	if (!ext4_has_feature_extents(sb))
 		return;

+	if (ext4_has_feature_bigalloc(sb))
+		clustersize = 1U << (sbi->s_cluster_bits + blkbits);
+
 	sbi->s_awu_min = max(sb->s_blocksize,
 			      bdev_atomic_write_unit_min_bytes(bdev));
-	sbi->s_awu_max = min(sb->s_blocksize,
+	sbi->s_awu_max = min(clustersize,
 			      bdev_atomic_write_unit_max_bytes(bdev));
 	if (sbi->s_awu_min && sbi->s_awu_max &&
 	    sbi->s_awu_min <= sbi->s_awu_max) {
--
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc
  2025-03-23  7:00   ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM)
@ 2025-03-23  7:02     ` Ritesh Harjani (IBM)
  2025-03-25 11:42       ` Ojaswin Mujoo
  0 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-03-23  7:02 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o,
	Ojaswin Mujoo, Ritesh Harjani (IBM)

EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
    mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>

With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize].

We first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
for block allocation depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
   then we don't call ext4_map_blocks() to allocate anything. We don't need to
   even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
   extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
   extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
   ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
   within the requested range. This then provide a single mapped extent type
   mapping for the requested range.

Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.

Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  8 +++--
 2 files changed, 93 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d04d8a7f12e7..0096a597ad04 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 		iomap->addr = IOMAP_NULL_ADDR;
 	}
 }
+/*
+ * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
+ * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
+ * mappings. This function is only called when the bigalloc is enabled, so we
+ * know that the allocated physical extent start is always aligned properly.
+ *
+ * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
+ * requested range does not have a single mapping type (Hole, Mapped, or
+ * Unwritten) throughout. In that case we will loop over the requested range to
+ * allocate and zero out the unwritten / holes in between, to get a single
+ * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
+ * critical path, so it should be ok to loop using ext4_map_blocks() with
+ * appropriate flags to allocate & zero the underlying short holes/unwritten
+ * extents within the requested range.
+ */
+static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+				  struct ext4_map_blocks *map)
+{
+	ext4_lblk_t m_lblk = map->m_lblk;
+	unsigned int m_len = map->m_len;
+	unsigned int mapped_len = 0, flags = 0;
+	u8 blkbits = inode->i_blkbits;
+	int ret;
+
+	WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
+
+	ret = ext4_map_blocks(handle, inode, map, 0);
+	if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
+		flags = EXT4_GET_BLOCKS_CREATE;
+	else if ((ret == 0 && map->m_len >= m_len) ||
+		(ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
+		flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+	else
+		flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+
+	do {
+		ret = ext4_map_blocks(handle, inode, map, flags);
+		if (ret < 0)
+			return ret;
+		mapped_len += map->m_len;
+		map->m_lblk += map->m_len;
+		map->m_len = m_len - mapped_len;
+	} while (mapped_len < m_len);
+
+	map->m_lblk = m_lblk;
+	map->m_len = m_len;
+
+	/*
+	 * We might have done some work in above loop. Let's ensure we query the
+	 * start of the physical extent, based on the origin m_lblk and m_len
+	 * and also ensure we were able to allocate the required range for doing
+	 * atomic write.
+	 */
+	ret = ext4_map_blocks(handle, inode, map, 0);
+	if (ret != m_len) {
+		ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
+				m_lblk, m_len);
+		return -EINVAL;
+	}
+	return mapped_len;
+}

 static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 			    unsigned int flags)
@@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

-	ret = ext4_map_blocks(handle, inode, map, m_flags);
+	if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
+		ret = ext4_map_blocks_atomic(handle, inode, map);
+	else
+		ret = ext4_map_blocks(handle, inode, map, m_flags);

 	/*
 	 * We cannot fill holes in indirect tree based inodes as that could
@@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	int ret;
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
+	unsigned int m_len_orig;

 	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
 		return -EINVAL;
@@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	map.m_lblk = offset >> blkbits;
 	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+	m_len_orig = map.m_len;

 	if (flags & IOMAP_WRITE) {
 		/*
@@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		 */
 		if (offset + length <= i_size_read(inode)) {
 			ret = ext4_map_blocks(NULL, inode, &map, 0);
-			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
-				goto out;
+			/*
+			 * For atomic writes the entire requested length should
+			 * be mapped.
+			 */
+			if (map.m_flags & EXT4_MAP_MAPPED) {
+				if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
+				   (flags & IOMAP_ATOMIC && ret >= m_len_orig))
+					goto out;
+			}
+			map.m_len = m_len_orig;
 		}
 		ret = ext4_iomap_alloc(inode, &map, flags);
 	} else {
@@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	 */
 	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);

+	/*
+	 * Before returning to iomap, let's ensure the allocated mapping
+	 * covers the entire requested length for atomic writes.
+	 */
+	if (flags & IOMAP_ATOMIC) {
+		if (map.m_len < (length >> blkbits)) {
+			WARN_ON(1);
+			return -EINVAL;
+		}
+	}
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);

 	return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a50e5c31b937..cbb24d535d59 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
 /*
  * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
  * @sb: super block
- * TODO: Later add support for bigalloc
  */
 static void ext4_atomic_write_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct block_device *bdev = sb->s_bdev;
+	unsigned int blkbits = sb->s_blocksize_bits;
+	unsigned int clustersize = sb->s_blocksize;

 	if (!bdev_can_atomic_write(bdev))
 		return;
@@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
 	if (!ext4_has_feature_extents(sb))
 		return;

+	if (ext4_has_feature_bigalloc(sb))
+		clustersize = 1U << (sbi->s_cluster_bits + blkbits);
+
 	sbi->s_awu_min = max(sb->s_blocksize,
 			      bdev_atomic_write_unit_min_bytes(bdev));
-	sbi->s_awu_max = min(sb->s_blocksize,
+	sbi->s_awu_max = min(clustersize,
 			      bdev_atomic_write_unit_max_bytes(bdev));
 	if (sbi->s_awu_min && sbi->s_awu_max &&
 	    sbi->s_awu_min <= sbi->s_awu_max) {
--
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc
  2025-03-23  7:02     ` Ritesh Harjani (IBM)
@ 2025-03-25 11:42       ` Ojaswin Mujoo
  0 siblings, 0 replies; 15+ messages in thread
From: Ojaswin Mujoo @ 2025-03-25 11:42 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-fsdevel, John Garry, djwong, linux-xfs,
	Theodore Ts'o

On Sun, Mar 23, 2025 at 12:32:18PM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 supports bigalloc feature which allows the FS to work in size of
> clusters (group of blocks) rather than individual blocks. This patch
> adds atomic write support for bigalloc so that systems with bs = ps can
> also create FS using -
>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
> 
> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> adjust ext4's atomic write unit max value to cluster size. This can then support
> atomic write of size anywhere between [blocksize, clustersize].
> 
> We first query the underlying region of the requested range by calling
> ext4_map_blocks() call. Here are the various cases which we then handle
> for block allocation depending upon the underlying mapping type:
> 1. If the underlying region for the entire requested range is a mapped extent,
>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
>    even start the jbd2 txn in this case.
> 2. For an append write case, we create a mapped extent.
> 3. If the underlying region is entirely a hole, then we create an unwritten
>    extent for the requested range.
> 4. If the underlying region is a large unwritten extent, then we split the
>    extent into 2 unwritten extent of required size.
> 5. If the underlying region has any type of mixed mapping, then we call
>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>    within the requested range. This then provide a single mapped extent type
>    mapping for the requested range.
> 
> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> flag only when the underlying extent mapping of the requested range is
> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> is, if the underlying region contains a mix of hole(s), unwritten
> extent(s), and mapped extent(s), we use this loop to ensure that all the
> short mappings are zeroed out. This guarantees that the entire requested
> range becomes a single, uniformly mapped extent. It is ok to do so
> because we know this is being done on a bigalloc enabled filesystem
> where the block bitmap represents the entire cluster unit.

Hi Ritesh, thanks for the patch. The approach looks good to me, just
adding a few comments below.
> 
> Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c |  8 +++--
>  2 files changed, 93 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d04d8a7f12e7..0096a597ad04 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>     iomap->addr = IOMAP_NULL_ADDR;
>   }
>  }
> +/*
> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
> + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
> + * mappings. This function is only called when the bigalloc is enabled, so we
> + * know that the allocated physical extent start is always aligned properly.
> + *
> + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
> + * requested range does not have a single mapping type (Hole, Mapped, or
> + * Unwritten) throughout. In that case we will loop over the requested range to
> + * allocate and zero out the unwritten / holes in between, to get a single
> + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
> + * critical path, so it should be ok to loop using ext4_map_blocks() with
> + * appropriate flags to allocate & zero the underlying short holes/unwritten
> + * extents within the requested range.
> + */
> +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
> +         struct ext4_map_blocks *map)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + unsigned int mapped_len = 0, flags = 0;
> + u8 blkbits = inode->i_blkbits;
> + int ret;
> +
> + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
> +
> + ret = ext4_map_blocks(handle, inode, map, 0);
> + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
> +   flags = EXT4_GET_BLOCKS_CREATE;
> + else if ((ret == 0 && map->m_len >= m_len) ||
> +   (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
> +   flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> + else
> +   flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +
> + do {
> +   ret = ext4_map_blocks(handle, inode, map, flags);

With the multiple calls to map block for converting the extents, I don't
think the transaction reservation wouldn't be enough anymore since in
the worst case we could be converting atleast (max atomicwrite size / blocksize) 
extents. We need to account for that as well.

> +   if (ret < 0)
> +     return ret;
> +   mapped_len += map->m_len;
> +   map->m_lblk += map->m_len;
> +   map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);

> +
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> +
> + /*
> +  * We might have done some work in above loop. Let's ensure we query the
> +  * start of the physical extent, based on the origin m_lblk and m_len
> +  * and also ensure we were able to allocate the required range for doing
> +  * atomic write.
> +  */
> + ret = ext4_map_blocks(handle, inode, map, 0);

 Here, We are calling ext4_map_blocks() 3 times uneccessarily even if a
 single complete mapping is found. I think a better approach would be to
 just go for the map_blocks and then decide if we want to split. Also,
 factor out a function to do the zero out. So, somthing like:

  if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
    flags = EXT4_GET_BLOCKS_CREATE;
  else
    flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

        ret = ext4_map_blocks(handle, inode, map, flags);

        if (map->m_len < m_len) {
          map->m_len = m_len;

                /* do the zero out */
          ext4_zero_mixed_mappings(handle, inode, map);
                ext4_map_blocks(handle, inode, map, 0);

                WARN_ON(!(map->m_flags & EXT4_MAP_MAPPED) || map->m_len < m_len);
        }

 I think this covers the 5 cases you mentioned in the commit message, if
 I'm not missing anything.  Also, this way we avoid the duplication for
 non zero-out cases and the zero-out function can then be resused incase
 we want to do the same for forcealign atomic writes in the future.

Regards,
ojaswin

> + if (ret != m_len) {
> +   ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
> +       m_lblk, m_len);
> +   return -EINVAL;
> + }
> + return mapped_len;
> +}
> 
>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>           unsigned int flags)
> @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>   else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>     m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> 
> - ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
> +   ret = ext4_map_blocks_atomic(handle, inode, map);
> + else
> +   ret = ext4_map_blocks(handle, inode, map, m_flags);
> 
>   /*
>    * We cannot fill holes in indirect tree based inodes as that could
> @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   int ret;
>   struct ext4_map_blocks map;
>   u8 blkbits = inode->i_blkbits;
> + unsigned int m_len_orig;
> 
>   if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>     return -EINVAL;
> @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   map.m_lblk = offset >> blkbits;
>   map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>         EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> + m_len_orig = map.m_len;
> 
>   if (flags & IOMAP_WRITE) {
>     /*
> @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>      */
>     if (offset + length <= i_size_read(inode)) {
>       ret = ext4_map_blocks(NULL, inode, &map, 0);
> -     if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> -       goto out;
> +     /*
> +      * For atomic writes the entire requested length should
> +      * be mapped.
> +      */
> +     if (map.m_flags & EXT4_MAP_MAPPED) {
> +       if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> +          (flags & IOMAP_ATOMIC && ret >= m_len_orig))
> +         goto out;
> +     }
> +     map.m_len = m_len_orig;
>     }
>     ret = ext4_iomap_alloc(inode, &map, flags);
>   } else {
> @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>    */
>   map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
> 
> + /*
> +  * Before returning to iomap, let's ensure the allocated mapping
> +  * covers the entire requested length for atomic writes.
> +  */
> + if (flags & IOMAP_ATOMIC) {
> +   if (map.m_len < (length >> blkbits)) {
> +     WARN_ON(1);
> +     return -EINVAL;
> +   }
> + }
>   ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> 
>   return 0;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a50e5c31b937..cbb24d535d59 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  /*
>   * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
>   * @sb: super block
> - * TODO: Later add support for bigalloc
>   */
>  static void ext4_atomic_write_init(struct super_block *sb)
>  {
>   struct ext4_sb_info *sbi = EXT4_SB(sb);
>   struct block_device *bdev = sb->s_bdev;
> + unsigned int blkbits = sb->s_blocksize_bits;
> + unsigned int clustersize = sb->s_blocksize;
> 
>   if (!bdev_can_atomic_write(bdev))
>     return;
> @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
>   if (!ext4_has_feature_extents(sb))
>     return;
> 
> + if (ext4_has_feature_bigalloc(sb))
> +   clustersize = 1U << (sbi->s_cluster_bits + blkbits);
> +
>   sbi->s_awu_min = max(sb->s_blocksize,
>             bdev_atomic_write_unit_min_bytes(bdev));
> - sbi->s_awu_max = min(sb->s_blocksize,
> + sbi->s_awu_max = min(clustersize,
>             bdev_atomic_write_unit_max_bytes(bdev));
>   if (sbi->s_awu_min && sbi->s_awu_max &&
>       sbi->s_awu_min <= sbi->s_awu_max) {
> --
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc
  2025-03-23  7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
  2025-03-23  7:00   ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM)
@ 2025-03-23  7:02   ` Ritesh Harjani (IBM)
  1 sibling, 0 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-03-23  7:02 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o,
	Ojaswin Mujoo, Ritesh Harjani (IBM)

This is an RFC patch before LSFMM to preview the change of how multi-fsblock atomic write
support with bigalloc look like. There is a scope of improvement in the
implementation, however this shows the general idea of the design. More details
are provided in the actual patch. There are still todos and more testing is
needed. But with iomap limitation of single fsblock atomic write now lifted,
the patch has definitely started to look better.

This is based out of vfs.all tree [1] for 6.15, which now has the necessary
iomap changes required for the bigalloc support in ext4.

TODOs:
1. Add better testcases to test atomic write support with bigalloc.
2. Discuss the approach of keeping the jbd2 txn open while zeroing the short
   underlying unwritten extents or short holes to create a single mapped type
   extent mapping. This anyway should be a non-perfomance critical path.
3. We use ext4_map_blocks() in loop instead of modifying the block allocator.
   Again since it's non-performance sensitive path, so hopefully it should ok?
   Because otherwise one can argue why take and release
   EXT4_I(inode)->i_data_sem multiple times. We won't take & release any group
   lock for this, since we know that with bigalloc the cluster is anyway
   available to us.
4. Once when we start supporting file/inode marked with atomic writes attribute,
   maybe we can add some optimizations like zero out the entire underlying
   cluster when someone forcefully wants to fzero or fpunch an underlying disk
   block, to keep the mapped extent intact.
5. Stress test of this is still pending through fsx and xfstests.

Reviews are appreciated.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.all&id=4f76518956c037517a4e4b120186075d3afb8266

Ritesh Harjani (IBM) (1):
  ext4: Add atomic write support for bigalloc

 fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c |  8 +++--
 2 files changed, 93 insertions(+), 5 deletions(-)

--
2.48.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-25 11:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-29  7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo
2025-01-29  8:59 ` John Garry
2025-01-29 16:06   ` Ojaswin Mujoo
2025-01-30 14:08     ` John Garry
2025-02-01  7:12       ` Ojaswin Mujoo
2025-02-04 12:20         ` John Garry
2025-02-04 20:12           ` Dave Chinner
2025-02-07  6:08           ` Ojaswin Mujoo
2025-02-07 12:01             ` John Garry
2025-02-08 17:05               ` Ojaswin Mujoo
2025-03-23  7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
2025-03-23  7:00   ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM)
2025-03-23  7:02     ` Ritesh Harjani (IBM)
2025-03-25 11:42       ` Ojaswin Mujoo
2025-03-23  7:02   ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).