* [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes @ 2025-01-29 7:06 Ojaswin Mujoo 2025-01-29 8:59 ` John Garry 2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM) 0 siblings, 2 replies; 15+ messages in thread From: Ojaswin Mujoo @ 2025-01-29 7:06 UTC (permalink / raw) To: lsf-pc Cc: linux-xfs, linux-fsdevel, John Garry, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 Greetings, I would like to submit a proposal to discuss the design of extsize and forcealign and various open questions around it. ** Background ** Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a multi-KB range on disk to go atomically. This feature has a wide variety of use cases especially for databases like mysql and postgres that can leverage atomic writes to gain significant performance. However, in order to enable atomic writes on Linux, the underlying disk may have some size and alignment constraints that the upper layers like filesystems should follow. extsize with forcealign is one of the ways filesystems can make sure the IO submitted to the disk adheres to the atomic writes constraints. extsize is a hint to the FS to allocate extents at a certian logical alignment and size. forcealign builds on this by forcing the allocator to enforce the alignment guarantees for physical blocks as well, which is essential for atomic writes. ** Points of discussion ** Extsize hints feature is already supported by XFS [1] with forcealign still under development and discussion [2]. After taking a look at ext4's multi-block allocator design, supporting extsize with forcealign can be done in ext4 as well. There is a RFC proposed which adds support for extsize hints feature in ext4 [3]. However there are some caveats and deviations from XFS design. With these in mind, I would like to propose LSFMM topic on: * exact semantics of extsize w/ forcealign which can bring a consistent interface among ext4 and xfs and possibly any other FS that plans to implement them in the future. * Documenting how forcealign with extsize should behave with various FS operations like fallocate, truncate, punch hole, insert/collapse range etc * Implementing extsize with delayed allocation and the challenges there. * Discussing tooling support of forcealign like how are we planning to maintain block alignment gurantees during fsck, resize and other times where we might need to move blocks around? * Documenting any areas where FSes might differ in their implementations of the same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas XFS has support for that. Hopefully this discussion will be relevant in defining consistent semantics for extsize hints and forcealign which might as well come useful for other FS developers too. Thoughts and suggestions are welcome. References: [1] https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html [2] https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/ [3] https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/ Regards, ojaswin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-01-29 7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo @ 2025-01-29 8:59 ` John Garry 2025-01-29 16:06 ` Ojaswin Mujoo 2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM) 1 sibling, 1 reply; 15+ messages in thread From: John Garry @ 2025-01-29 8:59 UTC (permalink / raw) To: Ojaswin Mujoo, lsf-pc Cc: linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On 29/01/2025 07:06, Ojaswin Mujoo wrote: Hi Ojaswin, > > I would like to submit a proposal to discuss the design of extsize and > forcealign and various open questions around it. > > ** Background ** > > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a > multi-KB range on disk to go atomically. This feature has a wide variety of use > cases especially for databases like mysql and postgres that can leverage atomic > writes to gain significant performance. However, in order to enable atomic > writes on Linux, the underlying disk may have some size and alignment > constraints that the upper layers like filesystems should follow. extsize with > forcealign is one of the ways filesystems can make sure the IO submitted to the > disk adheres to the atomic writes constraints. > > extsize is a hint to the FS to allocate extents at a certian logical alignment > and size. forcealign builds on this by forcing the allocator to enforce the > alignment guarantees for physical blocks as well, which is essential for atomic > writes. > > ** Points of discussion ** > > Extsize hints feature is already supported by XFS [1] with forcealign still > under development and discussion [2]. From https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ thread, the alternate solution to forcealign for XFS is to use a software-emulated fallback for unaligned atomic writes. I am looking at a PoC implementation now. Note that this does rely on CoW. There has been push back on forcealign for XFS, so we need to prove/disprove that this software-emulated fallback can work, see https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/ > After taking a look at ext4's multi-block > allocator design, supporting extsize with forcealign can be done in ext4 as > well. There is a RFC proposed which adds support for extsize hints feature in > ext4 [3]. However there are some caveats and deviations from XFS design. With > these in mind, I would like to propose LSFMM topic on: > > * exact semantics of extsize w/ forcealign which can bring a consistent > interface among ext4 and xfs and possibly any other FS that plans to > implement them in the future. > > * Documenting how forcealign with extsize should behave with various FS > operations like fallocate, truncate, punch hole, insert/collapse range etc > > * Implementing extsize with delayed allocation and the challenges there. > > * Discussing tooling support of forcealign like how are we planning to maintain > block alignment gurantees during fsck, resize and other times where we might > need to move blocks around? > > * Documenting any areas where FSes might differ in their implementations of the > same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas > XFS has support for that. > > Hopefully this discussion will be relevant in defining consistent semantics for > extsize hints and forcealign which might as well come useful for other FS > developers too. > > Thoughts and suggestions are welcome. > > References: > [1] https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVK2oQKuYw$ > [2] https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVLgqkSeIg$ > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/__;!!ACWV5N9M2RV99hQ!NoUXCJI_ofztyeV6aq2HvNI4YHcyjSHvzxHkw0fSGB9_SKz6jkAqzBVy7WcUSNNHrJl0jM0qolbvuVJ_GK50Cg$ > > Regards, > ojaswin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-01-29 8:59 ` John Garry @ 2025-01-29 16:06 ` Ojaswin Mujoo 2025-01-30 14:08 ` John Garry 0 siblings, 1 reply; 15+ messages in thread From: Ojaswin Mujoo @ 2025-01-29 16:06 UTC (permalink / raw) To: John Garry Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote: > On 29/01/2025 07:06, Ojaswin Mujoo wrote: > > Hi Ojaswin, > > > > > I would like to submit a proposal to discuss the design of extsize and > > forcealign and various open questions around it. > > > > ** Background ** > > > > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a > > multi-KB range on disk to go atomically. This feature has a wide variety of use > > cases especially for databases like mysql and postgres that can leverage atomic > > writes to gain significant performance. However, in order to enable atomic > > writes on Linux, the underlying disk may have some size and alignment > > constraints that the upper layers like filesystems should follow. extsize with > > forcealign is one of the ways filesystems can make sure the IO submitted to the > > disk adheres to the atomic writes constraints. > > > > extsize is a hint to the FS to allocate extents at a certian logical alignment > > and size. forcealign builds on this by forcing the allocator to enforce the > > alignment guarantees for physical blocks as well, which is essential for atomic > > writes. > > > > ** Points of discussion ** > > > > Extsize hints feature is already supported by XFS [1] with forcealign still > > under development and discussion [2]. > > From > https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ > thread, the alternate solution to forcealign for XFS is to use a > software-emulated fallback for unaligned atomic writes. I am looking at a > PoC implementation now. Note that this does rely on CoW. > > There has been push back on forcealign for XFS, so we need to prove/disprove > that this software-emulated fallback can work, see > https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/ > Hey John, Thanks for taking a look. I did go through the 2 series sometime back. I agree that there are some open challenges in getting the multi block atomic write interface correct especially for mixed mappings and this is one of the main reasons we want to explore the exchange_range fallback in case blocks are not aligned. That being said, I believe forcealign as a feature still holds a lot of relevance as: 1. Right now, it is the only way to guarantee aligned blocks and hence gurantee that our atomic writes can always benefit from hardware atomic write support. IIUC DBs are not very keen on losing out on performance due to some writes going via the software fallback path. 2. Not all FSes support COW (major example being ext4) and hence it will be very difficult to have a software fallback incase the blocks are not aligned. 3. As pointed out in [1], even with exchange_range there is still value in having forcealign to find the new blocks to be exchanged. I agree that forcealign is not the only way we can have atomic writes work but I do feel there is value in having forcealign for FSes and hence we should have a discussion around it so we can get the interface right. Just to be clear, the intention of this proposal is to mainly discuss forcealign as a feature. I am hoping there would be another different proposal to discuss atomic writes and the plethora of other open challenges there ;) [1] https://lore.kernel.org/linux-xfs/20250117182945.GH1611770@frogsfrogsfrogs/ > > After taking a look at ext4's multi-block > > allocator design, supporting extsize with forcealign can be done in ext4 as > > well. There is a RFC proposed which adds support for extsize hints feature in > > ext4 [3]. However there are some caveats and deviations from XFS design. With > > these in mind, I would like to propose LSFMM topic on: > > > > * exact semantics of extsize w/ forcealign which can bring a consistent > > interface among ext4 and xfs and possibly any other FS that plans to > > implement them in the future. > > > > * Documenting how forcealign with extsize should behave with various FS > > operations like fallocate, truncate, punch hole, insert/collapse range etc > > > > * Implementing extsize with delayed allocation and the challenges there. > > > > * Discussing tooling support of forcealign like how are we planning to maintain > > block alignment gurantees during fsck, resize and other times where we might > > need to move blocks around? > > > > * Documenting any areas where FSes might differ in their implementations of the > > same. Example, ext4 doesn't plan to support non power of 2 extsizes whereas > > XFS has support for that. > > > > Hopefully this discussion will be relevant in defining consistent semantics for > > extsize hints and forcealign which might as well come useful for other FS > > developers too. > > > > Thoughts and suggestions are welcome. > > > > References: > > [1] https://man7.org/linux/man-pages/man2/ioctl_xfs_fsgetxattr.2.html > > [2] https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/ > > [3] https://lore.kernel.org/linux-ext4/cover.1733901374.git.ojaswin@linux.ibm.com/ > > > > Regards, > > ojaswin > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-01-29 16:06 ` Ojaswin Mujoo @ 2025-01-30 14:08 ` John Garry 2025-02-01 7:12 ` Ojaswin Mujoo 0 siblings, 1 reply; 15+ messages in thread From: John Garry @ 2025-01-30 14:08 UTC (permalink / raw) To: Ojaswin Mujoo Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On 29/01/2025 16:06, Ojaswin Mujoo wrote: > On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote: >> On 29/01/2025 07:06, Ojaswin Mujoo wrote: >> >> Hi Ojaswin, >> >>> >>> I would like to submit a proposal to discuss the design of extsize and >>> forcealign and various open questions around it. >>> >>> ** Background ** >>> >>> Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a >>> multi-KB range on disk to go atomically. This feature has a wide variety of use >>> cases especially for databases like mysql and postgres that can leverage atomic >>> writes to gain significant performance. However, in order to enable atomic >>> writes on Linux, the underlying disk may have some size and alignment >>> constraints that the upper layers like filesystems should follow. extsize with >>> forcealign is one of the ways filesystems can make sure the IO submitted to the >>> disk adheres to the atomic writes constraints. >>> >>> extsize is a hint to the FS to allocate extents at a certian logical alignment >>> and size. forcealign builds on this by forcing the allocator to enforce the >>> alignment guarantees for physical blocks as well, which is essential for atomic >>> writes. >>> >>> ** Points of discussion ** >>> >>> Extsize hints feature is already supported by XFS [1] with forcealign still >>> under development and discussion [2]. >> >> From >> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!IuMiPMbR5L3B8f31W8tbRlB7d0dMLg2nxW8k7KOGF3t031T99wahnbwnIeDn6N3AdveQJvmbL4V_FBwB0T9U9Q$ >> thread, the alternate solution to forcealign for XFS is to use a >> software-emulated fallback for unaligned atomic writes. I am looking at a >> PoC implementation now. Note that this does rely on CoW. >> >> There has been push back on forcealign for XFS, so we need to prove/disprove >> that this software-emulated fallback can work, see >> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/__;!!ACWV5N9M2RV99hQ!IuMiPMbR5L3B8f31W8tbRlB7d0dMLg2nxW8k7KOGF3t031T99wahnbwnIeDn6N3AdveQJvmbL4V_FBwv-uf6Ig$ >> > > Hey John, > > Thanks for taking a look. I did go through the 2 series sometime back. > I agree that there are some open challenges in getting the multi block > atomic write interface correct especially for mixed mappings and this is > one of the main reasons we want to explore the exchange_range fallback > in case blocks are not aligned. Right, so for XFS I am looking at a CoW-based fallback for unaligned/mixed mapping atomic writes. I have no idea on how this could work for ext4. > > That being said, I believe forcealign as a feature still holds a lot > of relevance as: > > 1. Right now, it is the only way to guarantee aligned blocks and hence > gurantee that our atomic writes can always benefit from hardware atomic > write support. IIUC DBs are not very keen on losing out on performance > due to some writes going via the software fallback path. Sure, we need performance figures for this first. > > 2. Not all FSes support COW (major example being ext4) and hence it will > be very difficult to have a software fallback incase the blocks are > not aligned. Understood > > 3. As pointed out in [1], even with exchange_range there is still value > in having forcealign to find the new blocks to be exchanged. Yeah, again, we need performance figures. For my test case, I am trying 16K atomic writes with 4K FS block size, so I expect the software fallback to not kick in often after running the system for a while (as eventually we will get an aligned allocations). I am concerned of prospect of heavily fragmented files, though. > > I agree that forcealign is not the only way we can have atomic writes > work but I do feel there is value in having forcealign for FSes and > hence we should have a discussion around it so we can get the interface > right. > I thought that the interface for forcealign according to the candidate xfs implementation was quite straightforward. no? What was not clear was the age-old issue of how to issue an atomic write of mixed extents, which is really an atomic write issue. > Just to be clear, the intention of this proposal is to mainly discuss > forcealign as a feature. I am hoping there would be another different > proposal to discuss atomic writes and the plethora of other open > challenges there ;) Thanks, John ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-01-30 14:08 ` John Garry @ 2025-02-01 7:12 ` Ojaswin Mujoo 2025-02-04 12:20 ` John Garry 0 siblings, 1 reply; 15+ messages in thread From: Ojaswin Mujoo @ 2025-02-01 7:12 UTC (permalink / raw) To: John Garry Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On Thu, Jan 30, 2025 at 02:08:30PM +0000, John Garry wrote: > On 29/01/2025 16:06, Ojaswin Mujoo wrote: > > On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote: > > > On 29/01/2025 07:06, Ojaswin Mujoo wrote: > > > > > > Hi Ojaswin, > > > > > > > > > > > I would like to submit a proposal to discuss the design of extsize and > > > > forcealign and various open questions around it. > > > > > > > > ** Background ** > > > > > > > > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a > > > > multi-KB range on disk to go atomically. This feature has a wide variety of use > > > > cases especially for databases like mysql and postgres that can leverage atomic > > > > writes to gain significant performance. However, in order to enable atomic > > > > writes on Linux, the underlying disk may have some size and alignment > > > > constraints that the upper layers like filesystems should follow. extsize with > > > > forcealign is one of the ways filesystems can make sure the IO submitted to the > > > > disk adheres to the atomic writes constraints. > > > > > > > > extsize is a hint to the FS to allocate extents at a certian logical alignment > > > > and size. forcealign builds on this by forcing the allocator to enforce the > > > > alignment guarantees for physical blocks as well, which is essential for atomic > > > > writes. > > > > > > > > ** Points of discussion ** > > > > > > > > Extsize hints feature is already supported by XFS [1] with forcealign still > > > > under development and discussion [2]. > > > > > > From > > > https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ > > > thread, the alternate solution to forcealign for XFS is to use a > > > software-emulated fallback for unaligned atomic writes. I am looking at a > > > PoC implementation now. Note that this does rely on CoW. > > > > > > There has been push back on forcealign for XFS, so we need to prove/disprove > > > that this software-emulated fallback can work, see > > > https://lore.kernel.org/linux-xfs/20240924061719.GA11211@lst.de/ > > > > > > > Hey John, > > > > Thanks for taking a look. I did go through the 2 series sometime back. > > I agree that there are some open challenges in getting the multi block > > atomic write interface correct especially for mixed mappings and this is > > one of the main reasons we want to explore the exchange_range fallback > > in case blocks are not aligned. > > Right, so for XFS I am looking at a CoW-based fallback for unaligned/mixed > mapping atomic writes. I have no idea on how this could work for ext4. > > > > > That being said, I believe forcealign as a feature still holds a lot > > of relevance as: > > > > 1. Right now, it is the only way to guarantee aligned blocks and hence > > gurantee that our atomic writes can always benefit from hardware atomic > > write support. IIUC DBs are not very keen on losing out on performance > > due to some writes going via the software fallback path. > > Sure, we need performance figures for this first. > > > > > 2. Not all FSes support COW (major example being ext4) and hence it will > > be very difficult to have a software fallback incase the blocks are > > not aligned. > > Understood > > > > > 3. As pointed out in [1], even with exchange_range there is still value > > in having forcealign to find the new blocks to be exchanged. > > Yeah, again, we need performance figures. > > For my test case, I am trying 16K atomic writes with 4K FS block size, so I > expect the software fallback to not kick in often after running the system > for a while (as eventually we will get an aligned allocations). I am > concerned of prospect of heavily fragmented files, though. Yes that's true, if the FS is up long enough there is bound to be fragmentation eventually which might make it harder for extsize to get the blocks. With software fallback, there's again the point that many FSes will need some sort of COW/exchange_range support before they can support anything like that. Although I;ve not looked at what it will take to add that to ext4 but I'm assuming it will not be trivial at all. > > > > > I agree that forcealign is not the only way we can have atomic writes > > work but I do feel there is value in having forcealign for FSes and > > hence we should have a discussion around it so we can get the interface > > right. > > > > I thought that the interface for forcealign according to the candidate xfs > implementation was quite straightforward. no? As mentioned in the original proposal, there are still a open problems around extsize and forcealign. - The allocation and deallocation semantics are not completely clear to me for example we allow operations like unaligned punch_hole but not unaligned insert and collapse range, and I couldn't see that documented anywhere. - There are challenges in extsize with delayed allocation as well as how the tooling should handle forcealigned inodes. - How are FSes supposed to behave when forcealign/extsize is used with other FS features that change the allocation granularity like bigalloc or rtvol. I agree that XFS's implementation is a good reference but I'm sure as I continue working on the same from ext4 perspective we will have more points of discussion. So I definitely feel that its worth discussing this at LSFMM. > > What was not clear was the age-old issue of how to issue an atomic write of > mixed extents, which is really an atomic write issue. Right, btw are you planning any talk for atomic writes at LSFMM? Regards, ojaswin > > > Just to be clear, the intention of this proposal is to mainly discuss > > forcealign as a feature. I am hoping there would be another different > > proposal to discuss atomic writes and the plethora of other open > > challenges there ;) > > Thanks, > John ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-02-01 7:12 ` Ojaswin Mujoo @ 2025-02-04 12:20 ` John Garry 2025-02-04 20:12 ` Dave Chinner 2025-02-07 6:08 ` Ojaswin Mujoo 0 siblings, 2 replies; 15+ messages in thread From: John Garry @ 2025-02-04 12:20 UTC (permalink / raw) To: Ojaswin Mujoo Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On 01/02/2025 07:12, Ojaswin Mujoo wrote: Hi Ojaswin, >> For my test case, I am trying 16K atomic writes with 4K FS block size, so I >> expect the software fallback to not kick in often after running the system >> for a while (as eventually we will get an aligned allocations). I am >> concerned of prospect of heavily fragmented files, though. > Yes that's true, if the FS is up long enough there is bound to be > fragmentation eventually which might make it harder for extsize to > get the blocks. > > With software fallback, there's again the point that many FSes will need > some sort of COW/exchange_range support before they can support anything > like that. > > Although I;ve not looked at what it will take to add that to > ext4 but I'm assuming it will not be trivial at all. Sure, but then again you may not have issues with getting forcealign support accepted for ext4. However, I would have thought that bigalloc was good enough to use initially. > >>> I agree that forcealign is not the only way we can have atomic writes >>> work but I do feel there is value in having forcealign for FSes and >>> hence we should have a discussion around it so we can get the interface >>> right. >>> >> I thought that the interface for forcealign according to the candidate xfs >> implementation was quite straightforward. no? > As mentioned in the original proposal, there are still a open problems > around extsize and forcealign. > > - The allocation and deallocation semantics are not completely clear to > me for example we allow operations like unaligned punch_hole but not > unaligned insert and collapse range, and I couldn't see that > documented anywhere. For xfs, we were imposing the same restrictions as which we have for rtextsize > 1. If you check the following: https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/ You can see how the large allocunit value is affected by forcealign, and then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize() to see how this affects some fallocate modes. > > - There are challenges in extsize with delayed allocation as well as how > the tooling should handle forcealigned inodes. Yeah, maybe. I was only testing my xfs forcealign solution for dio (and no delayed alloc). > > - How are FSes supposed to behave when forcealign/extsize is used with > other FS features that change the allocation granularity like bigalloc > or rtvol. As you would expect, they need to be aligned with one another. For example, in the case of xfs rtvol, rextsize needs to be a multiple of extsize when forcealign is enabled. Or the other way around, I forget now.. > > I agree that XFS's implementation is a good reference but I'm > sure as I continue working on the same from ext4 perspective we will have > more points of discussion. So I definitely feel that its worth > discussing this at LSFMM. Understood, but I wait to see what happens to my CoW-based method for XFS to see where that goes before commenting on what needs to be discussed for xfs > >> What was not clear was the age-old issue of how to issue an atomic write of >> mixed extents, which is really an atomic write issue. > Right, btw are you planning any talk for atomic writes at LSFMM? I hadn't planned on it, but I guess that Martin will add something to the agenda. Thanks, John ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-02-04 12:20 ` John Garry @ 2025-02-04 20:12 ` Dave Chinner 2025-02-07 6:08 ` Ojaswin Mujoo 1 sibling, 0 replies; 15+ messages in thread From: Dave Chinner @ 2025-02-04 20:12 UTC (permalink / raw) To: John Garry Cc: Ojaswin Mujoo, lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On Tue, Feb 04, 2025 at 12:20:25PM +0000, John Garry wrote: > On 01/02/2025 07:12, Ojaswin Mujoo wrote: > > Hi Ojaswin, > > > > For my test case, I am trying 16K atomic writes with 4K FS block size, so I > > > expect the software fallback to not kick in often after running the system > > > for a while (as eventually we will get an aligned allocations). I am > > > concerned of prospect of heavily fragmented files, though. > > Yes that's true, if the FS is up long enough there is bound to be > > fragmentation eventually which might make it harder for extsize to > > get the blocks. > > > > With software fallback, there's again the point that many FSes will need > > some sort of COW/exchange_range support before they can support anything > > like that. > > > > Although I;ve not looked at what it will take to add that to > > ext4 but I'm assuming it will not be trivial at all. > > Sure, but then again you may not have issues with getting forcealign support > accepted for ext4. However, I would have thought that bigalloc was good > enough to use initially. > > > > > > > I agree that forcealign is not the only way we can have atomic writes > > > > work but I do feel there is value in having forcealign for FSes and > > > > hence we should have a discussion around it so we can get the interface > > > > right. > > > > > > > I thought that the interface for forcealign according to the candidate xfs > > > implementation was quite straightforward. no? > > As mentioned in the original proposal, there are still a open problems > > around extsize and forcealign. > > > > - The allocation and deallocation semantics are not completely clear to > > me for example we allow operations like unaligned punch_hole but not > > unaligned insert and collapse range, and I couldn't see that > > documented anywhere. > > For xfs, we were imposing the same restrictions as which we have for > rtextsize > 1. > > If you check the following: > https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/ > > You can see how the large allocunit value is affected by forcealign, and > then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize() > to see how this affects some fallocate modes. > > > > > - There are challenges in extsize with delayed allocation as well as how > > the tooling should handle forcealigned inodes. > > Yeah, maybe. I was only testing my xfs forcealign solution for dio (and no > delayed alloc). XFS turns off delalloc when extsize hints are set. See xfs_buffered_write_iomap_begin() - it starts with: /* we can't use delayed allocations when using extent size hints */ if (xfs_get_extsz_hint(ip)) return xfs_direct_write_iomap_begin(inode, offset, count, flags, iomap, srcmap); and so it treats the allocation like a direct IO write and so force-align should work with buffered writes as expected. This delalloc constraint is a historic relic in XFS - now that we use unwritten extents for delalloc we -could- use delalloc with extsize hints; it just requires the delalloc extents to be aligned to extsize hints. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-02-04 12:20 ` John Garry 2025-02-04 20:12 ` Dave Chinner @ 2025-02-07 6:08 ` Ojaswin Mujoo 2025-02-07 12:01 ` John Garry 1 sibling, 1 reply; 15+ messages in thread From: Ojaswin Mujoo @ 2025-02-07 6:08 UTC (permalink / raw) To: John Garry Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On Tue, Feb 04, 2025 at 12:20:25PM +0000, John Garry wrote: > On 01/02/2025 07:12, Ojaswin Mujoo wrote: > > Hi Ojaswin, > > > > For my test case, I am trying 16K atomic writes with 4K FS block size, so I > > > expect the software fallback to not kick in often after running the system > > > for a while (as eventually we will get an aligned allocations). I am > > > concerned of prospect of heavily fragmented files, though. > > Yes that's true, if the FS is up long enough there is bound to be > > fragmentation eventually which might make it harder for extsize to > > get the blocks. > > > > With software fallback, there's again the point that many FSes will need > > some sort of COW/exchange_range support before they can support anything > > like that. > > > > Although I;ve not looked at what it will take to add that to > > ext4 but I'm assuming it will not be trivial at all. > > Sure, but then again you may not have issues with getting forcealign support > accepted for ext4. However, I would have thought that bigalloc was good > enough to use initially. Yes, bigalloc is indeed good enough as a start but yes eventually something like forcealign will be beneficial as not everyone prefers an FS-wide cluster-size allocation granularity. We do have a patch for atomic writes with bigalloc that was sent way back in mid 2024 but then we went into the same discussion of mixed mapping[1]. Hmm I think it might be time to revisit that and see if we can do something better there. [1] https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/ > > > > > > > I agree that forcealign is not the only way we can have atomic writes > > > > work but I do feel there is value in having forcealign for FSes and > > > > hence we should have a discussion around it so we can get the interface > > > > right. > > > > > > > I thought that the interface for forcealign according to the candidate xfs > > > implementation was quite straightforward. no? > > As mentioned in the original proposal, there are still a open problems > > around extsize and forcealign. > > > > - The allocation and deallocation semantics are not completely clear to > > me for example we allow operations like unaligned punch_hole but not > > unaligned insert and collapse range, and I couldn't see that > > documented anywhere. > > For xfs, we were imposing the same restrictions as which we have for > rtextsize > 1. > > If you check the following: > https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/ > > You can see how the large allocunit value is affected by forcealign, and > then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize() > to see how this affects some fallocate modes. True, but it's something that just implicitly happens when we use forcealign. I eventually found out while testing forcealign with different operations but such things can come as a surprise to users especially when we support some operations to be unaligned and then reject some other similar ones. punch_hole/collapse_range is just an example and yes it might not be very important to support unaligned collapse range but in the long run it would be good to have these things documented/discussed. > > > > > - There are challenges in extsize with delayed allocation as well as how > > the tooling should handle forcealigned inodes. > > Yeah, maybe. I was only testing my xfs forcealign solution for dio (and no > delayed alloc). > > > > > - How are FSes supposed to behave when forcealign/extsize is used with > > other FS features that change the allocation granularity like bigalloc > > or rtvol. > > As you would expect, they need to be aligned with one another. > > For example, in the case of xfs rtvol, rextsize needs to be a multiple of > extsize when forcealign is enabled. Or the other way around, I forget now.. > > > > > I agree that XFS's implementation is a good reference but I'm > > sure as I continue working on the same from ext4 perspective we will have > > more points of discussion. So I definitely feel that its worth > > discussing this at LSFMM. > > Understood, but I wait to see what happens to my CoW-based method for XFS to > see where that goes before commenting on what needs to be discussed for xfs Got it. > > > > > > What was not clear was the age-old issue of how to issue an atomic write of > > > mixed extents, which is really an atomic write issue. > > Right, btw are you planning any talk for atomic writes at LSFMM? > > I hadn't planned on it, but I guess that Martin will add something to the > agenda. > > Thanks, > John > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-02-07 6:08 ` Ojaswin Mujoo @ 2025-02-07 12:01 ` John Garry 2025-02-08 17:05 ` Ojaswin Mujoo 0 siblings, 1 reply; 15+ messages in thread From: John Garry @ 2025-02-07 12:01 UTC (permalink / raw) To: Ojaswin Mujoo Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 > Yes, bigalloc is indeed good enough as a start but yes eventually > something like forcealign will be beneficial as not everyone prefers an > FS-wide cluster-size allocation granularity. > > We do have a patch for atomic writes with bigalloc that was sent way > back in mid 2024 but then we went into the same discussion of mixed > mapping[1]. > > Hmm I think it might be time to revisit that and see if we can do > something better there. > > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/__;!!ACWV5N9M2RV99hQ!OJKieZJEIvc-M87u_dxAxiEGC4zN0PQmfdLT6k73Y7_Lvr9m-iodyrytRCFxDPbVzsOlk-1kuXXvaKLA-y9kCQ$ Feel free to pick up the iomap patches I had for zeroing when trying to atomic write mixed mappings - that's in my v3 series IIRC. But you might still get some push back on them... >> >>> >>>>> I agree that forcealign is not the only way we can have atomic writes >>>>> work but I do feel there is value in having forcealign for FSes and >>>>> hence we should have a discussion around it so we can get the interface >>>>> right. >>>>> >>>> I thought that the interface for forcealign according to the candidate xfs >>>> implementation was quite straightforward. no? >>> As mentioned in the original proposal, there are still a open problems >>> around extsize and forcealign. >>> >>> - The allocation and deallocation semantics are not completely clear to >>> me for example we allow operations like unaligned punch_hole but not >>> unaligned insert and collapse range, and I couldn't see that >>> documented anywhere. >> >> For xfs, we were imposing the same restrictions as which we have for >> rtextsize > 1. >> >> If you check the following: >> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!OJKieZJEIvc-M87u_dxAxiEGC4zN0PQmfdLT6k73Y7_Lvr9m-iodyrytRCFxDPbVzsOlk-1kuXXvaKLSPqPbqA$ >> >> You can see how the large allocunit value is affected by forcealign, and >> then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize() >> to see how this affects some fallocate modes. > > True, but it's something that just implicitly happens when we use > forcealign. I eventually found out while testing forcealign with > different operations but such things can come as a surprise to users > especially when we support some operations to be unaligned and then > reject some other similar ones. > > punch_hole/collapse_range is just an example and yes it might not be > very important to support unaligned collapse range but in the long run > it would be good to have these things documented/discussed. Maybe the man pages can be documented for forcealign/rtextsize > 1 punch holes/collapse behaviour - at a quick glance, I could not see anything. Indeed, I am not sure how bigalloc affects punch holes/collapse range either. Thanks, John ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes 2025-02-07 12:01 ` John Garry @ 2025-02-08 17:05 ` Ojaswin Mujoo 0 siblings, 0 replies; 15+ messages in thread From: Ojaswin Mujoo @ 2025-02-08 17:05 UTC (permalink / raw) To: John Garry Cc: lsf-pc, linux-xfs, linux-fsdevel, djwong, dchinner, hch, ritesh.list, jack, tytso, linux-ext4 On Fri, Feb 07, 2025 at 12:01:32PM +0000, John Garry wrote: > > > Yes, bigalloc is indeed good enough as a start but yes eventually > > something like forcealign will be beneficial as not everyone prefers an > > FS-wide cluster-size allocation granularity. > > > > We do have a patch for atomic writes with bigalloc that was sent way > > back in mid 2024 but then we went into the same discussion of mixed > > mapping[1]. > > > > Hmm I think it might be time to revisit that and see if we can do > > something better there. > > > > [1] https://lore.kernel.org/linux-ext4/37baa9f4c6c2994df7383d8b719078a527e521b9.1729825985.git.ritesh.list@gmail.com/ > > Feel free to pick up the iomap patches I had for zeroing when trying to > atomic write mixed mappings - that's in my v3 series IIRC. Thanks I'll give it a try. > > But you might still get some push back on them... Right, it would be good if we all can come to a consensus of what to do if an FS wants to implement something like forcealign for atomic writes but does not have a way to implement software fallback. As I see, we seem to be 2 (un)popular options: 1. Reject atomic writes on mixed mappings. This is not user space friendly but simplest to implement 2. Zero out the unwritten part of the mapping and convert to a single mapping before performing the IO. All options have their shortcomings but I think 2 is still okay. I believe thats the path we've taken in the latest XFS patches right. > > > > > > > > > > > > > > I agree that forcealign is not the only way we can have atomic writes > > > > > > work but I do feel there is value in having forcealign for FSes and > > > > > > hence we should have a discussion around it so we can get the interface > > > > > > right. > > > > > > > > > > > I thought that the interface for forcealign according to the candidate xfs > > > > > implementation was quite straightforward. no? > > > > As mentioned in the original proposal, there are still a open problems > > > > around extsize and forcealign. > > > > > > > > - The allocation and deallocation semantics are not completely clear to > > > > me for example we allow operations like unaligned punch_hole but not > > > > unaligned insert and collapse range, and I couldn't see that > > > > documented anywhere. > > > > > > For xfs, we were imposing the same restrictions as which we have for > > > rtextsize > 1. > > > > > > If you check the following: > > > https://lore.kernel.org/linux-xfs/20240813163638.3751939-9-john.g.garry@oracle.com/ > > > > > > You can see how the large allocunit value is affected by forcealign, and > > > then check callers of xfs_is_falloc_aligned() -> xfs_inode_alloc_unitsize() > > > to see how this affects some fallocate modes. > > > > True, but it's something that just implicitly happens when we use > > forcealign. I eventually found out while testing forcealign with > > different operations but such things can come as a surprise to users > > especially when we support some operations to be unaligned and then > > reject some other similar ones. > > > > punch_hole/collapse_range is just an example and yes it might not be > > very important to support unaligned collapse range but in the long run > > it would be good to have these things documented/discussed. > > Maybe the man pages can be documented for forcealign/rtextsize > 1 punch > holes/collapse behaviour - at a quick glance, I could not see anything. Yep sounds good. > Indeed, I am not sure how bigalloc affects punch holes/collapse range > either. Yeah, I think even bigalloc has the similar behavior of disallowing unaligned insert/collapse ranges but allowing punch hole. > > Thanks, > John Regards, ojaswin ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc 2025-01-29 7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo 2025-01-29 8:59 ` John Garry @ 2025-03-23 7:00 ` Ritesh Harjani (IBM) 2025-03-23 7:00 ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM) 2025-03-23 7:02 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM) 1 sibling, 2 replies; 15+ messages in thread From: Ritesh Harjani (IBM) @ 2025-03-23 7:00 UTC (permalink / raw) To: linux-ext4 Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o, Ojaswin Mujoo, Ritesh Harjani (IBM) This is an RFC patch before LSFMM to preview the change of how multi-fsblock atomic write support with bigalloc look like. There is a scope of improvement in the implementation, however this shows the general idea of the design. More details are provided in the actual patch. There are still todos and more testing is needed. But with iomap limitation of single fsblock atomic write now lifted, the patch has definitely started to look better. This is based out of vfs.all tree [1] for 6.15, which now has the necessary iomap changes required for the bigalloc support in ext4. TODOs: 1. Add better testcases to test atomic write support with bigalloc. 2. Discuss the approach of keeping the jbd2 txn open while zeroing the short underlying unwritten extents or short holes to create a single mapped type extent mapping. This anyway should be a non-perfomance critical path. 3. We use ext4_map_blocks() in loop instead of modifying the block allocator. Again since it's non-performance sensitive path, so hopefully it should ok? Because otherwise one can argue why take and release EXT4_I(inode)->i_data_sem multiple times. We won't take & release any group lock for this, since we know that with bigalloc the cluster is anyway available to us. 4. Once when we start supporting file/inode marked with atomic writes attribute, maybe we can add some optimizations like zero out the entire underlying cluster when someone forcefully wants to fzero or fpunch an underlying disk block, to keep the mapped extent intact. 5. Stress test of this is still pending through fsx and xfstests. Reviews are appreciated. [1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.all&id=4f76518956c037517a4e4b120186075d3afb8266 Ritesh Harjani (IBM) (1): ext4: Add atomic write support for bigalloc fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/super.c | 8 +++-- 2 files changed, 93 insertions(+), 5 deletions(-) -- 2.48.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc 2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM) @ 2025-03-23 7:00 ` Ritesh Harjani (IBM) 2025-03-23 7:02 ` Ritesh Harjani (IBM) 2025-03-23 7:02 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM) 1 sibling, 1 reply; 15+ messages in thread From: Ritesh Harjani (IBM) @ 2025-03-23 7:00 UTC (permalink / raw) To: linux-ext4 Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o, Ojaswin Mujoo, Ritesh Harjani (IBM) EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. We first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle for block allocation depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/super.c | 8 +++-- 2 files changed, 93 insertions(+), 5 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d04d8a7f12e7..0096a597ad04 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, iomap->addr = IOMAP_NULL_ADDR; } } +/* + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed + * mappings. This function is only called when the bigalloc is enabled, so we + * know that the allocated physical extent start is always aligned properly. + * + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the + * requested range does not have a single mapping type (Hole, Mapped, or + * Unwritten) throughout. In that case we will loop over the requested range to + * allocate and zero out the unwritten / holes in between, to get a single + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance + * critical path, so it should be ok to loop using ext4_map_blocks() with + * appropriate flags to allocate & zero the underlying short holes/unwritten + * extents within the requested range. + */ +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map) +{ + ext4_lblk_t m_lblk = map->m_lblk; + unsigned int m_len = map->m_len; + unsigned int mapped_len = 0, flags = 0; + u8 blkbits = inode->i_blkbits; + int ret; + + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb)); + + ret = ext4_map_blocks(handle, inode, map, 0); + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) + flags = EXT4_GET_BLOCKS_CREATE; + else if ((ret == 0 && map->m_len >= m_len) || + (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN)) + flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; + else + flags = EXT4_GET_BLOCKS_CREATE_ZERO; + + do { + ret = ext4_map_blocks(handle, inode, map, flags); + if (ret < 0) + return ret; + mapped_len += map->m_len; + map->m_lblk += map->m_len; + map->m_len = m_len - mapped_len; + } while (mapped_len < m_len); + + map->m_lblk = m_lblk; + map->m_len = m_len; + + /* + * We might have done some work in above loop. Let's ensure we query the + * start of the physical extent, based on the origin m_lblk and m_len + * and also ensure we were able to allocate the required range for doing + * atomic write. + */ + ret = ext4_map_blocks(handle, inode, map, 0); + if (ret != m_len) { + ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n", + m_lblk, m_len); + return -EINVAL; + } + return mapped_len; +} static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, unsigned int flags) @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; - ret = ext4_map_blocks(handle, inode, map, m_flags); + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb)) + ret = ext4_map_blocks_atomic(handle, inode, map); + else + ret = ext4_map_blocks(handle, inode, map, m_flags); /* * We cannot fill holes in indirect tree based inodes as that could @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, int ret; struct ext4_map_blocks map; u8 blkbits = inode->i_blkbits; + unsigned int m_len_orig; if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) return -EINVAL; @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, map.m_lblk = offset >> blkbits; map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; + m_len_orig = map.m_len; if (flags & IOMAP_WRITE) { /* @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ if (offset + length <= i_size_read(inode)) { ret = ext4_map_blocks(NULL, inode, &map, 0); - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) - goto out; + /* + * For atomic writes the entire requested length should + * be mapped. + */ + if (map.m_flags & EXT4_MAP_MAPPED) { + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || + (flags & IOMAP_ATOMIC && ret >= m_len_orig)) + goto out; + } + map.m_len = m_len_orig; } ret = ext4_iomap_alloc(inode, &map, flags); } else { @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); + /* + * Before returning to iomap, let's ensure the allocated mapping + * covers the entire requested length for atomic writes. + */ + if (flags & IOMAP_ATOMIC) { + if (map.m_len < (length >> blkbits)) { + WARN_ON(1); + return -EINVAL; + } + } ext4_set_iomap(inode, iomap, &map, offset, length, flags); return 0; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a50e5c31b937..cbb24d535d59 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb) /* * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. * @sb: super block - * TODO: Later add support for bigalloc */ static void ext4_atomic_write_init(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct block_device *bdev = sb->s_bdev; + unsigned int blkbits = sb->s_blocksize_bits; + unsigned int clustersize = sb->s_blocksize; if (!bdev_can_atomic_write(bdev)) return; @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb) if (!ext4_has_feature_extents(sb)) return; + if (ext4_has_feature_bigalloc(sb)) + clustersize = 1U << (sbi->s_cluster_bits + blkbits); + sbi->s_awu_min = max(sb->s_blocksize, bdev_atomic_write_unit_min_bytes(bdev)); - sbi->s_awu_max = min(sb->s_blocksize, + sbi->s_awu_max = min(clustersize, bdev_atomic_write_unit_max_bytes(bdev)); if (sbi->s_awu_min && sbi->s_awu_max && sbi->s_awu_min <= sbi->s_awu_max) { -- 2.48.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc 2025-03-23 7:00 ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM) @ 2025-03-23 7:02 ` Ritesh Harjani (IBM) 2025-03-25 11:42 ` Ojaswin Mujoo 0 siblings, 1 reply; 15+ messages in thread From: Ritesh Harjani (IBM) @ 2025-03-23 7:02 UTC (permalink / raw) To: linux-ext4 Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o, Ojaswin Mujoo, Ritesh Harjani (IBM) EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. We first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle for block allocation depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/super.c | 8 +++-- 2 files changed, 93 insertions(+), 5 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d04d8a7f12e7..0096a597ad04 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, iomap->addr = IOMAP_NULL_ADDR; } } +/* + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed + * mappings. This function is only called when the bigalloc is enabled, so we + * know that the allocated physical extent start is always aligned properly. + * + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the + * requested range does not have a single mapping type (Hole, Mapped, or + * Unwritten) throughout. In that case we will loop over the requested range to + * allocate and zero out the unwritten / holes in between, to get a single + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance + * critical path, so it should be ok to loop using ext4_map_blocks() with + * appropriate flags to allocate & zero the underlying short holes/unwritten + * extents within the requested range. + */ +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map) +{ + ext4_lblk_t m_lblk = map->m_lblk; + unsigned int m_len = map->m_len; + unsigned int mapped_len = 0, flags = 0; + u8 blkbits = inode->i_blkbits; + int ret; + + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb)); + + ret = ext4_map_blocks(handle, inode, map, 0); + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) + flags = EXT4_GET_BLOCKS_CREATE; + else if ((ret == 0 && map->m_len >= m_len) || + (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN)) + flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; + else + flags = EXT4_GET_BLOCKS_CREATE_ZERO; + + do { + ret = ext4_map_blocks(handle, inode, map, flags); + if (ret < 0) + return ret; + mapped_len += map->m_len; + map->m_lblk += map->m_len; + map->m_len = m_len - mapped_len; + } while (mapped_len < m_len); + + map->m_lblk = m_lblk; + map->m_len = m_len; + + /* + * We might have done some work in above loop. Let's ensure we query the + * start of the physical extent, based on the origin m_lblk and m_len + * and also ensure we were able to allocate the required range for doing + * atomic write. + */ + ret = ext4_map_blocks(handle, inode, map, 0); + if (ret != m_len) { + ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n", + m_lblk, m_len); + return -EINVAL; + } + return mapped_len; +} static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, unsigned int flags) @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; - ret = ext4_map_blocks(handle, inode, map, m_flags); + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb)) + ret = ext4_map_blocks_atomic(handle, inode, map); + else + ret = ext4_map_blocks(handle, inode, map, m_flags); /* * We cannot fill holes in indirect tree based inodes as that could @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, int ret; struct ext4_map_blocks map; u8 blkbits = inode->i_blkbits; + unsigned int m_len_orig; if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) return -EINVAL; @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, map.m_lblk = offset >> blkbits; map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; + m_len_orig = map.m_len; if (flags & IOMAP_WRITE) { /* @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ if (offset + length <= i_size_read(inode)) { ret = ext4_map_blocks(NULL, inode, &map, 0); - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) - goto out; + /* + * For atomic writes the entire requested length should + * be mapped. + */ + if (map.m_flags & EXT4_MAP_MAPPED) { + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || + (flags & IOMAP_ATOMIC && ret >= m_len_orig)) + goto out; + } + map.m_len = m_len_orig; } ret = ext4_iomap_alloc(inode, &map, flags); } else { @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); + /* + * Before returning to iomap, let's ensure the allocated mapping + * covers the entire requested length for atomic writes. + */ + if (flags & IOMAP_ATOMIC) { + if (map.m_len < (length >> blkbits)) { + WARN_ON(1); + return -EINVAL; + } + } ext4_set_iomap(inode, iomap, &map, offset, length, flags); return 0; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a50e5c31b937..cbb24d535d59 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb) /* * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. * @sb: super block - * TODO: Later add support for bigalloc */ static void ext4_atomic_write_init(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct block_device *bdev = sb->s_bdev; + unsigned int blkbits = sb->s_blocksize_bits; + unsigned int clustersize = sb->s_blocksize; if (!bdev_can_atomic_write(bdev)) return; @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb) if (!ext4_has_feature_extents(sb)) return; + if (ext4_has_feature_bigalloc(sb)) + clustersize = 1U << (sbi->s_cluster_bits + blkbits); + sbi->s_awu_min = max(sb->s_blocksize, bdev_atomic_write_unit_min_bytes(bdev)); - sbi->s_awu_max = min(sb->s_blocksize, + sbi->s_awu_max = min(clustersize, bdev_atomic_write_unit_max_bytes(bdev)); if (sbi->s_awu_min && sbi->s_awu_max && sbi->s_awu_min <= sbi->s_awu_max) { -- 2.48.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc 2025-03-23 7:02 ` Ritesh Harjani (IBM) @ 2025-03-25 11:42 ` Ojaswin Mujoo 0 siblings, 0 replies; 15+ messages in thread From: Ojaswin Mujoo @ 2025-03-25 11:42 UTC (permalink / raw) To: Ritesh Harjani (IBM) Cc: linux-ext4, linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o On Sun, Mar 23, 2025 at 12:32:18PM +0530, Ritesh Harjani (IBM) wrote: > EXT4 supports bigalloc feature which allows the FS to work in size of > clusters (group of blocks) rather than individual blocks. This patch > adds atomic write support for bigalloc so that systems with bs = ps can > also create FS using - > mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> > > With bigalloc ext4 can support multi-fsblock atomic writes. We will have to > adjust ext4's atomic write unit max value to cluster size. This can then support > atomic write of size anywhere between [blocksize, clustersize]. > > We first query the underlying region of the requested range by calling > ext4_map_blocks() call. Here are the various cases which we then handle > for block allocation depending upon the underlying mapping type: > 1. If the underlying region for the entire requested range is a mapped extent, > then we don't call ext4_map_blocks() to allocate anything. We don't need to > even start the jbd2 txn in this case. > 2. For an append write case, we create a mapped extent. > 3. If the underlying region is entirely a hole, then we create an unwritten > extent for the requested range. > 4. If the underlying region is a large unwritten extent, then we split the > extent into 2 unwritten extent of required size. > 5. If the underlying region has any type of mixed mapping, then we call > ext4_map_blocks() in a loop to zero out the unwritten and the hole regions > within the requested range. This then provide a single mapped extent type > mapping for the requested range. > > Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO > flag only when the underlying extent mapping of the requested range is > not entirely a hole, an unwritten extent, or a fully mapped extent. That > is, if the underlying region contains a mix of hole(s), unwritten > extent(s), and mapped extent(s), we use this loop to ensure that all the > short mappings are zeroed out. This guarantees that the entire requested > range becomes a single, uniformly mapped extent. It is ok to do so > because we know this is being done on a bigalloc enabled filesystem > where the block bitmap represents the entire cluster unit. Hi Ritesh, thanks for the patch. The approach looks good to me, just adding a few comments below. > > Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- > fs/ext4/super.c | 8 +++-- > 2 files changed, 93 insertions(+), 5 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d04d8a7f12e7..0096a597ad04 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > iomap->addr = IOMAP_NULL_ADDR; > } > } > +/* > + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping > + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed > + * mappings. This function is only called when the bigalloc is enabled, so we > + * know that the allocated physical extent start is always aligned properly. > + * > + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the > + * requested range does not have a single mapping type (Hole, Mapped, or > + * Unwritten) throughout. In that case we will loop over the requested range to > + * allocate and zero out the unwritten / holes in between, to get a single > + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance > + * critical path, so it should be ok to loop using ext4_map_blocks() with > + * appropriate flags to allocate & zero the underlying short holes/unwritten > + * extents within the requested range. > + */ > +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode, > + struct ext4_map_blocks *map) > +{ > + ext4_lblk_t m_lblk = map->m_lblk; > + unsigned int m_len = map->m_len; > + unsigned int mapped_len = 0, flags = 0; > + u8 blkbits = inode->i_blkbits; > + int ret; > + > + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb)); > + > + ret = ext4_map_blocks(handle, inode, map, 0); > + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) > + flags = EXT4_GET_BLOCKS_CREATE; > + else if ((ret == 0 && map->m_len >= m_len) || > + (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN)) > + flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > + else > + flags = EXT4_GET_BLOCKS_CREATE_ZERO; > + > + do { > + ret = ext4_map_blocks(handle, inode, map, flags); With the multiple calls to map block for converting the extents, I don't think the transaction reservation wouldn't be enough anymore since in the worst case we could be converting atleast (max atomicwrite size / blocksize) extents. We need to account for that as well. > + if (ret < 0) > + return ret; > + mapped_len += map->m_len; > + map->m_lblk += map->m_len; > + map->m_len = m_len - mapped_len; > + } while (mapped_len < m_len); > + > + map->m_lblk = m_lblk; > + map->m_len = m_len; > + > + /* > + * We might have done some work in above loop. Let's ensure we query the > + * start of the physical extent, based on the origin m_lblk and m_len > + * and also ensure we were able to allocate the required range for doing > + * atomic write. > + */ > + ret = ext4_map_blocks(handle, inode, map, 0); Here, We are calling ext4_map_blocks() 3 times uneccessarily even if a single complete mapping is found. I think a better approach would be to just go for the map_blocks and then decide if we want to split. Also, factor out a function to do the zero out. So, somthing like: if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) flags = EXT4_GET_BLOCKS_CREATE; else flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; ret = ext4_map_blocks(handle, inode, map, flags); if (map->m_len < m_len) { map->m_len = m_len; /* do the zero out */ ext4_zero_mixed_mappings(handle, inode, map); ext4_map_blocks(handle, inode, map, 0); WARN_ON(!(map->m_flags & EXT4_MAP_MAPPED) || map->m_len < m_len); } I think this covers the 5 cases you mentioned in the commit message, if I'm not missing anything. Also, this way we avoid the duplication for non zero-out cases and the zero-out function can then be resused incase we want to do the same for forcealign atomic writes in the future. Regards, ojaswin > + if (ret != m_len) { > + ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n", > + m_lblk, m_len); > + return -EINVAL; > + } > + return mapped_len; > +} > > static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > unsigned int flags) > @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > > - ret = ext4_map_blocks(handle, inode, map, m_flags); > + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb)) > + ret = ext4_map_blocks_atomic(handle, inode, map); > + else > + ret = ext4_map_blocks(handle, inode, map, m_flags); > > /* > * We cannot fill holes in indirect tree based inodes as that could > @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > int ret; > struct ext4_map_blocks map; > u8 blkbits = inode->i_blkbits; > + unsigned int m_len_orig; > > if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > return -EINVAL; > @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > map.m_lblk = offset >> blkbits; > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + m_len_orig = map.m_len; > > if (flags & IOMAP_WRITE) { > /* > @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > if (offset + length <= i_size_read(inode)) { > ret = ext4_map_blocks(NULL, inode, &map, 0); > - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) > - goto out; > + /* > + * For atomic writes the entire requested length should > + * be mapped. > + */ > + if (map.m_flags & EXT4_MAP_MAPPED) { > + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || > + (flags & IOMAP_ATOMIC && ret >= m_len_orig)) > + goto out; > + } > + map.m_len = m_len_orig; > } > ret = ext4_iomap_alloc(inode, &map, flags); > } else { > @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); > > + /* > + * Before returning to iomap, let's ensure the allocated mapping > + * covers the entire requested length for atomic writes. > + */ > + if (flags & IOMAP_ATOMIC) { > + if (map.m_len < (length >> blkbits)) { > + WARN_ON(1); > + return -EINVAL; > + } > + } > ext4_set_iomap(inode, iomap, &map, offset, length, flags); > > return 0; > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index a50e5c31b937..cbb24d535d59 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb) > /* > * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. > * @sb: super block > - * TODO: Later add support for bigalloc > */ > static void ext4_atomic_write_init(struct super_block *sb) > { > struct ext4_sb_info *sbi = EXT4_SB(sb); > struct block_device *bdev = sb->s_bdev; > + unsigned int blkbits = sb->s_blocksize_bits; > + unsigned int clustersize = sb->s_blocksize; > > if (!bdev_can_atomic_write(bdev)) > return; > @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb) > if (!ext4_has_feature_extents(sb)) > return; > > + if (ext4_has_feature_bigalloc(sb)) > + clustersize = 1U << (sbi->s_cluster_bits + blkbits); > + > sbi->s_awu_min = max(sb->s_blocksize, > bdev_atomic_write_unit_min_bytes(bdev)); > - sbi->s_awu_max = min(sb->s_blocksize, > + sbi->s_awu_max = min(clustersize, > bdev_atomic_write_unit_max_bytes(bdev)); > if (sbi->s_awu_min && sbi->s_awu_max && > sbi->s_awu_min <= sbi->s_awu_max) { > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc 2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM) 2025-03-23 7:00 ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM) @ 2025-03-23 7:02 ` Ritesh Harjani (IBM) 1 sibling, 0 replies; 15+ messages in thread From: Ritesh Harjani (IBM) @ 2025-03-23 7:02 UTC (permalink / raw) To: linux-ext4 Cc: linux-fsdevel, John Garry, djwong, linux-xfs, Theodore Ts'o, Ojaswin Mujoo, Ritesh Harjani (IBM) This is an RFC patch before LSFMM to preview the change of how multi-fsblock atomic write support with bigalloc look like. There is a scope of improvement in the implementation, however this shows the general idea of the design. More details are provided in the actual patch. There are still todos and more testing is needed. But with iomap limitation of single fsblock atomic write now lifted, the patch has definitely started to look better. This is based out of vfs.all tree [1] for 6.15, which now has the necessary iomap changes required for the bigalloc support in ext4. TODOs: 1. Add better testcases to test atomic write support with bigalloc. 2. Discuss the approach of keeping the jbd2 txn open while zeroing the short underlying unwritten extents or short holes to create a single mapped type extent mapping. This anyway should be a non-perfomance critical path. 3. We use ext4_map_blocks() in loop instead of modifying the block allocator. Again since it's non-performance sensitive path, so hopefully it should ok? Because otherwise one can argue why take and release EXT4_I(inode)->i_data_sem multiple times. We won't take & release any group lock for this, since we know that with bigalloc the cluster is anyway available to us. 4. Once when we start supporting file/inode marked with atomic writes attribute, maybe we can add some optimizations like zero out the entire underlying cluster when someone forcefully wants to fzero or fpunch an underlying disk block, to keep the mapped extent intact. 5. Stress test of this is still pending through fsx and xfstests. Reviews are appreciated. [1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.all&id=4f76518956c037517a4e4b120186075d3afb8266 Ritesh Harjani (IBM) (1): ext4: Add atomic write support for bigalloc fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/super.c | 8 +++-- 2 files changed, 93 insertions(+), 5 deletions(-) -- 2.48.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-03-25 11:43 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-29 7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo 2025-01-29 8:59 ` John Garry 2025-01-29 16:06 ` Ojaswin Mujoo 2025-01-30 14:08 ` John Garry 2025-02-01 7:12 ` Ojaswin Mujoo 2025-02-04 12:20 ` John Garry 2025-02-04 20:12 ` Dave Chinner 2025-02-07 6:08 ` Ojaswin Mujoo 2025-02-07 12:01 ` John Garry 2025-02-08 17:05 ` Ojaswin Mujoo 2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM) 2025-03-23 7:00 ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM) 2025-03-23 7:02 ` Ritesh Harjani (IBM) 2025-03-25 11:42 ` Ojaswin Mujoo 2025-03-23 7:02 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).