* [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc
@ 2025-05-08 20:50 Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
` (7 more replies)
0 siblings, 8 replies; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
This is v3 of multi-fsblock atomic write support using bigalloc. This has
started looking into much better shape now. The major chunk of the design
changes has been kept in Patch-4 & 5.
This series can now be carefully reviewed, as all the error handling related
code paths should be properly taken care of.
v2 -> v3:
=========
1. Improved error handling at several places.
2. Further fixed some worst case journal credits estimation.
3. Added better checks in the slow path allocation loop for atomic writes.
v3 testing so far:
===============
- This has survived "quick" & "auto" group testing with bigalloc on x86 and Power.
- We have also tested atomic write related tests using fio and some data integrity
tests with sudden power off during writes on scsi_debug module.
(Will clean up these tests and try to post them out soon!)
Appreciate any review comments / feedback!
v1 -> v2:
==========
1. Handled review comments from Ojaswin to optimize the ext4_map_block() calls
in ext4_iomap_alloc().
2. Fixed the journal credits calculation for both:
- during block allocation in ext4_iomap_alloc()
- during dio completion in ->end_io callback.
Earlier we were starting multiple txns in ->end_io callback for unwritten to
written conversion. But since in case of atomic writes, we want a single jbd2
txn, hence made the necessary changes there.
[v2]: https://lore.kernel.org/linux-ext4/cover.1745987268.git.ritesh.list@gmail.com/
Ritesh Harjani (IBM) (7):
ext4: Document an edge case for overwrites
ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
ext4: Make ext4_meta_trans_blocks() non-static for later use
ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
ext4: Add multi-fsblock atomic write support with bigalloc
ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
ext4: Add atomic block write documentation
.../filesystems/ext4/atomic_writes.rst | 208 +++++++++++++
Documentation/filesystems/ext4/overview.rst | 1 +
fs/ext4/ext4.h | 26 +-
fs/ext4/extents.c | 99 ++++++
fs/ext4/file.c | 7 +-
fs/ext4/inode.c | 291 ++++++++++++++++--
fs/ext4/super.c | 7 +-
7 files changed, 614 insertions(+), 25 deletions(-)
create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
--
2.49.0
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v3 1/7] ext4: Document an edge case for overwrites
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-09 5:19 ` Ojaswin Mujoo
2025-05-14 16:23 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
` (6 subsequent siblings)
7 siblings, 2 replies; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
as it is easy to miss it when focusing on write paths alone.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/inode.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94c7d2d828a6..b10e5cd5bb5c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3436,6 +3436,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
+ /*
+ * This can be called for overwrites path from
+ * ext4_iomap_overwrite_begin().
+ */
ret = ext4_map_blocks(NULL, inode, &map, 0);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-09 5:20 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
` (5 subsequent siblings)
7 siblings, 2 replies; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
EXT4 only supports doing atomic write on inodes which uses extents, so
add a check in ext4_inode_can_atomic_write() which gets called during
open.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/ext4.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5a20e9cd7184..c0240f6f6491 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3847,7 +3847,9 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
static inline bool ext4_inode_can_atomic_write(struct inode *inode)
{
- return S_ISREG(inode->i_mode) && EXT4_SB(inode->i_sb)->s_awu_min > 0;
+ return S_ISREG(inode->i_mode) &&
+ ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ EXT4_SB(inode->i_sb)->s_awu_min > 0;
}
extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-09 5:21 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
` (4 subsequent siblings)
7 siblings, 2 replies; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
Let's make ext4_meta_trans_blocks() non-static for use in later
functions during ->end_io conversion for atomic writes.
We will need this function to estimate journal credits for a special
case. Instead of adding another wrapper around it, let's make this
non-static.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/inode.c | 6 +-----
2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c0240f6f6491..e2b36a3c1b0f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3039,6 +3039,8 @@ extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
+extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
+ int pextents);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b10e5cd5bb5c..2f99b087a5d8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -142,9 +142,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
new_size);
}
-static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
- int pextents);
-
/*
* Test whether an inode is a fast symlink.
* A fast symlink has its symlink data stored in ext4_inode_info->i_data.
@@ -5777,8 +5774,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
*
* Also account for superblock, inode, quota and xattr blocks
*/
-static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
- int pextents)
+int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
{
ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
int gdpblocks;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (2 preceding siblings ...)
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-14 16:16 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (3 subsequent siblings)
7 siblings, 1 reply; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
There can be a case where there are contiguous extents on the adjacent
leaf nodes of on-disk extent trees. So when someone tries to write to
this contiguous range, ext4_map_blocks() call will split by returning
1 extent at a time if this is not already cached in extent_status tree
cache (where if these extents when cached can get merged since they are
contiguous).
This is fine for a normal write however in case of atomic writes, it
can't afford to break the write into two. Now this is also something
that will only happen in the slow write case where we call
ext4_map_blocks() for each of these extents spread across different leaf
nodes. However, there is no guarantee that these extent status cache
cannot be reclaimed before the last call to ext4_map_blocks() in
ext4_map_blocks_atomic_write_slow().
Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
This flag checks if the requested range can be fully found in extent
status cache and return. If not, it looks up in on-disk extent
tree via ext4_map_query_blocks(). If the found extent is the last entry
in the leaf node, then it goes and queries the next lblk to see if there
is an adjacent contiguous extent in the adjacent leaf node of the
on-disk extent tree.
Even though there can be a case where there are multiple adjacent extent
entries spread across multiple leaf nodes. But we only read an adjacent
leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
The reason for this is that we are mostly only going to support atomic
writes with upto 64KB or maybe max upto 1MB of atomic write support.
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/ext4.h | 18 ++++++++-
fs/ext4/extents.c | 12 ++++++
fs/ext4/inode.c | 97 +++++++++++++++++++++++++++++++++++++++++------
3 files changed, 115 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e2b36a3c1b0f..b4bbe2837423 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -256,9 +256,19 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
#define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
#define EXT4_MAP_DELAYED BIT(BH_Delay)
+/*
+ * This is for use in ext4_map_query_blocks() for a special case where we can
+ * have a physically and logically contiguous blocks explit across two leaf
+ * nodes instead of a single extent. This is required in case of atomic writes
+ * to know whether the returned extent is last in leaf. If yes, then lookup for
+ * next in leaf block in ext4_map_query_blocks_next_in_leaf().
+ * - This is never going to be added to any buffer head state.
+ * - We use the next available bit after BH_BITMAP_UPTODATE.
+ */
+#define EXT4_MAP_QUERY_LAST_IN_LEAF BIT(BH_BITMAP_UPTODATE + 1)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_DELAYED)
+ EXT4_MAP_DELAYED | EXT4_MAP_QUERY_LAST_IN_LEAF)
struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -725,6 +735,12 @@ enum {
#define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400
/* Caller is in the atomic contex, find extent if it has been cached */
#define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800
+/*
+ * Atomic write caller needs this to query in the slow path of mixed mapping
+ * case, when a contiguous extent can be split across two adjacent leaf nodes.
+ * Look EXT4_MAP_QUERY_LAST_IN_LEAF.
+ */
+#define EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF 0x1000
/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c616a16a9f36..fa850f188d46 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4433,6 +4433,18 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
allocated = map->m_len;
ext4_ext_show_leaf(inode, path);
out:
+ /*
+ * We never use EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF with CREATE flag.
+ * So we know that the depth used here is correct, since there was no
+ * block allocation done if EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set.
+ * If tomorrow we start using this QUERY flag with CREATE, then we will
+ * need to re-calculate the depth as it might have changed due to block
+ * allocation.
+ */
+ if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
+ if (!err && ex && (ex == EXT_LAST_EXTENT(path[depth].p_hdr)))
+ map->m_flags |= EXT4_MAP_QUERY_LAST_IN_LEAF;
+
ext4_free_ext_path(path);
trace_ext4_ext_map_blocks_exit(inode, flags, map,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2f99b087a5d8..8b86b1a29bdc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -459,14 +459,71 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
}
#endif /* ES_AGGRESSIVE_TEST */
+static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
+ struct inode *inode, struct ext4_map_blocks *map,
+ unsigned int orig_mlen)
+{
+ struct ext4_map_blocks map2;
+ unsigned int status, status2;
+ int retval;
+
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+
+ WARN_ON_ONCE(!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF));
+ WARN_ON_ONCE(orig_mlen <= map->m_len);
+
+ /* Prepare map2 for lookup in next leaf block */
+ map2.m_lblk = map->m_lblk + map->m_len;
+ map2.m_len = orig_mlen - map->m_len;
+ map2.m_flags = 0;
+ retval = ext4_ext_map_blocks(handle, inode, &map2, 0);
+
+ if (retval <= 0) {
+ ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, false);
+ return map->m_len;
+ }
+
+ if (unlikely(retval != map2.m_len)) {
+ ext4_warning(inode->i_sb,
+ "ES len assertion failed for inode "
+ "%lu: retval %d != map->m_len %d",
+ inode->i_ino, retval, map2.m_len);
+ WARN_ON(1);
+ }
+
+ status2 = map2.m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+
+ /*
+ * If map2 is contiguous with map, then let's insert it as a single
+ * extent in es cache and return the combined length of both the maps.
+ */
+ if (map->m_pblk + map->m_len == map2.m_pblk &&
+ status == status2) {
+ ext4_es_insert_extent(inode, map->m_lblk,
+ map->m_len + map2.m_len, map->m_pblk,
+ status, false);
+ map->m_len += map2.m_len;
+ } else {
+ ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, false);
+ }
+
+ return map->m_len;
+}
+
static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
- struct ext4_map_blocks *map)
+ struct ext4_map_blocks *map, int flags)
{
unsigned int status;
int retval;
+ unsigned int orig_mlen = map->m_len;
+ unsigned int query_flags = flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF;
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- retval = ext4_ext_map_blocks(handle, inode, map, 0);
+ retval = ext4_ext_map_blocks(handle, inode, map, query_flags);
else
retval = ext4_ind_map_blocks(handle, inode, map, 0);
@@ -481,11 +538,23 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
WARN_ON(1);
}
- status = map->m_flags & EXT4_MAP_UNWRITTEN ?
- EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
- ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- map->m_pblk, status, false);
- return retval;
+ /*
+ * No need to query next in leaf:
+ * - if returned extent is not last in leaf or
+ * - if the last in leaf is the full requested range
+ */
+ if (!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) ||
+ ((map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) &&
+ (map->m_len == orig_mlen))) {
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status, false);
+ return retval;
+ }
+
+ return ext4_map_query_blocks_next_in_leaf(handle, inode, map,
+ orig_mlen);
}
static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
@@ -599,6 +668,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct extent_status es;
int retval;
int ret = 0;
+ unsigned int orig_mlen = map->m_len;
#ifdef ES_AGGRESSIVE_TEST
struct ext4_map_blocks orig_map;
@@ -650,7 +720,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
ext4_map_blocks_es_recheck(handle, inode, map,
&orig_map, flags);
#endif
- goto found;
+ if (!(flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) ||
+ orig_mlen == map->m_len)
+ goto found;
+
+ if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
+ map->m_len = orig_mlen;
}
/*
* In the query cache no-wait mode, nothing we can do more if we
@@ -664,7 +739,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* file system block.
*/
down_read(&EXT4_I(inode)->i_data_sem);
- retval = ext4_map_query_blocks(handle, inode, map);
+ retval = ext4_map_query_blocks(handle, inode, map, flags);
up_read((&EXT4_I(inode)->i_data_sem));
found:
@@ -1802,7 +1877,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
if (ext4_has_inline_data(inode))
retval = 0;
else
- retval = ext4_map_query_blocks(NULL, inode, map);
+ retval = ext4_map_query_blocks(NULL, inode, map, 0);
up_read(&EXT4_I(inode)->i_data_sem);
if (retval)
return retval < 0 ? retval : 0;
@@ -1825,7 +1900,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
goto found;
}
} else if (!ext4_has_inline_data(inode)) {
- retval = ext4_map_query_blocks(NULL, inode, map);
+ retval = ext4_map_query_blocks(NULL, inode, map, 0);
if (retval) {
up_write(&EXT4_I(inode)->i_data_sem);
return retval < 0 ? retval : 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (3 preceding siblings ...)
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-14 16:19 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
` (2 subsequent siblings)
7 siblings, 1 reply; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize]. This
patch adds the required changes to enable multi-fsblock atomic write
support using bigalloc in the next patch.
In this patch for block allocation:
we first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
then we don't call ext4_map_blocks() to allocate anything. We don't need to
even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
within the requested range. This then provide a single mapped extent type
mapping for the requested range.
Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.
Note having a single contiguous underlying region of type mapped,
unwrittn or hole is not a problem. But the reason to avoid writing on
top of mixed mapping region is because, atomic writes requires all or
nothing should get written for the userspace pwritev2 request. So if at
any point in time during the write if a crash or a sudden poweroff
occurs, the region undergoing atomic write should read either complete
old data or complete new data. But it should never have a mix of both
old and new data.
So, we first convert any mixed mapping region to a single contiguous
mapped extent before any data gets written to it. This is because
normally FS will only convert unwritten extents to written at the end of
the write in ->end_io() call. And if we allow the writes over a mixed
mapping and if a sudden power off happens in between, we will end up
reading mix of new data (over mapped extents) and old data (over
unwritten extents), because unwritten to written conversion never went
through.
So to avoid this and to avoid writes getting torned due to mixed
mapping, we first allocate a single contiguous block mapping and then
do the write.
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/ext4.h | 2 +
fs/ext4/extents.c | 87 ++++++++++++++++++++++
fs/ext4/file.c | 7 +-
fs/ext4/inode.c | 184 +++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 275 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b4bbe2837423..2567ed181f9f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3728,6 +3728,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
loff_t len);
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
loff_t offset, ssize_t len);
+extern int ext4_convert_unwritten_extents_atomic(handle_t *handle,
+ struct inode *inode, loff_t offset, ssize_t len);
extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
ext4_io_end_t *io_end);
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fa850f188d46..2967c74dabaf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4792,6 +4792,93 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return ret;
}
+/*
+ * This function converts a range of blocks to written extents. The caller of
+ * this function will pass the start offset and the size. all unwritten extents
+ * within this range will be converted to written extents.
+ *
+ * This function is called from the direct IO end io call back function for
+ * atomic writes, to convert the unwritten extents after IO is completed.
+ *
+ * Note that the requirement for atomic writes is that all conversion should
+ * happen atomically in a single fs journal transaction. We mainly only allocate
+ * unwritten extents either on a hole on a pre-exiting unwritten extent range in
+ * ext4_map_blocks_atomic_write(). The only case where we can have multiple
+ * unwritten extents in a range [offset, offset+len) is when there is a split
+ * unwritten extent between two leaf nodes which was cached in extent status
+ * cache during ext4_iomap_alloc() time. That will allow
+ * ext4_map_blocks_atomic_write() to return the unwritten extent range w/o going
+ * into the slow path. That means we might need a loop for conversion of this
+ * unwritten extent split across leaf block within a single journal transaction.
+ * Split extents across leaf nodes is a rare case, but let's still handle that
+ * to meet the requirements of multi-fsblock atomic writes.
+ *
+ * Returns 0 on success.
+ */
+int ext4_convert_unwritten_extents_atomic(handle_t *handle, struct inode *inode,
+ loff_t offset, ssize_t len)
+{
+ unsigned int max_blocks;
+ int ret = 0, ret2 = 0, ret3 = 0;
+ struct ext4_map_blocks map;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int credits = 0;
+ int flags = EXT4_GET_BLOCKS_IO_CONVERT_EXT;
+
+ map.m_lblk = offset >> blkbits;
+ max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
+
+ if (!handle) {
+ /*
+ * TODO: An optimization can be added later by having an extent
+ * status flag e.g. EXTENT_STATUS_SPLIT_LEAF. If we query that
+ * it can tell if the extent in the cache is a split extent.
+ * But for now let's assume pextents as 2 always.
+ */
+ credits = ext4_meta_trans_blocks(inode, max_blocks, 2);
+ }
+
+ if (credits) {
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+ }
+
+ while (ret >= 0 && ret < max_blocks) {
+ map.m_lblk += ret;
+ map.m_len = (max_blocks -= ret);
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (ret != max_blocks)
+ ext4_msg(inode->i_sb, KERN_INFO,
+ "inode #%lu: block %u: len %u: "
+ "split block mapping found for atomic write, "
+ "ret = %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret);
+ if (ret <= 0)
+ break;
+ }
+
+ ret2 = ext4_mark_inode_dirty(handle, inode);
+
+ if (credits) {
+ ret3 = ext4_journal_stop(handle);
+ if (unlikely(ret3))
+ ret2 = ret3;
+ }
+
+ if (ret <= 0 || ret2)
+ ext4_warning(inode->i_sb,
+ "inode #%lu: block %u: len %u: "
+ "returned %d or %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret, ret2);
+
+ return ret > 0 ? ret2 : ret;
+}
+
/*
* This function convert a range of blocks to written extents
* The caller of this function will pass the start offset and the size.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index beb078ee4811..959328072c15 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -377,7 +377,12 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(iocb->ki_filp);
- if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
+
+ if (!error && size && (flags & IOMAP_DIO_UNWRITTEN) &&
+ (iocb->ki_flags & IOCB_ATOMIC))
+ error = ext4_convert_unwritten_extents_atomic(NULL, inode, pos,
+ size);
+ else if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
if (error)
return error;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8b86b1a29bdc..2642e1ef128f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
}
}
+static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
+ struct inode *inode, struct ext4_map_blocks *map)
+{
+ ext4_lblk_t m_lblk = map->m_lblk;
+ unsigned int m_len = map->m_len;
+ unsigned int mapped_len = 0, m_flags = 0;
+ ext4_fsblk_t next_pblk;
+ bool check_next_pblk = false;
+ int ret = 0;
+
+ WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
+
+ /*
+ * This is a slow path in case of mixed mapping. We use
+ * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
+ * contiguous mapped mapping. This will ensure any unwritten or hole
+ * regions within the requested range is zeroed out and we return
+ * a single contiguous mapped extent.
+ */
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+
+ do {
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (ret < 0 && ret != -ENOSPC)
+ goto out_err;
+ /*
+ * This should never happen, but let's return an error code to
+ * avoid an infinite loop in here.
+ */
+ if (ret == 0) {
+ ret = -EFSCORRUPTED;
+ ext4_warning_inode(inode,
+ "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
+ m_flags, ret);
+ goto out_err;
+ }
+ /*
+ * With bigalloc we should never get ENOSPC nor discontiguous
+ * physical extents.
+ */
+ if ((check_next_pblk && next_pblk != map->m_pblk) ||
+ ret == -ENOSPC) {
+ ext4_warning_inode(inode,
+ "Non-contiguous allocation detected: expected %llu, got %llu, "
+ "or ext4_map_blocks() returned out of space ret: %d",
+ next_pblk, map->m_pblk, ret);
+ ret = -ENOSPC;
+ goto out_err;
+ }
+ next_pblk = map->m_pblk + map->m_len;
+ check_next_pblk = true;
+
+ mapped_len += map->m_len;
+ map->m_lblk += map->m_len;
+ map->m_len = m_len - mapped_len;
+ } while (mapped_len < m_len);
+
+ /*
+ * We might have done some work in above loop, so we need to query the
+ * start of the physical extent, based on the origin m_lblk and m_len.
+ * Let's also ensure we were able to allocate the required range for
+ * mixed mapping case.
+ */
+ map->m_lblk = m_lblk;
+ map->m_len = m_len;
+ map->m_flags = 0;
+
+ ret = ext4_map_blocks(handle, inode, map,
+ EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
+ if (ret != m_len) {
+ ext4_warning_inode(inode,
+ "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
+ m_lblk, m_len, ret);
+ ret = -EINVAL;
+ }
+ return ret;
+
+out_err:
+ /* reset map before returning an error */
+ map->m_lblk = m_lblk;
+ map->m_len = m_len;
+ map->m_flags = 0;
+ return ret;
+}
+
+/*
+ * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
+ * range in @map [lblk, lblk + len) is one single contiguous extent with no
+ * mixed mappings.
+ *
+ * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
+ * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
+ * physical extent for the requested range does not have a single contiguous
+ * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
+ * In that case we will loop over the requested range to allocate and zero out
+ * the unwritten / holes in between, to get a single mapped extent from
+ * [m_lblk, m_lblk + m_len). Note that this is only possible because we know
+ * this can be called only with bigalloc enabled filesystem where the underlying
+ * cluster is already allocated. This avoids allocating discontiguous extents
+ * in the slow path due to multiple calls to ext4_map_blocks().
+ * The slow path is mostly non-performance critical path, so it should be ok to
+ * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
+ * underlying short holes/unwritten extents within the requested range.
+ */
+static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map, int m_flags)
+{
+ ext4_lblk_t m_lblk = map->m_lblk;
+ unsigned int m_len = map->m_len;
+ int ret = 0;
+
+ WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
+
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (ret < 0 || ret == m_len)
+ goto out;
+ /*
+ * This is a mixed mapping case where we were not able to allocate
+ * a single contiguous extent. In that case let's reset requested
+ * mapping and call the slow path.
+ */
+ map->m_lblk = m_lblk;
+ map->m_len = m_len;
+ map->m_flags = 0;
+
+ return ext4_map_blocks_atomic_write_slow(handle, inode, map);
+out:
+ return ret;
+}
+
static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
@@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
*/
if (map->m_len > DIO_MAX_BLOCKS)
map->m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+
+ /*
+ * journal credits estimation for atomic writes. We call
+ * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
+ * then let's assume the no. of pextents required can be m_len i.e.
+ * every alternate block can be unwritten and hole.
+ */
+ if (flags & IOMAP_ATOMIC) {
+ unsigned int orig_mlen = map->m_len;
+
+ ret = ext4_map_blocks(NULL, inode, map, 0);
+ if (ret < 0)
+ return ret;
+ if (map->m_len < orig_mlen) {
+ map->m_len = orig_mlen;
+ dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
+ map->m_len);
+ } else {
+ dio_credits = ext4_chunk_trans_blocks(inode,
+ map->m_len);
+ }
+ } else {
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+ }
retry:
/*
@@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
- ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (flags & IOMAP_ATOMIC)
+ ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
+ else
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
/*
* We cannot fill holes in indirect tree based inodes as that could
@@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+ unsigned int orig_mlen;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_lblk = offset >> blkbits;
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+ orig_mlen = map.m_len;
if (flags & IOMAP_WRITE) {
/*
@@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
if (offset + length <= i_size_read(inode)) {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
- goto out;
+ /*
+ * For atomic writes the entire requested length should
+ * be mapped.
+ */
+ if (map.m_flags & EXT4_MAP_MAPPED) {
+ if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
+ (flags & IOMAP_ATOMIC && ret >= orig_mlen))
+ goto out;
+ }
+ map.m_len = orig_mlen;
}
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
@@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
+ /*
+ * Before returning to iomap, let's ensure the allocated mapping
+ * covers the entire requested length for atomic writes.
+ */
+ if (flags & IOMAP_ATOMIC) {
+ if (map.m_len < (length >> blkbits)) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+ }
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (4 preceding siblings ...)
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-14 16:21 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
7 siblings, 1 reply; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
Last couple of patches added the needed support for multi-fsblock atomic
writes using bigalloc. This patch ensures that filesystem advertizes the
needed atomic write unit min and max values for enabling multi-fsblock
atomic write support with bigalloc.
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/super.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 181934499624..508ea5cff1c7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4442,12 +4442,12 @@ static int ext4_handle_clustersize(struct super_block *sb)
/*
* ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
* @sb: super block
- * TODO: Later add support for bigalloc
*/
static void ext4_atomic_write_init(struct super_block *sb)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct block_device *bdev = sb->s_bdev;
+ unsigned int clustersize = sb->s_blocksize;
if (!bdev_can_atomic_write(bdev))
return;
@@ -4455,9 +4455,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
if (!ext4_has_feature_extents(sb))
return;
+ if (ext4_has_feature_bigalloc(sb))
+ clustersize = EXT4_CLUSTER_SIZE(sb);
+
sbi->s_awu_min = max(sb->s_blocksize,
bdev_atomic_write_unit_min_bytes(bdev));
- sbi->s_awu_max = min(sb->s_blocksize,
+ sbi->s_awu_max = min(clustersize,
bdev_atomic_write_unit_max_bytes(bdev));
if (sbi->s_awu_min && sbi->s_awu_max &&
sbi->s_awu_min <= sbi->s_awu_max) {
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 7/7] ext4: Add atomic block write documentation
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (5 preceding siblings ...)
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
@ 2025-05-08 20:50 ` Ritesh Harjani (IBM)
2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
7 siblings, 1 reply; 26+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-05-08 20:50 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel, Ritesh Harjani (IBM)
Add an initial documentation around atomic writes support in ext4.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
.../filesystems/ext4/atomic_writes.rst | 208 ++++++++++++++++++
Documentation/filesystems/ext4/overview.rst | 1 +
2 files changed, 209 insertions(+)
create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..59b03d8dbb79
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,208 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ EXT4's supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ EXT4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to use clustered allocations. With bigalloc
+each bit within block bitmap represents clusters (power of 2 number of blocks)
+rather than individual filesystem blocks. EXT4 supports atomic writes using
+bigalloc by making sure that atomic write min and max are within [blocksize,
+clustersize].
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional allocation is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. Only after this allocation, is the write
+operation performed by iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first checks if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+------
+
+Creating Filesystems with Atomic Write Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+~~~~~~~~~~~~~~~~~~~~~
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. Tthe number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+Hardware Support
+----------------
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+--------
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 0fad6eda6e15..9d4054c17ecb 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -25,3 +25,4 @@ order.
.. include:: inlinedata.rst
.. include:: eainode.rst
.. include:: verity.rst
+.. include:: atomic_writes.rst
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 1/7] ext4: Document an edge case for overwrites
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
@ 2025-05-09 5:19 ` Ojaswin Mujoo
2025-05-14 16:23 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2025-05-09 5:19 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry, djwong,
linux-fsdevel
On Fri, May 09, 2025 at 02:20:31AM +0530, Ritesh Harjani (IBM) wrote:
> ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
> calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
> as it is easy to miss it when focusing on write paths alone.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Looks good Ritesh. Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/inode.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 94c7d2d828a6..b10e5cd5bb5c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3436,6 +3436,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> + /*
> + * This can be called for overwrites path from
> + * ext4_iomap_overwrite_begin().
> + */
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> }
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
@ 2025-05-09 5:20 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2025-05-09 5:20 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry, djwong,
linux-fsdevel
On Fri, May 09, 2025 at 02:20:32AM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 only supports doing atomic write on inodes which uses extents, so
> add a check in ext4_inode_can_atomic_write() which gets called during
> open.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Looks good Ritesh. Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/ext4.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 5a20e9cd7184..c0240f6f6491 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3847,7 +3847,9 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> static inline bool ext4_inode_can_atomic_write(struct inode *inode)
> {
>
> - return S_ISREG(inode->i_mode) && EXT4_SB(inode->i_sb)->s_awu_min > 0;
> + return S_ISREG(inode->i_mode) &&
> + ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + EXT4_SB(inode->i_sb)->s_awu_min > 0;
> }
>
> extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
@ 2025-05-09 5:21 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2025-05-09 5:21 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry, djwong,
linux-fsdevel
On Fri, May 09, 2025 at 02:20:33AM +0530, Ritesh Harjani (IBM) wrote:
> Let's make ext4_meta_trans_blocks() non-static for use in later
> functions during ->end_io conversion for atomic writes.
> We will need this function to estimate journal credits for a special
> case. Instead of adding another wrapper around it, let's make this
> non-static.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Looks good Ritesh. Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/ext4.h | 2 ++
> fs/ext4/inode.c | 6 +-----
> 2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c0240f6f6491..e2b36a3c1b0f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3039,6 +3039,8 @@ extern void ext4_set_aops(struct inode *inode);
> extern int ext4_writepage_trans_blocks(struct inode *);
> extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode);
> extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
> +extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> + int pextents);
> extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
> loff_t lstart, loff_t lend);
> extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b10e5cd5bb5c..2f99b087a5d8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -142,9 +142,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
> new_size);
> }
>
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> - int pextents);
> -
> /*
> * Test whether an inode is a fast symlink.
> * A fast symlink has its symlink data stored in ext4_inode_info->i_data.
> @@ -5777,8 +5774,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
> *
> * Also account for superblock, inode, quota and xattr blocks
> */
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> - int pextents)
> +int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
> {
> ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
> int gdpblocks;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 7/7] ext4: Add atomic block write documentation
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
@ 2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-14 16:38 ` Darrick J. Wong
2025-05-15 2:18 ` Ritesh Harjani
0 siblings, 2 replies; 26+ messages in thread
From: Ojaswin Mujoo @ 2025-05-09 7:34 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry, djwong,
linux-fsdevel
On Fri, May 09, 2025 at 02:20:37AM +0530, Ritesh Harjani (IBM) wrote:
> Add an initial documentation around atomic writes support in ext4.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Hi Ritesh,
THe docs look mostly good. I'll add some feedback below:
> ---
> .../filesystems/ext4/atomic_writes.rst | 208 ++++++++++++++++++
> Documentation/filesystems/ext4/overview.rst | 1 +
> 2 files changed, 209 insertions(+)
> create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
>
> diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
> new file mode 100644
> index 000000000000..59b03d8dbb79
> --- /dev/null
> +++ b/Documentation/filesystems/ext4/atomic_writes.rst
> @@ -0,0 +1,208 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. _atomic_writes:
> +
> +Atomic Block Writes
> +-------------------------
> +
> +Introduction
> +~~~~~~~~~~~~
> +
> +Atomic (untorn) block writes ensure that either the entire write is committed
> +to disk or none of it is. This prevents "torn writes" during power loss or
> +system crashes. The ext4 filesystem supports atomic writes (only with Direct
> +I/O) on regular files with extents, provided the underlying storage device
> +supports hardware atomic writes. This is supported in the following two ways:
> +
> +1. **Single-fsblock Atomic Writes**:
> + EXT4's supports atomic write operations with a single filesystem block since
> + v6.13. In this the atomic write unit minimum and maximum sizes are both set
> + to filesystem blocksize.
> + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
> + pagesize system is possible.
> +
> +2. **Multi-fsblock Atomic Writes with Bigalloc**:
> + EXT4 now also supports atomic writes spanning multiple filesystem blocks
> + using a feature known as bigalloc. The atomic write unit's minimum and
> + maximum sizes are determined by the filesystem block size and cluster size,
> + based on the underlying device’s supported atomic write unit limits.
> +
> +Requirements
> +~~~~~~~~~~~~
> +
> +Basic requirements for atomic writes in ext4:
> +
> + 1. The extents feature must be enabled (default for ext4)
> + 2. The underlying block device must support atomic writes
> + 3. For single-fsblock atomic writes:
> +
> + 1. A filesystem with appropriate block size (up to the page size)
> + 4. For multi-fsblock atomic writes:
> +
> + 1. The bigalloc feature must be enabled
> + 2. The cluster size must be appropriately configured
> +
> +NOTE: EXT4 does not support software or COW based atomic write, which means
> +atomic writes on ext4 are only supported if underlying storage device supports
> +it.
> +
> +Multi-fsblock Implementation Details
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The bigalloc feature changes ext4 to use clustered allocations. With bigalloc
> +each bit within block bitmap represents clusters (power of 2 number of blocks)
> +rather than individual filesystem blocks. EXT4 supports atomic writes using
> +bigalloc by making sure that atomic write min and max are within [blocksize,
> +clustersize].
Should we add a line like:
Atomic write max unit is capped to the max supported by the underlying
device, incase it is less than the clustersize.
Also, maybe we can have a line wiht something like "With bigalloc's
clustered allocation we can be sure that an atomic write will always
be allocated aligned blocks. The only thing we need to ensure is that
we have a continuous mapping in the write rang."
> +
> +Here is the block allocation strategy in bigalloc for atomic writes:
> +
> + * For regions with fully mapped extents, no additional allocation is needed
> + * For append writes, a new mapped extent is allocated
> + * For regions that are entirely holes, unwritten extent is created
> + * For large unwritten extents, the extent gets split into two unwritten
> + extents of appropriate requested size
Are the above 4 points needed explicitly? Maybe we can have:
Append writes, and writes on regions that are fully mapped,
unwritten or hole follow the same flow as non atomic writes.
> + * For mixed mapping regions (combinations of holes, unwritten extents, or
> + mapped extents), ext4_map_blocks() is called in a loop with
> + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
> + mapped extent
Maybe:
... single continuous mapped extents by writing zeroes to it
So that we explicitly mention what we are doing and not rely on people
knowing the meaning of EXT4_GET_BLOCKS_ZERO flag.
> +
> +Note: Writing on a single contiguous underlying extent, whether mapped or
> +unwritten, is not inherently problematic. However, writing to a mixed mapping
> +region (i.e. one containing a combination of mapped and unwritten extents)
> +must be avoided when performing atomic writes.
> +
> +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
> +flag, requires that either all data is written or none at all. In the event of
> +a system crash or unexpected power loss during the write operation, the affected
> +region (when later read) must reflect either the complete old data or the
> +complete new data, but never a mix of both.
> +
> +To enforce this guarantee, we ensure that the write target is backed by
> +a single, contiguous extent before any data is written. This is critical because
> +ext4 defers the conversion of unwritten extents to written extents until the I/O
> +completion path (typically in ->end_io()). If a write is allowed to proceed over
> +a mixed mapping region (with mapped and unwritten extents) and a failure occurs
> +mid-write, the system could observe partially updated regions after reboot, i.e.
> +new data over mapped areas, and stale (old) data over unwritten extents that
> +were never marked written. This violates the atomicity and/or torn write
> +prevention guarantee.
> +
> +To prevent such torn writes, ext4 proactively allocates a single contiguous
> +extent for the entire requested region in ``ext4_iomap_alloc`` via
> +``ext4_map_blocks_atomic()``. Only after this allocation, is the write
> +operation performed by iomap.
> +
> +Handling Split Extents Across Leaf Blocks
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There can be a special edge case where we have logically and physically
> +contiguous extents stored in separate leaf nodes of the on-disk extent tree.
> +This occurs because on-disk extent tree merges only happens within the leaf
> +blocks except for a case where we have 2-level tree which can get merged and
> +collapsed entirely into the inode.
> +If such a layout exists and, in the worst case, the extent status cache entries
> +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
> +a single contiguous extent for these split leaf extents.
> +
> +To address this edge case, a new get block flag
> +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
> +``ext4_map_query_blocks()`` lookup behavior.
> +
> +This new get block flag allows ``ext4_map_blocks()`` to first checks if there is
s/checks/check
> +an entry in the extent status cache for the full range.
> +If not present, it consults the on-disk extent tree using
> +``ext4_map_query_blocks()``.
> +If the located extent is at the end of a leaf node, it probes the next logical
> +block (lblk) to detect a contiguous extent in the adjacent leaf.
> +
> +For now only one additional leaf block is queried to maintain efficiency, as
> +atomic writes are typically constrained to small sizes
> +(e.g. [blocksize, clustersize]).
> +
> +
> +Handling Journal transactions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +To support multi-fsblock atomic writes, we ensure enough journal credits are
> +reserved during:
> +
> + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
> + could be a mixed mapping for the underlying requested range. If yes, then we
> + reserve credits of up to ``m_len``, assuming every alternate block can be
> + an unwritten extent followed by a hole.
> +
> + 2. During ``->end_io()`` call, we make sure a single transaction is started for
> + doing unwritten-to-written conversion. The loop for conversion is mainly
> + only required to handle a split extent across leaf blocks.
> +
> +How to
> +------
> +
> +Creating Filesystems with Atomic Write Support
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For single-fsblock atomic writes with a larger block size
> +(on systems with block size < page size):
> +
> +.. code-block:: bash
> +
> + # Create an ext4 filesystem with a 16KB block size
> + # (requires page size >= 16KB)
> + mkfs.ext4 -b 16384 /dev/device
> +
> +For multi-fsblock atomic writes with bigalloc:
> +
> +.. code-block:: bash
> +
> + # Create an ext4 filesystem with bigalloc and 64KB cluster size
> + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
> +
> +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
> +and ``-O bigalloc`` enables the bigalloc feature.
> +
> +Application Interface
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
> +to perform atomic writes:
> +
> +.. code-block:: c
> +
> + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
> +
> +The write must be aligned to the filesystem's block size and not exceed the
> +filesystem's maximum atomic write unit size.
> +See ``generic_atomic_write_valid()`` for more details.
> +
> +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
> +details:
> +
> + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
> + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
> + * ``stx_atomic_write_segments_max``: Upper limit for segments. Tthe number of
> + separate memory buffers that can be gathered into a write operation
> + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
> +
> +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
> +writes are supported.
> +
> +Hardware Support
> +----------------
> +
> +The underlying storage device must support atomic write operations.
> +Modern NVMe and SCSI devices often provide this capability.
> +The Linux kernel exposes this information through sysfs:
> +
> +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
> +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
> +
> +Nonzero values for these attributes indicate that the device supports
> +atomic writes.
> +
> +See Also
> +--------
> +
> +* :doc:`bigalloc` - Documentation on the bigalloc feature
> +* :doc:`allocators` - Documentation on block allocation in ext4
> +* Support for atomic block writes in 6.13:
> + https://lwn.net/Articles/1009298/
> diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
> index 0fad6eda6e15..9d4054c17ecb 100644
> --- a/Documentation/filesystems/ext4/overview.rst
> +++ b/Documentation/filesystems/ext4/overview.rst
> @@ -25,3 +25,4 @@ order.
> .. include:: inlinedata.rst
> .. include:: eainode.rst
> .. include:: verity.rst
> +.. include:: atomic_writes.rst
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
` (6 preceding siblings ...)
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
@ 2025-05-09 17:42 ` Ritesh Harjani
2025-05-14 16:40 ` Darrick J. Wong
7 siblings, 1 reply; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-09 17:42 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Jan Kara, John Garry, djwong, Ojaswin Mujoo,
linux-fsdevel
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
> This is v3 of multi-fsblock atomic write support using bigalloc. This has
> started looking into much better shape now. The major chunk of the design
> changes has been kept in Patch-4 & 5.
>
> This series can now be carefully reviewed, as all the error handling related
> code paths should be properly taken care of.
>
We spotted that multi-fsblock changes might need to force a journal
commit if there were mixed mappings in the underlying region e.g. say WUWUWUW...
The issue arises when, during block allocation, the unwritten ranges are
first zeroed out, followed by the unwritten-to-written extent
conversion. This conversion is part of a journaled metadata transaction
that has not yet been committed, as the transaction is still running.
If an iomap write then modifies the data on those multi-fsblocks and a
sudden power loss occurs before the transaction commits, the
unwritten-to-written conversion will not be replayed during journal
recovery. As a result, we end up with new data written over mapped
blocks, while the alternate unwritten blocks will read zeroes. This
could cause a torn write behavior for atomic writes.
So we were thinking we might need something like this. Hopefully this
should still be ok, as mixed mapping case mostly is a non-performance
critical path. Thoughts?
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2642e1ef128f..59b59d609976 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3517,7 +3517,8 @@ static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
* underlying short holes/unwritten extents within the requested range.
*/
static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
- struct ext4_map_blocks *map, int m_flags)
+ struct ext4_map_blocks *map, int m_flags,
+ bool *force_commit)
{
ext4_lblk_t m_lblk = map->m_lblk;
unsigned int m_len = map->m_len;
@@ -3537,6 +3538,11 @@ static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
map->m_len = m_len;
map->m_flags = 0;
+ /*
+ * slow path means we have mixed mapping, that means we will need
+ * to force txn commit.
+ */
+ *force_commit = true;
return ext4_map_blocks_atomic_write_slow(handle, inode, map);
out:
return ret;
@@ -3548,6 +3554,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
handle_t *handle;
u8 blkbits = inode->i_blkbits;
int ret, dio_credits, m_flags = 0, retries = 0;
+ bool force_commit = false;
/*
* Trim the mapping request to the maximum value that we can map at
@@ -3610,7 +3617,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (flags & IOMAP_ATOMIC)
- ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
+ ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
+ &force_commit);
else
ret = ext4_map_blocks(handle, inode, map, m_flags);
@@ -3626,6 +3634,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
+ if (ret > 0 && force_commit)
+ ext4_force_commit(inode->i_sb);
+
return ret;
}
-ritesh
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
@ 2025-05-14 16:16 ` Darrick J. Wong
2025-05-14 18:47 ` Ritesh Harjani
0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:16 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:34AM +0530, Ritesh Harjani (IBM) wrote:
> There can be a case where there are contiguous extents on the adjacent
> leaf nodes of on-disk extent trees. So when someone tries to write to
> this contiguous range, ext4_map_blocks() call will split by returning
> 1 extent at a time if this is not already cached in extent_status tree
> cache (where if these extents when cached can get merged since they are
> contiguous).
>
> This is fine for a normal write however in case of atomic writes, it
> can't afford to break the write into two. Now this is also something
> that will only happen in the slow write case where we call
> ext4_map_blocks() for each of these extents spread across different leaf
> nodes. However, there is no guarantee that these extent status cache
> cannot be reclaimed before the last call to ext4_map_blocks() in
> ext4_map_blocks_atomic_write_slow().
Can you have two physically and logically contiguous mappings within a
single leaf node? Or is the key idea here that the extent status tree
will merge adjacent mappings from the same leaf block, just not between
leaf blocks?
> Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
> This flag checks if the requested range can be fully found in extent
> status cache and return. If not, it looks up in on-disk extent
> tree via ext4_map_query_blocks(). If the found extent is the last entry
> in the leaf node, then it goes and queries the next lblk to see if there
> is an adjacent contiguous extent in the adjacent leaf node of the
> on-disk extent tree.
>
> Even though there can be a case where there are multiple adjacent extent
> entries spread across multiple leaf nodes. But we only read an adjacent
> leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
> The reason for this is that we are mostly only going to support atomic
> writes with upto 64KB or maybe max upto 1MB of atomic write support.
>
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
> fs/ext4/ext4.h | 18 ++++++++-
> fs/ext4/extents.c | 12 ++++++
> fs/ext4/inode.c | 97 +++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 115 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e2b36a3c1b0f..b4bbe2837423 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -256,9 +256,19 @@ struct ext4_allocation_request {
> #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
> #define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
> #define EXT4_MAP_DELAYED BIT(BH_Delay)
> +/*
> + * This is for use in ext4_map_query_blocks() for a special case where we can
> + * have a physically and logically contiguous blocks explit across two leaf
s/explit/split/ ?
--D
> + * nodes instead of a single extent. This is required in case of atomic writes
> + * to know whether the returned extent is last in leaf. If yes, then lookup for
> + * next in leaf block in ext4_map_query_blocks_next_in_leaf().
> + * - This is never going to be added to any buffer head state.
> + * - We use the next available bit after BH_BITMAP_UPTODATE.
> + */
> +#define EXT4_MAP_QUERY_LAST_IN_LEAF BIT(BH_BITMAP_UPTODATE + 1)
> #define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
> EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
> - EXT4_MAP_DELAYED)
> + EXT4_MAP_DELAYED | EXT4_MAP_QUERY_LAST_IN_LEAF)
>
> struct ext4_map_blocks {
> ext4_fsblk_t m_pblk;
> @@ -725,6 +735,12 @@ enum {
> #define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400
> /* Caller is in the atomic contex, find extent if it has been cached */
> #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800
> +/*
> + * Atomic write caller needs this to query in the slow path of mixed mapping
> + * case, when a contiguous extent can be split across two adjacent leaf nodes.
> + * Look EXT4_MAP_QUERY_LAST_IN_LEAF.
> + */
> +#define EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF 0x1000
>
> /*
> * The bit position of these flags must not overlap with any of the
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c616a16a9f36..fa850f188d46 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4433,6 +4433,18 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> allocated = map->m_len;
> ext4_ext_show_leaf(inode, path);
> out:
> + /*
> + * We never use EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF with CREATE flag.
> + * So we know that the depth used here is correct, since there was no
> + * block allocation done if EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set.
> + * If tomorrow we start using this QUERY flag with CREATE, then we will
> + * need to re-calculate the depth as it might have changed due to block
> + * allocation.
> + */
> + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
> + if (!err && ex && (ex == EXT_LAST_EXTENT(path[depth].p_hdr)))
> + map->m_flags |= EXT4_MAP_QUERY_LAST_IN_LEAF;
> +
> ext4_free_ext_path(path);
>
> trace_ext4_ext_map_blocks_exit(inode, flags, map,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2f99b087a5d8..8b86b1a29bdc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -459,14 +459,71 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
> }
> #endif /* ES_AGGRESSIVE_TEST */
>
> +static int ext4_map_query_blocks_next_in_leaf(handle_t *handle,
> + struct inode *inode, struct ext4_map_blocks *map,
> + unsigned int orig_mlen)
> +{
> + struct ext4_map_blocks map2;
> + unsigned int status, status2;
> + int retval;
> +
> + status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +
> + WARN_ON_ONCE(!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF));
> + WARN_ON_ONCE(orig_mlen <= map->m_len);
> +
> + /* Prepare map2 for lookup in next leaf block */
> + map2.m_lblk = map->m_lblk + map->m_len;
> + map2.m_len = orig_mlen - map->m_len;
> + map2.m_flags = 0;
> + retval = ext4_ext_map_blocks(handle, inode, &map2, 0);
> +
> + if (retval <= 0) {
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + return map->m_len;
> + }
> +
> + if (unlikely(retval != map2.m_len)) {
> + ext4_warning(inode->i_sb,
> + "ES len assertion failed for inode "
> + "%lu: retval %d != map->m_len %d",
> + inode->i_ino, retval, map2.m_len);
> + WARN_ON(1);
> + }
> +
> + status2 = map2.m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +
> + /*
> + * If map2 is contiguous with map, then let's insert it as a single
> + * extent in es cache and return the combined length of both the maps.
> + */
> + if (map->m_pblk + map->m_len == map2.m_pblk &&
> + status == status2) {
> + ext4_es_insert_extent(inode, map->m_lblk,
> + map->m_len + map2.m_len, map->m_pblk,
> + status, false);
> + map->m_len += map2.m_len;
> + } else {
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + }
> +
> + return map->m_len;
> +}
> +
> static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> - struct ext4_map_blocks *map)
> + struct ext4_map_blocks *map, int flags)
> {
> unsigned int status;
> int retval;
> + unsigned int orig_mlen = map->m_len;
> + unsigned int query_flags = flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF;
>
> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> - retval = ext4_ext_map_blocks(handle, inode, map, 0);
> + retval = ext4_ext_map_blocks(handle, inode, map, query_flags);
> else
> retval = ext4_ind_map_blocks(handle, inode, map, 0);
>
> @@ -481,11 +538,23 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> WARN_ON(1);
> }
>
> - status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> - EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> - ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> - map->m_pblk, status, false);
> - return retval;
> + /*
> + * No need to query next in leaf:
> + * - if returned extent is not last in leaf or
> + * - if the last in leaf is the full requested range
> + */
> + if (!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) ||
> + ((map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) &&
> + (map->m_len == orig_mlen))) {
> + status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> + ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status, false);
> + return retval;
> + }
> +
> + return ext4_map_query_blocks_next_in_leaf(handle, inode, map,
> + orig_mlen);
> }
>
> static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
> @@ -599,6 +668,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> struct extent_status es;
> int retval;
> int ret = 0;
> + unsigned int orig_mlen = map->m_len;
> #ifdef ES_AGGRESSIVE_TEST
> struct ext4_map_blocks orig_map;
>
> @@ -650,7 +720,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> ext4_map_blocks_es_recheck(handle, inode, map,
> &orig_map, flags);
> #endif
> - goto found;
> + if (!(flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) ||
> + orig_mlen == map->m_len)
> + goto found;
> +
> + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF)
> + map->m_len = orig_mlen;
> }
> /*
> * In the query cache no-wait mode, nothing we can do more if we
> @@ -664,7 +739,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> * file system block.
> */
> down_read(&EXT4_I(inode)->i_data_sem);
> - retval = ext4_map_query_blocks(handle, inode, map);
> + retval = ext4_map_query_blocks(handle, inode, map, flags);
> up_read((&EXT4_I(inode)->i_data_sem));
>
> found:
> @@ -1802,7 +1877,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> if (ext4_has_inline_data(inode))
> retval = 0;
> else
> - retval = ext4_map_query_blocks(NULL, inode, map);
> + retval = ext4_map_query_blocks(NULL, inode, map, 0);
> up_read(&EXT4_I(inode)->i_data_sem);
> if (retval)
> return retval < 0 ? retval : 0;
> @@ -1825,7 +1900,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> goto found;
> }
> } else if (!ext4_has_inline_data(inode)) {
> - retval = ext4_map_query_blocks(NULL, inode, map);
> + retval = ext4_map_query_blocks(NULL, inode, map, 0);
> if (retval) {
> up_write(&EXT4_I(inode)->i_data_sem);
> return retval < 0 ? retval : 0;
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
@ 2025-05-14 16:19 ` Darrick J. Wong
2025-05-14 19:04 ` Ritesh Harjani
0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:19 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 supports bigalloc feature which allows the FS to work in size of
> clusters (group of blocks) rather than individual blocks. This patch
> adds atomic write support for bigalloc so that systems with bs = ps can
> also create FS using -
> mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
>
> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> adjust ext4's atomic write unit max value to cluster size. This can then support
> atomic write of size anywhere between [blocksize, clustersize]. This
> patch adds the required changes to enable multi-fsblock atomic write
> support using bigalloc in the next patch.
>
> In this patch for block allocation:
> we first query the underlying region of the requested range by calling
> ext4_map_blocks() call. Here are the various cases which we then handle
> depending upon the underlying mapping type:
> 1. If the underlying region for the entire requested range is a mapped extent,
> then we don't call ext4_map_blocks() to allocate anything. We don't need to
> even start the jbd2 txn in this case.
> 2. For an append write case, we create a mapped extent.
> 3. If the underlying region is entirely a hole, then we create an unwritten
> extent for the requested range.
> 4. If the underlying region is a large unwritten extent, then we split the
> extent into 2 unwritten extent of required size.
> 5. If the underlying region has any type of mixed mapping, then we call
> ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
> within the requested range. This then provide a single mapped extent type
> mapping for the requested range.
>
> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> flag only when the underlying extent mapping of the requested range is
> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> is, if the underlying region contains a mix of hole(s), unwritten
> extent(s), and mapped extent(s), we use this loop to ensure that all the
> short mappings are zeroed out. This guarantees that the entire requested
> range becomes a single, uniformly mapped extent. It is ok to do so
> because we know this is being done on a bigalloc enabled filesystem
> where the block bitmap represents the entire cluster unit.
>
> Note having a single contiguous underlying region of type mapped,
> unwrittn or hole is not a problem. But the reason to avoid writing on
> top of mixed mapping region is because, atomic writes requires all or
> nothing should get written for the userspace pwritev2 request. So if at
> any point in time during the write if a crash or a sudden poweroff
> occurs, the region undergoing atomic write should read either complete
> old data or complete new data. But it should never have a mix of both
> old and new data.
> So, we first convert any mixed mapping region to a single contiguous
> mapped extent before any data gets written to it. This is because
> normally FS will only convert unwritten extents to written at the end of
> the write in ->end_io() call. And if we allow the writes over a mixed
> mapping and if a sudden power off happens in between, we will end up
> reading mix of new data (over mapped extents) and old data (over
> unwritten extents), because unwritten to written conversion never went
> through.
> So to avoid this and to avoid writes getting torned due to mixed
> mapping, we first allocate a single contiguous block mapping and then
> do the write.
>
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Looks fine (I don't like the pre-zeroing but options are limited on
ext4) except for one thing...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 8b86b1a29bdc..2642e1ef128f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> }
> }
>
> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
> + struct inode *inode, struct ext4_map_blocks *map)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + unsigned int mapped_len = 0, m_flags = 0;
> + ext4_fsblk_t next_pblk;
> + bool check_next_pblk = false;
> + int ret = 0;
> +
> + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
> +
> + /*
> + * This is a slow path in case of mixed mapping. We use
> + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
> + * contiguous mapped mapping. This will ensure any unwritten or hole
> + * regions within the requested range is zeroed out and we return
> + * a single contiguous mapped extent.
> + */
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +
> + do {
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (ret < 0 && ret != -ENOSPC)
> + goto out_err;
> + /*
> + * This should never happen, but let's return an error code to
> + * avoid an infinite loop in here.
> + */
> + if (ret == 0) {
> + ret = -EFSCORRUPTED;
> + ext4_warning_inode(inode,
> + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
> + m_flags, ret);
> + goto out_err;
> + }
> + /*
> + * With bigalloc we should never get ENOSPC nor discontiguous
> + * physical extents.
> + */
> + if ((check_next_pblk && next_pblk != map->m_pblk) ||
> + ret == -ENOSPC) {
> + ext4_warning_inode(inode,
> + "Non-contiguous allocation detected: expected %llu, got %llu, "
> + "or ext4_map_blocks() returned out of space ret: %d",
> + next_pblk, map->m_pblk, ret);
> + ret = -ENOSPC;
> + goto out_err;
If you get physically discontiguous mappings within a cluster, the
extent tree is corrupt.
--D
> + }
> + next_pblk = map->m_pblk + map->m_len;
> + check_next_pblk = true;
> +
> + mapped_len += map->m_len;
> + map->m_lblk += map->m_len;
> + map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);
> +
> + /*
> + * We might have done some work in above loop, so we need to query the
> + * start of the physical extent, based on the origin m_lblk and m_len.
> + * Let's also ensure we were able to allocate the required range for
> + * mixed mapping case.
> + */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> +
> + ret = ext4_map_blocks(handle, inode, map,
> + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
> + if (ret != m_len) {
> + ext4_warning_inode(inode,
> + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
> + m_lblk, m_len, ret);
> + ret = -EINVAL;
> + }
> + return ret;
> +
> +out_err:
> + /* reset map before returning an error */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> + return ret;
> +}
> +
> +/*
> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
> + * mixed mappings.
> + *
> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
> + * physical extent for the requested range does not have a single contiguous
> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
> + * In that case we will loop over the requested range to allocate and zero out
> + * the unwritten / holes in between, to get a single mapped extent from
> + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know
> + * this can be called only with bigalloc enabled filesystem where the underlying
> + * cluster is already allocated. This avoids allocating discontiguous extents
> + * in the slow path due to multiple calls to ext4_map_blocks().
> + * The slow path is mostly non-performance critical path, so it should be ok to
> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
> + * underlying short holes/unwritten extents within the requested range.
> + */
> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
> + struct ext4_map_blocks *map, int m_flags)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + int ret = 0;
> +
> + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
> +
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (ret < 0 || ret == m_len)
> + goto out;
> + /*
> + * This is a mixed mapping case where we were not able to allocate
> + * a single contiguous extent. In that case let's reset requested
> + * mapping and call the slow path.
> + */
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> + map->m_flags = 0;
> +
> + return ext4_map_blocks_atomic_write_slow(handle, inode, map);
> +out:
> + return ret;
> +}
> +
> static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> unsigned int flags)
> {
> @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> */
> if (map->m_len > DIO_MAX_BLOCKS)
> map->m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +
> + /*
> + * journal credits estimation for atomic writes. We call
> + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
> + * then let's assume the no. of pextents required can be m_len i.e.
> + * every alternate block can be unwritten and hole.
> + */
> + if (flags & IOMAP_ATOMIC) {
> + unsigned int orig_mlen = map->m_len;
> +
> + ret = ext4_map_blocks(NULL, inode, map, 0);
> + if (ret < 0)
> + return ret;
> + if (map->m_len < orig_mlen) {
> + map->m_len = orig_mlen;
> + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
> + map->m_len);
> + } else {
> + dio_credits = ext4_chunk_trans_blocks(inode,
> + map->m_len);
> + }
> + } else {
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> + }
>
> retry:
> /*
> @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>
> - ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (flags & IOMAP_ATOMIC)
> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
> + else
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
>
> /*
> * We cannot fill holes in indirect tree based inodes as that could
> @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> int ret;
> struct ext4_map_blocks map;
> u8 blkbits = inode->i_blkbits;
> + unsigned int orig_mlen;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> map.m_lblk = offset >> blkbits;
> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> + orig_mlen = map.m_len;
>
> if (flags & IOMAP_WRITE) {
> /*
> @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> */
> if (offset + length <= i_size_read(inode)) {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> - goto out;
> + /*
> + * For atomic writes the entire requested length should
> + * be mapped.
> + */
> + if (map.m_flags & EXT4_MAP_MAPPED) {
> + if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> + (flags & IOMAP_ATOMIC && ret >= orig_mlen))
> + goto out;
> + }
> + map.m_len = orig_mlen;
> }
> ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> */
> map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
>
> + /*
> + * Before returning to iomap, let's ensure the allocated mapping
> + * covers the entire requested length for atomic writes.
> + */
> + if (flags & IOMAP_ATOMIC) {
> + if (map.m_len < (length >> blkbits)) {
> + WARN_ON(1);
> + return -EINVAL;
> + }
> + }
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>
> return 0;
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
@ 2025-05-14 16:21 ` Darrick J. Wong
0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:21 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:36AM +0530, Ritesh Harjani (IBM) wrote:
> Last couple of patches added the needed support for multi-fsblock atomic
> writes using bigalloc. This patch ensures that filesystem advertizes the
> needed atomic write unit min and max values for enabling multi-fsblock
> atomic write support with bigalloc.
>
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
> fs/ext4/super.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 181934499624..508ea5cff1c7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4442,12 +4442,12 @@ static int ext4_handle_clustersize(struct super_block *sb)
> /*
> * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
> * @sb: super block
> - * TODO: Later add support for bigalloc
> */
> static void ext4_atomic_write_init(struct super_block *sb)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct block_device *bdev = sb->s_bdev;
> + unsigned int clustersize = sb->s_blocksize;
>
> if (!bdev_can_atomic_write(bdev))
> return;
> @@ -4455,9 +4455,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
> if (!ext4_has_feature_extents(sb))
> return;
>
> + if (ext4_has_feature_bigalloc(sb))
> + clustersize = EXT4_CLUSTER_SIZE(sb);
Doesn't EXT4_CLUSTER_SIZE return EXT4_BLOCK_SIZE(sb) (aka s_blocksize)
for !bigalloc filesystems?
Looks fine to me otherwise
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> +
> sbi->s_awu_min = max(sb->s_blocksize,
> bdev_atomic_write_unit_min_bytes(bdev));
> - sbi->s_awu_max = min(sb->s_blocksize,
> + sbi->s_awu_max = min(clustersize,
> bdev_atomic_write_unit_max_bytes(bdev));
> if (sbi->s_awu_min && sbi->s_awu_max &&
> sbi->s_awu_min <= sbi->s_awu_max) {
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 1/7] ext4: Document an edge case for overwrites
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-09 5:19 ` Ojaswin Mujoo
@ 2025-05-14 16:23 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:23 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:31AM +0530, Ritesh Harjani (IBM) wrote:
> ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
> calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
> as it is easy to miss it when focusing on write paths alone.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Weird but ok,
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/ext4/inode.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 94c7d2d828a6..b10e5cd5bb5c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3436,6 +3436,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> + /*
> + * This can be called for overwrites path from
> + * ext4_iomap_overwrite_begin().
> + */
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> }
>
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
2025-05-09 5:20 ` Ojaswin Mujoo
@ 2025-05-14 16:24 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:24 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:32AM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 only supports doing atomic write on inodes which uses extents, so
> add a check in ext4_inode_can_atomic_write() which gets called during
> open.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Makes sense seeing as advertising the awu geometry is gated on the
filesystem having extents turned on...
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/ext4/ext4.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 5a20e9cd7184..c0240f6f6491 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3847,7 +3847,9 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> static inline bool ext4_inode_can_atomic_write(struct inode *inode)
> {
>
> - return S_ISREG(inode->i_mode) && EXT4_SB(inode->i_sb)->s_awu_min > 0;
> + return S_ISREG(inode->i_mode) &&
> + ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + EXT4_SB(inode->i_sb)->s_awu_min > 0;
> }
>
> extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
2025-05-09 5:21 ` Ojaswin Mujoo
@ 2025-05-14 16:24 ` Darrick J. Wong
1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:24 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 02:20:33AM +0530, Ritesh Harjani (IBM) wrote:
> Let's make ext4_meta_trans_blocks() non-static for use in later
> functions during ->end_io conversion for atomic writes.
> We will need this function to estimate journal credits for a special
> case. Instead of adding another wrapper around it, let's make this
> non-static.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/ext4/ext4.h | 2 ++
> fs/ext4/inode.c | 6 +-----
> 2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c0240f6f6491..e2b36a3c1b0f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3039,6 +3039,8 @@ extern void ext4_set_aops(struct inode *inode);
> extern int ext4_writepage_trans_blocks(struct inode *);
> extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode);
> extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
> +extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> + int pextents);
> extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
> loff_t lstart, loff_t lend);
> extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b10e5cd5bb5c..2f99b087a5d8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -142,9 +142,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
> new_size);
> }
>
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> - int pextents);
> -
> /*
> * Test whether an inode is a fast symlink.
> * A fast symlink has its symlink data stored in ext4_inode_info->i_data.
> @@ -5777,8 +5774,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
> *
> * Also account for superblock, inode, quota and xattr blocks
> */
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> - int pextents)
> +int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents)
> {
> ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
> int gdpblocks;
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 7/7] ext4: Add atomic block write documentation
2025-05-09 7:34 ` Ojaswin Mujoo
@ 2025-05-14 16:38 ` Darrick J. Wong
2025-05-15 2:15 ` Ritesh Harjani
2025-05-15 2:18 ` Ritesh Harjani
1 sibling, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:38 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Ritesh Harjani (IBM), linux-ext4, Theodore Ts'o, Jan Kara,
John Garry, linux-fsdevel
On Fri, May 09, 2025 at 01:04:05PM +0530, Ojaswin Mujoo wrote:
> On Fri, May 09, 2025 at 02:20:37AM +0530, Ritesh Harjani (IBM) wrote:
> > Add an initial documentation around atomic writes support in ext4.
> >
> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> Hi Ritesh,
>
> THe docs look mostly good. I'll add some feedback below:
> > ---
> > .../filesystems/ext4/atomic_writes.rst | 208 ++++++++++++++++++
> > Documentation/filesystems/ext4/overview.rst | 1 +
> > 2 files changed, 209 insertions(+)
> > create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
> >
> > diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
> > new file mode 100644
> > index 000000000000..59b03d8dbb79
> > --- /dev/null
> > +++ b/Documentation/filesystems/ext4/atomic_writes.rst
> > @@ -0,0 +1,208 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +.. _atomic_writes:
> > +
> > +Atomic Block Writes
> > +-------------------------
> > +
> > +Introduction
> > +~~~~~~~~~~~~
> > +
> > +Atomic (untorn) block writes ensure that either the entire write is committed
> > +to disk or none of it is. This prevents "torn writes" during power loss or
> > +system crashes. The ext4 filesystem supports atomic writes (only with Direct
> > +I/O) on regular files with extents, provided the underlying storage device
> > +supports hardware atomic writes. This is supported in the following two ways:
> > +
> > +1. **Single-fsblock Atomic Writes**:
> > + EXT4's supports atomic write operations with a single filesystem block since
> > + v6.13. In this the atomic write unit minimum and maximum sizes are both set
> > + to filesystem blocksize.
> > + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
> > + pagesize system is possible.
> > +
> > +2. **Multi-fsblock Atomic Writes with Bigalloc**:
> > + EXT4 now also supports atomic writes spanning multiple filesystem blocks
> > + using a feature known as bigalloc. The atomic write unit's minimum and
> > + maximum sizes are determined by the filesystem block size and cluster size,
> > + based on the underlying device’s supported atomic write unit limits.
> > +
> > +Requirements
> > +~~~~~~~~~~~~
> > +
> > +Basic requirements for atomic writes in ext4:
> > +
> > + 1. The extents feature must be enabled (default for ext4)
> > + 2. The underlying block device must support atomic writes
> > + 3. For single-fsblock atomic writes:
> > +
> > + 1. A filesystem with appropriate block size (up to the page size)
> > + 4. For multi-fsblock atomic writes:
> > +
> > + 1. The bigalloc feature must be enabled
> > + 2. The cluster size must be appropriately configured
> > +
> > +NOTE: EXT4 does not support software or COW based atomic write, which means
> > +atomic writes on ext4 are only supported if underlying storage device supports
> > +it.
> > +
> > +Multi-fsblock Implementation Details
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The bigalloc feature changes ext4 to use clustered allocations. With bigalloc
I would say "...changes ext4 to allocate in units of multiple fs blocks,
also known as clusters." so that the definition of a cluster is right
there in the first sentence instead of the second.
> > +each bit within block bitmap represents clusters (power of 2 number of blocks)
> > +rather than individual filesystem blocks. EXT4 supports atomic writes using
> > +bigalloc by making sure that atomic write min and max are within [blocksize,
> > +clustersize].
>
> Should we add a line like:
>
> Atomic write max unit is capped to the max supported by the underlying
> device, incase it is less than the clustersize.
I think the documentation should say exactly what the untorn write
geometry is constrained to:
"EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
following constraints: The minimum atomic write size is the larger of
the fs block size and the minimum hardware atomic write unit; and the
maximum atomic write size is smaller of the bigalloc cluster size and
the maximum hardware atomic write unit. Bigalloc ensures that all
allocations are aligned to the cluster size, which satisfies the LBA
alignment requirements of the hardware device if the start of the
partition/logical volume is itself aligned correctly."
> Also, maybe we can have a line wiht something like "With bigalloc's
> clustered allocation we can be sure that an atomic write will always
> be allocated aligned blocks. The only thing we need to ensure is that
> we have a continuous mapping in the write rang."
>
> > +
> > +Here is the block allocation strategy in bigalloc for atomic writes:
> > +
> > + * For regions with fully mapped extents, no additional allocation is needed
"No additional work is needed" ?
> > + * For append writes, a new mapped extent is allocated
> > + * For regions that are entirely holes, unwritten extent is created
> > + * For large unwritten extents, the extent gets split into two unwritten
> > + extents of appropriate requested size
>
> Are the above 4 points needed explicitly? Maybe we can have:
>
> Append writes, and writes on regions that are fully mapped,
> unwritten or hole follow the same flow as non atomic writes.
>
> > + * For mixed mapping regions (combinations of holes, unwritten extents, or
> > + mapped extents), ext4_map_blocks() is called in a loop with
> > + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
> > + mapped extent
> Maybe:
>
> ... single continuous mapped extents by writing zeroes to it
>
> So that we explicitly mention what we are doing and not rely on people
> knowing the meaning of EXT4_GET_BLOCKS_ZERO flag.
(Yeah.)
> > +Note: Writing on a single contiguous underlying extent, whether mapped or
> > +unwritten, is not inherently problematic. However, writing to a mixed mapping
> > +region (i.e. one containing a combination of mapped and unwritten extents)
> > +must be avoided when performing atomic writes.
> > +
> > +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
> > +flag, requires that either all data is written or none at all. In the event of
> > +a system crash or unexpected power loss during the write operation, the affected
> > +region (when later read) must reflect either the complete old data or the
> > +complete new data, but never a mix of both.
> > +
> > +To enforce this guarantee, we ensure that the write target is backed by
> > +a single, contiguous extent before any data is written. This is critical because
> > +ext4 defers the conversion of unwritten extents to written extents until the I/O
> > +completion path (typically in ->end_io()). If a write is allowed to proceed over
> > +a mixed mapping region (with mapped and unwritten extents) and a failure occurs
> > +mid-write, the system could observe partially updated regions after reboot, i.e.
> > +new data over mapped areas, and stale (old) data over unwritten extents that
> > +were never marked written. This violates the atomicity and/or torn write
> > +prevention guarantee.
> > +
> > +To prevent such torn writes, ext4 proactively allocates a single contiguous
> > +extent for the entire requested region in ``ext4_iomap_alloc`` via
> > +``ext4_map_blocks_atomic()``. Only after this allocation, is the write
> > +operation performed by iomap.
> > +
> > +Handling Split Extents Across Leaf Blocks
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +There can be a special edge case where we have logically and physically
> > +contiguous extents stored in separate leaf nodes of the on-disk extent tree.
> > +This occurs because on-disk extent tree merges only happens within the leaf
> > +blocks except for a case where we have 2-level tree which can get merged and
> > +collapsed entirely into the inode.
Aha, I guess this is the answer to my earlier question. :)
> > +If such a layout exists and, in the worst case, the extent status cache entries
> > +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
> > +a single contiguous extent for these split leaf extents.
> > +
> > +To address this edge case, a new get block flag
> > +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
> > +``ext4_map_query_blocks()`` lookup behavior.
> > +
> > +This new get block flag allows ``ext4_map_blocks()`` to first checks if there is
>
> s/checks/check
>
> > +an entry in the extent status cache for the full range.
> > +If not present, it consults the on-disk extent tree using
> > +``ext4_map_query_blocks()``.
> > +If the located extent is at the end of a leaf node, it probes the next logical
> > +block (lblk) to detect a contiguous extent in the adjacent leaf.
> > +
> > +For now only one additional leaf block is queried to maintain efficiency, as
> > +atomic writes are typically constrained to small sizes
> > +(e.g. [blocksize, clustersize]).
> > +
> > +
> > +Handling Journal transactions
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +To support multi-fsblock atomic writes, we ensure enough journal credits are
> > +reserved during:
> > +
> > + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
> > + could be a mixed mapping for the underlying requested range. If yes, then we
> > + reserve credits of up to ``m_len``, assuming every alternate block can be
> > + an unwritten extent followed by a hole.
> > +
> > + 2. During ``->end_io()`` call, we make sure a single transaction is started for
> > + doing unwritten-to-written conversion. The loop for conversion is mainly
> > + only required to handle a split extent across leaf blocks.
> > +
> > +How to
> > +------
> > +
> > +Creating Filesystems with Atomic Write Support
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +For single-fsblock atomic writes with a larger block size
> > +(on systems with block size < page size):
> > +
> > +.. code-block:: bash
> > +
> > + # Create an ext4 filesystem with a 16KB block size
> > + # (requires page size >= 16KB)
> > + mkfs.ext4 -b 16384 /dev/device
> > +
> > +For multi-fsblock atomic writes with bigalloc:
> > +
> > +.. code-block:: bash
> > +
> > + # Create an ext4 filesystem with bigalloc and 64KB cluster size
> > + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
> > +
> > +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
> > +and ``-O bigalloc`` enables the bigalloc feature.
Might want to add at least a sentence about "figure out what atomic
write unit your application needs by querying statx of the block device
or whatever. Or refer them to the "Hardware Support" section. :)
> > +
> > +Application Interface
> > +~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
> > +to perform atomic writes:
> > +
> > +.. code-block:: c
> > +
> > + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
> > +
> > +The write must be aligned to the filesystem's block size and not exceed the
> > +filesystem's maximum atomic write unit size.
> > +See ``generic_atomic_write_valid()`` for more details.
> > +
> > +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
> > +details:
> > +
> > + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
> > + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
> > + * ``stx_atomic_write_segments_max``: Upper limit for segments. Tthe number of
s/Tthe/The/
> > + separate memory buffers that can be gathered into a write operation
> > + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
> > +
> > +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
> > +writes are supported.
> > +
> > +Hardware Support
> > +----------------
> > +
> > +The underlying storage device must support atomic write operations.
> > +Modern NVMe and SCSI devices often provide this capability.
> > +The Linux kernel exposes this information through sysfs:
> > +
> > +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
> > +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
> > +
> > +Nonzero values for these attributes indicate that the device supports
> > +atomic writes.
The rest fits with my understanding of atomic untorn writes.
--D
> > +
> > +See Also
> > +--------
> > +
> > +* :doc:`bigalloc` - Documentation on the bigalloc feature
> > +* :doc:`allocators` - Documentation on block allocation in ext4
> > +* Support for atomic block writes in 6.13:
> > + https://lwn.net/Articles/1009298/
> > diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
> > index 0fad6eda6e15..9d4054c17ecb 100644
> > --- a/Documentation/filesystems/ext4/overview.rst
> > +++ b/Documentation/filesystems/ext4/overview.rst
> > @@ -25,3 +25,4 @@ order.
> > .. include:: inlinedata.rst
> > .. include:: eainode.rst
> > .. include:: verity.rst
> > +.. include:: atomic_writes.rst
> > --
> > 2.49.0
> >
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
@ 2025-05-14 16:40 ` Darrick J. Wong
2025-05-14 18:55 ` Ritesh Harjani
0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:40 UTC (permalink / raw)
To: Ritesh Harjani
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
On Fri, May 09, 2025 at 11:12:46PM +0530, Ritesh Harjani wrote:
> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>
> > This is v3 of multi-fsblock atomic write support using bigalloc. This has
> > started looking into much better shape now. The major chunk of the design
> > changes has been kept in Patch-4 & 5.
> >
> > This series can now be carefully reviewed, as all the error handling related
> > code paths should be properly taken care of.
> >
>
> We spotted that multi-fsblock changes might need to force a journal
> commit if there were mixed mappings in the underlying region e.g. say WUWUWUW...
>
> The issue arises when, during block allocation, the unwritten ranges are
> first zeroed out, followed by the unwritten-to-written extent
> conversion. This conversion is part of a journaled metadata transaction
> that has not yet been committed, as the transaction is still running.
> If an iomap write then modifies the data on those multi-fsblocks and a
> sudden power loss occurs before the transaction commits, the
> unwritten-to-written conversion will not be replayed during journal
> recovery. As a result, we end up with new data written over mapped
> blocks, while the alternate unwritten blocks will read zeroes. This
> could cause a torn write behavior for atomic writes.
>
> So we were thinking we might need something like this. Hopefully this
> should still be ok, as mixed mapping case mostly is a non-performance
> critical path. Thoughts?
I agree the journal has to be written out before the atomic write is
sent to the device.
--D
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2642e1ef128f..59b59d609976 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3517,7 +3517,8 @@ static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
> * underlying short holes/unwritten extents within the requested range.
> */
> static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
> - struct ext4_map_blocks *map, int m_flags)
> + struct ext4_map_blocks *map, int m_flags,
> + bool *force_commit)
> {
> ext4_lblk_t m_lblk = map->m_lblk;
> unsigned int m_len = map->m_len;
> @@ -3537,6 +3538,11 @@ static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
> map->m_len = m_len;
> map->m_flags = 0;
>
> + /*
> + * slow path means we have mixed mapping, that means we will need
> + * to force txn commit.
> + */
> + *force_commit = true;
> return ext4_map_blocks_atomic_write_slow(handle, inode, map);
> out:
> return ret;
> @@ -3548,6 +3554,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> handle_t *handle;
> u8 blkbits = inode->i_blkbits;
> int ret, dio_credits, m_flags = 0, retries = 0;
> + bool force_commit = false;
>
> /*
> * Trim the mapping request to the maximum value that we can map at
> @@ -3610,7 +3617,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>
> if (flags & IOMAP_ATOMIC)
> - ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
> + &force_commit);
> else
> ret = ext4_map_blocks(handle, inode, map, m_flags);
>
> @@ -3626,6 +3634,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
>
> + if (ret > 0 && force_commit)
> + ext4_force_commit(inode->i_sb);
> +
> return ret;
> }
>
>
> -ritesh
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
2025-05-14 16:16 ` Darrick J. Wong
@ 2025-05-14 18:47 ` Ritesh Harjani
0 siblings, 0 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-14 18:47 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
"Darrick J. Wong" <djwong@kernel.org> writes:
> On Fri, May 09, 2025 at 02:20:34AM +0530, Ritesh Harjani (IBM) wrote:
>> There can be a case where there are contiguous extents on the adjacent
>> leaf nodes of on-disk extent trees. So when someone tries to write to
>> this contiguous range, ext4_map_blocks() call will split by returning
>> 1 extent at a time if this is not already cached in extent_status tree
>> cache (where if these extents when cached can get merged since they are
>> contiguous).
>>
>> This is fine for a normal write however in case of atomic writes, it
>> can't afford to break the write into two. Now this is also something
>> that will only happen in the slow write case where we call
>> ext4_map_blocks() for each of these extents spread across different leaf
>> nodes. However, there is no guarantee that these extent status cache
>> cannot be reclaimed before the last call to ext4_map_blocks() in
>> ext4_map_blocks_atomic_write_slow().
>
> Can you have two physically and logically contiguous mappings within a
> single leaf node?
On disk extent tree can merge two such blocks if it is within the same
leaf node. But there can be a case where there are two logically and
physically contiguous mappings lying on two different leaf nodes.
(since on disk extent tree does not merge extents across branches.)
In that case ext4_map_blocks() can only return only 1 mapping at a time
(unless it is cached in extent status cache).
> Or is the key idea here that the extent status tree
> will merge adjacent mappings from the same leaf block, just not between
> leaf blocks?
>
Yes, in memory extent status cache can still merge this. But there can
be a case (we can argue in this case it may practically never happen)
that, the extent status cache got pruned due to memory pressure and we
have to look over on-disk extent tree. In that case we will need to look
ahead in the adjacent leaf block to see if we have a contiguous mapping.
Otherwise the atomic write will always fail over such contiguous region
split across two leaf nodes.
>> Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
>> This flag checks if the requested range can be fully found in extent
>> status cache and return. If not, it looks up in on-disk extent
>> tree via ext4_map_query_blocks(). If the found extent is the last entry
>> in the leaf node, then it goes and queries the next lblk to see if there
>> is an adjacent contiguous extent in the adjacent leaf node of the
>> on-disk extent tree.
>>
>> Even though there can be a case where there are multiple adjacent extent
>> entries spread across multiple leaf nodes. But we only read an adjacent
>> leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
>> The reason for this is that we are mostly only going to support atomic
>> writes with upto 64KB or maybe max upto 1MB of atomic write support.
>>
>> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>> fs/ext4/ext4.h | 18 ++++++++-
>> fs/ext4/extents.c | 12 ++++++
>> fs/ext4/inode.c | 97 +++++++++++++++++++++++++++++++++++++++++------
>> 3 files changed, 115 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index e2b36a3c1b0f..b4bbe2837423 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -256,9 +256,19 @@ struct ext4_allocation_request {
>> #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
>> #define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
>> #define EXT4_MAP_DELAYED BIT(BH_Delay)
>> +/*
>> + * This is for use in ext4_map_query_blocks() for a special case where we can
>> + * have a physically and logically contiguous blocks explit across two leaf
>
> s/explit/split/ ?
Thanks! Will fix it.
-ritesh
>
> --D
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-14 16:40 ` Darrick J. Wong
@ 2025-05-14 18:55 ` Ritesh Harjani
0 siblings, 0 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-14 18:55 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
"Darrick J. Wong" <djwong@kernel.org> writes:
> On Fri, May 09, 2025 at 11:12:46PM +0530, Ritesh Harjani wrote:
>> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>>
>> > This is v3 of multi-fsblock atomic write support using bigalloc. This has
>> > started looking into much better shape now. The major chunk of the design
>> > changes has been kept in Patch-4 & 5.
>> >
>> > This series can now be carefully reviewed, as all the error handling related
>> > code paths should be properly taken care of.
>> >
>>
>> We spotted that multi-fsblock changes might need to force a journal
>> commit if there were mixed mappings in the underlying region e.g. say WUWUWUW...
>>
>> The issue arises when, during block allocation, the unwritten ranges are
>> first zeroed out, followed by the unwritten-to-written extent
>> conversion. This conversion is part of a journaled metadata transaction
>> that has not yet been committed, as the transaction is still running.
>> If an iomap write then modifies the data on those multi-fsblocks and a
>> sudden power loss occurs before the transaction commits, the
>> unwritten-to-written conversion will not be replayed during journal
>> recovery. As a result, we end up with new data written over mapped
>> blocks, while the alternate unwritten blocks will read zeroes. This
>> could cause a torn write behavior for atomic writes.
>>
>> So we were thinking we might need something like this. Hopefully this
>> should still be ok, as mixed mapping case mostly is a non-performance
>> critical path. Thoughts?
>
> I agree the journal has to be written out before the atomic write is
> sent to the device.
Yes, we were even able to reproduce this problem on an actual nvme
(which supports atomic write), with one of our data integrity test
(which btw still needs little clean up for us to post for integrating it with xfstests).
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 2642e1ef128f..59b59d609976 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3517,7 +3517,8 @@ static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
>> * underlying short holes/unwritten extents within the requested range.
>> */
>> static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
>> - struct ext4_map_blocks *map, int m_flags)
>> + struct ext4_map_blocks *map, int m_flags,
>> + bool *force_commit)
>> {
>> ext4_lblk_t m_lblk = map->m_lblk;
>> unsigned int m_len = map->m_len;
>> @@ -3537,6 +3538,11 @@ static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
>> map->m_len = m_len;
>> map->m_flags = 0;
>>
>> + /*
>> + * slow path means we have mixed mapping, that means we will need
>> + * to force txn commit.
>> + */
>> + *force_commit = true;
>> return ext4_map_blocks_atomic_write_slow(handle, inode, map);
>> out:
>> return ret;
>> @@ -3548,6 +3554,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> handle_t *handle;
>> u8 blkbits = inode->i_blkbits;
>> int ret, dio_credits, m_flags = 0, retries = 0;
>> + bool force_commit = false;
>>
>> /*
>> * Trim the mapping request to the maximum value that we can map at
>> @@ -3610,7 +3617,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>
>> if (flags & IOMAP_ATOMIC)
>> - ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
>> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
>> + &force_commit);
>> else
>> ret = ext4_map_blocks(handle, inode, map, m_flags);
>>
>> @@ -3626,6 +3634,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> goto retry;
>>
>> + if (ret > 0 && force_commit)
>> + ext4_force_commit(inode->i_sb);
>> +
Needs to handle return value from ext4_force_commit() here. Will
integrate this change in v4.
-ritesh
>> return ret;
>> }
>>
>>
>> -ritesh
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc
2025-05-14 16:19 ` Darrick J. Wong
@ 2025-05-14 19:04 ` Ritesh Harjani
0 siblings, 0 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-14 19:04 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
Ojaswin Mujoo, linux-fsdevel
"Darrick J. Wong" <djwong@kernel.org> writes:
> On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote:
>> EXT4 supports bigalloc feature which allows the FS to work in size of
>> clusters (group of blocks) rather than individual blocks. This patch
>> adds atomic write support for bigalloc so that systems with bs = ps can
>> also create FS using -
>> mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
>>
>> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
>> adjust ext4's atomic write unit max value to cluster size. This can then support
>> atomic write of size anywhere between [blocksize, clustersize]. This
>> patch adds the required changes to enable multi-fsblock atomic write
>> support using bigalloc in the next patch.
>>
>> In this patch for block allocation:
>> we first query the underlying region of the requested range by calling
>> ext4_map_blocks() call. Here are the various cases which we then handle
>> depending upon the underlying mapping type:
>> 1. If the underlying region for the entire requested range is a mapped extent,
>> then we don't call ext4_map_blocks() to allocate anything. We don't need to
>> even start the jbd2 txn in this case.
>> 2. For an append write case, we create a mapped extent.
>> 3. If the underlying region is entirely a hole, then we create an unwritten
>> extent for the requested range.
>> 4. If the underlying region is a large unwritten extent, then we split the
>> extent into 2 unwritten extent of required size.
>> 5. If the underlying region has any type of mixed mapping, then we call
>> ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>> within the requested range. This then provide a single mapped extent type
>> mapping for the requested range.
>>
>> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
>> flag only when the underlying extent mapping of the requested range is
>> not entirely a hole, an unwritten extent, or a fully mapped extent. That
>> is, if the underlying region contains a mix of hole(s), unwritten
>> extent(s), and mapped extent(s), we use this loop to ensure that all the
>> short mappings are zeroed out. This guarantees that the entire requested
>> range becomes a single, uniformly mapped extent. It is ok to do so
>> because we know this is being done on a bigalloc enabled filesystem
>> where the block bitmap represents the entire cluster unit.
>>
>> Note having a single contiguous underlying region of type mapped,
>> unwrittn or hole is not a problem. But the reason to avoid writing on
>> top of mixed mapping region is because, atomic writes requires all or
>> nothing should get written for the userspace pwritev2 request. So if at
>> any point in time during the write if a crash or a sudden poweroff
>> occurs, the region undergoing atomic write should read either complete
>> old data or complete new data. But it should never have a mix of both
>> old and new data.
>> So, we first convert any mixed mapping region to a single contiguous
>> mapped extent before any data gets written to it. This is because
>> normally FS will only convert unwritten extents to written at the end of
>> the write in ->end_io() call. And if we allow the writes over a mixed
>> mapping and if a sudden power off happens in between, we will end up
>> reading mix of new data (over mapped extents) and old data (over
>> unwritten extents), because unwritten to written conversion never went
>> through.
>> So to avoid this and to avoid writes getting torned due to mixed
>> mapping, we first allocate a single contiguous block mapping and then
>> do the write.
>>
>> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> Looks fine (I don't like the pre-zeroing but options are limited on
> ext4) except for one thing...
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 8b86b1a29bdc..2642e1ef128f 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>> }
>> }
>>
>> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
>> + struct inode *inode, struct ext4_map_blocks *map)
>> +{
>> + ext4_lblk_t m_lblk = map->m_lblk;
>> + unsigned int m_len = map->m_len;
>> + unsigned int mapped_len = 0, m_flags = 0;
>> + ext4_fsblk_t next_pblk;
>> + bool check_next_pblk = false;
>> + int ret = 0;
>> +
>> + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> + /*
>> + * This is a slow path in case of mixed mapping. We use
>> + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
>> + * contiguous mapped mapping. This will ensure any unwritten or hole
>> + * regions within the requested range is zeroed out and we return
>> + * a single contiguous mapped extent.
>> + */
>> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> +
>> + do {
>> + ret = ext4_map_blocks(handle, inode, map, m_flags);
>> + if (ret < 0 && ret != -ENOSPC)
>> + goto out_err;
>> + /*
>> + * This should never happen, but let's return an error code to
>> + * avoid an infinite loop in here.
>> + */
>> + if (ret == 0) {
>> + ret = -EFSCORRUPTED;
>> + ext4_warning_inode(inode,
>> + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
>> + m_flags, ret);
>> + goto out_err;
>> + }
>> + /*
>> + * With bigalloc we should never get ENOSPC nor discontiguous
>> + * physical extents.
>> + */
>> + if ((check_next_pblk && next_pblk != map->m_pblk) ||
>> + ret == -ENOSPC) {
>> + ext4_warning_inode(inode,
>> + "Non-contiguous allocation detected: expected %llu, got %llu, "
>> + "or ext4_map_blocks() returned out of space ret: %d",
>> + next_pblk, map->m_pblk, ret);
>> + ret = -ENOSPC;
>> + goto out_err;
>
> If you get physically discontiguous mappings within a cluster, the
> extent tree is corrupt.
>
yes, I guess I was just being hesitant to do that. But you are right,
we should return -EFSCORRUPTED here then.
I will change the error code along with the other forcecommit change in v4.
> --D
>
Thanks for the review!
-ritesh
>> + next_pblk = map->m_pblk + map->m_len;
>> + check_next_pblk = true;
>> +
>> + mapped_len += map->m_len;
>> + map->m_lblk += map->m_len;
>> + map->m_len = m_len - mapped_len;
>> + } while (mapped_len < m_len);
>> +
>> + /*
>> + * We might have done some work in above loop, so we need to query the
>> + * start of the physical extent, based on the origin m_lblk and m_len.
>> + * Let's also ensure we were able to allocate the required range for
>> + * mixed mapping case.
>> + */
>> + map->m_lblk = m_lblk;
>> + map->m_len = m_len;
>> + map->m_flags = 0;
>> +
>> + ret = ext4_map_blocks(handle, inode, map,
>> + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
>> + if (ret != m_len) {
>> + ext4_warning_inode(inode,
>> + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
>> + m_lblk, m_len, ret);
>> + ret = -EINVAL;
>> + }
>> + return ret;
>> +
>> +out_err:
>> + /* reset map before returning an error */
>> + map->m_lblk = m_lblk;
>> + map->m_len = m_len;
>> + map->m_flags = 0;
>> + return ret;
>> +}
>> +
>> +/*
>> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
>> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
>> + * mixed mappings.
>> + *
>> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
>> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
>> + * physical extent for the requested range does not have a single contiguous
>> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
>> + * In that case we will loop over the requested range to allocate and zero out
>> + * the unwritten / holes in between, to get a single mapped extent from
>> + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know
>> + * this can be called only with bigalloc enabled filesystem where the underlying
>> + * cluster is already allocated. This avoids allocating discontiguous extents
>> + * in the slow path due to multiple calls to ext4_map_blocks().
>> + * The slow path is mostly non-performance critical path, so it should be ok to
>> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
>> + * underlying short holes/unwritten extents within the requested range.
>> + */
>> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
>> + struct ext4_map_blocks *map, int m_flags)
>> +{
>> + ext4_lblk_t m_lblk = map->m_lblk;
>> + unsigned int m_len = map->m_len;
>> + int ret = 0;
>> +
>> + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> + ret = ext4_map_blocks(handle, inode, map, m_flags);
>> + if (ret < 0 || ret == m_len)
>> + goto out;
>> + /*
>> + * This is a mixed mapping case where we were not able to allocate
>> + * a single contiguous extent. In that case let's reset requested
>> + * mapping and call the slow path.
>> + */
>> + map->m_lblk = m_lblk;
>> + map->m_len = m_len;
>> + map->m_flags = 0;
>> +
>> + return ext4_map_blocks_atomic_write_slow(handle, inode, map);
>> +out:
>> + return ret;
>> +}
>> +
>> static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> unsigned int flags)
>> {
>> @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> */
>> if (map->m_len > DIO_MAX_BLOCKS)
>> map->m_len = DIO_MAX_BLOCKS;
>> - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +
>> + /*
>> + * journal credits estimation for atomic writes. We call
>> + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
>> + * then let's assume the no. of pextents required can be m_len i.e.
>> + * every alternate block can be unwritten and hole.
>> + */
>> + if (flags & IOMAP_ATOMIC) {
>> + unsigned int orig_mlen = map->m_len;
>> +
>> + ret = ext4_map_blocks(NULL, inode, map, 0);
>> + if (ret < 0)
>> + return ret;
>> + if (map->m_len < orig_mlen) {
>> + map->m_len = orig_mlen;
>> + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
>> + map->m_len);
>> + } else {
>> + dio_credits = ext4_chunk_trans_blocks(inode,
>> + map->m_len);
>> + }
>> + } else {
>> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> + }
>>
>> retry:
>> /*
>> @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>> else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>
>> - ret = ext4_map_blocks(handle, inode, map, m_flags);
>> + if (flags & IOMAP_ATOMIC)
>> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
>> + else
>> + ret = ext4_map_blocks(handle, inode, map, m_flags);
>>
>> /*
>> * We cannot fill holes in indirect tree based inodes as that could
>> @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>> int ret;
>> struct ext4_map_blocks map;
>> u8 blkbits = inode->i_blkbits;
>> + unsigned int orig_mlen;
>>
>> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>> return -EINVAL;
>> @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>> map.m_lblk = offset >> blkbits;
>> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>> + orig_mlen = map.m_len;
>>
>> if (flags & IOMAP_WRITE) {
>> /*
>> @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>> */
>> if (offset + length <= i_size_read(inode)) {
>> ret = ext4_map_blocks(NULL, inode, &map, 0);
>> - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
>> - goto out;
>> + /*
>> + * For atomic writes the entire requested length should
>> + * be mapped.
>> + */
>> + if (map.m_flags & EXT4_MAP_MAPPED) {
>> + if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
>> + (flags & IOMAP_ATOMIC && ret >= orig_mlen))
>> + goto out;
>> + }
>> + map.m_len = orig_mlen;
>> }
>> ret = ext4_iomap_alloc(inode, &map, flags);
>> } else {
>> @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>> */
>> map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
>>
>> + /*
>> + * Before returning to iomap, let's ensure the allocated mapping
>> + * covers the entire requested length for atomic writes.
>> + */
>> + if (flags & IOMAP_ATOMIC) {
>> + if (map.m_len < (length >> blkbits)) {
>> + WARN_ON(1);
>> + return -EINVAL;
>> + }
>> + }
>> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>
>> return 0;
>> --
>> 2.49.0
>>
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 7/7] ext4: Add atomic block write documentation
2025-05-14 16:38 ` Darrick J. Wong
@ 2025-05-15 2:15 ` Ritesh Harjani
0 siblings, 0 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-15 2:15 UTC (permalink / raw)
To: Darrick J. Wong, Ojaswin Mujoo
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry,
linux-fsdevel
"Darrick J. Wong" <djwong@kernel.org> writes:
> On Fri, May 09, 2025 at 01:04:05PM +0530, Ojaswin Mujoo wrote:
>> On Fri, May 09, 2025 at 02:20:37AM +0530, Ritesh Harjani (IBM) wrote:
>> > Add an initial documentation around atomic writes support in ext4.
>> >
>> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>
>> Hi Ritesh,
>>
>> THe docs look mostly good. I'll add some feedback below:
>> > ---
>> > .../filesystems/ext4/atomic_writes.rst | 208 ++++++++++++++++++
>> > Documentation/filesystems/ext4/overview.rst | 1 +
>> > 2 files changed, 209 insertions(+)
>> > create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
>> >
>> > diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
>> > new file mode 100644
>> > index 000000000000..59b03d8dbb79
>> > --- /dev/null
>> > +++ b/Documentation/filesystems/ext4/atomic_writes.rst
>> > @@ -0,0 +1,208 @@
>> > +.. SPDX-License-Identifier: GPL-2.0
>> > +.. _atomic_writes:
>> > +
>> > +Atomic Block Writes
>> > +-------------------------
>> > +
>> > +Introduction
>> > +~~~~~~~~~~~~
>> > +
>> > +Atomic (untorn) block writes ensure that either the entire write is committed
>> > +to disk or none of it is. This prevents "torn writes" during power loss or
>> > +system crashes. The ext4 filesystem supports atomic writes (only with Direct
>> > +I/O) on regular files with extents, provided the underlying storage device
>> > +supports hardware atomic writes. This is supported in the following two ways:
>> > +
>> > +1. **Single-fsblock Atomic Writes**:
>> > + EXT4's supports atomic write operations with a single filesystem block since
>> > + v6.13. In this the atomic write unit minimum and maximum sizes are both set
>> > + to filesystem blocksize.
>> > + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
>> > + pagesize system is possible.
>> > +
>> > +2. **Multi-fsblock Atomic Writes with Bigalloc**:
>> > + EXT4 now also supports atomic writes spanning multiple filesystem blocks
>> > + using a feature known as bigalloc. The atomic write unit's minimum and
>> > + maximum sizes are determined by the filesystem block size and cluster size,
>> > + based on the underlying device’s supported atomic write unit limits.
>> > +
>> > +Requirements
>> > +~~~~~~~~~~~~
>> > +
>> > +Basic requirements for atomic writes in ext4:
>> > +
>> > + 1. The extents feature must be enabled (default for ext4)
>> > + 2. The underlying block device must support atomic writes
>> > + 3. For single-fsblock atomic writes:
>> > +
>> > + 1. A filesystem with appropriate block size (up to the page size)
>> > + 4. For multi-fsblock atomic writes:
>> > +
>> > + 1. The bigalloc feature must be enabled
>> > + 2. The cluster size must be appropriately configured
>> > +
>> > +NOTE: EXT4 does not support software or COW based atomic write, which means
>> > +atomic writes on ext4 are only supported if underlying storage device supports
>> > +it.
>> > +
>> > +Multi-fsblock Implementation Details
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +The bigalloc feature changes ext4 to use clustered allocations. With bigalloc
>
> I would say "...changes ext4 to allocate in units of multiple fs blocks,
> also known as clusters." so that the definition of a cluster is right
> there in the first sentence instead of the second.
>
Make sense. Will make the change.
>> > +each bit within block bitmap represents clusters (power of 2 number of blocks)
>> > +rather than individual filesystem blocks. EXT4 supports atomic writes using
>> > +bigalloc by making sure that atomic write min and max are within [blocksize,
>> > +clustersize].
>>
>> Should we add a line like:
>>
>> Atomic write max unit is capped to the max supported by the underlying
>> device, incase it is less than the clustersize.
>
> I think the documentation should say exactly what the untorn write
> geometry is constrained to:
>
> "EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
> following constraints: The minimum atomic write size is the larger of
> the fs block size and the minimum hardware atomic write unit; and the
> maximum atomic write size is smaller of the bigalloc cluster size and
> the maximum hardware atomic write unit. Bigalloc ensures that all
> allocations are aligned to the cluster size, which satisfies the LBA
> alignment requirements of the hardware device if the start of the
> partition/logical volume is itself aligned correctly."
>
Thanks! I will add this.
>> Also, maybe we can have a line wiht something like "With bigalloc's
>> clustered allocation we can be sure that an atomic write will always
>> be allocated aligned blocks. The only thing we need to ensure is that
>> we have a continuous mapping in the write rang."
>>
>> > +
>> > +Here is the block allocation strategy in bigalloc for atomic writes:
>> > +
>> > + * For regions with fully mapped extents, no additional allocation is needed
>
> "No additional work is needed" ?
>
Yes, make sense.
>> > + * For append writes, a new mapped extent is allocated
>> > + * For regions that are entirely holes, unwritten extent is created
>> > + * For large unwritten extents, the extent gets split into two unwritten
>> > + extents of appropriate requested size
>>
>> Are the above 4 points needed explicitly? Maybe we can have:
>>
>> Append writes, and writes on regions that are fully mapped,
>> unwritten or hole follow the same flow as non atomic writes.
>>
>> > + * For mixed mapping regions (combinations of holes, unwritten extents, or
>> > + mapped extents), ext4_map_blocks() is called in a loop with
>> > + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
>> > + mapped extent
>> Maybe:
>>
>> ... single continuous mapped extents by writing zeroes to it
>>
>> So that we explicitly mention what we are doing and not rely on people
>> knowing the meaning of EXT4_GET_BLOCKS_ZERO flag.
>
> (Yeah.)
>
I agree.
>> > +Note: Writing on a single contiguous underlying extent, whether mapped or
>> > +unwritten, is not inherently problematic. However, writing to a mixed mapping
>> > +region (i.e. one containing a combination of mapped and unwritten extents)
>> > +must be avoided when performing atomic writes.
>> > +
>> > +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
>> > +flag, requires that either all data is written or none at all. In the event of
>> > +a system crash or unexpected power loss during the write operation, the affected
>> > +region (when later read) must reflect either the complete old data or the
>> > +complete new data, but never a mix of both.
>> > +
>> > +To enforce this guarantee, we ensure that the write target is backed by
>> > +a single, contiguous extent before any data is written. This is critical because
>> > +ext4 defers the conversion of unwritten extents to written extents until the I/O
>> > +completion path (typically in ->end_io()). If a write is allowed to proceed over
>> > +a mixed mapping region (with mapped and unwritten extents) and a failure occurs
>> > +mid-write, the system could observe partially updated regions after reboot, i.e.
>> > +new data over mapped areas, and stale (old) data over unwritten extents that
>> > +were never marked written. This violates the atomicity and/or torn write
>> > +prevention guarantee.
>> > +
>> > +To prevent such torn writes, ext4 proactively allocates a single contiguous
>> > +extent for the entire requested region in ``ext4_iomap_alloc`` via
>> > +``ext4_map_blocks_atomic()``. Only after this allocation, is the write
>> > +operation performed by iomap.
>> > +
>> > +Handling Split Extents Across Leaf Blocks
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +There can be a special edge case where we have logically and physically
>> > +contiguous extents stored in separate leaf nodes of the on-disk extent tree.
>> > +This occurs because on-disk extent tree merges only happens within the leaf
>> > +blocks except for a case where we have 2-level tree which can get merged and
>> > +collapsed entirely into the inode.
>
> Aha, I guess this is the answer to my earlier question. :)
>
Yes, it is easy to miss. So it was better if this was documented.
>> > +If such a layout exists and, in the worst case, the extent status cache entries
>> > +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
>> > +a single contiguous extent for these split leaf extents.
>> > +
>> > +To address this edge case, a new get block flag
>> > +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
>> > +``ext4_map_query_blocks()`` lookup behavior.
>> > +
>> > +This new get block flag allows ``ext4_map_blocks()`` to first checks if there is
>>
>> s/checks/check
>>
Done.
>> > +an entry in the extent status cache for the full range.
>> > +If not present, it consults the on-disk extent tree using
>> > +``ext4_map_query_blocks()``.
>> > +If the located extent is at the end of a leaf node, it probes the next logical
>> > +block (lblk) to detect a contiguous extent in the adjacent leaf.
>> > +
>> > +For now only one additional leaf block is queried to maintain efficiency, as
>> > +atomic writes are typically constrained to small sizes
>> > +(e.g. [blocksize, clustersize]).
>> > +
>> > +
>> > +Handling Journal transactions
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +To support multi-fsblock atomic writes, we ensure enough journal credits are
>> > +reserved during:
>> > +
>> > + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
>> > + could be a mixed mapping for the underlying requested range. If yes, then we
>> > + reserve credits of up to ``m_len``, assuming every alternate block can be
>> > + an unwritten extent followed by a hole.
>> > +
>> > + 2. During ``->end_io()`` call, we make sure a single transaction is started for
>> > + doing unwritten-to-written conversion. The loop for conversion is mainly
>> > + only required to handle a split extent across leaf blocks.
>> > +
>> > +How to
>> > +------
>> > +
>> > +Creating Filesystems with Atomic Write Support
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +For single-fsblock atomic writes with a larger block size
>> > +(on systems with block size < page size):
>> > +
>> > +.. code-block:: bash
>> > +
>> > + # Create an ext4 filesystem with a 16KB block size
>> > + # (requires page size >= 16KB)
>> > + mkfs.ext4 -b 16384 /dev/device
>> > +
>> > +For multi-fsblock atomic writes with bigalloc:
>> > +
>> > +.. code-block:: bash
>> > +
>> > + # Create an ext4 filesystem with bigalloc and 64KB cluster size
>> > + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
>> > +
>> > +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
>> > +and ``-O bigalloc`` enables the bigalloc feature.
>
> Might want to add at least a sentence about "figure out what atomic
> write unit your application needs by querying statx of the block device
> or whatever. Or refer them to the "Hardware Support" section. :)
>
Sure.
>> > +
>> > +Application Interface
>> > +~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
>> > +to perform atomic writes:
>> > +
>> > +.. code-block:: c
>> > +
>> > + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
>> > +
>> > +The write must be aligned to the filesystem's block size and not exceed the
>> > +filesystem's maximum atomic write unit size.
>> > +See ``generic_atomic_write_valid()`` for more details.
>> > +
>> > +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
>> > +details:
>> > +
>> > + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
>> > + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
>> > + * ``stx_atomic_write_segments_max``: Upper limit for segments. Tthe number of
>
> s/Tthe/The/
>
Thanks!
>> > + separate memory buffers that can be gathered into a write operation
>> > + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
>> > +
>> > +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
>> > +writes are supported.
>> > +
>> > +Hardware Support
>> > +----------------
>> > +
>> > +The underlying storage device must support atomic write operations.
>> > +Modern NVMe and SCSI devices often provide this capability.
>> > +The Linux kernel exposes this information through sysfs:
>> > +
>> > +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
>> > +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
>> > +
>> > +Nonzero values for these attributes indicate that the device supports
>> > +atomic writes.
>
> The rest fits with my understanding of atomic untorn writes.
>
> --D
>
Thanks Darrick for the review. I will incorporate these changes.
-ritesh
>> > +
>> > +See Also
>> > +--------
>> > +
>> > +* :doc:`bigalloc` - Documentation on the bigalloc feature
>> > +* :doc:`allocators` - Documentation on block allocation in ext4
>> > +* Support for atomic block writes in 6.13:
>> > + https://lwn.net/Articles/1009298/
>> > diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
>> > index 0fad6eda6e15..9d4054c17ecb 100644
>> > --- a/Documentation/filesystems/ext4/overview.rst
>> > +++ b/Documentation/filesystems/ext4/overview.rst
>> > @@ -25,3 +25,4 @@ order.
>> > .. include:: inlinedata.rst
>> > .. include:: eainode.rst
>> > .. include:: verity.rst
>> > +.. include:: atomic_writes.rst
>> > --
>> > 2.49.0
>> >
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 7/7] ext4: Add atomic block write documentation
2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-14 16:38 ` Darrick J. Wong
@ 2025-05-15 2:18 ` Ritesh Harjani
1 sibling, 0 replies; 26+ messages in thread
From: Ritesh Harjani @ 2025-05-15 2:18 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: linux-ext4, Theodore Ts'o, Jan Kara, John Garry, djwong,
linux-fsdevel
Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> On Fri, May 09, 2025 at 02:20:37AM +0530, Ritesh Harjani (IBM) wrote:
>> Add an initial documentation around atomic writes support in ext4.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> Hi Ritesh,
>
> THe docs look mostly good. I'll add some feedback below:
>> ---
>> .../filesystems/ext4/atomic_writes.rst | 208 ++++++++++++++++++
>> Documentation/filesystems/ext4/overview.rst | 1 +
>> 2 files changed, 209 insertions(+)
>> create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
>>
>> diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
>> new file mode 100644
>> index 000000000000..59b03d8dbb79
>> --- /dev/null
>> +++ b/Documentation/filesystems/ext4/atomic_writes.rst
>> @@ -0,0 +1,208 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. _atomic_writes:
>> +
>> +Atomic Block Writes
>> +-------------------------
>> +
>> +Introduction
>> +~~~~~~~~~~~~
>> +
>> +Atomic (untorn) block writes ensure that either the entire write is committed
>> +to disk or none of it is. This prevents "torn writes" during power loss or
>> +system crashes. The ext4 filesystem supports atomic writes (only with Direct
>> +I/O) on regular files with extents, provided the underlying storage device
>> +supports hardware atomic writes. This is supported in the following two ways:
>> +
>> +1. **Single-fsblock Atomic Writes**:
>> + EXT4's supports atomic write operations with a single filesystem block since
>> + v6.13. In this the atomic write unit minimum and maximum sizes are both set
>> + to filesystem blocksize.
>> + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
>> + pagesize system is possible.
>> +
>> +2. **Multi-fsblock Atomic Writes with Bigalloc**:
>> + EXT4 now also supports atomic writes spanning multiple filesystem blocks
>> + using a feature known as bigalloc. The atomic write unit's minimum and
>> + maximum sizes are determined by the filesystem block size and cluster size,
>> + based on the underlying device’s supported atomic write unit limits.
>> +
>> +Requirements
>> +~~~~~~~~~~~~
>> +
>> +Basic requirements for atomic writes in ext4:
>> +
>> + 1. The extents feature must be enabled (default for ext4)
>> + 2. The underlying block device must support atomic writes
>> + 3. For single-fsblock atomic writes:
>> +
>> + 1. A filesystem with appropriate block size (up to the page size)
>> + 4. For multi-fsblock atomic writes:
>> +
>> + 1. The bigalloc feature must be enabled
>> + 2. The cluster size must be appropriately configured
>> +
>> +NOTE: EXT4 does not support software or COW based atomic write, which means
>> +atomic writes on ext4 are only supported if underlying storage device supports
>> +it.
>> +
>> +Multi-fsblock Implementation Details
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The bigalloc feature changes ext4 to use clustered allocations. With bigalloc
>> +each bit within block bitmap represents clusters (power of 2 number of blocks)
>> +rather than individual filesystem blocks. EXT4 supports atomic writes using
>> +bigalloc by making sure that atomic write min and max are within [blocksize,
>> +clustersize].
>
> Should we add a line like:
>
> Atomic write max unit is capped to the max supported by the underlying
> device, incase it is less than the clustersize.
>
> Also, maybe we can have a line wiht something like "With bigalloc's
> clustered allocation we can be sure that an atomic write will always
> be allocated aligned blocks. The only thing we need to ensure is that
> we have a continuous mapping in the write rang."
>
Yes, I guess the snip provided from Darrick covers all of this. Will
make the change.
>> +
>> +Here is the block allocation strategy in bigalloc for atomic writes:
>> +
>> + * For regions with fully mapped extents, no additional allocation is needed
>> + * For append writes, a new mapped extent is allocated
>> + * For regions that are entirely holes, unwritten extent is created
>> + * For large unwritten extents, the extent gets split into two unwritten
>> + extents of appropriate requested size
>
> Are the above 4 points needed explicitly? Maybe we can have:
>
> Append writes, and writes on regions that are fully mapped,
> unwritten or hole follow the same flow as non atomic writes.
>
Putting it explicitly helps, I guess.
>> + * For mixed mapping regions (combinations of holes, unwritten extents, or
>> + mapped extents), ext4_map_blocks() is called in a loop with
>> + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
>> + mapped extent
> Maybe:
>
> ... single continuous mapped extents by writing zeroes to it
>
> So that we explicitly mention what we are doing and not rely on people
> knowing the meaning of EXT4_GET_BLOCKS_ZERO flag.
>
Agreed.
>> +
>> +Note: Writing on a single contiguous underlying extent, whether mapped or
>> +unwritten, is not inherently problematic. However, writing to a mixed mapping
>> +region (i.e. one containing a combination of mapped and unwritten extents)
>> +must be avoided when performing atomic writes.
>> +
>> +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
>> +flag, requires that either all data is written or none at all. In the event of
>> +a system crash or unexpected power loss during the write operation, the affected
>> +region (when later read) must reflect either the complete old data or the
>> +complete new data, but never a mix of both.
>> +
>> +To enforce this guarantee, we ensure that the write target is backed by
>> +a single, contiguous extent before any data is written. This is critical because
>> +ext4 defers the conversion of unwritten extents to written extents until the I/O
>> +completion path (typically in ->end_io()). If a write is allowed to proceed over
>> +a mixed mapping region (with mapped and unwritten extents) and a failure occurs
>> +mid-write, the system could observe partially updated regions after reboot, i.e.
>> +new data over mapped areas, and stale (old) data over unwritten extents that
>> +were never marked written. This violates the atomicity and/or torn write
>> +prevention guarantee.
>> +
>> +To prevent such torn writes, ext4 proactively allocates a single contiguous
>> +extent for the entire requested region in ``ext4_iomap_alloc`` via
>> +``ext4_map_blocks_atomic()``. Only after this allocation, is the write
>> +operation performed by iomap.
>> +
>> +Handling Split Extents Across Leaf Blocks
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +There can be a special edge case where we have logically and physically
>> +contiguous extents stored in separate leaf nodes of the on-disk extent tree.
>> +This occurs because on-disk extent tree merges only happens within the leaf
>> +blocks except for a case where we have 2-level tree which can get merged and
>> +collapsed entirely into the inode.
>> +If such a layout exists and, in the worst case, the extent status cache entries
>> +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
>> +a single contiguous extent for these split leaf extents.
>> +
>> +To address this edge case, a new get block flag
>> +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
>> +``ext4_map_query_blocks()`` lookup behavior.
>> +
>> +This new get block flag allows ``ext4_map_blocks()`` to first checks if there is
>
> s/checks/check
>
Sure.
-ritesh
>> +an entry in the extent status cache for the full range.
>> +If not present, it consults the on-disk extent tree using
>> +``ext4_map_query_blocks()``.
>> +If the located extent is at the end of a leaf node, it probes the next logical
>> +block (lblk) to detect a contiguous extent in the adjacent leaf.
>> +
>> +For now only one additional leaf block is queried to maintain efficiency, as
>> +atomic writes are typically constrained to small sizes
>> +(e.g. [blocksize, clustersize]).
>> +
>> +
>> +Handling Journal transactions
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +To support multi-fsblock atomic writes, we ensure enough journal credits are
>> +reserved during:
>> +
>> + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
>> + could be a mixed mapping for the underlying requested range. If yes, then we
>> + reserve credits of up to ``m_len``, assuming every alternate block can be
>> + an unwritten extent followed by a hole.
>> +
>> + 2. During ``->end_io()`` call, we make sure a single transaction is started for
>> + doing unwritten-to-written conversion. The loop for conversion is mainly
>> + only required to handle a split extent across leaf blocks.
>> +
>> +How to
>> +------
>> +
>> +Creating Filesystems with Atomic Write Support
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +For single-fsblock atomic writes with a larger block size
>> +(on systems with block size < page size):
>> +
>> +.. code-block:: bash
>> +
>> + # Create an ext4 filesystem with a 16KB block size
>> + # (requires page size >= 16KB)
>> + mkfs.ext4 -b 16384 /dev/device
>> +
>> +For multi-fsblock atomic writes with bigalloc:
>> +
>> +.. code-block:: bash
>> +
>> + # Create an ext4 filesystem with bigalloc and 64KB cluster size
>> + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
>> +
>> +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
>> +and ``-O bigalloc`` enables the bigalloc feature.
>> +
>> +Application Interface
>> +~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
>> +to perform atomic writes:
>> +
>> +.. code-block:: c
>> +
>> + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
>> +
>> +The write must be aligned to the filesystem's block size and not exceed the
>> +filesystem's maximum atomic write unit size.
>> +See ``generic_atomic_write_valid()`` for more details.
>> +
>> +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
>> +details:
>> +
>> + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
>> + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
>> + * ``stx_atomic_write_segments_max``: Upper limit for segments. Tthe number of
>> + separate memory buffers that can be gathered into a write operation
>> + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
>> +
>> +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
>> +writes are supported.
>> +
>> +Hardware Support
>> +----------------
>> +
>> +The underlying storage device must support atomic write operations.
>> +Modern NVMe and SCSI devices often provide this capability.
>> +The Linux kernel exposes this information through sysfs:
>> +
>> +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
>> +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
>> +
>> +Nonzero values for these attributes indicate that the device supports
>> +atomic writes.
>> +
>> +See Also
>> +--------
>> +
>> +* :doc:`bigalloc` - Documentation on the bigalloc feature
>> +* :doc:`allocators` - Documentation on block allocation in ext4
>> +* Support for atomic block writes in 6.13:
>> + https://lwn.net/Articles/1009298/
>> diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
>> index 0fad6eda6e15..9d4054c17ecb 100644
>> --- a/Documentation/filesystems/ext4/overview.rst
>> +++ b/Documentation/filesystems/ext4/overview.rst
>> @@ -25,3 +25,4 @@ order.
>> .. include:: inlinedata.rst
>> .. include:: eainode.rst
>> .. include:: verity.rst
>> +.. include:: atomic_writes.rst
>> --
>> 2.49.0
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-05-15 2:27 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 20:50 [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-08 20:50 ` [PATCH v3 1/7] ext4: Document an edge case for overwrites Ritesh Harjani (IBM)
2025-05-09 5:19 ` Ojaswin Mujoo
2025-05-14 16:23 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 2/7] ext4: Check if inode uses extents in ext4_inode_can_atomic_write() Ritesh Harjani (IBM)
2025-05-09 5:20 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 3/7] ext4: Make ext4_meta_trans_blocks() non-static for later use Ritesh Harjani (IBM)
2025-05-09 5:21 ` Ojaswin Mujoo
2025-05-14 16:24 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 4/7] ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS Ritesh Harjani (IBM)
2025-05-14 16:16 ` Darrick J. Wong
2025-05-14 18:47 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani (IBM)
2025-05-14 16:19 ` Darrick J. Wong
2025-05-14 19:04 ` Ritesh Harjani
2025-05-08 20:50 ` [PATCH v3 6/7] ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc Ritesh Harjani (IBM)
2025-05-14 16:21 ` Darrick J. Wong
2025-05-08 20:50 ` [PATCH v3 7/7] ext4: Add atomic block write documentation Ritesh Harjani (IBM)
2025-05-09 7:34 ` Ojaswin Mujoo
2025-05-14 16:38 ` Darrick J. Wong
2025-05-15 2:15 ` Ritesh Harjani
2025-05-15 2:18 ` Ritesh Harjani
2025-05-09 17:42 ` [PATCH v3 0/7] ext4: Add multi-fsblock atomic write support with bigalloc Ritesh Harjani
2025-05-14 16:40 ` Darrick J. Wong
2025-05-14 18:55 ` Ritesh Harjani
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).