From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
To: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org,
John Garry <john.g.garry@oracle.com>,
djwong@kernel.org, linux-xfs@vger.kernel.org,
Theodore Ts'o <tytso@mit.edu>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Subject: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc
Date: Sun, 23 Mar 2025 12:30:10 +0530 [thread overview]
Message-ID: <6ce4303bfbccc4f5ed3be96b56eb1080b724b0da.1742699765.git.ritesh.list@gmail.com> (raw)
In-Reply-To: <cover.1742699765.git.ritesh.list@gmail.com>
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize].
We first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
for block allocation depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
then we don't call ext4_map_blocks() to allocate anything. We don't need to
even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
within the requested range. This then provide a single mapped extent type
mapping for the requested range.
Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.
Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/super.c | 8 +++--
2 files changed, 93 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d04d8a7f12e7..0096a597ad04 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->addr = IOMAP_NULL_ADDR;
}
}
+/*
+ * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
+ * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
+ * mappings. This function is only called when the bigalloc is enabled, so we
+ * know that the allocated physical extent start is always aligned properly.
+ *
+ * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
+ * requested range does not have a single mapping type (Hole, Mapped, or
+ * Unwritten) throughout. In that case we will loop over the requested range to
+ * allocate and zero out the unwritten / holes in between, to get a single
+ * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
+ * critical path, so it should be ok to loop using ext4_map_blocks() with
+ * appropriate flags to allocate & zero the underlying short holes/unwritten
+ * extents within the requested range.
+ */
+static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ ext4_lblk_t m_lblk = map->m_lblk;
+ unsigned int m_len = map->m_len;
+ unsigned int mapped_len = 0, flags = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret;
+
+ WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
+
+ ret = ext4_map_blocks(handle, inode, map, 0);
+ if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
+ flags = EXT4_GET_BLOCKS_CREATE;
+ else if ((ret == 0 && map->m_len >= m_len) ||
+ (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
+ flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+ else
+ flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+
+ do {
+ ret = ext4_map_blocks(handle, inode, map, flags);
+ if (ret < 0)
+ return ret;
+ mapped_len += map->m_len;
+ map->m_lblk += map->m_len;
+ map->m_len = m_len - mapped_len;
+ } while (mapped_len < m_len);
+
+ map->m_lblk = m_lblk;
+ map->m_len = m_len;
+
+ /*
+ * We might have done some work in above loop. Let's ensure we query the
+ * start of the physical extent, based on the origin m_lblk and m_len
+ * and also ensure we were able to allocate the required range for doing
+ * atomic write.
+ */
+ ret = ext4_map_blocks(handle, inode, map, 0);
+ if (ret != m_len) {
+ ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
+ m_lblk, m_len);
+ return -EINVAL;
+ }
+ return mapped_len;
+}
static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
@@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
- ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
+ ret = ext4_map_blocks_atomic(handle, inode, map);
+ else
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
/*
* We cannot fill holes in indirect tree based inodes as that could
@@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+ unsigned int m_len_orig;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_lblk = offset >> blkbits;
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+ m_len_orig = map.m_len;
if (flags & IOMAP_WRITE) {
/*
@@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
if (offset + length <= i_size_read(inode)) {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
- goto out;
+ /*
+ * For atomic writes the entire requested length should
+ * be mapped.
+ */
+ if (map.m_flags & EXT4_MAP_MAPPED) {
+ if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
+ (flags & IOMAP_ATOMIC && ret >= m_len_orig))
+ goto out;
+ }
+ map.m_len = m_len_orig;
}
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
@@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
+ /*
+ * Before returning to iomap, let's ensure the allocated mapping
+ * covers the entire requested length for atomic writes.
+ */
+ if (flags & IOMAP_ATOMIC) {
+ if (map.m_len < (length >> blkbits)) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+ }
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a50e5c31b937..cbb24d535d59 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
/*
* ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
* @sb: super block
- * TODO: Later add support for bigalloc
*/
static void ext4_atomic_write_init(struct super_block *sb)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct block_device *bdev = sb->s_bdev;
+ unsigned int blkbits = sb->s_blocksize_bits;
+ unsigned int clustersize = sb->s_blocksize;
if (!bdev_can_atomic_write(bdev))
return;
@@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
if (!ext4_has_feature_extents(sb))
return;
+ if (ext4_has_feature_bigalloc(sb))
+ clustersize = 1U << (sbi->s_cluster_bits + blkbits);
+
sbi->s_awu_min = max(sb->s_blocksize,
bdev_atomic_write_unit_min_bytes(bdev));
- sbi->s_awu_max = min(sb->s_blocksize,
+ sbi->s_awu_max = min(clustersize,
bdev_atomic_write_unit_max_bytes(bdev));
if (sbi->s_awu_min && sbi->s_awu_max &&
sbi->s_awu_min <= sbi->s_awu_max) {
--
2.48.1
WARNING: multiple messages have this Message-ID (diff)
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
To: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org,
John Garry <john.g.garry@oracle.com>,
djwong@kernel.org, linux-xfs@vger.kernel.org,
Theodore Ts'o <tytso@mit.edu>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Subject: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc
Date: Sun, 23 Mar 2025 12:32:18 +0530 [thread overview]
Message-ID: <6ce4303bfbccc4f5ed3be96b56eb1080b724b0da.1742699765.git.ritesh.list@gmail.com> (raw)
Message-ID: <20250323070218.TXPv0lyp0kW0RBhSJpoCl37NxYw24VwGfwoNb3Lyohg@z> (raw)
In-Reply-To: <cover.1742699765.git.ritesh.list@gmail.com>
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize].
We first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
for block allocation depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
then we don't call ext4_map_blocks() to allocate anything. We don't need to
even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
within the requested range. This then provide a single mapped extent type
mapping for the requested range.
Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.
Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/super.c | 8 +++--
2 files changed, 93 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d04d8a7f12e7..0096a597ad04 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->addr = IOMAP_NULL_ADDR;
}
}
+/*
+ * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
+ * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
+ * mappings. This function is only called when the bigalloc is enabled, so we
+ * know that the allocated physical extent start is always aligned properly.
+ *
+ * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
+ * requested range does not have a single mapping type (Hole, Mapped, or
+ * Unwritten) throughout. In that case we will loop over the requested range to
+ * allocate and zero out the unwritten / holes in between, to get a single
+ * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
+ * critical path, so it should be ok to loop using ext4_map_blocks() with
+ * appropriate flags to allocate & zero the underlying short holes/unwritten
+ * extents within the requested range.
+ */
+static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ ext4_lblk_t m_lblk = map->m_lblk;
+ unsigned int m_len = map->m_len;
+ unsigned int mapped_len = 0, flags = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret;
+
+ WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
+
+ ret = ext4_map_blocks(handle, inode, map, 0);
+ if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
+ flags = EXT4_GET_BLOCKS_CREATE;
+ else if ((ret == 0 && map->m_len >= m_len) ||
+ (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
+ flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+ else
+ flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+
+ do {
+ ret = ext4_map_blocks(handle, inode, map, flags);
+ if (ret < 0)
+ return ret;
+ mapped_len += map->m_len;
+ map->m_lblk += map->m_len;
+ map->m_len = m_len - mapped_len;
+ } while (mapped_len < m_len);
+
+ map->m_lblk = m_lblk;
+ map->m_len = m_len;
+
+ /*
+ * We might have done some work in above loop. Let's ensure we query the
+ * start of the physical extent, based on the origin m_lblk and m_len
+ * and also ensure we were able to allocate the required range for doing
+ * atomic write.
+ */
+ ret = ext4_map_blocks(handle, inode, map, 0);
+ if (ret != m_len) {
+ ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
+ m_lblk, m_len);
+ return -EINVAL;
+ }
+ return mapped_len;
+}
static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
@@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
- ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
+ ret = ext4_map_blocks_atomic(handle, inode, map);
+ else
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
/*
* We cannot fill holes in indirect tree based inodes as that could
@@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+ unsigned int m_len_orig;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_lblk = offset >> blkbits;
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+ m_len_orig = map.m_len;
if (flags & IOMAP_WRITE) {
/*
@@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
if (offset + length <= i_size_read(inode)) {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
- goto out;
+ /*
+ * For atomic writes the entire requested length should
+ * be mapped.
+ */
+ if (map.m_flags & EXT4_MAP_MAPPED) {
+ if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
+ (flags & IOMAP_ATOMIC && ret >= m_len_orig))
+ goto out;
+ }
+ map.m_len = m_len_orig;
}
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
@@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
+ /*
+ * Before returning to iomap, let's ensure the allocated mapping
+ * covers the entire requested length for atomic writes.
+ */
+ if (flags & IOMAP_ATOMIC) {
+ if (map.m_len < (length >> blkbits)) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+ }
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a50e5c31b937..cbb24d535d59 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
/*
* ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
* @sb: super block
- * TODO: Later add support for bigalloc
*/
static void ext4_atomic_write_init(struct super_block *sb)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct block_device *bdev = sb->s_bdev;
+ unsigned int blkbits = sb->s_blocksize_bits;
+ unsigned int clustersize = sb->s_blocksize;
if (!bdev_can_atomic_write(bdev))
return;
@@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
if (!ext4_has_feature_extents(sb))
return;
+ if (ext4_has_feature_bigalloc(sb))
+ clustersize = 1U << (sbi->s_cluster_bits + blkbits);
+
sbi->s_awu_min = max(sb->s_blocksize,
bdev_atomic_write_unit_min_bytes(bdev));
- sbi->s_awu_max = min(sb->s_blocksize,
+ sbi->s_awu_max = min(clustersize,
bdev_atomic_write_unit_max_bytes(bdev));
if (sbi->s_awu_min && sbi->s_awu_max &&
sbi->s_awu_min <= sbi->s_awu_max) {
--
2.48.1
next prev parent reply other threads:[~2025-03-23 7:00 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-29 7:06 [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes Ojaswin Mujoo
2025-01-29 8:59 ` John Garry
2025-01-29 16:06 ` Ojaswin Mujoo
2025-01-30 14:08 ` John Garry
2025-02-01 7:12 ` Ojaswin Mujoo
2025-02-04 12:20 ` John Garry
2025-02-04 20:12 ` Dave Chinner
2025-02-07 6:08 ` Ojaswin Mujoo
2025-02-07 12:01 ` John Garry
2025-02-08 17:05 ` Ojaswin Mujoo
2025-03-23 7:00 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write with bigalloc Ritesh Harjani (IBM)
2025-03-23 7:00 ` Ritesh Harjani (IBM) [this message]
2025-03-23 7:02 ` [RFCv1 1/1] ext4: Add multi-fsblock atomic write support " Ritesh Harjani (IBM)
2025-03-25 11:42 ` Ojaswin Mujoo
2025-03-23 7:02 ` [RFCv1 0/1] EXT4 support of multi-fsblock atomic write " Ritesh Harjani (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6ce4303bfbccc4f5ed3be96b56eb1080b724b0da.1742699765.git.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=djwong@kernel.org \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).