* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
From: Ojaswin Mujoo @ 2026-05-19 10:41 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-8-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The data=ordered mode introduces two fundamental conflicts with the
> iomap buffered write path, leading to potential deadlocks.
>
> 1) Lock ordering conflict
> In the iomap writeback path, each folio is processed sequentially:
> the folio lock is acquired first, followed by starting a transaction
> to create block mappings. In data=ordered mode, writeback triggered
> by the journal commit process may attempt to acquire a folio lock
> that is already held by iomap. Meanwhile, iomap, under that same
> folio lock, may start a new transaction and wait for the currently
> committing transaction to finish, resulting in a deadlock.
Right, makes sense.
>
> 2) Partial folio submission not supported
> When block size is smaller than folio size, a folio may contain both
> mapped and unmapped blocks. In data=ordered mode, if the journal
> waits for such a folio to be written back while the regular writeback
> process has already started committing it (with the writeback flag
> set), mapping the remaining unmapped blocks can deadlock. This is
> because the writeback flag is cleared only after the entire folio is
> processed and committed.
Okay so IIUC, if we do end up using iomap with ordered data, there are 2
codepaths with issues here:
txn_commit
ordered data writeback (say it goes via iomap)
folio_lock
iomap_writeback_folio
folio_start_writeback
iomap_writeback_range
ext4_map_block
txn_start
wait for tnx commit - DEADLOCK
Currently we avoid this by having ext4_normal_submit_inode_buffers()
pass can_map = 0 so journal flush makese sure not to start any txn.
Then we have
txn_commit background writeback (via iomap)
folio_lock()
ordered data writeback
folio_lock
iomap_writeback_folio
folio_start_writeback
iomap_writeback_range
ext4_map_block
txn_start
wait for txn commit - DEADLOCK
Currently, this is taken care because we try to start the txn before
taking any folio locks/starting writeback, and hence we cannot deadlock.
If the above description makes sense, do you think it'd be good to add
them to the commit message. The reason is that although these paths seem
obvious when we look at them a lot, it took me a good bit of time to
understand what deadlocks you are talking about here :p
Having the code traces like above makes it very clear.
>
> To support data=ordered mode, the iomap core would need two invasive
> changes:
> - Acquire the transaction handle before locking any folio for
> writeback.
> - Support partial folio submission.
>
> Both changes are complicated and risk performance regressions.
> Therefore, we must avoid using data=ordered mode when converting to the
> iomap path.
>
> Currently, data=ordered mode is used in three scenarios:
> - Append write
> - Post-EOF partial block truncate-up followed by append write
> - Online defragmentation
>
> We can address the first two without data=ordered mode:
> - For append write: always allocate unwritten blocks (i.e. always
> enable dioread_nolock), preserving the behavior of current
> extent-type inodes.
> - For post-EOF truncate-up + append write: postpone updating i_disksize
> until after the zeroed partial block has been written back.
I'm still going through how we are addressing no data=ordered so will
get back on this in some time.
Thanks,
Ojaswin
>
> Online defragmentation does not yet support iomap; this can be resolved
> separately in the future.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/ext4_jbd2.h | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..26999f173870 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>
> static inline int ext4_should_order_data(struct inode *inode)
> {
> - return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> + /*
> + * inodes using the iomap buffered I/O path do not use the
> + * data=ordered mode.
> + */
> + return !ext4_inode_buffered_iomap(inode) &&
> + (ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
> }
>
> static inline int ext4_should_writeback_data(struct inode *inode)
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 05/23] ext4: implement buffered read path using iomap
From: Ojaswin Mujoo @ 2026-05-19 10:00 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-6-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:25PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Implement the iomap read path for ext4 by introducing a new
> ext4_iomap_buffered_read_ops instance. This provides the read_folio()
> and readahead() callbacks for ext4_iomap_aops. The implementation
> introduces:
>
> - ext4_iomap_map_blocks(): Helper function to query extent mappings for
> a given read range using ext4_map_blocks() and convert the mapping
> information to iomap type
> - ext4_iomap_buffered_read_begin(): The iomap_begin callback that maps
> blocks, validates filesystem state, and populates the iomap. It
> returns -ERANGE for inline data which is not yet supported.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Regards,
Ojaswin
> ---
> fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 178ac2be37b7..6c4d9137b279 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3908,14 +3908,57 @@ const struct iomap_ops ext4_iomap_report_ops = {
> .iomap_begin = ext4_iomap_begin_report,
> };
>
> +static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> + loff_t length, struct ext4_map_blocks *map)
> +{
> + u8 blkbits = inode->i_blkbits;
> +
> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> + return -EINVAL;
> +
> + /* Calculate the first and last logical blocks respectively. */
> + map->m_lblk = offset >> blkbits;
> + map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> +
> + return ext4_map_blocks(NULL, inode, map, 0);
> +}
> +
> +static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> + loff_t length, unsigned int flags, struct iomap *iomap,
> + struct iomap *srcmap)
> +{
> + struct ext4_map_blocks map;
> + int ret;
> +
> + if (unlikely(ext4_forced_shutdown(inode->i_sb)))
> + return -EIO;
> +
> + /* Inline data support is not yet available. */
> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> + return -ERANGE;
> +
> + ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> + if (ret < 0)
> + return ret;
> +
> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> + return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_buffered_read_ops = {
> + .iomap_begin = ext4_iomap_buffered_read_begin,
> +};
> +
> static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> {
> + iomap_bio_read_folio(folio, &ext4_iomap_buffered_read_ops);
> return 0;
> }
>
> static void ext4_iomap_readahead(struct readahead_control *rac)
> {
> -
> + iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> }
>
> static int ext4_iomap_writepages(struct address_space *mapping,
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Ojaswin Mujoo @ 2026-05-19 10:02 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-7-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:26PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The iomap buffered write path does not hold any locks between querying
> inode extent mapping information and performing buffered writes. It
> relies on the sequence counter saved in the inode to detect stale
> mappings.
>
> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> blocks") added the m_seq field to ext4_map_blocks to pass out extent
> sequence numbers, but it missed two callsites within
> ext4_da_map_blocks(). These callsites are on the delayed allocation
> path, which is also used in the iomap buffered write path. Pass out the
> sequence counter to ensure stale mappings can be detected.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
Looks good,
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Regards,
Ojaswin
> ---
> fs/ext4/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c4d9137b279..39577a6b65b9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> ext4_check_map_extents_env(inode);
>
> /* Lookup extent status tree firstly */
> - if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> map->m_len = min_t(unsigned int, map->m_len,
> es.es_len - (map->m_lblk - es.es_lblk));
>
> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> * is held in write mode, before inserting a new da entry in
> * the extent status tree.
> */
> - if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> map->m_len = min_t(unsigned int, map->m_len,
> es.es_len - (map->m_lblk - es.es_lblk));
>
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
From: Ojaswin Mujoo @ 2026-05-19 9:28 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-5-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Introduce initial support for iomap in the buffered I/O path for regular
> files on ext4.
>
> - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
> inode uses iomap instead of buffer_head for buffered I/O
> - Add helper ext4_inode_buffered_iomap() to check the flag
> - Add new address space operations ext4_iomap_aops with callbacks that
> will use generic iomap implementations
> - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
>
> The following callbacks(read_folio(), readahead(), writepages()) are
> provided as placeholders and will be implemented in later patches.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
Hi Zhang, looks good to me. Just a questions below:
> ---
> fs/ext4/ext4.h | 7 +++++++
> fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
> 2 files changed, 39 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..1e27d73d7427 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1972,6 +1972,7 @@ enum {
> EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
> EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */
> EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */
> + EXT4_STATE_BUFFERED_IOMAP, /* Inode use iomap for buffered IO */
> };
>
> #define EXT4_INODE_BIT_FNS(name, field, offset) \
> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
> !list_empty(&EXT4_I(inode)->i_orphan);
> }
>
> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> +{
> + return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> +}
> +
> /*
> * Codes for operating systems
> */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b1ef706987c3..178ac2be37b7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
> .iomap_begin = ext4_iomap_begin_report,
> };
>
> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> +{
> + return 0;
> +}
> +
> +static void ext4_iomap_readahead(struct readahead_control *rac)
> +{
> +
> +}
> +
> +static int ext4_iomap_writepages(struct address_space *mapping,
> + struct writeback_control *wbc)
> +{
> + return 0;
> +}
> +
> /*
> * For data=journal mode, folio should be marked dirty only when it was
> * writeably mapped. When that happens, it was already attached to the
> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
> .swap_activate = ext4_iomap_swap_activate,
> };
>
> +static const struct address_space_operations ext4_iomap_aops = {
> + .read_folio = ext4_iomap_read_folio,
> + .readahead = ext4_iomap_readahead,
> + .writepages = ext4_iomap_writepages,
> + .dirty_folio = iomap_dirty_folio,
> + .bmap = ext4_bmap,
> + .invalidate_folio = iomap_invalidate_folio,
> + .release_folio = iomap_release_folio,
> + .migrate_folio = filemap_migrate_folio,
> + .is_partially_uptodate = iomap_is_partially_uptodate,
> + .error_remove_folio = generic_error_remove_folio,
> + .swap_activate = ext4_iomap_swap_activate,
> +};
So one question, for ->release_folio() we are using
iomap_release_folio() instead of ext4_release_folio() here which doesnt
make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
seems to be trying to clean up already checkpointed buffers.
I wanted to check if ->release_folio() can be called for folios with
ext4 metadata buffers? (from my limited understanding of
shrink_folio_list() -> filemap_release_folio() it seems we can) And if
it can be called, is it okay to skip the
jbd2_journal_try_to_free_buffers call?
Regards,
ojaswin
> +
> static const struct address_space_operations ext4_dax_aops = {
> .writepages = ext4_dax_writepages,
> .dirty_folio = noop_dirty_folio,
> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
> }
> if (IS_DAX(inode))
> inode->i_mapping->a_ops = &ext4_dax_aops;
> + else if (ext4_inode_buffered_iomap(inode))
> + inode->i_mapping->a_ops = &ext4_iomap_aops;
> else if (test_opt(inode->i_sb, DELALLOC))
> inode->i_mapping->a_ops = &ext4_da_aops;
> else
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v2] mm: do not install PMD mappings when handling a COW fault
From: William Kucharski @ 2026-05-19 9:02 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-mm, linux-ext4, linux-kernel, david, yi.zhang,
karol.wachowski, wangkefeng.wang, yangerkun, liuyongqiang13
In-Reply-To: <a3c8038a-43e9-4427-92dd-8fed619001e8@huaweicloud.com>
> On May 19, 2026, at 02:36, Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> Gentle ping – could anyone take this patch?
>
> Thanks,
> Yi.
Could you update the comment to clarify why you shouldn't install PMD mappings
while doing CoW rather than just state it should never be done?
>
> On 10/24/2025 6:22 PM, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When pinning a page with FOLL_LONGTERM in a CoW VMA and a PMD-aligned
>> (2MB on x86) large folio follow_page_mask() failed to obtain a valid
>> anonymous page, resulting in an infinite loop issue. The specific
>> triggering process is as follows:
>>
>> 1. User call mmap with a 2MB size in MAP_PRIVATE mode for a file that
>> has a 2MB large folio installed in the page cache.
>>
>> addr = mmap(NULL, 2*1024*1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
>>
>> 2. The kernel driver pass this mapped address to pin_user_pages_fast()
>> in FOLL_LONGTERM mode.
>>
>> pin_user_pages_fast(addr, 512, FOLL_LONGTERM, pages);
>>
>> -> pin_user_pages_fast()
>> | gup_fast_fallback()
>> | __gup_longterm_locked()
>> | __get_user_pages_locked()
>> | __get_user_pages()
>> | follow_page_mask()
>> | follow_p4d_mask()
>> | follow_pud_mask()
>> | follow_pmd_mask() //pmd_leaf(pmdval) is true because the
>> | //huge PMD is installed. This is normal
>> | //in the first round, but it shouldn't
>> | //happen in the second round.
>> | follow_huge_pmd() //require an anonymous page
>> | return -EMLINK;
>> | faultin_page()
>> | handle_mm_fault()
>> | wp_huge_pmd() //remove PMD and fall back to PTE
>> | handle_pte_fault()
>> | do_pte_missing()
>> | do_fault()
>> | do_read_fault() //FAULT_FLAG_WRITE is not set
>> | finish_fault()
>> | do_set_pmd() //install a huge PMD again, this is wrong!!!
>> | do_wp_page() //create private anonymous pages
>> <- goto retry;
>>
>> Due to an incorrectly large PMD set in do_read_fault(),
>> follow_pmd_mask() always returns -EMLINK, causing an infinite loop.
>>
>> David pointed out that we can preallocate a page table and remap the PMD
>> to be mapped by a PTE table in wp_huge_pmd() in the future. But now we
>> can avoid this issue by not installing PMD mappings when handling a COW
>> and unshare fault in do_set_pmd().
>>
>> Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
>> Reported-by: Karol Wachowski <karol.wachowski@linux.intel.com>
>> Closes: https://lore.kernel.org/linux-ext4/844e5cd4-462e-4b88-b3b5-816465a3b7e3@linux.intel.com/
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/memory.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 0ba4f6b71847..0748a31367df 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5212,6 +5212,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>> return ret;
>>
>> + /* We're about to trigger CoW, so never map it through a PMD. */
>> + if (is_cow_mapping(vma->vm_flags) &&
>> + (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
>> + return ret;
>> +
>> if (folio_order(folio) != HPAGE_PMD_ORDER)
>> return ret;
>> page = &folio->page;
>
>
^ permalink raw reply
* Re: [PATCH v2] mm: do not install PMD mappings when handling a COW fault
From: Zhang Yi @ 2026-05-19 8:36 UTC (permalink / raw)
To: linux-mm
Cc: linux-ext4, linux-kernel, david, yi.zhang, karol.wachowski,
wangkefeng.wang, yangerkun, liuyongqiang13
In-Reply-To: <20251024102237.3332200-1-yi.zhang@huaweicloud.com>
Gentle ping – could anyone take this patch?
Thanks,
Yi.
On 10/24/2025 6:22 PM, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> When pinning a page with FOLL_LONGTERM in a CoW VMA and a PMD-aligned
> (2MB on x86) large folio follow_page_mask() failed to obtain a valid
> anonymous page, resulting in an infinite loop issue. The specific
> triggering process is as follows:
>
> 1. User call mmap with a 2MB size in MAP_PRIVATE mode for a file that
> has a 2MB large folio installed in the page cache.
>
> addr = mmap(NULL, 2*1024*1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
>
> 2. The kernel driver pass this mapped address to pin_user_pages_fast()
> in FOLL_LONGTERM mode.
>
> pin_user_pages_fast(addr, 512, FOLL_LONGTERM, pages);
>
> -> pin_user_pages_fast()
> | gup_fast_fallback()
> | __gup_longterm_locked()
> | __get_user_pages_locked()
> | __get_user_pages()
> | follow_page_mask()
> | follow_p4d_mask()
> | follow_pud_mask()
> | follow_pmd_mask() //pmd_leaf(pmdval) is true because the
> | //huge PMD is installed. This is normal
> | //in the first round, but it shouldn't
> | //happen in the second round.
> | follow_huge_pmd() //require an anonymous page
> | return -EMLINK;
> | faultin_page()
> | handle_mm_fault()
> | wp_huge_pmd() //remove PMD and fall back to PTE
> | handle_pte_fault()
> | do_pte_missing()
> | do_fault()
> | do_read_fault() //FAULT_FLAG_WRITE is not set
> | finish_fault()
> | do_set_pmd() //install a huge PMD again, this is wrong!!!
> | do_wp_page() //create private anonymous pages
> <- goto retry;
>
> Due to an incorrectly large PMD set in do_read_fault(),
> follow_pmd_mask() always returns -EMLINK, causing an infinite loop.
>
> David pointed out that we can preallocate a page table and remap the PMD
> to be mapped by a PTE table in wp_huge_pmd() in the future. But now we
> can avoid this issue by not installing PMD mappings when handling a COW
> and unshare fault in do_set_pmd().
>
> Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
> Reported-by: Karol Wachowski <karol.wachowski@linux.intel.com>
> Closes: https://lore.kernel.org/linux-ext4/844e5cd4-462e-4b88-b3b5-816465a3b7e3@linux.intel.com/
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
> mm/memory.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b71847..0748a31367df 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5212,6 +5212,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> return ret;
>
> + /* We're about to trigger CoW, so never map it through a PMD. */
> + if (is_cow_mapping(vma->vm_flags) &&
> + (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
> + return ret;
> +
> if (folio_order(folio) != HPAGE_PMD_ORDER)
> return ret;
> page = &folio->page;
^ permalink raw reply
* Re: [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr()
From: Ojaswin Mujoo @ 2026-05-19 6:08 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-4-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:23PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Refactor the error handling in ext4_setattr() for better clarity:
>
> - Return directly on ext4_break_layouts() failure.
> - Propagate ext4_truncate() errors using the existing error variable
> and jump to the common 'err_out' label.
> - Propagate posix_acl_chmod() errors also through the error variable,
> as it theoretically does not return a non-fatal error.
>
> With these changes, every error path either returns immediately or jumps
> to err_out. Consequently, the "if (!error)" condition guarding
> setattr_copy() and mark_inode_dirty() becomes unreachable for error
> cases. Remove this redundant check and the unused rc variable can be
> removed as well.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Looks good to me:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/inode.c | 32 +++++++++++++++-----------------
> 1 file changed, 15 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 35e958f89bd5..b1ef706987c3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5989,7 +5989,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> struct iattr *attr)
> {
> struct inode *inode = d_inode(dentry);
> - int error, rc = 0;
> + int error;
> int orphan = 0;
> const unsigned int ia_valid = attr->ia_valid;
> bool inc_ivers = true;
> @@ -6102,10 +6102,10 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>
> filemap_invalidate_lock(inode->i_mapping);
>
> - rc = ext4_break_layouts(inode);
> - if (rc) {
> + error = ext4_break_layouts(inode);
> + if (error) {
> filemap_invalidate_unlock(inode->i_mapping);
> - goto err_out;
> + return error;
> }
>
> if (attr->ia_size > oldsize)
> @@ -6117,15 +6117,19 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> }
>
> filemap_invalidate_unlock(inode->i_mapping);
> + if (error)
> + goto err_out;
> }
>
> - if (!error) {
> - if (inc_ivers)
> - inode_inc_iversion(inode);
> - setattr_copy(idmap, inode, attr);
> - mark_inode_dirty(inode);
> - }
> + if (inc_ivers)
> + inode_inc_iversion(inode);
> + setattr_copy(idmap, inode, attr);
> + mark_inode_dirty(inode);
>
> + if (ia_valid & ATTR_MODE)
> + error = posix_acl_chmod(idmap, dentry, inode->i_mode);
> +
> +err_out:
> /*
> * If the call to ext4_truncate failed to get a transaction handle at
> * all, we need to clean up the in-core orphan list manually.
> @@ -6133,14 +6137,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> if (orphan && inode->i_nlink)
> ext4_orphan_del(NULL, inode);
>
> - if (!error && (ia_valid & ATTR_MODE))
> - rc = posix_acl_chmod(idmap, dentry, inode->i_mode);
> -
> -err_out:
> - if (error)
> + if (error)
> ext4_std_error(inode->i_sb, error);
> - if (!error)
> - error = rc;
> return error;
> }
>
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]()
From: Ojaswin Mujoo @ 2026-05-19 6:05 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-3-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:22PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Refactor ext4_setattr() by introducing two helper functions,
> ext4_truncate_up() and ext4_truncate_down(), to handle size changes. The
> current ATTR_SIZE processing consolidates checks for both shrinking and
> non-shrinking cases, leading to cluttered code. Separating the
> truncation paths improves readability.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Looks good to me Zhang:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/inode.c | 199 +++++++++++++++++++++++++++---------------------
> 1 file changed, 112 insertions(+), 87 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0751dc55e94f..35e958f89bd5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5855,6 +5855,112 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
> }
> }
>
> +/*
> + * Set i_size and i_disksize to 'newsize'.
> + *
> + * Both i_rwsem and i_data_sem are required here to avoid races between
> + * generic append writeback and concurrent truncate that also modify
> + * i_size and i_disksize.
> + */
> +static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
> +{
> + WARN_ON_ONCE(S_ISREG(inode->i_mode) && !inode_is_locked(inode));
> +
> + down_write(&EXT4_I(inode)->i_data_sem);
> + i_size_write(inode, newsize);
> + EXT4_I(inode)->i_disksize = newsize;
> + up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_truncate_up(struct inode *inode, loff_t oldsize, loff_t newsize)
> +{
> + ext4_lblk_t old_lblk, new_lblk;
> + handle_t *handle;
> + int ret;
> +
> + if (!IS_ALIGNED(oldsize | newsize, i_blocksize(inode))) {
> + ret = ext4_inode_attach_jinode(inode);
> + if (ret)
> + return ret;
> + }
> +
> + inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> + if (!IS_ALIGNED(oldsize, i_blocksize(inode))) {
> + ret = ext4_block_zero_eof(inode, oldsize, LLONG_MAX);
> + if (ret)
> + return ret;
> + }
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + old_lblk = oldsize > 0 ? (oldsize - 1) >> inode->i_blkbits : 0;
> + new_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> + ext4_fc_track_range(handle, inode, old_lblk, new_lblk);
> +
> + ext4_set_inode_size(inode, newsize);
> +
> + ret = ext4_mark_inode_dirty(handle, inode);
> + ext4_journal_stop(handle);
> + if (ret)
> + return ret;
> + /*
> + * isize extend must be called outside an active handle due to
> + * the lock ordering of transaction start and folio lock in the
> + * iomap buffered I/O path (folio lock -> transaction start).
> + */
> + pagecache_isize_extended(inode, oldsize, newsize);
> + return 0;
> +}
> +
> +static int ext4_truncate_down(struct inode *inode, loff_t oldsize,
> + loff_t newsize, int *orphan)
> +{
> + ext4_lblk_t start_lblk;
> + handle_t *handle;
> + int ret;
> +
> + /* Do not change i_size. */
> + if (newsize == oldsize)
> + goto truncate;
> +
> + /* Shrink. */
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + if (ext4_handle_valid(handle)) {
> + ret = ext4_orphan_add(handle, inode);
> + *orphan = 1;
> + if (ret) {
> + ext4_journal_stop(handle);
> + return ret;
> + }
> + }
> +
> + start_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> + ext4_fc_track_range(handle, inode, start_lblk, EXT_MAX_BLOCKS - 1);
> +
> + ext4_set_inode_size(inode, newsize);
> +
> + ret = ext4_mark_inode_dirty(handle, inode);
> + ext4_journal_stop(handle);
> + if (ret)
> + return ret;
> +
> + if (ext4_should_journal_data(inode))
> + ext4_wait_for_tail_page_commit(inode);
> +truncate:
> + /*
> + * Truncate pagecache after we've waited for commit in data=journal
> + * mode to make pages freeable. Call ext4_truncate() even if
> + * i_size didn't change to truncatea possible preallocated blocks.
> + */
> + truncate_pagecache(inode, newsize);
> + return ext4_truncate(inode);
> +}
> +
> /*
> * ext4_setattr()
> *
> @@ -5951,7 +6057,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> }
>
> if (attr->ia_valid & ATTR_SIZE) {
> - handle_t *handle;
> loff_t oldsize = inode->i_size;
> int shrink = (attr->ia_size < inode->i_size);
>
> @@ -6003,94 +6108,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> goto err_out;
> }
>
> - if (attr->ia_size != inode->i_size) {
> - /* attach jbd2 jinode for EOF folio tail zeroing */
> - if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
> - oldsize & (inode->i_sb->s_blocksize - 1)) {
> - error = ext4_inode_attach_jinode(inode);
> - if (error)
> - goto out_mmap_sem;
> - }
> -
> - /*
> - * Update c/mtime and tail zero the EOF folio on
> - * truncate up. ext4_truncate() handles the shrink case
> - * below.
> - */
> - if (!shrink) {
> - inode_set_mtime_to_ts(inode,
> - inode_set_ctime_current(inode));
> - if (oldsize & (inode->i_sb->s_blocksize - 1)) {
> - error = ext4_block_zero_eof(inode,
> - oldsize, LLONG_MAX);
> - if (error)
> - goto out_mmap_sem;
> - }
> - }
> -
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> - if (IS_ERR(handle)) {
> - error = PTR_ERR(handle);
> - goto out_mmap_sem;
> - }
> - if (ext4_handle_valid(handle) && shrink) {
> - error = ext4_orphan_add(handle, inode);
> - orphan = 1;
> - if (error)
> - goto out_handle;
> - }
> -
> - if (shrink)
> - ext4_fc_track_range(handle, inode,
> - (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> - inode->i_sb->s_blocksize_bits,
> - EXT_MAX_BLOCKS - 1);
> - else
> - ext4_fc_track_range(
> - handle, inode,
> - (oldsize > 0 ? oldsize - 1 : oldsize) >>
> - inode->i_sb->s_blocksize_bits,
> - (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> - inode->i_sb->s_blocksize_bits);
> -
> - /*
> - * We have to update i_size under i_data_sem together
> - * with i_disksize to avoid races with writeback code
> - * updating disksize in mpage_map_and_submit_extent().
> - */
> - down_write(&EXT4_I(inode)->i_data_sem);
> - i_size_write(inode, attr->ia_size);
> - EXT4_I(inode)->i_disksize = attr->ia_size;
> - up_write(&EXT4_I(inode)->i_data_sem);
> -
> - error = ext4_mark_inode_dirty(handle, inode);
> -out_handle:
> - ext4_journal_stop(handle);
> - if (error)
> - goto out_mmap_sem;
> - if (!shrink) {
> - pagecache_isize_extended(inode, oldsize,
> - inode->i_size);
> - } else if (ext4_should_journal_data(inode)) {
> - ext4_wait_for_tail_page_commit(inode);
> - }
> + if (attr->ia_size > oldsize)
> + error = ext4_truncate_up(inode, oldsize, attr->ia_size);
> + else {
> + /* Shrink or do not change i_size. */
> + error = ext4_truncate_down(inode, oldsize,
> + attr->ia_size, &orphan);
> }
>
> - /*
> - * Truncate pagecache after we've waited for commit
> - * in data=journal mode to make pages freeable.
> - */
> - truncate_pagecache(inode, inode->i_size);
> - /*
> - * Call ext4_truncate() even if i_size didn't change to
> - * truncate possible preallocated blocks.
> - */
> - if (attr->ia_size <= oldsize) {
> - rc = ext4_truncate(inode);
> - if (rc)
> - error = rc;
> - }
> -out_mmap_sem:
> filemap_invalidate_unlock(inode->i_mapping);
> }
>
> --
> 2.52.0
>
^ permalink raw reply
* [syzbot] [ext4?] possible deadlock in ext4_map_blocks (3)
From: syzbot @ 2026-05-19 5:54 UTC (permalink / raw)
To: adilger.kernel, jack, libaokun, linux-ext4, linux-kernel, ojaswin,
ritesh.list, syzkaller-bugs, tytso, yi.zhang
Hello,
syzbot found the following issue on:
HEAD commit: 66182ca873a4 Merge tag 'net-7.1-rc4' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10328d6a580000
kernel config: https://syzkaller.appspot.com/x/.config?x=f2e8ebfec4636d32
dashboard link: https://syzkaller.appspot.com/bug?extid=1d62d734f435d85cc693
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/04f707fb980e/disk-66182ca8.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/88110d832473/vmlinux-66182ca8.xz
kernel image: https://storage.googleapis.com/syzbot-assets/fd09a1619478/bzImage-66182ca8.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1d62d734f435d85cc693@syzkaller.appspotmail.com
============================================
WARNING: possible recursive locking detected
syzkaller #0 Not tainted
--------------------------------------------
syz.4.1400/19208 is trying to acquire lock:
ffff888040b26cd0 (&ei->i_data_sem/2){++++}-{4:4}, at: ext4_map_blocks+0x7b5/0x11d0 fs/ext4/inode.c:823
but task is already holding lock:
ffff888040b231c0 (&ei->i_data_sem/2){++++}-{4:4}, at: ext4_map_blocks+0x7b5/0x11d0 fs/ext4/inode.c:823
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&ei->i_data_sem/2);
lock(&ei->i_data_sem/2);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by syz.4.1400/19208:
#0: ffff888025696480 (sb_writers#4){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2724 [inline]
#0: ffff888025696480 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:684
#1: ffff888040b23360 (&sb->s_type->i_mutex_key#10){++++}-{4:4}, at: inode_lock include/linux/fs.h:1029 [inline]
#1: ffff888040b23360 (&sb->s_type->i_mutex_key#10){++++}-{4:4}, at: ext4_buffered_write_iter+0xa1/0x3a0 fs/ext4/file.c:311
#2: ffff888040b231c0 (&ei->i_data_sem/2){++++}-{4:4}, at: ext4_map_blocks+0x7b5/0x11d0 fs/ext4/inode.c:823
#3: ffffffff8e170dd8 (dquot_srcu){.+.+}-{0:0}, at: srcu_lock_acquire include/linux/srcu.h:187 [inline]
#3: ffffffff8e170dd8 (dquot_srcu){.+.+}-{0:0}, at: srcu_read_lock include/linux/srcu.h:294 [inline]
#3: ffffffff8e170dd8 (dquot_srcu){.+.+}-{0:0}, at: __dquot_alloc_space+0x18d/0xea0 fs/quota/dquot.c:1729
#4: ffff888062ce2a98 (&dquot->dq_lock){+.+.}-{4:4}, at: dquot_commit+0x5e/0x450 fs/quota/dquot.c:533
#5: ffff8880256961e8 (&s->s_dquot.dqio_sem){++++}-{4:4}, at: v2_write_dquot+0x9c/0x260 fs/quota/quota_v2.c:369
stack backtrace:
CPU: 1 UID: 0 PID: 19208 Comm: syz.4.1400 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_deadlock_bug+0x279/0x290 kernel/locking/lockdep.c:3041
check_deadlock kernel/locking/lockdep.c:3093 [inline]
validate_chain kernel/locking/lockdep.c:3895 [inline]
__lock_acquire+0x253f/0x2cf0 kernel/locking/lockdep.c:5237
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
ext4_map_blocks+0x7b5/0x11d0 fs/ext4/inode.c:823
ext4_getblk+0x1ca/0x780 fs/ext4/inode.c:992
ext4_bread+0x2a/0x180 fs/ext4/inode.c:1055
ext4_quota_write+0x239/0x580 fs/ext4/super.c:7398
qtree_write_dquot+0x25b/0x5e0 fs/quota/quota_tree.c:462
v2_write_dquot+0x183/0x260 fs/quota/quota_v2.c:372
dquot_commit+0x377/0x450 fs/quota/dquot.c:540
ext4_write_dquot+0x20a/0x380 fs/ext4/super.c:7010
mark_dquot_dirty fs/quota/dquot.c:398 [inline]
mark_all_dquot_dirty+0x205/0x460 fs/quota/dquot.c:438
__dquot_alloc_space+0x620/0xea0 fs/quota/dquot.c:1765
dquot_alloc_space_nodirty include/linux/quotaops.h:292 [inline]
dquot_alloc_space include/linux/quotaops.h:305 [inline]
dquot_alloc_block include/linux/quotaops.h:329 [inline]
ext4_mb_new_blocks+0xfb1/0x46d0 fs/ext4/mballoc.c:6276
ext4_ext_map_blocks+0x1515/0x5860 fs/ext4/extents.c:4461
ext4_map_create_blocks+0x11d/0x540 fs/ext4/inode.c:631
ext4_map_blocks+0x7cd/0x11d0 fs/ext4/inode.c:824
_ext4_get_block+0x1e3/0x470 fs/ext4/inode.c:924
ext4_get_block_unwritten+0x2e/0x100 fs/ext4/inode.c:957
ext4_block_write_begin+0xb14/0x1950 fs/ext4/inode.c:1211
ext4_write_begin+0xb40/0x1890 fs/ext4/ext4_jbd2.h:-1
ext4_da_write_begin+0x355/0xd60 fs/ext4/inode.c:3152
generic_perform_write+0x2af/0x8b0 mm/filemap.c:4325
ext4_buffered_write_iter+0xd0/0x3a0 fs/ext4/file.c:316
ext4_file_write_iter+0x299/0x1c10 fs/ext4/file.c:-1
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_pwrite64 fs/read_write.c:795 [inline]
__do_sys_pwrite64 fs/read_write.c:803 [inline]
__se_sys_pwrite64 fs/read_write.c:800 [inline]
__x64_sys_pwrite64+0x19c/0x230 fs/read_write.c:800
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f50b506ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f50b32be028 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
RAX: ffffffffffffffda RBX: 00007f50b52e5fa0 RCX: 00007f50b506ce59
RDX: 0000000000000001 RSI: 0000200000000140 RDI: 0000000000000007
RBP: 00007f50b5102d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000008080c61 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f50b52e6038 R14: 00007f50b52e5fa0 R15: 00007fff20f29238
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* Re: [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr()
From: Ojaswin Mujoo @ 2026-05-19 5:24 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-2-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:21PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> The logic for updating the file size in ext4_setattr() is currently
> somewhat messy. By directly entering the error-handling path after
> failing to add an orphan inode, the unnecessary recovery process
> involving old_disksize and the file size can be avoided.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> fs/ext4/inode.c | 22 +++++++++-------------
> 1 file changed, 9 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..0751dc55e94f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5953,7 +5953,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> if (attr->ia_valid & ATTR_SIZE) {
> handle_t *handle;
> loff_t oldsize = inode->i_size;
> - loff_t old_disksize;
> int shrink = (attr->ia_size < inode->i_size);
>
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> @@ -6037,6 +6036,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> if (ext4_handle_valid(handle) && shrink) {
> error = ext4_orphan_add(handle, inode);
> orphan = 1;
> + if (error)
> + goto out_handle;
> }
>
> if (shrink)
> @@ -6052,23 +6053,18 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> inode->i_sb->s_blocksize_bits);
>
> - down_write(&EXT4_I(inode)->i_data_sem);
> - old_disksize = EXT4_I(inode)->i_disksize;
> - EXT4_I(inode)->i_disksize = attr->ia_size;
> -
> /*
> * We have to update i_size under i_data_sem together
> * with i_disksize to avoid races with writeback code
> - * running ext4_wb_update_i_disksize().
> + * updating disksize in mpage_map_and_submit_extent().
> */
> - if (!error)
> - i_size_write(inode, attr->ia_size);
> - else
> - EXT4_I(inode)->i_disksize = old_disksize;
> + down_write(&EXT4_I(inode)->i_data_sem);
> + i_size_write(inode, attr->ia_size);
> + EXT4_I(inode)->i_disksize = attr->ia_size;
> up_write(&EXT4_I(inode)->i_data_sem);
> - rc = ext4_mark_inode_dirty(handle, inode);
> - if (!error)
> - error = rc;
> +
> + error = ext4_mark_inode_dirty(handle, inode);
> +out_handle:
> ext4_journal_stop(handle);
> if (error)
> goto out_mmap_sem;
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 9/9] fstests: btrfs: test UUID consistency for clones with metadata_uuid
From: Anand Jain @ 2026-05-18 14:44 UTC (permalink / raw)
To: Christoph Hellwig, Anand Jain
Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs, amir73il,
zlang, djwong
In-Reply-To: <agK9MC0ufkTJj-zu@infradead.org>
On 12/5/26 13:40, Christoph Hellwig wrote:
> On Tue, Apr 28, 2026 at 02:42:59PM +0800, Anand Jain wrote:
>> Btrfs uses the metadata_uuid superblock feature to change the on-disk UUID
>> without rewriting every block header. This patch adds a sanity check to
>> ensure UUID consistency when a filesystem with metadata_uuid enabled is
>> cloned.
>
> xfs does the same.
>
> Can we abstract out the uuid change and generalize the test?
Sure. let me try.
Thanks.
^ permalink raw reply
* Re: [PATCH v4 4/9] fstests: verify fanotify isolation on cloned filesystems
From: Anand Jain @ 2026-05-18 14:43 UTC (permalink / raw)
To: Christoph Hellwig, Anand Jain
Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs, amir73il,
zlang, André Almeida
In-Reply-To: <agK83s6NBxpbDREJ@infradead.org>
On 12/5/26 13:38, Christoph Hellwig wrote:
> On Tue, Apr 28, 2026 at 02:42:54PM +0800, Anand Jain wrote:
>> Verify that fanotify events are correctly routed to the appropriate
>> watcher when cloned filesystems are mounted.
>> Helps verify kernel's event notification distinguishes between devices
>> sharing the same FSID/UUID.
>
> Do these tests pass with all major file systems? Or does this reproduce
> the previous btrfs issues in this area?
Yes. The major FSs (BTRFS, EXT4, XFS, and F2FS) all pass, except
EXT4 does not run `generic/801` because statfs `f_fsid` is not
unique when the cloned FS is mounted. Also, BTRFS requires the
kernel patch mentioned in test cases `generic/802` and `generic/804`.
`generic/801` fails in an SELinux environment; this can be
fixed with the following changes in `generic/801` I'll add
this change in v5.
-------
diff --git a/tests/generic/801 b/tests/generic/801
index e1282f4e3d71..384e82120d4e 100644
--- a/tests/generic/801
+++ b/tests/generic/801
@@ -23,6 +23,10 @@ _cleanup()
umount $mnt1 $mnt2 2>/dev/null
_loop_image_destroy "${devs[@]}" 2> /dev/null
rm -r -f $tmp.*
+
+ if [ -n "$old_selinux_state" ]; then
+ setenforce "$old_selinux_state"
+ fi
}
monitor_fanotify()
@@ -49,6 +53,14 @@ mnt2=$TEST_DIR/$seq/mnt2
mkdir -p $mnt1
mkdir -p $mnt2
+# Setting SELINUX_MOUNT_OPTIONS to null still fails fanotify with
+# permission failure, so set SELinux to permissive if it is enforcing.
+if command -v getenforce &>/dev/null && [ "$(getenforce)" = "Enforcing"
]; then
+ old_selinux_state="Enforcing"
+ setenforce Permissive
+else
+ old_selinux_state=""
+fi
_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]}
$mnt1 || \
_fail "Failed to mount dev1"
_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]}
$mnt2 || \
(END)
-------
^ permalink raw reply related
* Re: [PATCH v4 0/9] fstests: add test coverage for cloned filesystem ids
From: Anand Jain @ 2026-05-18 14:27 UTC (permalink / raw)
To: Christoph Hellwig, Anand Jain
Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs, amir73il,
zlang
In-Reply-To: <agK8gc2niqpTuHHt@infradead.org>
On 12/5/26 13:37, Christoph Hellwig wrote:
> This is missing a real cover letter saying what this series is trying
> to archive.
I should have copied from v1. See below [1]
SCRATCH_DEV_POOL support for non-btrfs was dropped in v2.
Instead, using the newly added helper _loop_image_create_clone().
[1]
https://lore.kernel.org/fstests/cover.1772095513.git.asj@kernel.org/
-----
This series adds fstests infrastructure and test cases to verify correct
filesystem identity behaviour when a filesystem is cloned (e.g. via
block-level copy), covering inotify, fanotify, f_fsid, libblkid, IMA,
and exportfs file handles.
- SCRATCH_DEV_POOL support extended to non-Btrfs filesystems
- _mkfs_scratch_sized_clone() helper to create a cloned filesystem
- _clone_mount_option() helper to apply per-filesystem clone mount options
New tests verify:
- inotify and fanotify events are isolated between cloned filesystems
- f_fsid is unique across cloned filesystem instances
- libblkid correctly resolves duplicate UUIDs to distinct devices
- IMA distinct identity for each cloned filesystem
- exportfs file handles resolve correctly on cloned filesystems
-----
^ permalink raw reply
* Re: improve the swap_activate interface
From: Christoph Hellwig @ 2026-05-18 5:29 UTC (permalink / raw)
To: Chris Li
Cc: Christoph Hellwig, Andrew Morton, Kairui Song, Christian Brauner,
Darrick J . Wong, Jens Axboe, David Sterba, Theodore Ts'o,
Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
Namjae Jeon, Hyunchul Lee, Steve French, Paulo Alcantara,
Carlos Maiolino, Damien Le Moal, Naohiro Aota, linux-xfs,
linux-fsdevel, linux-doc, linux-mm, linux-block, linux-btrfs,
linux-ext4, linux-f2fs-devel, linux-nfs, linux-cifs
In-Reply-To: <CACePvbUj0-fAd-gjRjxFXYz22hGQaT9upFL85KUqD=W=SWX+0Q@mail.gmail.com>
On Fri, May 15, 2026 at 02:40:09PM -0700, Chris Li wrote:
> BTW, I just tried it, this series conflicts with Kairui's swap table
> phase IV series. Might need to coordinate the merge order with Kairui.
Yes. I think the swap table should be a priority. Next would be
swap_ops (see my take on that from Friday) and then this series. The
iomap-fuse work from Darrick seems to be a bit stalled on the fuse side,
so I hope he can wait a bit on this as well. Especially as swapfile
support might not be the highest priority feature for that.
^ permalink raw reply
* Re: [PATCH RFC 01/17] lib/crc: add crc32c_flip_range() for incremental CRC update
From: Baokun Li @ 2026-05-17 4:18 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-ext4, linux-crypto, ardb, tytso, adilger.kernel, jack,
yi.zhang, ojaswin, ritesh.list
In-Reply-To: <20260514035248.GA2816@sol>
Hi Eric,
Thanks for the feedback!
在 2026/5/14 11:52, Eric Biggers 写道:
> On Fri, May 08, 2026 at 08:15:23PM +0800, Baokun Li wrote:
>> When a contiguous range of bits in a buffer is flipped, the CRC32c
>> checksum can be updated incrementally without re-scanning the entire
>> buffer, by exploiting the linearity of CRCs over GF(2):
>>
>> New_CRC = Old_CRC ^ CRC(flip_mask << trailing_bits)
>>
>> Introduce crc32c_flip_range() which computes this delta using
>> precomputed GF(2) shift matrices and nibble-indexed lookup tables.
>> The implementation decomposes nbits and trailing_bits into
>> power-of-2 components and combines them via the CRC concatenation
>> property:
>>
>> CRC(A || B) = shift(CRC(A), len(B)) ^ CRC(B)
>>
>> This gives O(log N) complexity with only ~9.8KB of static tables
>> (fits in L1 cache). The current maximum supported buffer size is
>> 64KB (INCR_MAX_ORDER = 19, i.e. 2^19 bits = 524288 bits = 64KB).
> It will be a little while before I can do a full review of this, but
> just a high-level comment: "only ~9.8KB of static tables (fits in L1
> cache)" isn't ideal. Large tables tend to microbenchmark well, then
> have worse real-world performance due to lots of other things contending
> for the L1 cache.
You're right, and that's exactly the trap I fell into when picking
the initial size. I went with the variant that had the best kunit
microbenchmark while still fitting in a typical L1 -- the
nibble-indexed (4-bit) tables. I've now re-measured all three
candidate table sizes:
=== crc32c_flip_range benchmark (ns, speedup vs full) ===
bitmap full 1bit(2.5KB) 2bit(4.9KB) 4bit(9.8KB)
1024 46 165 (0.3x) 82 (0.6x) 48 (1.0x)
2048 88 180 (0.5x) 88 (1.0x) 53 (1.7x)
4096 181 194 (0.9x) 98 (1.8x) 58 (3.1x)
8192 358 207 (1.7x) 104 (3.4x) 63 (5.7x)
16384 707 222 (3.2x) 112 (6.3x) 68 (10.4x)
32768 1424 234 (6.1x) 121 (11.8x) 73 (19.5x)
65536 2846 248 (11.5x) 129 (22.1x) 79 (36.0x)
One thing worth mentioning: the upcoming crc32c_splice() API reuses
the same GF(2) shift tables for byte-granular CRC updates (extent
blocks, dir blocks, etc.). It's being posted as a separate series
because the ext4 integration is more involved, but roughly:
u32 crc32c_splice(const void *buf, u32 buflen, u32 old_crc,
u32 old_region_crc, u32 offset, u32 len)
{
u32 new_region_crc, delta, trail_bits;
[...]
new_region_crc = crc32c(0, (const u8 *)buf + offset, len);
delta = old_region_crc ^ new_region_crc;
if (!delta)
return old_crc;
trail_bits = (buflen - offset - len) * 8;
delta = gf2_shift_crc(delta, trail_bits);
return old_crc ^ delta;
}
The splice kunit numbers, for completeness:
=== crc32c_splice benchmark (ns, speedup vs full) ===
blk_regio full splice(1bit) splice(2bit) splice(4bit)
1024_12 46 8 (5.8x) 9 (5.1x) 9 (5.1x)
1024_32 46 15 (3.1x) 14 (3.3x) 15 (3.1x)
1024_64 46 20 (2.3x) 19 (2.4x) 20 (2.3x)
1024_128 46 30 (1.5x) 31 (1.5x) 30 (1.5x)
1024_264 46 53 (0.9x) 53 (0.9x) 53 (0.9x)
2048_12 88 8 (11.0x) 8 (11.0x) 8 (11.0x)
2048_32 88 15 (5.9x) 13 (6.8x) 15 (5.9x)
2048_64 89 20 (4.5x) 20 (4.5x) 20 (4.5x)
2048_128 89 31 (2.9x) 30 (3.0x) 30 (3.0x)
2048_264 88 53 (1.7x) 53 (1.7x) 53 (1.7x)
4096_12 181 9 (20.1x) 7 (25.9x) 9 (20.1x)
4096_32 181 14 (12.9x) 15 (12.1x) 15 (12.1x)
4096_64 181 20 (9.1x) 20 (9.1x) 19 (9.5x)
4096_128 181 31 (5.8x) 31 (5.8x) 30 (6.0x)
4096_264 182 54 (3.4x) 53 (3.4x) 54 (3.4x)
8192_12 358 9 (39.8x) 8 (44.8x) 10 (35.8x)
8192_32 358 15 (23.9x) 15 (23.9x) 15 (23.9x)
8192_64 358 21 (17.0x) 20 (17.9x) 21 (17.0x)
8192_128 358 32 (11.2x) 31 (11.5x) 31 (11.5x)
8192_264 358 54 (6.6x) 53 (6.8x) 53 (6.8x)
16384_12 707 10 (70.7x) 8 (88.4x) 8 (88.4x)
16384_32 706 15 (47.1x) 15 (47.1x) 15 (47.1x)
16384_64 706 21 (33.6x) 19 (37.2x) 19 (37.2x)
16384_128 707 30 (23.6x) 31 (22.8x) 30 (23.6x)
16384_264 707 54 (13.1x) 53 (13.3x) 53 (13.3x)
32768_12 1422 10 (142.2x) 9 (158.0x) 9 (158.0x)
32768_32 1422 15 (94.8x) 15 (94.8x) 15 (94.8x)
32768_64 1422 20 (71.1x) 19 (74.8x) 20 (71.1x)
32768_128 1422 31 (45.9x) 31 (45.9x) 31 (45.9x)
32768_264 1422 53 (26.8x) 53 (26.8x) 54 (26.3x)
65536_12 2841 10 (284.1x) 9 (315.7x) 8 (355.1x)
65536_32 2840 14 (202.9x) 15 (189.3x) 14 (202.9x)
65536_64 2840 21 (135.2x) 19 (149.5x) 20 (142.0x)
65536_128 2845 30 (94.8x) 31 (91.8x) 31 (91.8x)
65536_264 2841 53 (53.6x) 53 (53.6x) 53 (53.6x)
But, as you point out, what really matters is the real-world impact
once the tables are competing for L1 with everything else. I tested
all three table sizes on an ext4 fio workload (single-process
sequential fallocate of 64K extents) across a range of filesystem
block sizes. Results below, with both +flip_range alone and
+flip_range+splice applied:
=== default mkfs, single-process (GB/s) ===
config base raw-bit-flip raw-bit-splice 2-bit-flip 2-bit-splice
4-bit-flip 4-bit-splice
S_1k 15.4 15.3(-0.6%) 15.3(-0.6%) 15.1(-1.9%) 15.8(+2.6%)
15.0(-2.6%) 15.5(+0.6%)
S_2k 17.6 17.7(+0.6%) 17.9(+1.7%) 17.6(+0.0%) 18.3(+4.0%)
17.2(-2.3%) 18.6(+5.7%)
S_4k 16.9 17.0(+0.6%) 18.6(+10.1%) 17.4(+3.0%) 18.4(+8.9%)
17.3(+2.4%) 18.7(+10.7%)
S_8k 15.8 16.3(+3.2%) 18.1(+14.6%) 16.6(+5.1%) 18.3(+15.8%)
16.4(+3.8%) 17.8(+12.7%)
S_16k 12.5 13.1(+4.8%) 15.4(+23.2%) 13.0(+4.0%) 15.5(+24.0%)
12.9(+3.2%) 15.6(+24.8%)
S_32k 8.93 9.37(+4.9%) 12.5(+40.0%) 9.10(+1.9%) 13.1(+46.7%)
9.07(+1.6%) 12.5(+40.0%)
S_64k 8.17 8.43(+3.2%) 14.3(+75.0%) 8.64(+5.8%) 14.6(+78.7%)
8.39(+2.7%) 14.8(+81.2%)
So the larger tables do measure a bit faster, but the gain over 2-bit
is about 3% while the .rodata footprint doubles. All three variants
land within run-to-run noise on the real workload, which matches your
prediction exactly.
Based on this I'd lean toward the 2-bit (4.9 KB) variant for v2 as
the better trade-off. Would you prefer that, or the smaller 1-bit
(2.5 KB) version? The ext4 numbers say either is fine; 2-bit just
keeps a little more headroom on the microbench in case other
consumers show up later.
> Another consideration is that basically every Linux kernel has
> CONFIG_CRC32 enabled, regardless of whether they would actually find
> this new functionality useful.
Agreed. As large-block hardware becomes more common I expect other
filesystems beyond ext4 to hit the same large-buffer CRC overhead, so
I deliberately put this in lib/crc as a general-purpose API rather
than burying it inside ext4. But you're right that it shouldn't be
unconditionally compiled in. For v2 I'll add CONFIG_CRC32_INCR,
selected by consumers (initially just ext4), so kernels that don't
need it pay zero .text/.rodata cost.
> I'm not necessarily saying this should be its own option, especially if
> it's useful for ext4 even in the non-LBS case. But I do think it would
> be nice if it could be a bit smaller and more memory-optimized.
The non-LBS case does see some benefit, but it's modest -- the
incremental update mostly matters once group-descriptor-size CRCs
become large. The good news is that the regression on small-block
configs is essentially zero (see the S_1k / S_2k rows above), so I
left it unconditionally enabled in the current series to keep things
simple.
If there's concern about that, I'm happy to either gate it on a
sysfs/mount-option knob, or restrict it to LBS-only paths in ext4.
>
> Anyway, I'll look into the algorithm more when I have time.
>
Thanks again for taking the time on this -- the current series is
still rough around the edges and I'd appreciate any further feedback
once you get to a deeper review.
Cheers,
Baokun
^ permalink raw reply
* Re: [PATCH v14 00/15] Exposing case folding behavior
From: Chuck Lever @ 2026-05-16 14:48 UTC (permalink / raw)
To: Cedric Blancher, Christian Brauner
Cc: Chuck Lever, linux-fsdevel, linux-ext4, linux-xfs, linux-cifs,
linux-nfs, linux-api, linux-f2fs-devel, hirofumi, linkinjeon,
sj1557.seo, yuezhang.mo, almaz.alexandrovich, slava, glaubitz,
frank.li, tytso, adilger.kernel, cem, sfrench, pc, ronniesahlberg,
sprasad, trondmy, anna, jaegeuk, chao, hansg, senozhatsky,
Darrick J. Wong, Roland Mainz, Steve French
In-Reply-To: <CALXu0UdsurG-ayuYViqs0HXOfgyDw8gpNC+f=5y59cuuSPUbBA@mail.gmail.com>
On 5/16/26 2:43 AM, Cedric Blancher wrote:
> On Mon, 11 May 2026 at 16:11, Christian Brauner <brauner@kernel.org> wrote:
>>
>> On Thu, 07 May 2026 04:52:53 -0400, Chuck Lever wrote:
>>> Christian, let's lock this one in. I will post subsequent changes
>>> as delta patches.
>>>
>>> Following on from:
>>>
>>> https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
>>>
>>> [...]
>>
>> Applied to the vfs-7.2.exportfs branch of the vfs/vfs.git tree.
>> Patches in the vfs-7.2.exportfs branch should appear in linux-next soon.
>
> @Chuck Lever Thank you!
>
> Does that mean the support for case-insensitive filesystems will work
> with Linux 7.2?
I don't want to make claims with 100% certainty, but we expect this
series to get merged into 7.2. It's early days, so there are likely to
be bugs -- there is so much subtle behavior under the covers.
--
Chuck Lever
^ permalink raw reply
* Re: [PATCH v14 00/15] Exposing case folding behavior
From: Cedric Blancher @ 2026-05-16 6:43 UTC (permalink / raw)
To: Christian Brauner, Chuck Lever
Cc: Chuck Lever, linux-fsdevel, linux-ext4, linux-xfs, linux-cifs,
linux-nfs, linux-api, linux-f2fs-devel, hirofumi, linkinjeon,
sj1557.seo, yuezhang.mo, almaz.alexandrovich, slava, glaubitz,
frank.li, tytso, adilger.kernel, cem, sfrench, pc, ronniesahlberg,
sprasad, trondmy, anna, jaegeuk, chao, hansg, senozhatsky,
Darrick J. Wong, Roland Mainz, Steve French
In-Reply-To: <20260511-wertverlust-vorbringen-070f016f3bd4@brauner>
On Mon, 11 May 2026 at 16:11, Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, 07 May 2026 04:52:53 -0400, Chuck Lever wrote:
> > Christian, let's lock this one in. I will post subsequent changes
> > as delta patches.
> >
> > Following on from:
> >
> > https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
> >
> > [...]
>
> Applied to the vfs-7.2.exportfs branch of the vfs/vfs.git tree.
> Patches in the vfs-7.2.exportfs branch should appear in linux-next soon.
@Chuck Lever Thank you!
Does that mean the support for case-insensitive filesystems will work
with Linux 7.2?
Ced
--
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur
^ permalink raw reply
* Re: [PATCH 4/4] iomap: fix out-of-bounds bitmap_set() with zero-length range
From: Zhang Yi @ 2026-05-16 1:25 UTC (permalink / raw)
To: Darrick J. Wong, Zhang Yi
Cc: linux-fsdevel, linux-xfs, linux-ext4, brauner, hch, yi.zhang,
yangerkun, yukuai
In-Reply-To: <20260515173047.GC9555@frogsfrogsfrogs>
On 5/16/2026 1:30 AM, Darrick J. Wong wrote:
> On Fri, May 15, 2026 at 09:50:08AM +0800, Zhang Yi wrote:
>> On 5/15/2026 2:10 AM, Darrick J. Wong wrote:
>>> On Thu, May 14, 2026 at 02:29:55PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> ifs_set_range_dirty() and ifs_set_range_uptodate() compute last_blk
>>>> as (off + len - 1) >> i_blkbits. When off is 0 and len is 0, the
>>>> unsigned subtraction underflows to SIZE_MAX, producing a huge
>>>> last_blk and nr_blks value that causes bitmap_set() to write far
>>>> beyond the ifs->state allocation.
>>>>
>>>> Regarding ifs_set_range_uptodate(), it is temporarily safe because len
>>>> cannot be passed in as 0. However, for ifs_set_range_dirty() this is
>>>> reachable from __iomap_write_end(): when copy_folio_from_iter_atomic()
>>>> returns 0 (e.g. user buffer fault) and the folio is already uptodate,
>>>> the guard at the top of __iomap_write_end() does not trigger because
>>>> !folio_test_uptodate() is false, and iomap_set_range_dirty() is called
>>>> with copied == 0.
>>>>
>>>> Add a !len guard to both functions before the computation, so that a
>>>> zero-length range is a no-op.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>> fs/iomap/buffered-io.c | 23 +++++++++++++++--------
>>>> 1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>> index 27ab33edbdee..6fe5f7e998fd 100644
>>>> --- a/fs/iomap/buffered-io.c
>>>> +++ b/fs/iomap/buffered-io.c
>>>> @@ -67,11 +67,14 @@ static bool ifs_set_range_uptodate(struct folio *folio,
>>>> struct iomap_folio_state *ifs, size_t off, size_t len)
>>>> {
>>>> struct inode *inode = folio->mapping->host;
>>>> - unsigned int first_blk = off >> inode->i_blkbits;
>>>> - unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>>>> - unsigned int nr_blks = last_blk - first_blk + 1;
>>>> + unsigned int first_blk, last_blk;
>>>>
>>>> - bitmap_set(ifs->state, first_blk, nr_blks);
>>>> + if (!len)
>>>> + return true;
>>>> +
>>>> + first_blk = off >> inode->i_blkbits;
>>>> + last_blk = (off + len - 1) >> inode->i_blkbits;
>>>> + bitmap_set(ifs->state, first_blk, last_blk - first_blk + 1);
>>>> return ifs_is_fully_uptodate(folio, ifs);
>>>> }
>>>>
>>>> @@ -203,13 +206,17 @@ static void ifs_set_range_dirty(struct folio *folio,
>>>> {
>>>> struct inode *inode = folio->mapping->host;
>>>> unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>>>> - unsigned int first_blk = (off >> inode->i_blkbits);
>>>> - unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>>>> - unsigned int nr_blks = last_blk - first_blk + 1;
>>>> + unsigned int first_blk, last_blk;
>>>> unsigned long flags;
>>>>
>>>> + if (!len)
>>>> + return;
>>>> +
>>>> + first_blk = off >> inode->i_blkbits;
>>>> + last_blk = (off + len - 1) >> inode->i_blkbits;
>>>> spin_lock_irqsave(&ifs->state_lock, flags);
>>>> - bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
>>>> + bitmap_set(ifs->state, first_blk + blks_per_folio,
>>>> + last_blk - first_blk + 1);
>>>
>>> I'm curious about the inconsistency in the computations between
>>> ifs_clear_range_dirty and ifs_set_range_dirty now. In the function that
>>> clears dirty bits, off/len are rounded inwards:
>>>
>>> unsigned int first_blk = round_up(off, i_blocksize(inode)) >>
>>> inode->i_blkbits;
>>> unsigned int last_blk = (off + len) >> inode->i_blkbits;
>>> unsigned long flags;
>>>
>>> if (first_blk >= last_blk)
>>> return;
>>>
>>> spin_lock_irqsave(&ifs->state_lock, flags);
>>> bitmap_clear(ifs->state, first_blk + blks_per_folio,
>>> last_blk - first_blk);
>>>
>>> but here we're still rounding outwards:
>>>
>>> if (!len)
>>> return;
>>>
>>> first_blk = off >> inode->i_blkbits;
>>> last_blk = (off + len - 1) >> inode->i_blkbits;
>>> spin_lock_irqsave(&ifs->state_lock, flags);
>>> bitmap_set(ifs->state, first_blk + blks_per_folio,
>>> last_blk - first_blk + 1);
>>>
>>> That doesn't quite sound right to me without an explanation in the code,
>>> which currently lacks one. I *think* the reason for the discrepancy is
>>> that if we want to dirty part of an fsblock, we need to mark the whole
>>> block dirty in the ifs so that all the blocks get written out; but when
>>> we're clearing dirty bits, we want to leave an fsblock dirty if we only
>>> wrote back part of that fsblock. Does that sound right?
>>>
>>> --D
>>>
>>
>> Yes, that's right. The primary purpose of clearing the dirty bit is for
>> invalidating a partial folio(e.g., when punching hole). Only when a full
>> fsblock has been invalidated can its corresponding dirty bit be cleared.
>> Write-back operations never seem to write back only part of a fsblock.
>>
>> I can add comments for them in my next iteration.
>
> Sounds good, thank you.
>
> (Does this patch also need a fixes tag like Joanne asked?)
Yeah, I will add it as well.
Thanks,
Yi.
>
> --D
>
>> Thanks,
>> Yi.
>>
>>
>>
^ permalink raw reply
* Re: [PATCH 01/12] swap: remove the maxpages variable in sys_swapon
From: Chris Li @ 2026-05-15 22:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-2-hch@lst.de>
On Mon, May 11, 2026 at 10:36 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Always use si->max which is updated setup_swap_extents instead of copying
> into and out of maxpages.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Li <chrisl@kernel.org>
> ---
> mm/swapfile.c | 27 +++++++++++----------------
> 1 file changed, 11 insertions(+), 16 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9174f1eeffb0..f7ebd97e28a3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3350,10 +3350,9 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
> }
>
> static int setup_swap_clusters_info(struct swap_info_struct *si,
> - union swap_header *swap_header,
> - unsigned long maxpages)
> + union swap_header *swap_header)
> {
> - unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> + unsigned long nr_clusters = DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER);
> struct swap_cluster_info *cluster_info;
> int err = -ENOMEM;
> unsigned long i;
> @@ -3395,7 +3394,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
> if (err)
> goto err;
> }
> - for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
> + for (i = si->max; i < round_up(si->max, SWAPFILE_CLUSTER); i++) {
> err = swap_cluster_setup_bad_slot(si, cluster_info, i, true);
Nitpick: I couldn't hlep but notice the si->max does not change
between setup bad slots, so in theory you can cache the si->max value
to a local variable for the loop. In real life, it will make no
difference, so feel free to keep it as is.
Chris
^ permalink raw reply
* Re: [PATCH 02/12] swap: move boilerplate code into the core swap code
From: Chris Li @ 2026-05-15 22:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-3-hch@lst.de>
On Mon, May 11, 2026 at 10:36 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Make the core swap code calculate sis->pages, nr_extents and the span,
> re-set sis->max based on it and don't require passing the current offset
> into the swap file to swap_add_extent as all that can trivially be
> calculated internally. Also truncate the spans based on the available
> information.
>
> All this removes a lot of boilerplate code in the callers.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Very nice cleanup.
Acked-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
* Re: [PATCH 03/12] swap,fs: move swapfile operations to struct file_operations
From: Chris Li @ 2026-05-15 22:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-4-hch@lst.de>
On Mon, May 11, 2026 at 10:37 PM Christoph Hellwig <hch@lst.de> wrote:
>
> The swap operations have nothing to do with the address_space, which is
> used for pagecache operations. Move them to struct file_operations
> instead. This will allow moving the block device special cases into
> block/fops.c subsequently.
>
> Pass struct file first to ->swap_activate as file operations typically
> get the file or iocb as first argument and use swap_activate instead of
> swapfile_activate in all names to be consistent.
>
> Note that while the trivial iomap wrappers are moved to a new file when
> applicable to keep them local to the file operation instances, complex
> implementation are kept in their existing place. It might be worth to
> move them in follow-on patches if the maintainers desire so.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
That makes sense to me. I ack for the core swap part of the code.
Acked-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
* Re: [PATCH 04/12] swap: restrict to regular files or block devices
From: Chris Li @ 2026-05-15 22:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-5-hch@lst.de>
On Mon, May 11, 2026 at 10:37 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Various swap code assumes it runs either on a block device or on a
> regular file. Make this restriction explicit using checks right
> after opening the file.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
* Re: [PATCH 05/12] swap: cleanup setup_swap_extents
From: Chris Li @ 2026-05-15 22:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-6-hch@lst.de>
On Mon, May 11, 2026 at 10:37 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Reflow setup_swap_extents so that the flag checking is not conditional on
> a swap_activate method. This is currently a no-op because the swapoff
> code still checks the presence of a swap_deactivate method, but it
> simplifies adding a new check, and also makes the SWP_ACTIVATED flag
> more consistent.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
* Re: [PATCH 06/12] swap,block: move the block device swapon code into block/fops.c
From: Chris Li @ 2026-05-15 22:14 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-7-hch@lst.de>
On Mon, May 11, 2026 at 10:37 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Make use of the abstractions we have. This is a preparation for
> moving more special casing down into block/.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
* Re: [PATCH 09/12] swap: push down setting sis->bdev into ->swap_activate
From: Chris Li @ 2026-05-15 22:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Kairui Song, Christian Brauner, Darrick J . Wong,
Jens Axboe, David Sterba, Theodore Ts'o, Jaegeuk Kim, Chao Yu,
Trond Myklebust, Anna Schumaker, Namjae Jeon, Hyunchul Lee,
Steve French, Paulo Alcantara, Carlos Maiolino, Damien Le Moal,
Naohiro Aota, linux-xfs, linux-fsdevel, linux-doc, linux-mm,
linux-block, linux-btrfs, linux-ext4, linux-f2fs-devel, linux-nfs,
linux-cifs
In-Reply-To: <20260512053625.2950900-10-hch@lst.de>
On Mon, May 11, 2026 at 10:37 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Only the file operation method knows what block device we'll swap
> to. So move down setting sis->bdev and the special blockdev flag
> into ->swap_activate.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
The core swap part of the code looks fine to me, I did not look much
deeper into the fs side.
Ack-by: Chris Li <chrisl@kernel.org>
Chris
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox