* [PATCH v2 0/1] ext4: Fix stale data exposure caused with dioread_nolock @ 2023-09-16 10:42 Ojaswin Mujoo 2023-09-16 10:42 ` [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure Ojaswin Mujoo 0 siblings, 1 reply; 4+ messages in thread From: Ojaswin Mujoo @ 2023-09-16 10:42 UTC (permalink / raw) To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara ** Changes in v2 ** - Moved the logic to mark buffer as new in ext4_get_block_unwritten, as it is actually specific to this function, rather than _ext4_get_block which is a more generic function called from codepaths that dont need this logic - Tested with -g rw,quick as well as custom replication test (present in commit message). No issues with ps == bs as well as sub page size scenarios. - TODO: Add logic in ext4_block_zero_page_range() to ignore zeroing if the buffer is unwritten. This needs some review of the involved code paths hence sending the fix to stale data first right now and will get to this in a new patch. - v1: https://lore.kernel.org/linux-ext4/cover.1693909504.git.ojaswin@linux.ibm.com/ ** Original Cover ** The detailed report on the issues faced and the root cause can be found in the commit message. I've intentionally added all the details to commit message so that it can be tracked in the future, let me know if its too long and I can try stripping some info. For this particular fix, I've tested these patches with xfstests -g quick with: - 64k block size, 64k pagesize - 4k blocksize 64k pagesize - both with and without nodelalloc and I don't see any regressions. I'll plan to run more tests on this and report back if I notice anything. Suggestions or ideas are welcome. Regards, ojaswin Ojaswin Mujoo (1): ext4: Mark buffer new if it is unwritten to avoid stale data exposure fs/ext4/inode.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) -- 2.39.3 ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure 2023-09-16 10:42 [PATCH v2 0/1] ext4: Fix stale data exposure caused with dioread_nolock Ojaswin Mujoo @ 2023-09-16 10:42 ` Ojaswin Mujoo 2023-09-18 7:26 ` Jan Kara 0 siblings, 1 reply; 4+ messages in thread From: Ojaswin Mujoo @ 2023-09-16 10:42 UTC (permalink / raw) To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara ** Short Version ** In ext4 with dioread_nolock, we could have a scenario where the bh returned by get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we never zero out the range of bh that is not under write, causing whatever stale data is present in the folio at that time to be written out to disk. To fix this mark the buffer as new in ext4_get_block_unwritten(), in case it is unwritten. ----- ** Long Version ** The issue mentioned above was resulting in two different bugs: 1. On block size < page size case in ext4, generic/269 was reliably failing with dioread_nolock. The state of the write was as follows: * The write was extending i_size. * The last block of the file was fallocated and had an unwritten extent * We were near ENOSPC and hence we were switching to non-delayed alloc allocation. In this case, the back trace that triggers the bug is as follows: ext4_da_write_begin() /* switch to nodelalloc due to low space */ ext4_write_begin() ext4_should_dioread_nolock() // true since mount flags still have delalloc __block_write_begin(..., ext4_get_block_unwritten) __block_write_begin_int() for(each buffer head in page) { /* first iteration, this is bh1 which contains i_size */ if (!buffer_mapped) get_block() /* returns bh with only UNWRITTEN and MAPPED */ /* second iteration, bh2 */ if (!buffer_mapped) get_block() /* we fail here, could be ENOSPC */ } if (err) /* * this would zero out all new buffers and mark them uptodate. * Since bh1 was never marked new, we skip it here which causes * the bug later. */ folio_zero_new_buffers(); /* ext4_wrte_begin() error handling */ ext4_truncate_failed_write() ext4_truncate() ext4_block_truncate_page() __ext4_block_zero_page_range() if(!buffer_uptodate()) ext4_read_bh_lock() ext4_read_bh() -> ... ext4_submit_bh_wbc() BUG_ON(buffer_unwritten(bh)); /* !!! */ 2. The second issue is stale data exposure with page size >= blocksize with dioread_nolock. The conditions needed for it to happen are same as the previous issue ie dioread_nolock around ENOSPC condition. The issue is also similar where in __block_write_begin_int() when we call ext4_get_block_unwritten() on the buffer_head and the underlying extent is unwritten, we get an unwritten and mapped buffer head. Since it is not new, we never zero out the partial range which is not under write, thus writing stale data to disk. This can be easily observed with the following reproducer: fallocate -l 4k testfile xfs_io -c "pwrite 2k 2k" testfile # hexdump output will have stale data in from byte 0 to 2k in testfile hexdump -C testfile NOTE: To trigger this, we need dioread_nolock enabled and write happening via ext4_write_begin(), which is usually used when we have -o nodealloc. Since dioread_nolock is disabled with nodelalloc, the only alternate way to call ext4_write_begin() is to make sure dellayed alloc switches to nodelalloc (ext4_da_write_begin() calls ext4_write_begin()). This will usually happen when ext4 is almost full like the way generic/269 was triggering it in Issue 1 above. This might make this issue harder to replicate hence for reliable replicate, I used the below patch to temporarily allow dioread_nolock with nodelalloc and then mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale data issue 100% of times: @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) if (ext4_should_journal_data(inode)) return 0; /* temporary fix to prevent generic/422 test failures */ - if (!test_opt(inode->i_sb, DELALLOC)) - return 0; + // if (!test_opt(inode->i_sb, DELALLOC)) + // return 0; return 1; } After applying this patch to mark buffer as NEW, both the above issues are fixed. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> --- fs/ext4/inode.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6c490f05e2ba..8b286a800193 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -789,10 +789,22 @@ int ext4_get_block(struct inode *inode, sector_t iblock, int ext4_get_block_unwritten(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { + int ret = 0; + ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", inode->i_ino, create); - return _ext4_get_block(inode, iblock, bh_result, + ret = _ext4_get_block(inode, iblock, bh_result, EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); + + /* + * If the buffer is marked unwritten, mark it as new to make sure it is + * zeroed out correctly in case of partial writes. Otherwise, there is + * a chance of stale data getting exposed. + */ + if (ret == 0 && buffer_unwritten(bh_result)) + set_buffer_new(bh_result); + + return ret; } /* Maximum number of blocks we map for direct IO at once. */ -- 2.39.3 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure 2023-09-16 10:42 ` [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure Ojaswin Mujoo @ 2023-09-18 7:26 ` Jan Kara 2023-09-18 10:20 ` Ojaswin Mujoo 0 siblings, 1 reply; 4+ messages in thread From: Jan Kara @ 2023-09-18 7:26 UTC (permalink / raw) To: Ojaswin Mujoo Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara On Sat 16-09-23 16:12:13, Ojaswin Mujoo wrote: > ** Short Version ** > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we > never zero out the range of bh that is not under write, causing whatever stale > data is present in the folio at that time to be written out to disk. To fix this > mark the buffer as new in ext4_get_block_unwritten(), in case it is unwritten. > > ----- I'm not sure if this separator isn't going to confuse some tools processing patches ;) > ** Long Version ** > > The issue mentioned above was resulting in two different bugs: > > 1. On block size < page size case in ext4, generic/269 was reliably > failing with dioread_nolock. The state of the write was as follows: > > * The write was extending i_size. > * The last block of the file was fallocated and had an unwritten extent > * We were near ENOSPC and hence we were switching to non-delayed alloc > allocation. > > In this case, the back trace that triggers the bug is as follows: > > ext4_da_write_begin() > /* switch to nodelalloc due to low space */ > ext4_write_begin() > ext4_should_dioread_nolock() // true since mount flags still have delalloc > __block_write_begin(..., ext4_get_block_unwritten) > __block_write_begin_int() > for(each buffer head in page) { > /* first iteration, this is bh1 which contains i_size */ > if (!buffer_mapped) > get_block() /* returns bh with only UNWRITTEN and MAPPED */ > /* second iteration, bh2 */ > if (!buffer_mapped) > get_block() /* we fail here, could be ENOSPC */ > } > if (err) > /* > * this would zero out all new buffers and mark them uptodate. > * Since bh1 was never marked new, we skip it here which causes > * the bug later. > */ > folio_zero_new_buffers(); > /* ext4_wrte_begin() error handling */ > ext4_truncate_failed_write() > ext4_truncate() > ext4_block_truncate_page() > __ext4_block_zero_page_range() > if(!buffer_uptodate()) > ext4_read_bh_lock() > ext4_read_bh() -> ... ext4_submit_bh_wbc() > BUG_ON(buffer_unwritten(bh)); /* !!! */ > > 2. The second issue is stale data exposure with page size >= blocksize > with dioread_nolock. The conditions needed for it to happen are same as > the previous issue ie dioread_nolock around ENOSPC condition. The issue > is also similar where in __block_write_begin_int() when we call > ext4_get_block_unwritten() on the buffer_head and the underlying extent > is unwritten, we get an unwritten and mapped buffer head. Since it is > not new, we never zero out the partial range which is not under write, > thus writing stale data to disk. This can be easily observed with the > following reproducer: > > fallocate -l 4k testfile > xfs_io -c "pwrite 2k 2k" testfile > # hexdump output will have stale data in from byte 0 to 2k in testfile > hexdump -C testfile > > NOTE: To trigger this, we need dioread_nolock enabled and write > happening via ext4_write_begin(), which is usually used when we have -o > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only > alternate way to call ext4_write_begin() is to make sure dellayed > alloc switches to nodelalloc (ext4_da_write_begin() calls > ext4_write_begin()). This will usually happen when ext4 is almost full > like the way generic/269 was triggering it in Issue 1 above. This might > make this issue harder to replicate hence for reliable replicate, I used > the below patch to temporarily allow dioread_nolock with nodelalloc and > then mount the disk with -o nodealloc,dioread_nolock. With this you can > hit the stale data issue 100% of times: > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) > if (ext4_should_journal_data(inode)) > return 0; > /* temporary fix to prevent generic/422 test failures */ > - if (!test_opt(inode->i_sb, DELALLOC)) > - return 0; > + // if (!test_opt(inode->i_sb, DELALLOC)) > + // return 0; > return 1; > } > > After applying this patch to mark buffer as NEW, both the above issues are > fixed. > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Looks good to me. Thanks for fixing this! Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext4/inode.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 6c490f05e2ba..8b286a800193 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -789,10 +789,22 @@ int ext4_get_block(struct inode *inode, sector_t iblock, > int ext4_get_block_unwritten(struct inode *inode, sector_t iblock, > struct buffer_head *bh_result, int create) > { > + int ret = 0; > + > ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", > inode->i_ino, create); > - return _ext4_get_block(inode, iblock, bh_result, > + ret = _ext4_get_block(inode, iblock, bh_result, > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); > + > + /* > + * If the buffer is marked unwritten, mark it as new to make sure it is > + * zeroed out correctly in case of partial writes. Otherwise, there is > + * a chance of stale data getting exposed. > + */ > + if (ret == 0 && buffer_unwritten(bh_result)) > + set_buffer_new(bh_result); > + > + return ret; > } > > /* Maximum number of blocks we map for direct IO at once. */ > -- > 2.39.3 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure 2023-09-18 7:26 ` Jan Kara @ 2023-09-18 10:20 ` Ojaswin Mujoo 0 siblings, 0 replies; 4+ messages in thread From: Ojaswin Mujoo @ 2023-09-18 10:20 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel Hi Jan, Thanks for the review! On Mon, Sep 18, 2023 at 09:26:09AM +0200, Jan Kara wrote: > On Sat 16-09-23 16:12:13, Ojaswin Mujoo wrote: > > ** Short Version ** > > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we > > never zero out the range of bh that is not under write, causing whatever stale > > data is present in the folio at that time to be written out to disk. To fix this > > mark the buffer as new in ext4_get_block_unwritten(), in case it is unwritten. > > > > ----- > > I'm not sure if this separator isn't going to confuse some tools processing > patches ;) Oh no, my bad :/ Let me quickly send a v3 to fix this. Thanks for pointing it out. Regards, ojaswin ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-09-18 10:21 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-16 10:42 [PATCH v2 0/1] ext4: Fix stale data exposure caused with dioread_nolock Ojaswin Mujoo 2023-09-16 10:42 ` [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure Ojaswin Mujoo 2023-09-18 7:26 ` Jan Kara 2023-09-18 10:20 ` Ojaswin Mujoo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).