* [GIT PULL] Btrfs updates for 6.15
@ 2025-03-24 16:37 David Sterba
2025-03-27 20:48 ` pr-tracker-bot
2025-03-28 13:27 ` Josef Bacik
0 siblings, 2 replies; 8+ messages in thread
From: David Sterba @ 2025-03-24 16:37 UTC (permalink / raw)
To: torvalds; +Cc: David Sterba, linux-btrfs, linux-kernel
Hi,
please pull the following btrfs updates, thanks.
User visible changes:
- fall back to buffered write if direct io is done on a file that requires
checksums
- this avoids a problem with checksum mismatch errors, observed e.g. on
virtual images when writes to pages under writeback cause the checksum
mismatch reports
- this may lead to some performance degradation but currently the
recommended setup for VM images is to use the NOCOW file attribute that
also disables checksums
- fast/realtime zstd levels -15 to -1
- supported by mount options (compress=zstd:-5) and defrag ioctl
- improved speed, reduced compression ratio, check the commit for sample
measurements
- defrag ioctl extended to accept negative compression levels
- subpage mode
- remove warning when subpage mode is used, the feature is now reasonably
complete and tested
- in debug mode allow to create 2K b-tree nodes to allow testing subpage on
x86_64 with 4K pages too
Performance improvements:
- in send, better file path caching improves runtime (on sample load by -30%)
- on s390x with hardware zlib support prepare the input buffer in a better way
to get the best results from the acceleration
- minor speed improvement in encoded read, avoid memory allocation in
synchronous mode
Core:
- enable stable writes on inodes, replacing manually waiting for writeback
and allowing to skip that on inodes without checksums
- add last checks and warnings for out-of-band dirty writes to pages,
requiring a fixup ("fixup worker"), this should not be necessary since 5.8
where get_user_page() and pin_user_pages*() prevent this
- long history behind that, we'll be happy to remove the whole infrastructure
in the near future
- more folio API conversions and preparations for large folio support
- subpage cleanups and refactoring, split handling of data and metadata to
allow future support for large folios
- readpage works as block-by-block, no change for normal mode, this is
preparation for future subpage updates
- block group refcount fixes and hardening
- delayed iput fixes
- in zoned mode, fix zone activation on filesystem with missing devices
- cleanups:
- inode parameter cleanups
- path auto-freeing updates
- code flow simplifications in send
- redundant parameter cleanups
----------------------------------------------------------------
The following changes since commit 4701f33a10702d5fc577c32434eb62adde0a1ae1:
Linux 6.14-rc7 (2025-03-16 12:55:17 -1000)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.15-tag
for you to fetch changes up to 35fec1089ebb5617f85884d3fa6a699ce6337a75:
btrfs: zoned: fix zone finishing with missing devices (2025-03-18 20:35:57 +0100)
----------------------------------------------------------------
Anand Jain (1):
btrfs: sysfs: accept size suffixes for read policy values
Boris Burkov (5):
btrfs: fix block group refcount race in btrfs_create_pending_block_groups()
btrfs: harden block_group::bg_list against list_del() races
btrfs: make btrfs_discard_workfn() block_group ref explicit
btrfs: explicitly ref count block_group on new_bgs list
btrfs: codify pattern for adding block_group to bg_list
Dan Carpenter (1):
btrfs: return a literal instead of a variable in btrfs_init_dev_replace()
Daniel Vacek (3):
btrfs: keep private struct on stack for sync reads in btrfs_encoded_read_regular_fill_pages()
btrfs: zstd: enable negative compression levels mount option
btrfs: defrag: extend ioctl to accept compression levels
David Sterba (55):
btrfs: add __cold attribute to extent_io_tree_panic()
btrfs: async-thread: switch local variables need_order bool
btrfs: zstd: move zstd_parameters to the workspace
btrfs: zstd: remove local variable for storing page offsets
btrfs: unify ordering of btrfs_key initializations
btrfs: simplify returns and labels in btrfs_init_fs_root()
btrfs: update include and forward declarations in headers
btrfs: pass struct btrfs_inode to can_nocow_extent()
btrfs: pass struct btrfs_inode to extent_range_clear_dirty_for_io()
btrfs: pass struct btrfs_inode to btrfs_read_locked_inode()
btrfs: pass struct btrfs_inode to btrfs_iget_locked()
btrfs: pass struct btrfs_inode to new_simple_dir()
btrfs: pass struct btrfs_inode to btrfs_inode_type()
btrfs: pass struct btrfs_inode to btrfs_defrag_file()
btrfs: use struct btrfs_inode inside create_pending_snapshot()
btrfs: pass struct btrfs_inode to fill_stack_inode_item()
btrfs: pass struct btrfs_inode to btrfs_fill_inode()
btrfs: pass struct btrfs_inode to btrfs_load_inode_props()
btrfs: pass struct btrfs_inode to btrfs_inode_inherit_props()
btrfs: props: switch prop_handler::apply to struct btrfs_inode
btrfs: props: switch prop_handler::extract to struct btrfs_inode
btrfs: pass struct btrfs_inode to clone_copy_inline_extent()
btrfs: pass struct btrfs_inode to btrfs_double_mmap_lock()
btrfs: pass struct btrfs_inode to btrfs_double_mmap_unlock()
btrfs: pass struct btrfs_inode to btrfs_extent_same_range()
btrfs: use struct btrfs_inode inside btrfs_remap_file_range()
btrfs: use struct btrfs_inode inside btrfs_remap_file_range_prep()
btrfs: use struct btrfs_inode inside btrfs_get_parent()
btrfs: use struct btrfs_inode inside btrfs_get_name()
btrfs: don't pass nodesize to __alloc_extent_buffer()
btrfs: merge alloc_dummy_extent_buffer() helpers
btrfs: simplify parameters of metadata folio helpers
btrfs: add __pure attribute to eb page and folio counters
btrfs: use num_extent_folios() in for loop bounds
btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_dev_replace()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_run_dev_replace()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_check_dir_item_collision()
btrfs: use BTRFS_PATH_AUTO_FREE in load_global_roots()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_get_name()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_extent_info()
btrfs: use BTRFS_PATH_AUTO_FREE in run_delayed_extent_op()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_bio_sums()
btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_remove_free_space_inode()
btrfs: use BTRFS_PATH_AUTO_FREE in populate_free_space_tree()
btrfs: use BTRFS_PATH_AUTO_FREE in clear_free_space_tree()
btrfs: use BTRFS_PATH_AUTO_FREE in load_free_space_tree()
btrfs: parameter constification in ioctl.c
btrfs: pass btrfs_root pointers to send ioctl parameters
btrfs: pass root pointers to search tree ioctl helpers
btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()
btrfs: simplify local variables in btrfs_ioctl_resize()
btrfs: pass struct to btrfs_ioctl_subvol_getflags()
btrfs: unify inode variable naming
Filipe Manana (57):
btrfs: avoid assigning twice to block_start at btrfs_do_readpage()
btrfs: send: remove duplicated logic from fs_path_reset()
btrfs: send: make fs_path_len() inline and constify its argument
btrfs: send: always use fs_path_len() to determine a path's length
btrfs: send: simplify return logic from fs_path_prepare_for_add()
btrfs: send: simplify return logic from fs_path_add()
btrfs: send: implement fs_path_add_path() using fs_path_add()
btrfs: send: simplify return logic from fs_path_add_from_extent_buffer()
btrfs: send: return -ENAMETOOLONG when attempting a path that is too long
btrfs: send: simplify return logic from __get_cur_name_and_parent()
btrfs: send: simplify return logic from is_inode_existent()
btrfs: send: simplify return logic from get_cur_inode_state()
btrfs: send: factor out common logic when sending xattrs
btrfs: send: only use boolean variables at process_recorded_refs()
btrfs: send: add and use helper to rename current inode when processing refs
btrfs: send: simplify return logic from send_remove_xattr()
btrfs: send: simplify return logic from record_new_ref_if_needed()
btrfs: send: simplify return logic from record_deleted_ref_if_needed()
btrfs: send: simplify return logic from record_new_ref()
btrfs: send: simplify return logic from record_deleted_ref()
btrfs: send: simplify return logic from record_changed_ref()
btrfs: send: remove unnecessary return variable from process_new_xattr()
btrfs: send: simplify return logic from process_changed_xattr()
btrfs: send: simplify return logic from send_verity()
btrfs: send: simplify return logic from send_rename()
btrfs: send: simplify return logic from send_link()
btrfs: send: simplify return logic from send_unlink()
btrfs: send: simplify return logic from send_rmdir()
btrfs: send: keep the current inode's path cached
btrfs: send: avoid path allocation for the current inode when issuing commands
btrfs: send: simplify return logic from send_set_xattr()
btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
btrfs: fix reclaimed bytes accounting after automatic block group reclaim
btrfs: fix non-empty delayed iputs list on unmount due to compressed write workers
btrfs: move __btrfs_bio_end_io() code into its single caller
btrfs: move btrfs_cleanup_bio() code into its single caller
btrfs: fix non-empty delayed iputs list on unmount due to async workers
btrfs: avoid unnecessary bio dereference at run_one_async_done()
btrfs: send: remove unnecessary inode lookup at send_encoded_inline_extent()
btrfs: send: simplify return logic from send_encoded_extent()
btrfs: return a btrfs_inode from btrfs_iget_logging()
btrfs: return a btrfs_inode from read_one_inode()
btrfs: pass a btrfs_inode to fixup_inode_link_count()
btrfs: make btrfs_iget() return a btrfs inode instead
btrfs: make btrfs_iget_path() return a btrfs inode instead
btrfs: remove unnecessary fs_info argument from create_reloc_inode()
btrfs: remove unnecessary fs_info argument from delete_block_group_cache()
btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()
btrfs: tests: fix chunk map leak after failure to add it to the tree
btrfs: avoid unnecessary memory allocation and copy at overwrite_item()
btrfs: use variables to store extent buffer and slot at overwrite_item()
btrfs: update outdated comment for overwrite_item()
btrfs: use memcmp_extent_buffer() at replay_one_extent()
btrfs: remove redundant else statement from btrfs_log_inode_parent()
btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent()
btrfs: remove end_no_trans label from btrfs_log_inode_parent()
Johannes Thumshirn (3):
btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set
btrfs: zoned: fix zone activation with missing devices
btrfs: zoned: fix zone finishing with missing devices
Mark Harmstone (2):
btrfs: avoid linker error in btrfs_find_create_tree_block()
btrfs: don't clobber ret in btrfs_validate_super()
Matthew Wilcox (Oracle) (2):
btrfs: update some folio related comments
btrfs: convert io_ctl_prepare_pages() to work on folios
Qu Wenruo (32):
btrfs: remove duplicated metadata folio flag update in end_bbio_meta_read()
btrfs: always fallback to buffered write if the inode requires checksum
btrfs: zlib: refactor S390x HW acceleration buffer preparation
btrfs: expose per-inode stable writes flag
btrfs: factor out nocow ordered extent and extent map generation into a helper
btrfs: move ordered extent cleanup to where they are allocated
btrfs: remove btrfs_fs_info::sectors_per_page
btrfs: factor out metadata subpage detection into a dedicated helper
btrfs: make subpage attach and detach handle metadata properly
btrfs: use metadata specific helpers to simplify extent buffer helpers
btrfs: simplify subpage handling of btrfs_clear_buffer_dirty()
btrfs: simplify subpage handling of write_one_eb()
btrfs: simplify subpage handling of read_extent_buffer_pages_nowait()
btrfs: require strict data/metadata split for subpage checks
btrfs: prevent inline data extents read from touching blocks beyond its range
btrfs: fix the qgroup data free range for inline data extents
btrfs: introduce a read path dedicated extent lock helper
btrfs: make btrfs_do_readpage() to do block-by-block read
btrfs: allow buffered write to avoid full page read if it's block aligned
btrfs: allow inline data extents creation if block size < page size
btrfs: remove the subpage related warning message
btrfs: properly limit inline data extent according to block size
btrfs: allow debug builds to accept 2K block size
btrfs: reject out-of-band dirty folios during writeback
btrfs: run btrfs_error_commit_super() early
btrfs: add extra warning if delayed iput is added when it's not allowed
btrfs: subpage: make btrfs_is_subpage() check against a folio
btrfs: add a size parameter to btrfs_alloc_subpage()
btrfs: replace PAGE_SIZE with folio_size for subpage.[ch]
btrfs: prepare btrfs_launcher_folio() for large folios support
btrfs: prepare extent_io.c for future large folio support
btrfs: prepare btrfs_page_mkwrite() for large folios
Sun YangKai (3):
btrfs: simplify the return value handling in search_ioctl()
btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()
btrfs: avoid redundant path slot assignment in btrfs_search_forward()
fs/btrfs/accessors.h | 1 +
fs/btrfs/acl.h | 2 +
fs/btrfs/async-thread.c | 11 +-
fs/btrfs/backref.c | 4 +-
fs/btrfs/bio.c | 38 +--
fs/btrfs/block-group.c | 155 ++++++----
fs/btrfs/btrfs_inode.h | 17 +-
fs/btrfs/compression.c | 31 +-
fs/btrfs/compression.h | 26 +-
fs/btrfs/ctree.c | 18 +-
fs/btrfs/ctree.h | 2 +-
fs/btrfs/defrag.c | 78 ++---
fs/btrfs/defrag.h | 4 +-
fs/btrfs/delayed-inode.c | 99 ++++---
fs/btrfs/delayed-inode.h | 2 +-
fs/btrfs/delayed-ref.h | 2 +
fs/btrfs/dev-replace.c | 33 +--
fs/btrfs/dir-item.c | 24 +-
fs/btrfs/dir-item.h | 1 +
fs/btrfs/direct-io.c | 19 +-
fs/btrfs/direct-io.h | 2 +
fs/btrfs/discard.c | 34 ++-
fs/btrfs/discard.h | 1 +
fs/btrfs/disk-io.c | 109 ++++---
fs/btrfs/export.c | 51 ++--
fs/btrfs/extent-io-tree.c | 8 +-
fs/btrfs/extent-tree.c | 63 ++--
fs/btrfs/extent-tree.h | 1 -
fs/btrfs/extent_io.c | 589 ++++++++++++++++++++++----------------
fs/btrfs/extent_io.h | 9 +-
fs/btrfs/file-item.c | 30 +-
fs/btrfs/file-item.h | 2 +
fs/btrfs/file.c | 28 +-
fs/btrfs/file.h | 2 +
fs/btrfs/free-space-cache.c | 57 ++--
fs/btrfs/free-space-tree.c | 45 ++-
fs/btrfs/fs.c | 1 -
fs/btrfs/fs.h | 26 +-
fs/btrfs/inode-item.c | 6 +-
fs/btrfs/inode.c | 587 +++++++++++++++++++------------------
fs/btrfs/ioctl.c | 217 +++++++-------
fs/btrfs/ioctl.h | 4 +-
fs/btrfs/locking.c | 1 -
fs/btrfs/ordered-data.c | 23 +-
fs/btrfs/ordered-data.h | 9 +-
fs/btrfs/print-tree.h | 2 +
fs/btrfs/props.c | 66 ++---
fs/btrfs/props.h | 8 +-
fs/btrfs/qgroup.c | 2 +-
fs/btrfs/qgroup.h | 3 +
fs/btrfs/raid-stripe-tree.h | 1 +
fs/btrfs/reflink.c | 100 +++----
fs/btrfs/relocation.c | 30 +-
fs/btrfs/scrub.c | 4 +-
fs/btrfs/send.c | 544 ++++++++++++++++-------------------
fs/btrfs/send.h | 4 +-
fs/btrfs/space-info.c | 2 +-
fs/btrfs/subpage.c | 224 +++++++++------
fs/btrfs/subpage.h | 56 +++-
fs/btrfs/super.c | 6 +-
fs/btrfs/sysfs.c | 14 +-
fs/btrfs/sysfs.h | 1 +
fs/btrfs/tests/extent-io-tests.c | 6 +-
fs/btrfs/tests/extent-map-tests.c | 1 +
fs/btrfs/transaction.c | 39 ++-
fs/btrfs/tree-log.c | 392 ++++++++++++-------------
fs/btrfs/verity.c | 4 +-
fs/btrfs/volumes.c | 16 +-
fs/btrfs/volumes.h | 4 +
fs/btrfs/xattr.h | 2 +
fs/btrfs/zlib.c | 83 ++++--
fs/btrfs/zoned.c | 9 +
fs/btrfs/zstd.c | 66 +++--
include/uapi/linux/btrfs.h | 16 +-
74 files changed, 2263 insertions(+), 1914 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-24 16:37 [GIT PULL] Btrfs updates for 6.15 David Sterba @ 2025-03-27 20:48 ` pr-tracker-bot 2025-03-28 13:27 ` Josef Bacik 1 sibling, 0 replies; 8+ messages in thread From: pr-tracker-bot @ 2025-03-27 20:48 UTC (permalink / raw) To: David Sterba; +Cc: torvalds, David Sterba, linux-btrfs, linux-kernel The pull request you sent on Mon, 24 Mar 2025 17:37:51 +0100: > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.15-tag has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/fd71def6d9abc5ae362fb9995d46049b7b0ed391 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-24 16:37 [GIT PULL] Btrfs updates for 6.15 David Sterba 2025-03-27 20:48 ` pr-tracker-bot @ 2025-03-28 13:27 ` Josef Bacik 2025-03-28 17:36 ` David Sterba 1 sibling, 1 reply; 8+ messages in thread From: Josef Bacik @ 2025-03-28 13:27 UTC (permalink / raw) To: David Sterba; +Cc: linux-btrfs On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote: > Hi, > > please pull the following btrfs updates, thanks. > > User visible changes: > > - fall back to buffered write if direct io is done on a file that requires > checksums <trimming the everybody linux-btrfs from the cc list> What? We use this constantly in a bunch of places to avoid page cache overhead, it's perfectly legitimate to do DIO to a file that requires checksums. Does the vm case mess this up? Absolutely, but that's why we say use NOCOW for that case. We've always had this behavior, we've always been clear that if you break it you buy it. This is a huge regression for a pretty significant use case. Thanks, Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-28 13:27 ` Josef Bacik @ 2025-03-28 17:36 ` David Sterba 2025-03-28 19:39 ` Josef Bacik 0 siblings, 1 reply; 8+ messages in thread From: David Sterba @ 2025-03-28 17:36 UTC (permalink / raw) To: Josef Bacik; +Cc: David Sterba, linux-btrfs, Qu Wenruo On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote: > On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote: > > Hi, > > > > please pull the following btrfs updates, thanks. > > > > User visible changes: > > > > - fall back to buffered write if direct io is done on a file that requires > > checksums > > <trimming the everybody linux-btrfs from the cc list> > > What? We use this constantly in a bunch of places to avoid page cache overhead, > it's perfectly legitimate to do DIO to a file that requires checksums. Does the > vm case mess this up? Absolutely, but that's why we say use NOCOW for that > case. We've always had this behavior, we've always been clear that if you break > it you buy it. This is a huge regression for a pretty significant use case. The patch has been up for like 2 months and you could have said "don't because reasons" any time before the pull request. Now we're left with a revert, or other alternatives making the use cases working. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-28 17:36 ` David Sterba @ 2025-03-28 19:39 ` Josef Bacik 2025-03-28 21:36 ` Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Josef Bacik @ 2025-03-28 19:39 UTC (permalink / raw) To: David Sterba; +Cc: David Sterba, linux-btrfs, Qu Wenruo On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote: > On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote: > > On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote: > > > Hi, > > > > > > please pull the following btrfs updates, thanks. > > > > > > User visible changes: > > > > > > - fall back to buffered write if direct io is done on a file that requires > > > checksums > > > > <trimming the everybody linux-btrfs from the cc list> > > > > What? We use this constantly in a bunch of places to avoid page cache overhead, > > it's perfectly legitimate to do DIO to a file that requires checksums. Does the > > vm case mess this up? Absolutely, but that's why we say use NOCOW for that > > case. We've always had this behavior, we've always been clear that if you break > > it you buy it. This is a huge regression for a pretty significant use case. > > The patch has been up for like 2 months and you could have said "don't > because reasons" any time before the pull request. Now we're left with a > revert, or other alternatives making the use cases working. Boris told me about this and I forgot. I took everybody off the CC list because I don't want to revert it, in fact generally speaking I'd love to never have these style of bug reports again. But it is a pretty significant change. Are we ok going forward saying you don't get O_DIRECT unless you want NOCOW? Should we maybe allow for users to indicate they're not dumb and can be trusted to do O_DIRECT properly? I just think this opens us up to a lot more uncomfortable conversations than the other behavior. I personally think this is better, if it's been sitting there for 2 months then hooray we're in agreement. But I'm also worried it'll come back to bite us. Thanks, Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-28 19:39 ` Josef Bacik @ 2025-03-28 21:36 ` Qu Wenruo 2025-03-31 9:35 ` Daniel Vacek 2025-03-31 23:58 ` David Sterba 2 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2025-03-28 21:36 UTC (permalink / raw) To: Josef Bacik, David Sterba; +Cc: David Sterba, linux-btrfs 在 2025/3/29 06:09, Josef Bacik 写道: > On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote: >> On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote: >>> On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote: >>>> Hi, >>>> >>>> please pull the following btrfs updates, thanks. >>>> >>>> User visible changes: >>>> >>>> - fall back to buffered write if direct io is done on a file that requires >>>> checksums >>> >>> <trimming the everybody linux-btrfs from the cc list> >>> >>> What? We use this constantly in a bunch of places to avoid page cache overhead, >>> it's perfectly legitimate to do DIO to a file that requires checksums. Does the >>> vm case mess this up? Absolutely, but that's why we say use NOCOW for that >>> case. We've always had this behavior, we've always been clear that if you break >>> it you buy it. This is a huge regression for a pretty significant use case. BTW, it's not clear, it's not in the documents (until recently, along with this change), and no one is really explaining that to end users until now. If you're a experienced fs/mm developer, sure it's obvious, but most users (millions of VM users who choose to use none cache mode) are not so experienced. Future more, a lot of such VM users are just trying to reduce page cache memory usage, not so about the zero-copy performance. And the buffered fallback should still make a lot of sense for them, and still allow working data checksum (which they may care more than the performance). >> >> The patch has been up for like 2 months and you could have said "don't >> because reasons" any time before the pull request. Now we're left with a >> revert, or other alternatives making the use cases working. > > Boris told me about this and I forgot. I took everybody off the CC list because > I don't want to revert it, in fact generally speaking I'd love to never have > these style of bug reports again. > > But it is a pretty significant change. Are we ok going forward saying you don't > get O_DIRECT unless you want NOCOW? Despite NOCOW, there are still other things (like alignment) can make O_DIRECT to fall back buffered, without even letting the users know. And it is always fs specific on whether it choose to fall buffered or return -EINVAL. So in the first place there is never any guarantee that O_DIRECT always results in zero-copy writes. > Should we maybe allow for users to indicate > they're not dumb and can be trusted to do O_DIRECT properly? I just think this > opens us up to a lot more uncomfortable conversations than the other behavior. For whatever reason, no matter dumb users or not, all operations through a fs should not lead to data/metadata corruption, no matter if it's real corruption or just false csum mismatch. (Not to mention it's not easy to fix the csum mismatch either) If dumb programs can still corrupt the fs, then it's a bug in the fs. To me, this is very straightforward, and the priority should be obvious. > > I personally think this is better, if it's been sitting there for 2 months then > hooray we're in agreement. But I'm also worried it'll come back to bite us. I'd say, let it bite. If it's some performance critical use cases, I believe the developer should be reasonable enough to be persuaded (either choose perf and go NODATASUM, or choose datacsum and accept the perf drop). If it's newbie level usages, they shouldn't even notice any difference, except the fact btrfs no longer corrupts their files. Even if it really bites in the future, I doubt it will be any worse. On the other hand, the data csum mismatch for VM images are already biting us for too long, and it's really damaging the reputation of Btrfs. Thanks, Qu > Thanks, > > Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-28 19:39 ` Josef Bacik 2025-03-28 21:36 ` Qu Wenruo @ 2025-03-31 9:35 ` Daniel Vacek 2025-03-31 23:58 ` David Sterba 2 siblings, 0 replies; 8+ messages in thread From: Daniel Vacek @ 2025-03-31 9:35 UTC (permalink / raw) To: Josef Bacik; +Cc: David Sterba, David Sterba, linux-btrfs, Qu Wenruo On Fri, 28 Mar 2025 at 20:40, Josef Bacik <josef@toxicpanda.com> wrote: > > ... Should we maybe allow for users to indicate > they're not dumb and can be trusted to do O_DIRECT properly? I just think this Seriously, I don't think it's a good idea. Though, I'd vote for adding a tracing point so that people chasing a performance issue/regression can check if this is possibly the cause. Yet, that wouldn't restore the performance for you, of course. Anyways, if you really trust your app that much (hey, it's your call, right?) you can always patch *your* kernel. But IMO such a hack does not belong to the upstream kernel. As Qu mentioned, the FS needs to stay consistent and not allow users to shoot themselves into a leg. --nX > > Josef > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [GIT PULL] Btrfs updates for 6.15 2025-03-28 19:39 ` Josef Bacik 2025-03-28 21:36 ` Qu Wenruo 2025-03-31 9:35 ` Daniel Vacek @ 2025-03-31 23:58 ` David Sterba 2 siblings, 0 replies; 8+ messages in thread From: David Sterba @ 2025-03-31 23:58 UTC (permalink / raw) To: Josef Bacik; +Cc: David Sterba, David Sterba, linux-btrfs, Qu Wenruo On Fri, Mar 28, 2025 at 03:39:27PM -0400, Josef Bacik wrote: > On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote: > > On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote: > > > On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote: > > > > Hi, > > > > > > > > please pull the following btrfs updates, thanks. > > > > > > > > User visible changes: > > > > > > > > - fall back to buffered write if direct io is done on a file that requires > > > > checksums > > > > > > <trimming the everybody linux-btrfs from the cc list> > > > > > > What? We use this constantly in a bunch of places to avoid page cache overhead, > > > it's perfectly legitimate to do DIO to a file that requires checksums. Does the > > > vm case mess this up? Absolutely, but that's why we say use NOCOW for that > > > case. We've always had this behavior, we've always been clear that if you break > > > it you buy it. This is a huge regression for a pretty significant use case. > > > > The patch has been up for like 2 months and you could have said "don't > > because reasons" any time before the pull request. Now we're left with a > > revert, or other alternatives making the use cases working. > > Boris told me about this and I forgot. I took everybody off the CC list because > I don't want to revert it, in fact generally speaking I'd love to never have > these style of bug reports again. > > But it is a pretty significant change. Are we ok going forward saying you don't > get O_DIRECT unless you want NOCOW? Should we maybe allow for users to indicate > they're not dumb and can be trusted to do O_DIRECT properly? I just think this > opens us up to a lot more uncomfortable conversations than the other behavior. > > I personally think this is better, if it's been sitting there for 2 months then > hooray we're in agreement. But I'm also worried it'll come back to bite us. I think we're in agreement where the fix is going and I also agree this is a significant change and can indeed become worse in some sense than what we have now. We already have exceptions for direct io when combined with various features, most notably with compression where it falls back to buffered right away. We may need to enumerate and document the common combinations and decide what to do about them and what are potential drawbacks. The upcoming uncached IO support can possibly fill some gaps. The fix is going towards correctness at the cost of performance and arguably this should have been the default already. We ended up with many people noticing checksum mismatches in syslog due to the VM setup. It's rather bad usability to tell people to ignore certain errors that would otherwise be quite significant. Some of the combinations: 1) dio + no csum + no cow -> true direct io 2) dio + no csum + cow -> almost direct io 3) dio + csum + cow + no compression -> fallback to buffered 4) dio + csum + cow compression -> fallback to buffered The changed case is 3. Quoting your reply: > We use this constantly in a bunch of places to avoid page cache overhead, > it's perfectly legitimate to do DIO to a file that requires checksums. So if you want to avoid page cache overhead the uncached mode does not help that much I guess, the pages and related structures still need to be set up. I'm wondering if the use of direct io is more for convenience (not to pollute the page cache) or for the real expectations of the performance gains when the caching is done on the application side (e.g. databases). The mutually exclusive feature is checksumming, so if the direct io is for performance then the application could also turn off the checksums. We don't have a magic bit and opening mode that can keep the dio + csum working as before, and currently don't have a suggestion how to implement it. Seems that one way or another the user application will have to do some work to get best the performance. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-31 23:58 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-24 16:37 [GIT PULL] Btrfs updates for 6.15 David Sterba 2025-03-27 20:48 ` pr-tracker-bot 2025-03-28 13:27 ` Josef Bacik 2025-03-28 17:36 ` David Sterba 2025-03-28 19:39 ` Josef Bacik 2025-03-28 21:36 ` Qu Wenruo 2025-03-31 9:35 ` Daniel Vacek 2025-03-31 23:58 ` David Sterba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox