public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix missing last_unlink_trans update when removing a directory
Date: Thu, 9 Apr 2026 09:29:16 -0700	[thread overview]
Message-ID: <20260409162846.GA2654001@zen.localdomain> (raw)
In-Reply-To: <3f193ac4b0faadc24d718c3633f8c1a9c61a687c.1775747553.git.fdmanana@suse.com>

On Thu, Apr 09, 2026 at 04:21:15PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When removing a directory we are not updating its last_unlink_trans field,
> which can result in incorrect fsync behaviour in case some one fsyncs the
> directory after it was removed because it's holding a file descriptor on
> it.
> 
> Example scenario:
> 
>    mkdir /mnt/dir1
>    mkdir /mnt/dir1/dir2
>    mkdir /mnt/dir3
> 
>    sync -f /mnt
> 
>    # Do some change to the directory and fsync it.
>    chmod 700 /mnt/dir1
>    xfs_io -c fsync /mnt/dir1
> 
>    # Move dir2 out of dir1 so that dir1 becomes empty.
>    mv /mnt/dir1/dir2 /mnt/dir3/
> 
>    open fd on /mnt/dir1
>    call rmdir(2) on path "/mnt/dir1"
>    fsync fd
> 
>    <trigger power failure>
> 
> When attempting to mount the filesystem, the log replay will fail with
> an -EIO error and dmesg/syslog has the following:
> 
>    [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650
>    [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm
>    [445771.627912] BTRFS info (device dm-0): start tree-log replay
>    [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5
>    [445771.629453] memcg:ffff89f400351b00
>    [445771.629892] aops:btree_aops [btrfs] ino:1
>    [445771.630737] flags: 0x17fffc00000402a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
>    [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8
>    [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00
>    [445771.635029] page dumped because: eb page dump
>    [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir
>    [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5
>    [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087
>    [445771.638094] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
>    [445771.638097] 		inode generation 3 transid 9 size 16 nbytes 16384
>    [445771.638098] 		block group 0 mode 40755 links 1 uid 0 gid 0
>    [445771.638100] 		rdev 0 sequence 2 flags 0x0
>    [445771.638102] 		atime 1775744884.0
>    [445771.660056] 		ctime 1775744885.645502983
>    [445771.660058] 		mtime 1775744885.645502983
>    [445771.660060] 		otime 1775744884.0
>    [445771.660062] 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
>    [445771.660064] 		index 0 name_len 2
>    [445771.660066] 	item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34
>    [445771.660068] 		location key (259 1 0) type 2
>    [445771.660070] 		transid 9 data_len 0 name_len 4
>    [445771.660075] 	item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34
>    [445771.660076] 		location key (257 1 0) type 2
>    [445771.660077] 		transid 9 data_len 0 name_len 4
>    [445771.660078] 	item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
>    [445771.660079] 		location key (257 1 0) type 2
>    [445771.660080] 		transid 9 data_len 0 name_len 4
>    [445771.660081] 	item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
>    [445771.660082] 		location key (259 1 0) type 2
>    [445771.660083] 		transid 9 data_len 0 name_len 4
>    [445771.660084] 	item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
>    [445771.660086] 		inode generation 9 transid 9 size 8 nbytes 0
>    [445771.660087] 		block group 0 mode 40777 links 1 uid 0 gid 0
>    [445771.660088] 		rdev 0 sequence 2 flags 0x0
>    [445771.660089] 		atime 1775744885.641174097
>    [445771.660090] 		ctime 1775744885.645502983
>    [445771.660091] 		mtime 1775744885.645502983
>    [445771.660105] 		otime 1775744885.641174097
>    [445771.660106] 	item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
>    [445771.660107] 		index 2 name_len 4
>    [445771.660108] 	item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34
>    [445771.660109] 		location key (258 1 0) type 2
>    [445771.660110] 		transid 9 data_len 0 name_len 4
>    [445771.660111] 	item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34
>    [445771.660112] 		location key (258 1 0) type 2
>    [445771.660113] 		transid 9 data_len 0 name_len 4
>    [445771.660114] 	item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160
>    [445771.660115] 		inode generation 9 transid 10 size 0 nbytes 0
>    [445771.660116] 		block group 0 mode 40755 links 2 uid 0 gid 0
>    [445771.660117] 		rdev 0 sequence 0 flags 0x0
>    [445771.660118] 		atime 1775744885.645502983
>    [445771.660119] 		ctime 1775744885.645502983
>    [445771.660120] 		mtime 1775744885.645502983
>    [445771.660121] 		otime 1775744885.645502983
>    [445771.660122] 	item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14
>    [445771.660123] 		index 2 name_len 4
>    [445771.660124] 	item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14
>    [445771.660125] 		index 2 name_len 4
>    [445771.660126] 	item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160
>    [445771.660127] 		inode generation 9 transid 10 size 8 nbytes 0
>    [445771.660128] 		block group 0 mode 40755 links 1 uid 0 gid 0
>    [445771.660129] 		rdev 0 sequence 1 flags 0x0
>    [445771.660130] 		atime 1775744885.645502983
>    [445771.660130] 		ctime 1775744885.645502983
>    [445771.660131] 		mtime 1775744885.645502983
>    [445771.660132] 		otime 1775744885.645502983
>    [445771.660133] 	item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14
>    [445771.660134] 		index 3 name_len 4
>    [445771.660135] 	item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34
>    [445771.660136] 		location key (258 1 0) type 2
>    [445771.660137] 		transid 10 data_len 0 name_len 4
>    [445771.660138] 	item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34
>    [445771.660139] 		location key (258 1 0) type 2
>    [445771.660140] 		transid 10 data_len 0 name_len 4
>    [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
>    [445771.661650] ------------[ cut here ]------------
>    [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087
>    [445771.663588] Modules linked in: btrfs f2fs xfs (...)
>    [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G        W           7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full)
>    [445771.672575] Tainted: [W]=WARN
>    [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>    [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs]
>    [445771.675222] Code: 89 44 24 (...)
>    [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246
>    [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000
>    [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000
>    [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff
>    [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668
>    [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500
>    [445771.682912] FS:  00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000
>    [445771.684393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0
>    [445771.686273] Call Trace:
>    [445771.686646]  <TASK>
>    [445771.686969]  btrfs_submit_bbio+0x83f/0x860 [btrfs]
>    [445771.687750]  ? write_one_eb+0x28f/0x340 [btrfs]
>    [445771.688428]  btree_writepages+0x2e3/0x550 [btrfs]
>    [445771.689180]  ? kmem_cache_alloc_noprof+0x12a/0x490
>    [445771.689963]  ? alloc_extent_state+0x19/0x120 [btrfs]
>    [445771.690801]  ? kmem_cache_free+0x135/0x380
>    [445771.691328]  ? preempt_count_add+0x69/0xa0
>    [445771.691831]  ? set_extent_bit+0x252/0x8e0 [btrfs]
>    [445771.692468]  ? xas_load+0x9/0xc0
>    [445771.692873]  ? xas_find+0x14d/0x1a0
>    [445771.693304]  do_writepages+0xc6/0x160
>    [445771.693756]  filemap_writeback+0xb8/0xe0
>    [445771.694274]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
>    [445771.694999]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
>    [445771.695818]  btrfs_commit_transaction+0x5c8/0xd10 [btrfs]
>    [445771.696530]  ? kmem_cache_free+0x135/0x380
>    [445771.697120]  ? release_extent_buffer+0x34/0x160 [btrfs]
>    [445771.697786]  btrfs_recover_log_trees+0x7be/0x7e0 [btrfs]
>    [445771.698525]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
>    [445771.699206]  open_ctree+0x11e5/0x1810 [btrfs]
>    [445771.699776]  btrfs_get_tree.cold+0xb/0x162 [btrfs]
>    [445771.700463]  ? fscontext_read+0x165/0x180
>    [445771.701146]  ? rw_verify_area+0x50/0x180
>    [445771.701866]  vfs_get_tree+0x25/0xd0
>    [445771.702491]  vfs_cmd_create+0x59/0xe0
>    [445771.703125]  __do_sys_fsconfig+0x303/0x610
>    [445771.703603]  do_syscall_64+0xe9/0xf20
>    [445771.703974]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    [445771.704700] RIP: 0033:0x7f0447cbd4aa
>    [445771.705108] Code: 73 01 c3 (...)
>    [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
>    [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa
>    [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
>    [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000
>    [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>    [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23
>    [445771.712277]  </TASK>
>    [445771.712541] ---[ end trace 0000000000000000 ]---
>    [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5
>    [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
>    [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5)
>    [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure
>    [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5
>    [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5
> 
> The problem is that such a fsync should have result in a fallback to a
> transaction commit, but that did not happen because through the
> btrfs_rmdir() we never update the directory's last_unlink_trans field.
> Any inode that had a link removed must have its last_unlink_trans updated
> to the ID of transaction used for the operation, otherwise fsync and log
> replay will not work correctly.
> 
> btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we
> never call btrfs_record_unlink_dir() in order to update last_unlink_trans.
> However btrfs_unlink(), which is used for unlinking regular files, calls
> btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix
> this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink()
> to btrfs_unlink_inode().
> 
> A test case for fstests will follow soon.
> 
> Reported-by: Slava0135 <slava.kovalevskiy.2014@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/
> CC: stable@vger.kernel.org
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 40474014c03f..7d1ae5993f6a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4508,6 +4508,7 @@ int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
>  
>  	ret = __btrfs_unlink_inode(trans, dir, inode, name, NULL);
>  	if (!ret) {
> +		btrfs_record_unlink_dir(trans, dir, inode, false);

The old code was unconditionally calling btrfs_record_unlink_dir()
before btrfs_unlink_inode() while the new version only calls
btrfs_record_unlink_dir() on success.

Is that an intentional change? It appears that btrfs_unlink() is not
reacting particularly strongly to a non-zero return value (just ending
the transaction), so I think this could theoretically lead to a behavior
change.

I also noticed that btrfs_record_unlink_dir()'s documentation states:

 * Must be called before the unlink operations (updates to the subvolume tree,
 * inodes, etc) are done.

So if that is, in fact, inaccurate now, at least the comment should change.

Thanks,
Boris

>  		drop_nlink(&inode->vfs_inode);
>  		ret = btrfs_update_inode(trans, inode);
>  	}
> @@ -4549,9 +4550,6 @@ static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
>  		goto fscrypt_free;
>  	}
>  
> -	btrfs_record_unlink_dir(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)),
> -				false);
> -
>  	ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)),
>  				 &fname.disk_name);
>  	if (ret)
> -- 
> 2.47.2
> 

  reply	other threads:[~2026-04-09 16:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 15:21 [PATCH] btrfs: fix missing last_unlink_trans update when removing a directory fdmanana
2026-04-09 16:29 ` Boris Burkov [this message]
2026-04-09 17:19   ` Filipe Manana
2026-04-09 17:19 ` [PATCH v2] " fdmanana
2026-04-09 17:41   ` Boris Burkov
2026-04-10 11:28 ` [PATCH v3] " fdmanana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260409162846.GA2654001@zen.localdomain \
    --to=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox