linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Nikolay Borisov <nborisov@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
Date: Sat, 1 Jan 2022 16:04:16 -0500	[thread overview]
Message-ID: <YdDB0PSBKa2GMAPV@hungrycats.org> (raw)
In-Reply-To: <eb5804bc-10d0-ab12-73c4-bcaa08b297e0@suse.com>

On Sat, Jan 01, 2022 at 09:48:28PM +0200, Nikolay Borisov wrote:
> 
> 
> On 30.11.21 г. 15:55, Nikolay Borisov wrote:
> > 
> > 
> > On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> >> Not a new bug, but it's still there.  btrfs replace ends in a transaction
> >> deadlock.
> >>
> >> 'btrfs replace status' reports the replace completed and exits:
> >>
> >> 	Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
> >>
> >> Magic-SysRq-W:
> >>
> >> 	sysrq: Show Blocked State
> >> 	task:btrfs-transacti state:D stack:    0 pid:29509 ppid:     2 flags:0x00004000
> >> 	Call Trace:
> >> 	 __schedule+0x35a/0xaa0
> >> 	 schedule+0x68/0xe0
> >> 	 schedule_preempt_disabled+0x15/0x20
> >> 	 __mutex_lock+0x1ac/0x7e0
> >> 	 ? lock_acquire+0x190/0x2d0
> >> 	 ? btrfs_run_dev_stats+0x46/0x450
> >> 	 ? rcu_read_lock_sched_held+0x16/0x80
> >> 	 mutex_lock_nested+0x1b/0x20
> >> 	 btrfs_run_dev_stats+0x46/0x450
> >> 	 ? _raw_spin_unlock+0x23/0x30
> >> 	 ? release_extent_buffer+0xa7/0xe0
> >> 	 commit_cowonly_roots+0xa2/0x2a0
> >> 	 ? btrfs_qgroup_account_extents+0x2d3/0x320
> >> 	 btrfs_commit_transaction+0x51f/0xc60
> >> 	 transaction_kthread+0x15a/0x180
> >> 	 kthread+0x151/0x170
> >> 	 ? btrfs_cleanup_transaction.isra.0+0x630/0x630
> >> 	 ? kthread_create_worker_on_cpu+0x70/0x70
> >> 	 ret_from_fork+0x22/0x30
> >> 	task:nfsd            state:D stack:    0 pid:31445 ppid:     2 flags:0x00004000
> >> 	Call Trace:
> >> 	 __schedule+0x35a/0xaa0
> >> 	 schedule+0x68/0xe0
> >> 	 btrfs_bio_counter_inc_blocked+0xe3/0x120
> >> 	 ? add_wait_queue_exclusive+0x80/0x80
> >> 	 btrfs_map_bio+0x4d/0x3f0
> >> 	 ? rcu_read_lock_sched_held+0x16/0x80
> >> 	 ? kmem_cache_alloc+0x2e8/0x360
> >> 	 btrfs_submit_metadata_bio+0xe9/0x100
> >> 	 submit_one_bio+0x67/0x80
> >> 	 read_extent_buffer_pages+0x277/0x380
> >> 	 btree_read_extent_buffer_pages+0xa1/0x120
> >> 	 read_tree_block+0x3b/0x70
> >> 	 read_block_for_search.isra.0+0x1a2/0x350
> >> 	 ? rcu_read_lock_sched_held+0x16/0x80
> >> 	 btrfs_search_slot+0x20f/0x910
> >> 	 btrfs_lookup_dir_item+0x78/0xc0
> >> 	 btrfs_lookup_dentry+0xca/0x540
> >> 	 btrfs_lookup+0x13/0x40
> >> 	 __lookup_slow+0x10d/0x1e0
> >> 	 ? rcu_read_lock_sched_held+0x16/0x80
> >> 	 lookup_one_len+0x77/0x90
> >> 	 nfsd_lookup_dentry+0xe0/0x440 [nfsd]
> >> 	 nfsd_lookup+0x89/0x150 [nfsd]
> >> 	 nfsd4_lookup+0x1a/0x20 [nfsd]
> >> 	 nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
> >> 	 nfsd_dispatch+0xe6/0x1a0 [nfsd]
> >> 	 svc_process+0x55e/0x990 [sunrpc]
> >> 	 ? nfsd_svc+0x6a0/0x6a0 [nfsd]
> >> 	 nfsd+0x173/0x2a0 [nfsd]
> >> 	 kthread+0x151/0x170
> >> 	 ? nfsd_destroy+0x190/0x190 [nfsd]
> >> 	 ? kthread_create_worker_on_cpu+0x70/0x70
> >> 	 ret_from_fork+0x22/0x30
> >> 	task:btrfs           state:D stack:    0 pid:14692 ppid: 14687 flags:0x00004000
> >> 	Call Trace:
> >> 	 __schedule+0x35a/0xaa0
> >> 	 schedule+0x68/0xe0
> >> 	 btrfs_rm_dev_replace_blocked+0x8a/0xc0
> >> 	 ? add_wait_queue_exclusive+0x80/0x80
> >> 	 btrfs_dev_replace_finishing+0x59a/0x790
> >> 	 btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> >> 	 ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> >> 	 btrfs_ioctl+0x27b2/0x2fe0
> >> 	 ? _raw_spin_unlock_irq+0x28/0x40
> >> 	 ? _raw_spin_unlock_irq+0x28/0x40
> >> 	 ? trace_hardirqs_on+0x54/0xf0
> >> 	 ? _raw_spin_unlock_irq+0x28/0x40
> >> 	 ? do_sigaction+0xfd/0x250
> >> 	 ? __might_fault+0x79/0x80
> >> 	 __x64_sys_ioctl+0x91/0xc0
> >> 	 ? __x64_sys_ioctl+0x91/0xc0
> >> 	 do_syscall_64+0x38/0x90
> >> 	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> 	RIP: 0033:0x7f1d8a0f4cc7
> >> 	RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> 	RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
> >> 	RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
> >> 	RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> 	R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
> >> 	R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
> >> 	task:mkdir           state:D stack:    0 pid: 2349 ppid:  2346 flags:0x00000000
> >> 	Call Trace:
> >> 	 __schedule+0x35a/0xaa0
> >> 	 schedule+0x68/0xe0
> >> 	 wait_current_trans+0xed/0x150
> >> 	 ? add_wait_queue_exclusive+0x80/0x80
> >> 	 start_transaction+0x551/0x700
> >> 	 btrfs_start_transaction+0x1e/0x20
> >> 	 btrfs_mkdir+0x5f/0x210
> >> 	 vfs_mkdir+0x150/0x200
> >> 	 do_mkdirat+0x118/0x140
> >> 	 __x64_sys_mkdir+0x1b/0x20
> >> 	 do_syscall_64+0x38/0x90
> >> 	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> 	RIP: 0033:0x7f608a3c6b07
> >> 	RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
> >> 	RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
> >> 	RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
> >> 	RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
> >> 	R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
> >> 	R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
> >>
> >> After a reboot (still in degraded mode), btrfs finishes the replace in
> >> a little under 5 seconds:
> >>
> >> 	[  508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
> >> 	[  513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
> >>
> > 
> > 
> > I have a working hypothesis what might be going wrong, however without a
> > crash dump to investigate I can't really confirm it. Basically I think
> > btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> > to running bios count since it's using cond_wake_up_nomb. If I'm right
> > then the following should fix it:
> > 
> > @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> > btrfs_fs_info *fs_info)
> >  void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> >  {
> >         percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> > -       cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> > +       /* paired with the wait_event barrier in replace_blocked */
> > +       cond_wake_up(&fs_info->dev_replace.replace_wait);
> >  }
> 
> Ping, any feedback on this patch?

I've had a VM running 37 replaces completed without hanging.  In the
2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
lockup bug[2].

[1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
[2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/

> > Can you apply it and see if it can reproduce, I don't know what's the
> > incident rate of this bug so you have to decide at what point it should
> > be fixed. In any case this patch can't have any negative functional
> > impact, it just makes the ordering slightly stronger to ensure the write
> > happens before possibly waking up someone on the queue.
> > 
> > 
> 

  reply	other threads:[~2022-01-01 21:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
2021-11-30 12:36 ` Nikolay Borisov
2021-11-30 15:23   ` Zygo Blaxell
2021-11-30 13:55 ` Nikolay Borisov
2021-11-30 21:18   ` Zygo Blaxell
2022-01-01 19:48   ` Nikolay Borisov
2022-01-01 21:04     ` Zygo Blaxell [this message]
2022-01-01 21:07       ` Nikolay Borisov
2022-01-01 23:12         ` Zygo Blaxell
2022-01-26 15:02           ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YdDB0PSBKa2GMAPV@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).