From: Nikolay Borisov <nborisov@suse.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
Date: Sat, 1 Jan 2022 23:07:24 +0200 [thread overview]
Message-ID: <ac3a3216-2beb-3365-0430-38865faccc83@suse.com> (raw)
In-Reply-To: <YdDB0PSBKa2GMAPV@hungrycats.org>
On 1.01.22 г. 23:04, Zygo Blaxell wrote:
> On Sat, Jan 01, 2022 at 09:48:28PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 30.11.21 г. 15:55, Nikolay Borisov wrote:
>>>
>>>
>>> On 29.11.21 г. 23:46, Zygo Blaxell wrote:
>>>> Not a new bug, but it's still there. btrfs replace ends in a transaction
>>>> deadlock.
>>>>
>>>> 'btrfs replace status' reports the replace completed and exits:
>>>>
>>>> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
>>>>
>>>> Magic-SysRq-W:
>>>>
>>>> sysrq: Show Blocked State
>>>> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> schedule_preempt_disabled+0x15/0x20
>>>> __mutex_lock+0x1ac/0x7e0
>>>> ? lock_acquire+0x190/0x2d0
>>>> ? btrfs_run_dev_stats+0x46/0x450
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> mutex_lock_nested+0x1b/0x20
>>>> btrfs_run_dev_stats+0x46/0x450
>>>> ? _raw_spin_unlock+0x23/0x30
>>>> ? release_extent_buffer+0xa7/0xe0
>>>> commit_cowonly_roots+0xa2/0x2a0
>>>> ? btrfs_qgroup_account_extents+0x2d3/0x320
>>>> btrfs_commit_transaction+0x51f/0xc60
>>>> transaction_kthread+0x15a/0x180
>>>> kthread+0x151/0x170
>>>> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
>>>> ? kthread_create_worker_on_cpu+0x70/0x70
>>>> ret_from_fork+0x22/0x30
>>>> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> btrfs_bio_counter_inc_blocked+0xe3/0x120
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> btrfs_map_bio+0x4d/0x3f0
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> ? kmem_cache_alloc+0x2e8/0x360
>>>> btrfs_submit_metadata_bio+0xe9/0x100
>>>> submit_one_bio+0x67/0x80
>>>> read_extent_buffer_pages+0x277/0x380
>>>> btree_read_extent_buffer_pages+0xa1/0x120
>>>> read_tree_block+0x3b/0x70
>>>> read_block_for_search.isra.0+0x1a2/0x350
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> btrfs_search_slot+0x20f/0x910
>>>> btrfs_lookup_dir_item+0x78/0xc0
>>>> btrfs_lookup_dentry+0xca/0x540
>>>> btrfs_lookup+0x13/0x40
>>>> __lookup_slow+0x10d/0x1e0
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> lookup_one_len+0x77/0x90
>>>> nfsd_lookup_dentry+0xe0/0x440 [nfsd]
>>>> nfsd_lookup+0x89/0x150 [nfsd]
>>>> nfsd4_lookup+0x1a/0x20 [nfsd]
>>>> nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
>>>> nfsd_dispatch+0xe6/0x1a0 [nfsd]
>>>> svc_process+0x55e/0x990 [sunrpc]
>>>> ? nfsd_svc+0x6a0/0x6a0 [nfsd]
>>>> nfsd+0x173/0x2a0 [nfsd]
>>>> kthread+0x151/0x170
>>>> ? nfsd_destroy+0x190/0x190 [nfsd]
>>>> ? kthread_create_worker_on_cpu+0x70/0x70
>>>> ret_from_fork+0x22/0x30
>>>> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> btrfs_rm_dev_replace_blocked+0x8a/0xc0
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> btrfs_dev_replace_finishing+0x59a/0x790
>>>> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>>>> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>>>> btrfs_ioctl+0x27b2/0x2fe0
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? trace_hardirqs_on+0x54/0xf0
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? do_sigaction+0xfd/0x250
>>>> ? __might_fault+0x79/0x80
>>>> __x64_sys_ioctl+0x91/0xc0
>>>> ? __x64_sys_ioctl+0x91/0xc0
>>>> do_syscall_64+0x38/0x90
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> RIP: 0033:0x7f1d8a0f4cc7
>>>> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>>>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
>>>> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
>>>> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
>>>> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> wait_current_trans+0xed/0x150
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> start_transaction+0x551/0x700
>>>> btrfs_start_transaction+0x1e/0x20
>>>> btrfs_mkdir+0x5f/0x210
>>>> vfs_mkdir+0x150/0x200
>>>> do_mkdirat+0x118/0x140
>>>> __x64_sys_mkdir+0x1b/0x20
>>>> do_syscall_64+0x38/0x90
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> RIP: 0033:0x7f608a3c6b07
>>>> RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
>>>> RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
>>>> RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
>>>> RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
>>>> R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
>>>> R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
>>>>
>>>> After a reboot (still in degraded mode), btrfs finishes the replace in
>>>> a little under 5 seconds:
>>>>
>>>> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
>>>> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
>>>>
>>>
>>>
>>> I have a working hypothesis what might be going wrong, however without a
>>> crash dump to investigate I can't really confirm it. Basically I think
>>> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
>>> to running bios count since it's using cond_wake_up_nomb. If I'm right
>>> then the following should fix it:
>>>
>>> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
>>> btrfs_fs_info *fs_info)
>>> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
>>> {
>>> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
>>> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
>>> + /* paired with the wait_event barrier in replace_blocked */
>>> + cond_wake_up(&fs_info->dev_replace.replace_wait);
>>> }
>>
>> Ping, any feedback on this patch?
>
> I've had a VM running 37 replaces completed without hanging. In the
> 2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
> lockup bug[2].
How does that compare vs without the patch? The KASAN thing looks like
raid56-related so I'd discount it. The logical_ino lockup also isn't
directly related to this patch. So without the patch you should have had
some incident rate greater than 0 of the replace lock up ?
>
> [1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
> [2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/
>
>>> Can you apply it and see if it can reproduce, I don't know what's the
>>> incident rate of this bug so you have to decide at what point it should
>>> be fixed. In any case this patch can't have any negative functional
>>> impact, it just makes the ordering slightly stronger to ensure the write
>>> happens before possibly waking up someone on the queue.
>>>
>>>
>>
>
next prev parent reply other threads:[~2022-01-01 21:07 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
2021-11-30 12:36 ` Nikolay Borisov
2021-11-30 15:23 ` Zygo Blaxell
2021-11-30 13:55 ` Nikolay Borisov
2021-11-30 21:18 ` Zygo Blaxell
2022-01-01 19:48 ` Nikolay Borisov
2022-01-01 21:04 ` Zygo Blaxell
2022-01-01 21:07 ` Nikolay Borisov [this message]
2022-01-01 23:12 ` Zygo Blaxell
2022-01-26 15:02 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac3a3216-2beb-3365-0430-38865faccc83@suse.com \
--to=nborisov@suse.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.