* [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
@ 2011-12-06 5:35 Miao Xie
2011-12-06 5:49 ` Al Viro
2011-12-06 9:59 ` Christoph Hellwig
0 siblings, 2 replies; 10+ messages in thread
From: Miao Xie @ 2011-12-06 5:35 UTC (permalink / raw)
To: Chris Mason, viro; +Cc: Linux Btrfs, Linux Fsdevel, Ito
The reason the deadlock is that:
Task Btrfs-cleaner
umount()
down_write(&s->s_umount)
close_ctree()
wait for the end of
btrfs-cleaner
start_transaction
reserve space
shrink_delalloc()
writeback_inodes_sb_nr_if_idle()
down_read(&sb->s_umount)
So, the deadlock has happened.
We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs
is on remounting or umounting. At this time, we will use the sync function of
btrfs to sync all the delalloc file. It may waste lots of time, but as a
corner case, we needn't care.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 813c6bb..86c295d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3372,6 +3372,27 @@ out:
return ret;
}
+void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
+ unsigned long nr_pages)
+{
+ struct super_block *sb = root->fs_info->sb;
+
+ if (writeback_in_progress(sb->s_bdi))
+ return;
+
+ /*
+ * If we can not get s_umount, it means the fs is on remounting or
+ * umounting. At this time, we just sync all the delalloc file.
+ */
+ if (down_read_trylock(&sb->s_umount)) {
+ writeback_inodes_sb_nr(sb, nr_pages);
+ up_read(&sb->s_umount);
+ } else {
+ btrfs_start_delalloc_inodes(root, 0);
+ btrfs_wait_ordered_extents(root, 0, 0);
+ }
+}
+
/*
* shrink metadata reservation for delalloc
*/
@@ -3416,7 +3437,7 @@ static int shrink_delalloc(struct btrfs_root *root, u64 to_reclaim,
smp_mb();
nr_pages = min_t(unsigned long, nr_pages,
root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
- writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages);
+ btrfs_writeback_inodes_sb_nr(root, nr_pages);
spin_lock(&space_info->lock);
if (reserved > space_info->bytes_may_use)
--
1.7.6.4
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 5:35 [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount Miao Xie
@ 2011-12-06 5:49 ` Al Viro
2011-12-06 6:52 ` Miao Xie
2011-12-06 9:59 ` Christoph Hellwig
1 sibling, 1 reply; 10+ messages in thread
From: Al Viro @ 2011-12-06 5:49 UTC (permalink / raw)
To: Miao Xie; +Cc: Chris Mason, Linux Btrfs, Linux Fsdevel, Ito
> +void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
> + unsigned long nr_pages)
> +{
> + struct super_block *sb = root->fs_info->sb;
> +
> + if (writeback_in_progress(sb->s_bdi))
> + return;
> +
> + /*
> + * If we can not get s_umount, it means the fs is on remounting or
> + * umounting. At this time, we just sync all the delalloc file.
> + */
> + if (down_read_trylock(&sb->s_umount)) {
> + writeback_inodes_sb_nr(sb, nr_pages);
> + up_read(&sb->s_umount);
> + } else {
> + btrfs_start_delalloc_inodes(root, 0);
> + btrfs_wait_ordered_extents(root, 0, 0);
> + }
> +}
If that can race with umount, what prevents sb, its ->s_bdi et.al. being freed
under you?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 5:49 ` Al Viro
@ 2011-12-06 6:52 ` Miao Xie
0 siblings, 0 replies; 10+ messages in thread
From: Miao Xie @ 2011-12-06 6:52 UTC (permalink / raw)
To: Al Viro; +Cc: Chris Mason, Linux Btrfs, Linux Fsdevel, Ito
On tue, 6 Dec 2011 05:49:06 +0000, Al Viro wrote:
>
>> +void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
>> + unsigned long nr_pages)
>> +{
>> + struct super_block *sb = root->fs_info->sb;
>> +
>> + if (writeback_in_progress(sb->s_bdi))
>> + return;
>> +
>> + /*
>> + * If we can not get s_umount, it means the fs is on remounting or
>> + * umounting. At this time, we just sync all the delalloc file.
>> + */
>> + if (down_read_trylock(&sb->s_umount)) {
>> + writeback_inodes_sb_nr(sb, nr_pages);
>> + up_read(&sb->s_umount);
>> + } else {
>> + btrfs_start_delalloc_inodes(root, 0);
>> + btrfs_wait_ordered_extents(root, 0, 0);
>> + }
>> +}
>
> If that can race with umount, what prevents sb, its ->s_bdi et.al. being freed
> under you?
In fact, it happened. See the following mail.
http://marc.info/?l=linux-btrfs&m=131495252725296&w=2
The above function is called when some one want to modify the meta-data.
Btrfs will wait until all the meta-data operations end, and then free ->s_bdi
and the other objects. So we needn't worry about those objects.
(Maybe I misunderstood what you said. If yes, I'm sorry)
Thanks
Miao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 5:35 [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount Miao Xie
2011-12-06 5:49 ` Al Viro
@ 2011-12-06 9:59 ` Christoph Hellwig
2011-12-06 11:06 ` Miao Xie
1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2011-12-06 9:59 UTC (permalink / raw)
To: Miao Xie; +Cc: Chris Mason, viro, Linux Btrfs, Linux Fsdevel, Ito
On Tue, Dec 06, 2011 at 01:35:47PM +0800, Miao Xie wrote:
> The reason the deadlock is that:
> Task Btrfs-cleaner
> umount()
> down_write(&s->s_umount)
> close_ctree()
> wait for the end of
> btrfs-cleaner
> start_transaction
> reserve space
> shrink_delalloc()
> writeback_inodes_sb_nr_if_idle()
> down_read(&sb->s_umount)
> So, the deadlock has happened.
>
> We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs
> is on remounting or umounting. At this time, we will use the sync function of
> btrfs to sync all the delalloc file. It may waste lots of time, but as a
> corner case, we needn't care.
I can't see why you need the writeout when the trylocks fails. Umount
needs to take care of writing out all pending file data anyway, so doing
it from the cleaner thread in addition doesn't sound like it would help.
So I'd rather suggest to move the trylock into
writeback_inodes_sb_nr_if_idle, and while you're at it also rewrite
writeback_inodes_sb_if_idle that ext4 is using to sit on top of
writeback_inodes_sb_nr_if_idle to share that logic, and drop the
unused writeback_inodes_sb_nr export.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 9:59 ` Christoph Hellwig
@ 2011-12-06 11:06 ` Miao Xie
2011-12-06 11:23 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Miao Xie @ 2011-12-06 11:06 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chris Mason, viro, Linux Btrfs, Linux Fsdevel, Ito
On tue, 6 Dec 2011 04:59:23 -0500, Christoph Hellwig wrote:
> On Tue, Dec 06, 2011 at 01:35:47PM +0800, Miao Xie wrote:
>> The reason the deadlock is that:
>> Task Btrfs-cleaner
>> umount()
>> down_write(&s->s_umount)
>> close_ctree()
>> wait for the end of
>> btrfs-cleaner
>> start_transaction
>> reserve space
>> shrink_delalloc()
>> writeback_inodes_sb_nr_if_idle()
>> down_read(&sb->s_umount)
>> So, the deadlock has happened.
>>
>> We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs
>> is on remounting or umounting. At this time, we will use the sync function of
>> btrfs to sync all the delalloc file. It may waste lots of time, but as a
>> corner case, we needn't care.
>
> I can't see why you need the writeout when the trylocks fails. Umount
> needs to take care of writing out all pending file data anyway, so doing
> it from the cleaner thread in addition doesn't sound like it would help.
umount invokes sync_fs() and write out all the dirty file data. For the
other file systems, its OK because the file system does not introduce dirty pages
by itself. But btrfs is different. Its automatic defragment will make lots of dirty
pages after sync_fs() and reserve lots of meta-data space for those pages.
And then the cleaner thread may find there is no enough space to reserve, it must
sync the dirty file data and release the reserved space which is for the dirty
file data.
>
> So I'd rather suggest to move the trylock into
> writeback_inodes_sb_nr_if_idle, and while you're at it also rewrite
> writeback_inodes_sb_if_idle that ext4 is using to sit on top of
> writeback_inodes_sb_nr_if_idle to share that logic, and drop the
> unused writeback_inodes_sb_nr export.
It is a good way. I will try it.
(Someone is using this way to fix the other deadlock between freeze and writeback)
Thanks
Miao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 11:06 ` Miao Xie
@ 2011-12-06 11:23 ` Christoph Hellwig
2011-12-06 21:36 ` Chris Mason
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2011-12-06 11:23 UTC (permalink / raw)
To: Miao Xie
Cc: Christoph Hellwig, Chris Mason, viro, Linux Btrfs, Linux Fsdevel,
Ito
On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:
> > I can't see why you need the writeout when the trylocks fails. Umount
> > needs to take care of writing out all pending file data anyway, so doing
> > it from the cleaner thread in addition doesn't sound like it would help.
>
> umount invokes sync_fs() and write out all the dirty file data. For the
> other file systems, its OK because the file system does not introduce dirty pages
> by itself. But btrfs is different. Its automatic defragment will make lots of dirty
> pages after sync_fs() and reserve lots of meta-data space for those pages.
> And then the cleaner thread may find there is no enough space to reserve, it must
> sync the dirty file data and release the reserved space which is for the dirty
> file data.
I think the safest way to fix is is to write out all dirty data again
once the cleaner thread has been safely stopped.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 11:23 ` Christoph Hellwig
@ 2011-12-06 21:36 ` Chris Mason
2011-12-07 2:31 ` Miao Xie
0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2011-12-06 21:36 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Miao Xie, viro, Linux Btrfs, Linux Fsdevel, Ito
On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote:
> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:
> > > I can't see why you need the writeout when the trylocks fails. Umount
> > > needs to take care of writing out all pending file data anyway, so doing
> > > it from the cleaner thread in addition doesn't sound like it would help.
> >
> > umount invokes sync_fs() and write out all the dirty file data. For the
> > other file systems, its OK because the file system does not introduce dirty pages
> > by itself. But btrfs is different. Its automatic defragment will make lots of dirty
> > pages after sync_fs() and reserve lots of meta-data space for those pages.
> > And then the cleaner thread may find there is no enough space to reserve, it must
> > sync the dirty file data and release the reserved space which is for the dirty
> > file data.
>
> I think the safest way to fix is is to write out all dirty data again
> once the cleaner thread has been safely stopped.
>
Said another way we want to stop the autodefrag code before the unmount
is ready to continue. We also want to stop balancing, scrub etc.
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-06 21:36 ` Chris Mason
@ 2011-12-07 2:31 ` Miao Xie
2011-12-07 11:11 ` Ilya Dryomov
0 siblings, 1 reply; 10+ messages in thread
From: Miao Xie @ 2011-12-07 2:31 UTC (permalink / raw)
To: Chris Mason, Christoph Hellwig, viro, Linux Btrfs, Linux Fsdevel,
Ito
On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote:
> On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:
>>>> I can't see why you need the writeout when the trylocks fails. Umount
>>>> needs to take care of writing out all pending file data anyway, so doing
>>>> it from the cleaner thread in addition doesn't sound like it would help.
>>>
>>> umount invokes sync_fs() and write out all the dirty file data. For the
>>> other file systems, its OK because the file system does not introduce dirty pages
>>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty
>>> pages after sync_fs() and reserve lots of meta-data space for those pages.
>>> And then the cleaner thread may find there is no enough space to reserve, it must
>>> sync the dirty file data and release the reserved space which is for the dirty
>>> file data.
>>
>> I think the safest way to fix is is to write out all dirty data again
>> once the cleaner thread has been safely stopped.
>>
>
> Said another way we want to stop the autodefrag code before the unmount
> is ready to continue. We also want to stop balancing, scrub etc.
But there is no good interface to do it before umount gets s_umount lock.
I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush
can help us to fix the bug perfectly.
Thanks
Miao
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-07 2:31 ` Miao Xie
@ 2011-12-07 11:11 ` Ilya Dryomov
2011-12-08 3:46 ` Miao Xie
0 siblings, 1 reply; 10+ messages in thread
From: Ilya Dryomov @ 2011-12-07 11:11 UTC (permalink / raw)
To: Miao Xie
Cc: Chris Mason, Christoph Hellwig, viro, Linux Btrfs, Linux Fsdevel,
Ito
On Wed, Dec 07, 2011 at 10:31:35AM +0800, Miao Xie wrote:
> On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote:
> > On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote:
> >> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:
> >>>> I can't see why you need the writeout when the trylocks fails. Umount
> >>>> needs to take care of writing out all pending file data anyway, so doing
> >>>> it from the cleaner thread in addition doesn't sound like it would help.
> >>>
> >>> umount invokes sync_fs() and write out all the dirty file data. For the
> >>> other file systems, its OK because the file system does not introduce dirty pages
> >>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty
> >>> pages after sync_fs() and reserve lots of meta-data space for those pages.
> >>> And then the cleaner thread may find there is no enough space to reserve, it must
> >>> sync the dirty file data and release the reserved space which is for the dirty
> >>> file data.
> >>
> >> I think the safest way to fix is is to write out all dirty data again
> >> once the cleaner thread has been safely stopped.
> >>
> >
> > Said another way we want to stop the autodefrag code before the unmount
> > is ready to continue. We also want to stop balancing, scrub etc.
>
> But there is no good interface to do it before umount gets s_umount lock.
> I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush
> can help us to fix the bug perfectly.
But it won't fix the umount while balancing family of deadlocks (they
are really of the same nature, vfs grabs s_umount mutex and we need it
to proceed). (Balance cancelling code is part of restriper patches,
it's just a hook in close_ctree() that waits until we are done
relocating a chunk - very similar to cleaner wait)
One example would be that balancing code while dirtying pages calls
balance_dirty_pages_ratelimited() for each dirtied page, as it should.
And if balance_dirty_pages() then decides to initiate writeback we are
stuck schedule()ing forever, because writeback can't proceed w/o
read-taking s_umount mutex which is fully held by vfs - it just skips
the relocation inode.
Thanks,
Ilya
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
2011-12-07 11:11 ` Ilya Dryomov
@ 2011-12-08 3:46 ` Miao Xie
0 siblings, 0 replies; 10+ messages in thread
From: Miao Xie @ 2011-12-08 3:46 UTC (permalink / raw)
To: Ilya Dryomov
Cc: Chris Mason, Christoph Hellwig, viro, Linux Btrfs, Linux Fsdevel,
Ito
On Wed, 7 Dec 2011 13:11:58 +0200, Ilya Dryomov wrote:
> On Wed, Dec 07, 2011 at 10:31:35AM +0800, Miao Xie wrote:
>> On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote:
>>> On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote:
>>>> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:
>>>>>> I can't see why you need the writeout when the trylocks fails. Umount
>>>>>> needs to take care of writing out all pending file data anyway, so doing
>>>>>> it from the cleaner thread in addition doesn't sound like it would help.
>>>>>
>>>>> umount invokes sync_fs() and write out all the dirty file data. For the
>>>>> other file systems, its OK because the file system does not introduce dirty pages
>>>>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty
>>>>> pages after sync_fs() and reserve lots of meta-data space for those pages.
>>>>> And then the cleaner thread may find there is no enough space to reserve, it must
>>>>> sync the dirty file data and release the reserved space which is for the dirty
>>>>> file data.
>>>>
>>>> I think the safest way to fix is is to write out all dirty data again
>>>> once the cleaner thread has been safely stopped.
>>>>
>>>
>>> Said another way we want to stop the autodefrag code before the unmount
>>> is ready to continue. We also want to stop balancing, scrub etc.
>>
>> But there is no good interface to do it before umount gets s_umount lock.
>> I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush
>> can help us to fix the bug perfectly.
>
> But it won't fix the umount while balancing family of deadlocks (they
> are really of the same nature, vfs grabs s_umount mutex and we need it
> to proceed). (Balance cancelling code is part of restriper patches,
> it's just a hook in close_ctree() that waits until we are done
> relocating a chunk - very similar to cleaner wait)
I will change the logic, we will add a while loop to check if something is
running(xxx_running is not zero), if yes, invoke btrfs_sync_fs() to do
dirty page flush.
>
> One example would be that balancing code while dirtying pages calls
> balance_dirty_pages_ratelimited() for each dirtied page, as it should.
> And if balance_dirty_pages() then decides to initiate writeback we are
> stuck schedule()ing forever, because writeback can't proceed w/o
> read-taking s_umount mutex which is fully held by vfs - it just skips
> the relocation inode.
AFAIK, balance_dirty_pages() just wake up the flush thread, and the flush
thread also doesn't grab s_umount. So we needn't worry about it.I think.
Thanks
Miao
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-12-08 3:46 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 5:35 [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount Miao Xie
2011-12-06 5:49 ` Al Viro
2011-12-06 6:52 ` Miao Xie
2011-12-06 9:59 ` Christoph Hellwig
2011-12-06 11:06 ` Miao Xie
2011-12-06 11:23 ` Christoph Hellwig
2011-12-06 21:36 ` Chris Mason
2011-12-07 2:31 ` Miao Xie
2011-12-07 11:11 ` Ilya Dryomov
2011-12-08 3:46 ` Miao Xie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).