wait_block_group_cache_progress() waits forever in case of drive failure

All of lore.kernel.org
 help / color / mirror / Atom feed

* wait_block_group_cache_progress() waits forever in case of drive failure
@ 2013-06-04 16:23 Alex Lyakas
  2013-06-05  9:17 ` Stefan Behrens
  0 siblings, 1 reply; 2+ messages in thread
From: Alex Lyakas @ 2013-06-04 16:23 UTC (permalink / raw)
  To: linux-btrfs

Greetings all,
when testing drive failures, I occasionally hit the following hang:

# Block group is being cached-in by caching_thread()
# caching_thread() experiences an error, e.g., in btrfs_search_slot,
because of drive failure:
	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
	if (ret < 0)
		goto err;

# caching thread exits:
err:
	btrfs_free_path(path);
	up_read(&fs_info->extent_commit_sem);

	free_excluded_extents(extent_root, block_group);

	mutex_unlock(&caching_ctl->mutex);
out:
	wake_up(&caching_ctl->wait);

	put_caching_control(caching_ctl);
	btrfs_put_block_group(block_group);

However, wait_block_group_cache_progress() is still stuck in a stack like this:
[<ffffffff816ec509>] schedule+0x29/0x70
[<ffffffffa044bd42>] wait_block_group_cache_progress+0xe2/0x110 [btrfs]
[<ffffffff8107fc10>] ? add_wait_queue+0x60/0x60
[<ffffffff8107fc10>] ? add_wait_queue+0x60/0x60
[<ffffffffa04568d6>] find_free_extent+0x306/0xb90 [btrfs]
[<ffffffffa04462ee>] ? btrfs_search_slot+0x2fe/0x820 [btrfs]
[<ffffffffa0457200>] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
...
because of:
	wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
		   (cache->free_space_ctl->free_space >= num_bytes));

But cache->cached never becomes BTRFS_CACHE_FINISHED, and
cache->free_space_ctl->free_space will also not grow enough, so the
wait never finishes.
At this point, the system totally hangs.

Same problem can happen with wait_block_group_cache_done().

I am thinking: can we add additional condition, like:
	wait_event(caching_ctl->wait,
                       test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
                       block_group_cache_done(cache) ||
                       (cache->free_space_ctl->free_space >= num_bytes));

So that when transaction aborts, FS is marked as "bad", and then all
these waits will complete, so that the user can unmount?

Or some other way to fix this problem?

Thanks,
Alex.

P.S: should I open a bugzilla for this?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: wait_block_group_cache_progress() waits forever in case of drive failure
  2013-06-04 16:23 wait_block_group_cache_progress() waits forever in case of drive failure Alex Lyakas
@ 2013-06-05  9:17 ` Stefan Behrens
  0 siblings, 0 replies; 2+ messages in thread
From: Stefan Behrens @ 2013-06-05  9:17 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Tue, 4 Jun 2013 19:23:18 +0300, Alex Lyakas wrote:
[...]
> P.S: should I open a bugzilla for this?

Yes.
Otherwise the bug report gets lost.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-06-05  9:17 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04 16:23 wait_block_group_cache_progress() waits forever in case of drive failure Alex Lyakas
2013-06-05  9:17 ` Stefan Behrens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.