* Scrub racing with btrfs workers on umount
@ 2012-11-20 13:34 Jan Schmidt
2012-11-21 12:58 ` Chris Mason
0 siblings, 1 reply; 2+ messages in thread
From: Jan Schmidt @ 2012-11-20 13:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Arne Jansen, Chris Mason
I found a race condition in the way scrub interacts with btrfs workers:
Please carefully distinguish "workers" as struct btrfs_workers from "worker" as
struct btrfs_worker_thread.
Process A: umount
Process B: scrub-workers
Process C: genwork-worker
A: close_ctree()
A: scrub_cancel()
B: scrub_workers_put()
B: btrfs_stop_workers()
B: -> lock(workers->lock)
B: -> splice idle worker list into worker list
B: -> while (worker list)
B: -> worker = delete from worker list
B: -> unlock(workers->lock)
B: -> kthread_stop(worker)
C: __btrfs_start_workers()
C: -> kthread_run starts new scrub-worker
C: -> lock(workers->lock) /* from scrub-workers */
C: -> add to workers idle list
C: -> unlock(workers->lock)
B: -> lock(workers->lock)
A: stop genwork-worker
In that situation, the scrub-workers have one idle worker but return from
btrfs_stop_workers(). There are several strategies to fix this:
1. Let close_ctree call scrub_cancel() after stopping the generic worker.
2. Let btrfs_stop_workers be more careful on insertions to the idle list, like:
diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 58b7d14..2523781 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -431,6 +431,8 @@ void btrfs_stop_workers(struct btrfs_workers *workers)
kthread_stop(worker->task);
spin_lock_irq(&workers->lock);
put_worker(worker);
+ /* we dropped the lock above, check for new idle workers */
+ list_splice_init(&workers->idle_list, &workers->worker_list);
}
spin_unlock_irq(&workers->lock);
}
3. Find out why __btrfs_start_workers() gets called in the above scenario at all.
Any thoughts? Otherwise, I'm going for option 3 and will call back.
-Jan
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: Scrub racing with btrfs workers on umount
2012-11-20 13:34 Scrub racing with btrfs workers on umount Jan Schmidt
@ 2012-11-21 12:58 ` Chris Mason
0 siblings, 0 replies; 2+ messages in thread
From: Chris Mason @ 2012-11-21 12:58 UTC (permalink / raw)
To: Jan Schmidt; +Cc: linux-btrfs, Arne Jansen, Chris Mason
On Tue, Nov 20, 2012 at 06:34:13AM -0700, Jan Schmidt wrote:
> I found a race condition in the way scrub interacts with btrfs workers:
>
> Please carefully distinguish "workers" as struct btrfs_workers from "worker" as
> struct btrfs_worker_thread.
>
> Process A: umount
> Process B: scrub-workers
> Process C: genwork-worker
>
> A: close_ctree()
> A: scrub_cancel()
> B: scrub_workers_put()
> B: btrfs_stop_workers()
> B: -> lock(workers->lock)
> B: -> splice idle worker list into worker list
> B: -> while (worker list)
> B: -> worker = delete from worker list
> B: -> unlock(workers->lock)
> B: -> kthread_stop(worker)
> C: __btrfs_start_workers()
> C: -> kthread_run starts new scrub-worker
> C: -> lock(workers->lock) /* from scrub-workers */
> C: -> add to workers idle list
> C: -> unlock(workers->lock)
> B: -> lock(workers->lock)
> A: stop genwork-worker
>
> In that situation, the scrub-workers have one idle worker but return from
> btrfs_stop_workers(). There are several strategies to fix this:
>
> 1. Let close_ctree call scrub_cancel() after stopping the generic worker.
>
> 2. Let btrfs_stop_workers be more careful on insertions to the idle list, like:
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 58b7d14..2523781 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -431,6 +431,8 @@ void btrfs_stop_workers(struct btrfs_workers *workers)
> kthread_stop(worker->task);
> spin_lock_irq(&workers->lock);
> put_worker(worker);
> + /* we dropped the lock above, check for new idle workers */
> + list_splice_init(&workers->idle_list, &workers->worker_list);
> }
> spin_unlock_irq(&workers->lock);
> }
>
> 3. Find out why __btrfs_start_workers() gets called in the above scenario at all.
>
> Any thoughts? Otherwise, I'm going for option 3 and will call back.
I'd go for a mixture of 1 and 3. My first guess is that we're getting
endio events for scrub reads and that is kicking the scrub state
machine.
-chris
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2012-11-21 12:58 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-20 13:34 Scrub racing with btrfs workers on umount Jan Schmidt
2012-11-21 12:58 ` Chris Mason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).