* [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait @ 2016-10-28 16:58 Tejun Heo 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, When there's heavy metadata operation traffic on ext4, the journal gets filled soon and majority of filesystem users end up blocking on journal->j_checkpoint_mutex with a stacktrace similar to the following. [<ffffffff8c32e758>] __jbd2_log_wait_for_space+0xb8/0x1d0 [<ffffffff8c3285f6>] add_transaction_credits+0x286/0x2a0 [<ffffffff8c32876c>] start_this_handle+0x10c/0x400 [<ffffffff8c328c5b>] jbd2__journal_start+0xdb/0x1e0 [<ffffffff8c30ee5d>] __ext4_journal_start_sb+0x6d/0x120 [<ffffffff8c2d713e>] __ext4_new_inode+0x64e/0x1330 [<ffffffff8c2e9bf0>] ext4_create+0xc0/0x1c0 [<ffffffff8c2570fd>] path_openat+0x124d/0x1380 [<ffffffff8c258501>] do_filp_open+0x91/0x100 [<ffffffff8c2462d0>] do_sys_open+0x130/0x220 [<ffffffff8c2463de>] SyS_open+0x1e/0x20 [<ffffffff8c7ec5b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4 [<ffffffffffffffff>] 0xffffffffffffffff Because the sleeps on the mutex aren't accounted as iowait, the system doesn't show the usual signs of being bogged down by IOs - both iowait and /proc/stat:procs_blocked stay misleadingly low. While propagation of iowait through locking constructs is far from being strict, heavy contention on j_checkpoint_mutex is easy to trigger, obviously iowait and getting it right can help users in tracking down the issue quite a bit. Due to the way io_schedule() is implemented, it currently is hairy to add an io variant to an existing interface - the schedule() call itself, which is usually buried deep, should be replaced with io_schedule(). As we already have current->in_iowait to mark the task as sleeping for iowait, this can be made easy by breaking up io_schedule() into multiple steps so that the preparation and marking can be done before calling an existing interafce and the actual iowait accounting can be done from inside the scheduler. What do you think? This patch contains the following four patches. 0001-sched-move-IO-scheduling-accounting-from-io_schedule.patch 0002-sched-separate-out-io_schedule_prepare-and-io_schedu.patch 0003-mutex-add-mutex_lock_io.patch 0004-jbd2-use-mutex_lock_io-for-journal-j_checkpoint_mute.patch 0001-0002 implement io_schedule_prepare/finish(). 0003 implements mutex_lock_io() using io_schedule_prepare/finish(). 0004 uses mutex_lock_io() on journal->j_checkpoint_mutex. This patchset is also available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mutex_lock_io Thanks, diffstat follows. fs/jbd2/commit.c | 2 - fs/jbd2/journal.c | 14 ++++++------- include/linux/mutex.h | 4 +++ include/linux/sched.h | 8 ++----- kernel/locking/mutex.c | 24 ++++++++++++++++++++++ kernel/sched/core.c | 52 +++++++++++++++++++++++++++++++++++++------------ 6 files changed, 79 insertions(+), 25 deletions(-) -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo @ 2016-10-28 16:58 ` Tejun Heo 2016-10-28 18:27 ` Peter Zijlstra ` (2 more replies) 2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo ` (2 subsequent siblings) 3 siblings, 3 replies; 16+ messages in thread From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside schedule(). io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. #1 and #2 can be performed in the prepartaion step while #3 must be done close to the actual scheduling. This patch moves #3 into __schedule() so that later patches can separate out preparation and finish steps from io_schedule(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> --- kernel/sched/core.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94732d1..f6baa38 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) unsigned long *switch_count; struct pin_cookie cookie; struct rq *rq; - int cpu; + int cpu, in_iowait; cpu = smp_processor_id(); rq = cpu_rq(cpu); prev = rq->curr; + in_iowait = prev->in_iowait; + + if (in_iowait) { + delayacct_blkio_start(); + atomic_inc(&rq->nr_iowait); + } schedule_debug(prev); @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) } balance_callback(rq); + + if (in_iowait) { + atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); + } } void __noreturn do_task_dead(void) @@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to); long __sched io_schedule_timeout(long timeout) { int old_iowait = current->in_iowait; - struct rq *rq; long ret; current->in_iowait = 1; blk_schedule_flush_plug(current); - delayacct_blkio_start(); - rq = raw_rq(); - atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); current->in_iowait = old_iowait; - atomic_dec(&rq->nr_iowait); - delayacct_blkio_end(); return ret; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo @ 2016-10-28 18:27 ` Peter Zijlstra 2016-10-28 19:07 ` Peter Zijlstra 2016-11-03 15:33 ` Pavan Kondeti 2016-12-06 21:29 ` [PATCH v2 " Tejun Heo 2 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2016-10-28 18:27 UTC (permalink / raw) To: Tejun Heo Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote: > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > unsigned long *switch_count; > struct pin_cookie cookie; > struct rq *rq; > - int cpu; > + int cpu, in_iowait; > > cpu = smp_processor_id(); > rq = cpu_rq(cpu); > prev = rq->curr; > + in_iowait = prev->in_iowait; > + > + if (in_iowait) { > + delayacct_blkio_start(); > + atomic_inc(&rq->nr_iowait); > + } > > schedule_debug(prev); > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > } > > balance_callback(rq); > + > + if (in_iowait) { > + atomic_dec(&rq->nr_iowait); > + delayacct_blkio_end(); > + } > } > > void __noreturn do_task_dead(void) Urgh, can't say I like this much. It moves two branches into the schedule path. Nor do I really like the idea of having to annotate special mutexes for the iowait crap. I'll think more after KS/LPC etc.. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 18:27 ` Peter Zijlstra @ 2016-10-28 19:07 ` Peter Zijlstra 2016-10-28 19:12 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2016-10-28 19:07 UTC (permalink / raw) To: Tejun Heo Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo On Fri, Oct 28, 2016 at 08:27:12PM +0200, Peter Zijlstra wrote: > On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote: > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > > unsigned long *switch_count; > > struct pin_cookie cookie; > > struct rq *rq; > > - int cpu; > > + int cpu, in_iowait; > > > > cpu = smp_processor_id(); > > rq = cpu_rq(cpu); > > prev = rq->curr; > > + in_iowait = prev->in_iowait; > > + > > + if (in_iowait) { > > + delayacct_blkio_start(); > > + atomic_inc(&rq->nr_iowait); > > + } > > > > schedule_debug(prev); > > > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > > } > > > > balance_callback(rq); > > + > > + if (in_iowait) { > > + atomic_dec(&rq->nr_iowait); > > + delayacct_blkio_end(); > > + } > > } > > > > void __noreturn do_task_dead(void) > > Urgh, can't say I like this much. It moves two branches into the > schedule path. > > Nor do I really like the idea of having to annotate special mutexes for > the iowait crap. > > I'll think more after KS/LPC etc.. One alternative is to inherit the iowait state of the task we block on. That'll not get rid of the branches much, but it will remove the new mutex APIs. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 19:07 ` Peter Zijlstra @ 2016-10-28 19:12 ` Tejun Heo 2016-10-29 3:21 ` Peter Zijlstra 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-10-28 19:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, Peter. On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > One alternative is to inherit the iowait state of the task we block on. > That'll not get rid of the branches much, but it will remove the new > mutex APIs. Yeah, thought about that briefly but we don't necessarily track mutex or other synchronization construct owners, things get gnarly with rwsems (the inode ones sometimes end up in a similar situation), and we'll probably end up dealing with some surprising propagations down the line. That said, getting such automatic propagation working would be great. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 19:12 ` Tejun Heo @ 2016-10-29 3:21 ` Peter Zijlstra 2016-10-31 16:45 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2016-10-29 3:21 UTC (permalink / raw) To: Tejun Heo Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote: > Hello, Peter. > > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > > One alternative is to inherit the iowait state of the task we block on. > > That'll not get rid of the branches much, but it will remove the new > > mutex APIs. > > Yeah, thought about that briefly but we don't necessarily track mutex This one I actually fixed and should be in -next. And it would be sufficient to cover the use case here. > or other synchronization construct owners, things get gnarly with > rwsems (the inode ones sometimes end up in a similar situation), and > we'll probably end up dealing with some surprising propagations down > the line. rwsems could be done for writers only. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-29 3:21 ` Peter Zijlstra @ 2016-10-31 16:45 ` Tejun Heo 2016-12-06 21:30 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-10-31 16:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, On Sat, Oct 29, 2016 at 05:21:26AM +0200, Peter Zijlstra wrote: > On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote: > > Hello, Peter. > > > > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > > > One alternative is to inherit the iowait state of the task we block on. > > > That'll not get rid of the branches much, but it will remove the new > > > mutex APIs. > > > > Yeah, thought about that briefly but we don't necessarily track mutex > > This one I actually fixed and should be in -next. And it would be > sufficient to cover the use case here. Tracking the owners of mutexes and rwsems does help quite a bit. I don't think it's as simple as inheriting io sleep state from the current owner tho. The owner might be running or in a non-IO sleep when others try to grab the mutex. It is an option to ignore those cases but this would have a real possibility to lead to surprising results in some corner cases. If we choose to propagate dynamically, it becomes an a lot more complex problem and I don't think it'd be justfiable. Unless there can be a simple enough and reliable solution, I think it'd be better to stick with explicit marking. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-31 16:45 ` Tejun Heo @ 2016-12-06 21:30 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-12-06 21:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, On Mon, Oct 31, 2016 at 10:45:56AM -0600, Tejun Heo wrote: > Tracking the owners of mutexes and rwsems does help quite a bit. I > don't think it's as simple as inheriting io sleep state from the > current owner tho. The owner might be running or in a non-IO sleep > when others try to grab the mutex. It is an option to ignore those > cases but this would have a real possibility to lead to surprising > results in some corner cases. If we choose to propagate dynamically, > it becomes an a lot more complex problem and I don't think it'd be > justfiable. > > Unless there can be a simple enough and reliable solution, I think > it'd be better to stick with explicit marking. Just posted the fixed version for the first patch. Any more thoughts on this? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo 2016-10-28 18:27 ` Peter Zijlstra @ 2016-11-03 15:33 ` Pavan Kondeti 2016-11-08 22:51 ` Tejun Heo 2016-12-06 21:29 ` [PATCH v2 " Tejun Heo 2 siblings, 1 reply; 16+ messages in thread From: Pavan Kondeti @ 2016-11-03 15:33 UTC (permalink / raw) To: Tejun Heo Cc: torvalds, akpm, mingo, Peter Zijlstra, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hi Tejun, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 94732d1..f6baa38 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > unsigned long *switch_count; > struct pin_cookie cookie; > struct rq *rq; > - int cpu; > + int cpu, in_iowait; > > cpu = smp_processor_id(); > rq = cpu_rq(cpu); > prev = rq->curr; > + in_iowait = prev->in_iowait; > + > + if (in_iowait) { > + delayacct_blkio_start(); > + atomic_inc(&rq->nr_iowait); > + } > > schedule_debug(prev); > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > } > > balance_callback(rq); > + > + if (in_iowait) { > + atomic_dec(&rq->nr_iowait); > + delayacct_blkio_end(); > + } > } I think, the nr_iowait update can go wrong here. When the task migrates to a different CPU upon wakeup, this rq points to a different CPU from the one on which nr_iowait is incremented before. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-11-03 15:33 ` Pavan Kondeti @ 2016-11-08 22:51 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-11-08 22:51 UTC (permalink / raw) To: Pavan Kondeti Cc: torvalds, akpm, mingo, Peter Zijlstra, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, On Thu, Nov 03, 2016 at 09:03:45PM +0530, Pavan Kondeti wrote: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 94732d1..f6baa38 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > > unsigned long *switch_count; > > struct pin_cookie cookie; > > struct rq *rq; > > - int cpu; > > + int cpu, in_iowait; > > > > cpu = smp_processor_id(); > > rq = cpu_rq(cpu); > > prev = rq->curr; > > + in_iowait = prev->in_iowait; > > + > > + if (in_iowait) { > > + delayacct_blkio_start(); > > + atomic_inc(&rq->nr_iowait); > > + } > > > > schedule_debug(prev); > > > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > > } > > > > balance_callback(rq); > > + > > + if (in_iowait) { > > + atomic_dec(&rq->nr_iowait); > > + delayacct_blkio_end(); > > + } > > } > > I think, the nr_iowait update can go wrong here. > > When the task migrates to a different CPU upon wakeup, this rq points > to a different CPU from the one on which nr_iowait is incremented > before. Ah, you're right, it should remember the original rq. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo 2016-10-28 18:27 ` Peter Zijlstra 2016-11-03 15:33 ` Pavan Kondeti @ 2016-12-06 21:29 ` Tejun Heo 2016-12-07 9:35 ` Peter Zijlstra 2 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-12-06 21:29 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside schedule(). io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. #1 and #2 can be performed in the prepartaion step while #3 must be done close to the actual scheduling. This patch moves #3 into __schedule() so that later patches can separate out preparation and finish steps from io_schedule(). v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait to avoid misattributing the count after the task gets migrated to another CPU. Noticed by Pavan. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavan Kondeti <pkondeti@codeaurora.org> --- kernel/sched/core.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b struct task_struct *prev, *next; unsigned long *switch_count; struct pin_cookie cookie; - struct rq *rq; - int cpu; + struct rq *rq, *prev_rq; + int cpu, in_iowait; cpu = smp_processor_id(); - rq = cpu_rq(cpu); + rq = prev_rq = cpu_rq(cpu); prev = rq->curr; + in_iowait = prev->in_iowait; + + if (in_iowait) { + delayacct_blkio_start(); + atomic_inc(&rq->nr_iowait); + } schedule_debug(prev); @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b } balance_callback(rq); + + if (in_iowait) { + atomic_dec(&prev_rq->nr_iowait); + delayacct_blkio_end(); + } } void __noreturn do_task_dead(void) @@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to); long __sched io_schedule_timeout(long timeout) { int old_iowait = current->in_iowait; - struct rq *rq; long ret; current->in_iowait = 1; blk_schedule_flush_plug(current); - delayacct_blkio_start(); - rq = raw_rq(); - atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); current->in_iowait = old_iowait; - atomic_dec(&rq->nr_iowait); - delayacct_blkio_end(); return ret; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() 2016-12-06 21:29 ` [PATCH v2 " Tejun Heo @ 2016-12-07 9:35 ` Peter Zijlstra 2016-12-07 20:48 ` [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2016-12-07 9:35 UTC (permalink / raw) To: Tejun Heo Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo On Tue, Dec 06, 2016 at 04:29:35PM -0500, Tejun Heo wrote: > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b > struct task_struct *prev, *next; > unsigned long *switch_count; > struct pin_cookie cookie; > - struct rq *rq; > - int cpu; > + struct rq *rq, *prev_rq; > + int cpu, in_iowait; > > cpu = smp_processor_id(); > - rq = cpu_rq(cpu); > + rq = prev_rq = cpu_rq(cpu); > prev = rq->curr; > + in_iowait = prev->in_iowait; > + > + if (in_iowait) { > + delayacct_blkio_start(); > + atomic_inc(&rq->nr_iowait); > + } > > schedule_debug(prev); > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b > } > > balance_callback(rq); > + > + if (in_iowait) { > + atomic_dec(&prev_rq->nr_iowait); > + delayacct_blkio_end(); > + } > } > > void __noreturn do_task_dead(void) So I really dislike this, it mucks with the fast paths for something of dubious utility. At the very least move this to the actual blocking paths such that regular context switches don't see this overhead (also delayacct is horrific crap). A little something like so... completely untested. --- kernel/sched/core.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b08fb257856..935bcd91f4a4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&task_rq(p)->nr_iowait); + } + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; set_task_cpu(p, cpu); } + +#else /* CONFIG_SMP */ + + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&task_rq(p)->nr_iowait); + } + #endif /* CONFIG_SMP */ ttwu_queue(p, cpu, wake_flags); @@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie trace_sched_waking(p); - if (!task_on_rq_queued(p)) + if (!task_on_rq_queued(p)) { + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&task_rq(p)->nr_iowait); + } ttwu_activate(rq, p, ENQUEUE_WAKEUP); + } ttwu_do_wakeup(rq, p, 0, cookie); ttwu_stat(p, smp_processor_id(), 0); @@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(void) return sum; } +/* + * IO-wait accounting, and how its mostly bollocks (on SMP). + * + * The idea behind IO-wait account is to account the idle time that we could + * have spend running if it were not for IO. That is, if we were to improve the + * storage performance, we'd have a proportional reduction in IO-wait time. + * + * This all works nicely on UP, where, when a task blocks on IO, we account + * idle time as IO-wait, because if the storage were faster, it could've been + * running and we'd not be idle. + * + * This has been extended to SMP, by doing the same for each CPU. This however + * is broken. + * + * Imagine for instance the case where two tasks block on one CPU, only the one + * CPU will have IO-wait accounted, while the other has regular idle. Even + * though, if the storage were faster, both could've ran at the same time, + * utilising both CPUs. + * + * This means, that when looking globally, the current IO-wait accounting on + * SMP is a lower bound, by reason of under accounting. + * + * Worse, since the numbers are provided per CPU, they are sometimes + * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly + * associated with any one particular CPU, it can wake to another CPU than it + * blocked on. This means the per CPU IO-wait number is meaningless. + * + * Task CPU affinities can make all that even more 'interesting'. + */ + unsigned long nr_iowait(void) { unsigned long i, sum = 0; @@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void) return sum; } +/* + * Consumers of these two interfaces, like for example the cpufreq menu + * governor are using nonsensical data. Boosting frequency for a CPU that has + * IO-wait which might not even end up running the task when it does become + * runnable. + */ + unsigned long nr_iowait_cpu(int cpu) { struct rq *this = cpu_rq(cpu); @@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(bool preempt) deactivate_task(rq, prev, DEQUEUE_SLEEP); prev->on_rq = 0; + if (prev->in_iowait) { + atomic_inc(&rq->nr_iowait); + delayacct_blkio_start(); + } + /* * If a worker went to sleep, notify and ask workqueue * whether it wants to wake up a task to maintain ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler 2016-12-07 9:35 ` Peter Zijlstra @ 2016-12-07 20:48 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-12-07 20:48 UTC (permalink / raw) To: Peter Zijlstra Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo Hello, Yeah, that works. Here's v3 based on your patch. The other patches still apply correctly. Thanks. ------ 8< ------ For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside the scheduler. io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. #1 and #2 can be performed in the prepartaion step while #3 must be done close to the actual scheduling. This patch moves #3 into the scheduler so that later patches can separate out preparation and finish steps from io_schedule(). v3: Replaced with PeterZ's implementation which performs nr_iowait accounting in the sleep and wake up path to avoid unnecessarily burdening non sleeping paths in __schedule(). v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait to avoid misattributing the count after the task gets migrated to another CPU. Noticed by Pavan. Signed-off-by: Tejun Heo <tj@kernel.org> Patch-originally-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavan Kondeti <pkondeti@codeaurora.org> --- kernel/sched/core.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 61 insertions(+), 7 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, un p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&task_rq(p)->nr_iowait); + } + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; set_task_cpu(p, cpu); } + +#else /* CONFIG_SMP */ + + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&task_rq(p)->nr_iowait); + } + #endif /* CONFIG_SMP */ ttwu_queue(p, cpu, wake_flags); @@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct trace_sched_waking(p); - if (!task_on_rq_queued(p)) + if (!task_on_rq_queued(p)) { + if (p->in_iowait) { + delayacct_blkio_end(); + atomic_dec(&rq->nr_iowait); + } ttwu_activate(rq, p, ENQUEUE_WAKEUP); + } ttwu_do_wakeup(rq, p, 0, cookie); ttwu_stat(p, smp_processor_id(), 0); @@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(v return sum; } +/* + * IO-wait accounting, and how its mostly bollocks (on SMP). + * + * The idea behind IO-wait account is to account the idle time that we could + * have spend running if it were not for IO. That is, if we were to improve the + * storage performance, we'd have a proportional reduction in IO-wait time. + * + * This all works nicely on UP, where, when a task blocks on IO, we account + * idle time as IO-wait, because if the storage were faster, it could've been + * running and we'd not be idle. + * + * This has been extended to SMP, by doing the same for each CPU. This however + * is broken. + * + * Imagine for instance the case where two tasks block on one CPU, only the one + * CPU will have IO-wait accounted, while the other has regular idle. Even + * though, if the storage were faster, both could've ran at the same time, + * utilising both CPUs. + * + * This means, that when looking globally, the current IO-wait accounting on + * SMP is a lower bound, by reason of under accounting. + * + * Worse, since the numbers are provided per CPU, they are sometimes + * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly + * associated with any one particular CPU, it can wake to another CPU than it + * blocked on. This means the per CPU IO-wait number is meaningless. + * + * Task CPU affinities can make all that even more 'interesting'. + */ + unsigned long nr_iowait(void) { unsigned long i, sum = 0; @@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void) return sum; } +/* + * Consumers of these two interfaces, like for example the cpufreq menu + * governor are using nonsensical data. Boosting frequency for a CPU that has + * IO-wait which might not even end up running the task when it does become + * runnable. + */ + unsigned long nr_iowait_cpu(int cpu) { struct rq *this = cpu_rq(cpu); @@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(b deactivate_task(rq, prev, DEQUEUE_SLEEP); prev->on_rq = 0; + if (prev->in_iowait) { + atomic_inc(&rq->nr_iowait); + delayacct_blkio_start(); + } + /* * If a worker went to sleep, notify and ask workqueue * whether it wants to wake up a task to maintain @@ -5063,19 +5123,13 @@ EXPORT_SYMBOL_GPL(yield_to); long __sched io_schedule_timeout(long timeout) { int old_iowait = current->in_iowait; - struct rq *rq; long ret; current->in_iowait = 1; blk_schedule_flush_plug(current); - delayacct_blkio_start(); - rq = raw_rq(); - atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); current->in_iowait = old_iowait; - atomic_dec(&rq->nr_iowait); - delayacct_blkio_end(); return ret; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() 2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo @ 2016-10-28 16:58 ` Tejun Heo 2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo 2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo 3 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo Now that IO schedule accounting is done inside __schedule(), io_schedule() can be split into three steps - prep, schedule, and finish - where the schedule part doesn't need any special annotation. This allows marking a sleep as iowait by simply wrapping an existing blocking function with io_schedule_prepare() and io_schedule_finish(). Because task_struct->in_iowait is single bit, the caller of io_schedule_prepare() needs to record and the pass its state to io_schedule_finish() to be safe regarding nesting. While this isn't the prettiest, these functions are mostly gonna be used by core functions and we don't want to use more space for ->in_iowait. While at it, as it's simple to do now, reimplement io_schedule() without unnecessarily going through io_schedule_timeout(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> --- include/linux/sched.h | 8 +++----- kernel/sched/core.c | 33 ++++++++++++++++++++++++++++----- 2 files changed, 31 insertions(+), 10 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..c025f77 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -441,12 +441,10 @@ extern signed long schedule_timeout_idle(signed long timeout); asmlinkage void schedule(void); extern void schedule_preempt_disabled(void); +extern int __must_check io_schedule_prepare(void); +extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); - -static inline void io_schedule(void) -{ - io_schedule_timeout(MAX_SCHEDULE_TIMEOUT); -} +extern void io_schedule(void); void __noreturn do_task_dead(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f6baa38..30d3185 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5067,25 +5067,48 @@ int __sched yield_to(struct task_struct *p, bool preempt) } EXPORT_SYMBOL_GPL(yield_to); +int io_schedule_prepare(void) +{ + int old_iowait = current->in_iowait; + + current->in_iowait = 1; + blk_schedule_flush_plug(current); + + return old_iowait; +} + +void io_schedule_finish(int token) +{ + current->in_iowait = token; +} + /* * This task is about to go to sleep on IO. Increment rq->nr_iowait so * that process accounting knows that this is a task in IO wait state. */ long __sched io_schedule_timeout(long timeout) { - int old_iowait = current->in_iowait; + int token; long ret; - current->in_iowait = 1; - blk_schedule_flush_plug(current); - + token = io_schedule_prepare(); ret = schedule_timeout(timeout); - current->in_iowait = old_iowait; + io_schedule_finish(token); return ret; } EXPORT_SYMBOL(io_schedule_timeout); +void io_schedule(void) +{ + int token; + + token = io_schedule_prepare(); + schedule(); + io_schedule_finish(token); +} +EXPORT_SYMBOL(io_schedule); + /** * sys_sched_get_priority_max - return maximum RT priority. * @policy: scheduling class. -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] mutex: add mutex_lock_io() 2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo 2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo @ 2016-10-28 16:58 ` Tejun Heo 2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo 3 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo We sometimes end up propagating IO blocking through mutexes; however, because there currently is no way of annotating mutex sleeps as iowait, there are cases where iowait and /proc/stat:procs_blocked report misleading numbers obscuring the actual state of the system. This patch adds mutex_lock_io() so that mutex sleeps can be marked as iowait in those cases. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> --- include/linux/mutex.h | 4 ++++ kernel/locking/mutex.c | 24 ++++++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/include/linux/mutex.h b/include/linux/mutex.h index 2cb7531..5d77ddd 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -142,10 +142,12 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass); extern int __must_check mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass); +extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass); #define mutex_lock(lock) mutex_lock_nested(lock, 0) #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0) #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0) +#define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0) #define mutex_lock_nest_lock(lock, nest_lock) \ do { \ @@ -157,11 +159,13 @@ do { \ extern void mutex_lock(struct mutex *lock); extern int __must_check mutex_lock_interruptible(struct mutex *lock); extern int __must_check mutex_lock_killable(struct mutex *lock); +extern void mutex_lock_io(struct mutex *lock); # define mutex_lock_nested(lock, subclass) mutex_lock(lock) # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock) # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock) # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock) +# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock) #endif /* diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..a37709c 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -653,6 +653,20 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass) EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested); +void __sched +mutex_lock_io_nested(struct mutex *lock, unsigned int subclass) +{ + int token; + + might_sleep(); + + token = io_schedule_prepare(); + __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, + subclass, NULL, _RET_IP_, NULL, 0); + io_schedule_finish(token); +} +EXPORT_SYMBOL_GPL(mutex_lock_io_nested); + static inline int ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) { @@ -816,6 +830,16 @@ int __sched mutex_lock_killable(struct mutex *lock) } EXPORT_SYMBOL(mutex_lock_killable); +void __sched mutex_lock_io(struct mutex *lock) +{ + int token; + + token = io_schedule_prepare(); + mutex_lock(lock); + io_schedule_finish(token); +} +EXPORT_SYMBOL_GPL(mutex_lock_io); + __visible void __sched __mutex_lock_slowpath(atomic_t *lock_count) { -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex 2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo ` (2 preceding siblings ...) 2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo @ 2016-10-28 16:58 ` Tejun Heo 3 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw) To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo When an ext4 fs is bogged down by a lot of metadata IOs (in the reported case, it was deletion of millions of files, but any massive amount of journal writes would do), after the journal is filled up, tasks which try to access the filesystem and aren't currently performing the journal writes end up waiting in __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex. Because those mutex sleeps aren't marked as iowait, this condition can lead to misleadingly low iowait and /proc/stat:procs_blocked. While iowait propagation is far from strict, this condition can be triggered fairly easily and annotating these sleeps correctly helps initial diagnosis quite a bit. Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that these sleeps are properly marked as iowait. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mingbo Wan <mingbo@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- fs/jbd2/commit.c | 2 +- fs/jbd2/journal.c | 14 +++++++------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 31f8ca0..e772881 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -392,7 +392,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) /* Do we need to erase the effects of a prior jbd2_journal_flush? */ if (journal->j_flags & JBD2_FLUSHED) { jbd_debug(3, "super block updated\n"); - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); /* * We hold j_checkpoint_mutex so tail cannot change under us. * We don't need any special data guarantees for writing sb diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 927da49..0aa7d06 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -944,7 +944,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block) */ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block) { - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); if (tid_gt(tid, journal->j_tail_sequence)) __jbd2_update_log_tail(journal, tid, block); mutex_unlock(&journal->j_checkpoint_mutex); @@ -1304,7 +1304,7 @@ static int journal_reset(journal_t *journal) journal->j_flags |= JBD2_FLUSHED; } else { /* Lock here to make assertions happy... */ - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); /* * Update log tail information. We use WRITE_FUA since new * transaction will start reusing journal space and so we @@ -1691,7 +1691,7 @@ int jbd2_journal_destroy(journal_t *journal) spin_lock(&journal->j_list_lock); while (journal->j_checkpoint_transactions != NULL) { spin_unlock(&journal->j_list_lock); - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); err = jbd2_log_do_checkpoint(journal); mutex_unlock(&journal->j_checkpoint_mutex); /* @@ -1713,7 +1713,7 @@ int jbd2_journal_destroy(journal_t *journal) if (journal->j_sb_buffer) { if (!is_journal_aborted(journal)) { - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); write_lock(&journal->j_state_lock); journal->j_tail_sequence = @@ -1954,7 +1954,7 @@ int jbd2_journal_flush(journal_t *journal) spin_lock(&journal->j_list_lock); while (!err && journal->j_checkpoint_transactions != NULL) { spin_unlock(&journal->j_list_lock); - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); err = jbd2_log_do_checkpoint(journal); mutex_unlock(&journal->j_checkpoint_mutex); spin_lock(&journal->j_list_lock); @@ -1964,7 +1964,7 @@ int jbd2_journal_flush(journal_t *journal) if (is_journal_aborted(journal)) return -EIO; - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); if (!err) { err = jbd2_cleanup_journal_tail(journal); if (err < 0) { @@ -2024,7 +2024,7 @@ int jbd2_journal_wipe(journal_t *journal, int write) err = jbd2_journal_skip_recovery(journal); if (write) { /* Lock to make assertions happy... */ - mutex_lock(&journal->j_checkpoint_mutex); + mutex_lock_io(&journal->j_checkpoint_mutex); jbd2_mark_journal_empty(journal, WRITE_FUA); mutex_unlock(&journal->j_checkpoint_mutex); } -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-12-07 20:48 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo 2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo 2016-10-28 18:27 ` Peter Zijlstra 2016-10-28 19:07 ` Peter Zijlstra 2016-10-28 19:12 ` Tejun Heo 2016-10-29 3:21 ` Peter Zijlstra 2016-10-31 16:45 ` Tejun Heo 2016-12-06 21:30 ` Tejun Heo 2016-11-03 15:33 ` Pavan Kondeti 2016-11-08 22:51 ` Tejun Heo 2016-12-06 21:29 ` [PATCH v2 " Tejun Heo 2016-12-07 9:35 ` Peter Zijlstra 2016-12-07 20:48 ` [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler Tejun Heo 2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo 2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo 2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).