From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Kent Overstreet <kmo@daterainc.com>
Cc: Sam Fulcomer <enotty@brown.edu>,
Francis Moreau <francis.moro@gmail.com>,
Daniel J Blueman <daniel@quora.org>,
linux-bcache@vger.kernel.org
Subject: Re: [PATCH] bcache: fix writeback thread to sleep less intrusively
Date: Fri, 11 Apr 2014 19:31:01 -0700 [thread overview]
Message-ID: <20140412023101.GC9176@birch.djwong.org> (raw)
In-Reply-To: <20140411014135.GA18933@birch.djwong.org>
On Thu, Apr 10, 2014 at 06:41:35PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 10, 2014 at 05:46:49PM -0700, Kent Overstreet wrote:
> > On Thu, Apr 10, 2014 at 05:26:36PM -0700, Darrick J. Wong wrote:
> > > Hi all,
> > >
> > > The attached patch fixes both the "writeback blocked for XXX seconds"
> > > complaints from the kernel and the oddly high load averages on idle systems
> > > problems for me. Can you give it a try to see if it fixes your problem too?
> > >
> > > --D
> > > ---
> > > Currently, the writeback thread performs uninterruptible sleep while
> > > it waits for enough dirty data to accumulate to start writeback.
> > > Unfortunately, uninterruptible sleep counts towards load average,
> > > which artificially inflates it. Since the wb thread is a kernel
> > > thread and kthreads don't receive signals, we can use the
> > > interruptible sleep call, which eliminates the high load average
> > > symptom.
> > >
> > > A second symptom is that if we mount a non-writeback cache, the
> > > writeback thread will be woken up. If the cache later accumulates
> > > dirty data and writeback_running=1 (this seems to be a default) then
> > > the writeback thread will enter uninterruptible sleep waiting for
> > > dirty data. This is unnecessary and (I think) results in the
> > > "bcache_writebac:155 blocked for more than XXX seconds" complaints
> > > that people have been talking about. The fix for this is simple -- if
> > > we're not in writeback mode, just go to (interruptible) sleep for a
> > > long time. Alternately, we could use wait_event until the cache mode
> > > changes.
> > >
> > > Finally, change bch_cached_dev_attach() to always wake up the
> > > writeback thread, because the newly created wb thread remains in
> > > uninterruptible sleep state until something explicitly wakes it up.
> > > This wakeup allows the thread to call bch_writeback_thread(),
> > > whereupon it will most likely end up in interruptible sleep. In
> > > theory we could just let the first write take care of this, but
> > > there's really no reason not to do the transition quickly.
> > >
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > drivers/md/bcache/super.c | 2 +-
> > > drivers/md/bcache/writeback.c | 16 ++++++++++++++--
> > > 2 files changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > > index 24a3a15..3ffe970 100644
> > > --- a/drivers/md/bcache/super.c
> > > +++ b/drivers/md/bcache/super.c
> > > @@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
> > > bch_sectors_dirty_init(dc);
> > > atomic_set(&dc->has_dirty, 1);
> > > atomic_inc(&dc->count);
> > > - bch_writeback_queue(dc);
> > > }
> > > + bch_writeback_queue(dc);
> > >
> > > bch_cached_dev_run(dc);
> > > bcache_device_link(&dc->disk, c, "bdev");
> > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > > index f4300e4..f49e6b1 100644
> > > --- a/drivers/md/bcache/writeback.c
> > > +++ b/drivers/md/bcache/writeback.c
> > > @@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
> > > if (KEY_START(&w->key) != dc->last_read ||
> > > jiffies_to_msecs(delay) > 50)
> > > while (!kthread_should_stop() && delay)
> > > - delay = schedule_timeout_uninterruptible(delay);
> > > + delay = schedule_timeout_interruptible(delay);
> > >
> > > dc->last_read = KEY_OFFSET(&w->key);
> > >
> > > @@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)
> > >
> > > while (!kthread_should_stop()) {
> > > down_write(&dc->writeback_lock);
> > > + if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
> > > + up_write(&dc->writeback_lock);
> > > + set_current_state(TASK_INTERRUPTIBLE);
> > > +
> > > + if (kthread_should_stop())
> > > + return 0;
> > > +
> > > + try_to_freeze();
> > > + schedule_timeout_interruptible(10 * HZ);
> > > + continue;
> > > + }
> > > +
> >
> > So this addition isn't correct - cache mode might've been flipped to
> > writethrough, but we might still have dirty data we need to flush: that's why
> > the line below is checking dc->has_dirty.
>
> Good point.
>
> > I don't think your schedule_timeout_interruptible() is correct either, and I'm
> > not seeing what's wrong with the code right below - where it's doing the
> > set_current_state() then the schedule() - but from your report of high idle load
> > average (what about cpu?) it sounds like something is wrong.
>
> The load average will settle at around 2.0 or so; powertop reports idleness of
> 90% or more for all cores, and the disks aren't doing anything either.
>
> On a freshly booted VM, it calls schedule and goes to sleep for quite a while.
> As soon as I start running some fs write tests, I see that has_dirty becomes 1.
> Once in a while, searched_full_index==1, but RB_EMPTY_ROOT(&dc->writeback_keys.keys)
> never hits 0, so has_dirty stays 1. When I kill the tests and go back to idle,
> we end up looping in read_dirty for a long time, I guess because ... aha! It's
> slowly trickling the dirty data out to the backing device, and cranking up
> writeback_rate makes it finish (and go back to that schedule() sleep) faster.
Ok, I have a little more to share about this -- the PD controller in charge of
WB tries to maintain (by default) 10% of the cache as dirty. On my relatively
idle laptop, it is frequently the case that < 10% of the cache is dirty. When
this happens, the writeback_rate falls to 512b/s. Meanwhile, read_dirty writes
out the dirty data at this relatively slow rate, apparently using
schedule_timeout_uninterruptible to throttle the writeout.
There's enough activity on my laptop to ensure that there's always _some_ dirty
data, hence the writeback thread is constantly in uninterruptible sleep. I
could probably set writeback_percent = 0 to mitigate this since in my case I
probably want as little dirty data as possible, but ... ugh.
So that brings me back to this question:
> Hmm. I'm wondering about the choice of _interruptible vs.
> _uninterruptible -- what are you trying to prevent from happening?
I changed it to _interruptible, and so far I haven't seen any problems running
xfstests.
> > Can you try and diagnose further? Also if you want to split the rest of the
> > changes out into a separate patch I'll definitely take that. Thanks!
>
> Do you mean the first chunk, which moves the bch_writeback_queue() in super.c?
Assuming you do, I'll issue a patch shortly.
--D
>
> --D
> >
> >
> > > if (!atomic_read(&dc->has_dirty) ||
> > > (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
> > > !dc->writeback_running)) {
> > > @@ -436,7 +448,7 @@ static int bch_writeback_thread(void *arg)
> > > while (delay &&
> > > !kthread_should_stop() &&
> > > !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
> > > - delay = schedule_timeout_uninterruptible(delay);
> > > + delay = schedule_timeout_interruptible(delay);
> > > }
> > > }
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-04-12 2:31 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-11 0:26 [PATCH] bcache: fix writeback thread to sleep less intrusively Darrick J. Wong
2014-04-11 0:46 ` Kent Overstreet
2014-04-11 1:41 ` Darrick J. Wong
2014-04-12 2:31 ` Darrick J. Wong [this message]
2014-04-12 2:33 ` [PATCH] bcache: let the writeback thread run at least once at startup Darrick J. Wong
2014-04-30 11:51 ` [PATCH] bcache: fix writeback thread to sleep less intrusively Daniel Smedegaard Buus
2014-04-30 17:24 ` Darrick J. Wong
2014-05-01 9:38 ` Daniel Smedegaard Buus
2014-05-01 21:54 ` Slava Pestov
2014-05-20 7:07 ` Daniel J Blueman
[not found] ` <CACHGV4+mNu_KV7JazT-34D++3S2NKhDkOmc_wo0QfrfdqpccoQ@mail.gmail.com>
2014-07-09 10:27 ` Daniel Smedegaard Buus
2014-07-09 15:53 ` Peter Kieser
2014-07-09 18:02 ` Daniel Smedegaard Buus
2014-07-09 18:19 ` Slava Pestov
2014-07-09 19:18 ` Daniel Smedegaard Buus
2014-04-11 7:26 ` Francis Moreau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140412023101.GC9176@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=daniel@quora.org \
--cc=enotty@brown.edu \
--cc=francis.moro@gmail.com \
--cc=kmo@daterainc.com \
--cc=linux-bcache@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).