From: Coly Li <colyli@suse.de>
To: tang.junhui@zte.com.cn
Cc: bcache@lists.ewheeler.net, kubuxu@ipfs.io, linux-bcache@vger.kernel.org
Subject: Re: [PATCH 1/2] bcache: do not perform writeback if dirty data is no more than writeback_percent
Date: Wed, 19 Jul 2017 18:32:51 +0800 [thread overview]
Message-ID: <8a2ec66d-332b-d19f-d1ba-933d871db2cb@suse.de> (raw)
In-Reply-To: <OF2AB44ECA.94CE1BF0-ON48258162.00356FFD-48258162.00359404@zte.com.cn>
On 2017/7/19 下午5:45, tang.junhui@zte.com.cn wrote:
>> Currently bcache_dev_sectors_dirty() iterates all dirty stripes of
>> bcache device, it will take around 15 milliseconds for a 15.2TB cache
>> device on my hardware. The latency is unacceptable,
>
> Why it is unacceptable? It does not block the front-endincoming IOs.
Hi Junhui,
Imagine the following situation,
1) Cached device has dirty data but not exceed its writeback_percent,
after its writeback thread is woke up, no_writeback_now() will iterates
all dirty stripes of the bcache device (for 15 ms on my server), and
return true to bch_writeback_thread(), then the thread schedules out and
sleep.
2) Before dirty data exceeds writeback_percent of the cached device, it
is not difficult to have more than 1 writ request onto bcache device for
every 15 ms (on my hardware). If it happens, then the writeback thread
will be woke up repeated and every time no_writeback_now() will be
called and consume CPU cycles to iterate all stripes for 15ms.
3) The result is, there is potential possibility, that the writeback
thread always consume a full CPU core before dirty data exceeds cached
device's writeback_percent, even there is only a few write request on
the bcache device.
This is unacceptable in this patch set.
Thanks.
Coly
>
>
> 发件人: Coly Li <i@coly.li>
> 收件人: Coly Li <colyli@suse.de>, linux-bcache@vger.kernel.org,
> 抄送: bcache@lists.ewheeler.net, tang.junhui@zte.com.cn,
> kubuxu@ipfs.io
> 日期: 2017/07/19 16:20
> 主题: Re: [PATCH 1/2] bcache: do not perform writeback if dirty
> data is no more than writeback_percent
> 发件人: linux-bcache-owner@vger.kernel.org
> ------------------------------------------------------------------------
>
>
>
> On 2017/7/19 下午1:36, Coly Li wrote:
>> The functionanlity of sysfs entry writeback_percent is explained in file
>> Document/ABI/testing/sysfs-block-bcache:
>> "For backing devices: If nonzero, writeback from cache to
>> backing device only takes place when more than this percentage
>> of the cache is used, allowing more write coalescing to take
>> place and reducing total number of writes sent to the backing
>> device. Integer between 0 and 40."
>>
>> Indeed currently in writeback mode, even dirty data percent is not exceed
>> writeback_percent of the cached device, writeback still happens at least
>> one key per second. This is not a behavior as designed.
>>
>> This patch does three things to fix the above issue,
>> 1) Move writeback thread related checks from bch_writeback_thread()
>> to no_writeback_now().
>> 2) Add a duration to count time duration since last I/O request received
>> to the moment when no_writeback_now() is called. When duration is more
>> than BCH_IDLE_DURATION_MSECS, perform a proactive writeback no matter
>> dirty data exceeds writeback_percent or not.
>> 3) If duration is not timeout, and dirty data is not more than cached
>> device's writeback_percent, writeback thread just schedules out and
>> waits for another try after BCH_IDLE_DURATION_MSECS.
>>
>> By this patch, if a write request's dirty data does not exceed
>> writeback_percent, writeback will not happen. In this case dirty data may
>> stay in cache device for a very long time if I/O on bcache device is idle.
>> To fix this, bch_writeback_queue() is called in update_writeback_rate()
>> to wake up the writeback thread in period of
> writeback_rate_update_seconds.
>> Then if no I/O happens on bcache device for BCH_IDLE_DURATION_MSECS
>> milliseconds, at least we have chance to wake up writeback thread to find
>> the duration timeout in a not-that-much delay.
>>
>> Now writeback_percent acts as what it is defined for.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>> drivers/md/bcache/bcache.h | 4 ++++
>> drivers/md/bcache/request.c | 6 +++++
>> drivers/md/bcache/writeback.c | 55
> +++++++++++++++++++++++++++++++++++++++----
>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index dee542fff68e..ab7e60336edb 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -662,6 +662,8 @@ struct cache_set {
>>
>> #define BUCKET_HASH_BITS 12
>> struct hlist_head bucket_hash[1 <<
> BUCKET_HASH_BITS];
>> +
>> + uint64_t
> last_request_time;
>> };
>>
>> struct bbio {
>> @@ -680,6 +682,8 @@ struct bbio {
>> #define BTREE_PRIO USHRT_MAX
>> #define INITIAL_PRIO 32768U
>>
>> +#define BCH_IDLE_DURATION_MSECS 10000U
>> +
>> #define btree_bytes(c)
> ((c)->btree_pages * PAGE_SIZE)
>> #define btree_blocks(b)
> \
>> ((unsigned) (KEY_SIZE(&b->key) >> (b)->c->block_bits))
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 019b3df9f1c6..98971486f7f1 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -957,10 +957,13 @@ static blk_qc_t cached_dev_make_request(struct
> request_queue *q,
>> struct search *s;
>> struct bcache_device *d =
> bio->bi_bdev->bd_disk->private_data;
>> struct cached_dev *dc = container_of(d, struct
> cached_dev, disk);
>> + struct cache_set *c = d->c;
>> int rw = bio_data_dir(bio);
>>
>> generic_start_io_acct(rw, bio_sectors(bio),
> &d->disk->part0);
>>
>> + if (c)
>> + c->last_request_time = local_clock();
>> bio->bi_bdev = dc->bdev;
>> bio->bi_iter.bi_sector += dc->sb.data_offset;
>>
>> @@ -991,6 +994,9 @@ static blk_qc_t cached_dev_make_request(struct
> request_queue *q,
>> else
>>
> generic_make_request(bio);
>> }
>> + /* update last_request_time again to make it to be
> more accurate */
>> + if (c)
>> + c->last_request_time = local_clock();
>>
>> return BLK_QC_T_NONE;
>> }
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index 42c66e76f05e..13594dd7f564 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -78,8 +78,15 @@ static void update_writeback_rate(struct
> work_struct *work)
>> down_read(&dc->writeback_lock);
>>
>> if (atomic_read(&dc->has_dirty) &&
>> - dc->writeback_percent)
>> + dc->writeback_percent) {
>> __update_writeback_rate(dc);
>> + /*
>> + * wake up writeback thread to
> check whether request
>> + * duration is timeout in
> no_writeback_now(). If yes,
>> + * existing dirty data should be
> handled.
>> + */
>> + bch_writeback_queue(dc);
>> + }
>>
>> up_read(&dc->writeback_lock);
>>
>> @@ -412,6 +419,48 @@ static bool refill_dirty(struct cached_dev *dc)
>> return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
>> }
>>
>> +static bool no_writeback_now(struct cached_dev *dc)
>> +{
>> + uint64_t duration;
>> + struct cache_set *c = dc->disk.c;
>> + uint64_t cache_set_sectors;
>> + uint64_t dirty_sectors;
>> + uint64_t percent_sectors;
>> +
>> + if (!atomic_read(&dc->has_dirty))
>> + return true;
>> +
>> + if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>> + !dc->writeback_running)
>> + return true;
>> +
>> + /*
>> + * last_request_time is initialized to 0, it is
> possible that
>> + * duration is larger than BCH_IDLE_DURATION_MSECS
> when the first
>> + * time it is calculated. If there is dirty data of
> the cached
>> + * device, bcache will try to perform a proactive
> writeback.
>> + */
>> + duration = local_clock() - c->last_request_time;
>> + if ((duration/NSEC_PER_MSEC) > BCH_IDLE_DURATION_MSECS)
>> + return false;
>> +
>> + /*
>> + * If duration is not timeout, and dirty data does
> not exceed
>> + * writeback_percent, no writeback now.
>> + * This is what writeback_percent is designed for, see
>> + * /sys/block/<disk>/bcache/writeback_percentkernel in
>> + * Documentation/ABI/testing/sysfs-block-bcache
>> + */
>> + cache_set_sectors = c->nbuckets * c->sb.bucket_size;
>> + percent_sectors = div64_u64(
>> + cache_set_sectors *
> dc->writeback_percent, 100);
>> + dirty_sectors = bcache_dev_sectors_dirty(&dc->disk);
>
> Currently bcache_dev_sectors_dirty() iterates all dirty stripes of
> bcache device, it will take around 15 milliseconds for a 15.2TB cache
> device on my hardware. The latency is unacceptable, I also mentioned
> that on Junhui's work (previously posted patch titled "bcache: update
> bucket_in_use periodically").
>
> I am working on a low latency dirty sectors counter now, after it gets
> done, I will add them into this patch set and post again.
>
> Just FYI. Thanks.
>
>
> Coly
>
>
>
>> + if (dirty_sectors <= percent_sectors)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> static int bch_writeback_thread(void *arg)
>> {
>> struct cached_dev *dc = arg;
>> @@ -419,9 +468,7 @@ static int bch_writeback_thread(void *arg)
>>
>> while (!kthread_should_stop()) {
>> down_write(&dc->writeback_lock);
>> - if (!atomic_read(&dc->has_dirty) ||
>> -
> (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>> - !dc->writeback_running)) {
>> + if (no_writeback_now(dc)) {
>>
> up_write(&dc->writeback_lock);
>>
> set_current_state(TASK_INTERRUPTIBLE);
>>
>>
>
>
> --
> Coly Li
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
next prev parent reply other threads:[~2017-07-19 10:33 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-19 5:36 set max writeback rate when bcache device is idle Coly Li
2017-07-19 5:36 ` [PATCH 1/2] bcache: do not perform writeback if dirty data is no more than writeback_percent Coly Li
2017-07-19 8:19 ` Coly Li
[not found] ` <OF2AB44ECA.94CE1BF0-ON48258162.00356FFD-48258162.00359404@zte.com.cn>
2017-07-19 10:32 ` Coly Li [this message]
[not found] ` <OF7ADE25A8.9CA7AF89-ON48258165.00108B81-48258165.0010FA59@zte.com.cn>
2017-07-22 15:49 ` Coly Li
2017-07-19 5:36 ` [PATCH 2/2] bcache: implement max_writeback_rate_when_idle option for writeback mode Coly Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8a2ec66d-332b-d19f-d1ba-933d871db2cb@suse.de \
--to=colyli@suse.de \
--cc=bcache@lists.ewheeler.net \
--cc=kubuxu@ipfs.io \
--cc=linux-bcache@vger.kernel.org \
--cc=tang.junhui@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox