* [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
@ 2015-09-05 11:10 Vojtech Pavlik
2015-09-16 11:32 ` Kent Overstreet
0 siblings, 1 reply; 10+ messages in thread
From: Vojtech Pavlik @ 2015-09-05 11:10 UTC (permalink / raw)
To: Kent Overstreet
Cc: linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac,
Jiri Kosina, Jens Axboe
Fix writeback_thread never finishing writing back all dirty data in bcache when
partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.
Signed-off-by: Vojtech Pavlik <vojtech@suse.com>
---
This is a fix for the current upstream bcache, not the devel branch.
If partial_stripes_expensive is set for a cache set, then writeback_thread
always attempts to write full stripes out back to the backing device first.
However, since it does that based on a bitmap and not a simple linear
search, like the rest of the code of refill_dirty(), it changes the
last_scanned pointer so that never points to 'end'. refill_dirty() then
never tries to scan from 'start', resulting in the writeback_thread
looping, consuming 100% of CPU, but never making progress in writing out
the incomplete dirty stripes in the cache.
Scanning the tree after not finding enough full stripes fixes the issue.
Incomplete dirty stripes are written to the backing device, the device
eventually reaches a clean state if there is nothing dirtying data and
writeback_thread sleeps. This also fixes the problem of the cache device
not being possible to detach in the partial_stripes_expensive scenario.
It may be more efficient to separate the last_scanned field for normal and
stripe scans instead.
drivers/md/bcache/writeback.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..6f8b81d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc)
refill_full_stripes(dc);
if (array_freelist_empty(&buf->freelist))
return false;
+ bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
}
if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
--
2.1.4
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-05 11:10 [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes Vojtech Pavlik
@ 2015-09-16 11:32 ` Kent Overstreet
2015-09-16 21:08 ` Denis Bychkov
0 siblings, 1 reply; 10+ messages in thread
From: Kent Overstreet @ 2015-09-16 11:32 UTC (permalink / raw)
To: Vojtech Pavlik
Cc: linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac,
Jiri Kosina, Jens Axboe
On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
> Fix writeback_thread never finishing writing back all dirty data in bcache when
> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.
>
> Signed-off-by: Vojtech Pavlik <vojtech@suse.com>
> ---
>
> This is a fix for the current upstream bcache, not the devel branch.
>
> If partial_stripes_expensive is set for a cache set, then writeback_thread
> always attempts to write full stripes out back to the backing device first.
> However, since it does that based on a bitmap and not a simple linear
> search, like the rest of the code of refill_dirty(), it changes the
> last_scanned pointer so that never points to 'end'. refill_dirty() then
> never tries to scan from 'start', resulting in the writeback_thread
> looping, consuming 100% of CPU, but never making progress in writing out
> the incomplete dirty stripes in the cache.
>
> Scanning the tree after not finding enough full stripes fixes the issue.
>
> Incomplete dirty stripes are written to the backing device, the device
> eventually reaches a clean state if there is nothing dirtying data and
> writeback_thread sleeps. This also fixes the problem of the cache device
> not being possible to detach in the partial_stripes_expensive scenario.
Good catch!
> It may be more efficient to separate the last_scanned field for normal and
> stripe scans instead.
Actually, I think I like your approach - I think it gives us the behaviour we
want, although it took some thinking to figure out why.
One of the reasons for last_scanned is just to do writeback in LBA order and
avoid making the disks seek around - so, if refill_full_stripes() did just queue
up some IO at last_scanned, we do want to keep scanning from that position for
non full stripes.
But since (as you noted) last_scanned is never going to be >= end after calling
refill_full_strips() (if there were any full stripes) - really the correct thing
to do is loop around to the start if necessary so that we can successfully scan
everything.
The only inefficiency I see with your approach is that on the second scan, after
we've looped, we don't need to scan to the end of the disk - we only need to
scan up to where we initially started.
But, another observation - if we change refill_dirty() so that it always scans
the entire keyspace if necessary, regardless of where last_scanned was - well,
that really isn't specific to the refill_full_stripes() case - we can just
always do that.
Can you give this patch a try?
-- >8 --
Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
Previously, it would only scan the entire disk if it was starting from the very
start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so that
refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if necessary,
regardless of what last_scanned was, the code gets cleaner and we fix that bug
too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
drivers/md/bcache/writeback.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index cdde0f32f0..08a52db38b 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -359,11 +359,13 @@ next:
}
}
+/*
+ * Returns true if we scanned the entire disk
+ */
static bool refill_dirty(struct cached_dev *dc)
{
struct keybuf *buf = &dc->writeback_keys;
- struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
- bool searched_from_start = false;
+ struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
if (dc->partial_stripes_expensive) {
refill_full_stripes(dc);
@@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc)
return false;
}
- if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
- buf->last_scanned = KEY(dc->disk.id, 0, 0);
- searched_from_start = true;
- }
-
+ start_pos = buf->last_scanned;
bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
- return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
+ if (bkey_cmp(&buf->last_scanned, &end) < 0)
+ return false;
+
+ /*
+ * If we get to the end start scanning again from the beginning, and
+ * only scan up to where we initially started scanning from:
+ */
+ buf->last_scanned = KEY(dc->disk.id, 0, 0);
+ bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
+
+ return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
}
static void bch_writeback(struct cached_dev *dc)
--
2.5.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-16 11:32 ` Kent Overstreet
@ 2015-09-16 21:08 ` Denis Bychkov
2015-09-17 15:30 ` Denis Bychkov
2015-09-29 23:00 ` Peter Kieser
0 siblings, 2 replies; 10+ messages in thread
From: Denis Bychkov @ 2015-09-16 21:08 UTC (permalink / raw)
To: Kent Overstreet
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Kent Overstreet,
Emmanuel Florac, Jiri Kosina, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 6228 bytes --]
Hi Kent, Vojtech, list
Your fix works perfectly, I can finally remove my patch that disables
the partial_stripes_expensive branch and unleash the full bcache
performance on my RAID-6 array. I was aware of this problem for some
time now, but never got to learning the bcache codebase enough to find
out why this happens with partial stripes write-back enabled, so I
just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for
finding the reason and to you for the patch. I have quite a collection
of stability patches for the mainline bcache. Would you please be kind
to review them when convenient and merge upstream the ones you think
are worth merging. It will be 4.4 merge window, I believe.
Please find all the patches attached. All of them are rebased on
mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let
me know.
On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
>> Fix writeback_thread never finishing writing back all dirty data in bcache when
>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.
>>
>> Signed-off-by: Vojtech Pavlik <vojtech@suse.com>
>> ---
>>
>> This is a fix for the current upstream bcache, not the devel branch.
>>
>> If partial_stripes_expensive is set for a cache set, then writeback_thread
>> always attempts to write full stripes out back to the backing device first.
>> However, since it does that based on a bitmap and not a simple linear
>> search, like the rest of the code of refill_dirty(), it changes the
>> last_scanned pointer so that never points to 'end'. refill_dirty() then
>> never tries to scan from 'start', resulting in the writeback_thread
>> looping, consuming 100% of CPU, but never making progress in writing out
>> the incomplete dirty stripes in the cache.
>>
>> Scanning the tree after not finding enough full stripes fixes the issue.
>>
>> Incomplete dirty stripes are written to the backing device, the device
>> eventually reaches a clean state if there is nothing dirtying data and
>> writeback_thread sleeps. This also fixes the problem of the cache device
>> not being possible to detach in the partial_stripes_expensive scenario.
>
> Good catch!
>
>> It may be more efficient to separate the last_scanned field for normal and
>> stripe scans instead.
>
> Actually, I think I like your approach - I think it gives us the behaviour we
> want, although it took some thinking to figure out why.
>
> One of the reasons for last_scanned is just to do writeback in LBA order and
> avoid making the disks seek around - so, if refill_full_stripes() did just queue
> up some IO at last_scanned, we do want to keep scanning from that position for
> non full stripes.
>
> But since (as you noted) last_scanned is never going to be >= end after calling
> refill_full_strips() (if there were any full stripes) - really the correct thing
> to do is loop around to the start if necessary so that we can successfully scan
> everything.
>
> The only inefficiency I see with your approach is that on the second scan, after
> we've looped, we don't need to scan to the end of the disk - we only need to
> scan up to where we initially started.
>
> But, another observation - if we change refill_dirty() so that it always scans
> the entire keyspace if necessary, regardless of where last_scanned was - well,
> that really isn't specific to the refill_full_stripes() case - we can just
> always do that.
>
> Can you give this patch a try?
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
>
> Previously, it would only scan the entire disk if it was starting from the very
> start of the disk - i.e. if the previous scan got to the end.
>
> This was broken by refill_full_stripes(), which updates last_scanned so that
> refill_dirty was never triggering the searched_from_start path.
>
> But if we change refill_dirty() to always scan the entire disk if necessary,
> regardless of what last_scanned was, the code gets cleaner and we fix that bug
> too.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
> drivers/md/bcache/writeback.c | 24 ++++++++++++++++--------
> 1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index cdde0f32f0..08a52db38b 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -359,11 +359,13 @@ next:
> }
> }
>
> +/*
> + * Returns true if we scanned the entire disk
> + */
> static bool refill_dirty(struct cached_dev *dc)
> {
> struct keybuf *buf = &dc->writeback_keys;
> - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
> - bool searched_from_start = false;
> + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>
> if (dc->partial_stripes_expensive) {
> refill_full_stripes(dc);
> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc)
> return false;
> }
>
> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
> - buf->last_scanned = KEY(dc->disk.id, 0, 0);
> - searched_from_start = true;
> - }
> -
> + start_pos = buf->last_scanned;
> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
>
> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
> + if (bkey_cmp(&buf->last_scanned, &end) < 0)
> + return false;
> +
> + /*
> + * If we get to the end start scanning again from the beginning, and
> + * only scan up to where we initially started scanning from:
> + */
> + buf->last_scanned = KEY(dc->disk.id, 0, 0);
> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
> +
> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
> }
>
> static void bch_writeback(struct cached_dev *dc)
> --
> 2.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Denis
[-- Attachment #2: bcache_correct_errors_on_register.patch --]
[-- Type: application/octet-stream, Size: 861 bytes --]
From: Denis Bychkov <manover@gmail.com>
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1964,6 +1964,8 @@
else
err = "device busy";
mutex_unlock(&bch_register_lock);
+ if (attr == &ksysfs_register_quiet)
+ goto out;
}
goto err;
}
@@ -2002,8 +2004,7 @@
err_close:
blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
err:
- if (attr != &ksysfs_register_quiet)
- pr_info("error opening %s: %s", path, err);
+ pr_info("error opening %s: %s", path, err);
ret = -EINVAL;
goto out;
}
[-- Attachment #3: bcache-attach-detach-cleanup.patch --]
[-- Type: application/octet-stream, Size: 5134 bytes --]
From: Joshua Schmid <jschmid@suse.com>
Subject: [PATCH] bcache: [BUG] clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
Newsgroups: gmane.linux.kernel.bcache.devel
Date: 2015-02-03 11:18:01 GMT (3 weeks, 2 days, 11 hours and 45 minutes ago)
From: Zheng Liu <wenqing.lz@taobao.com>
This bug can be reproduced by the following script:
#!/bin/bash
bcache_sysfs="/sys/fs/bcache"
function clear_cache()
{
if [ ! -e $bcache_sysfs ]; then
echo "no bcache sysfs"
exit
fi
cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
sleep 5
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
}
for ((i=0;i<10;i++)); do
clear_cache
done
The warning messages look like below:
[ 275.948611] ------------[ cut here ]------------
[ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W
--------------- )
[ 275.979253] Hardware name: Tecal RH2285
[ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.089315] Call Trace:
[ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[ 276.338241] ------------[ cut here ]------------
[ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- )
[ 276.386017] Hardware name: Tecal RH2285
[ 276.401430] Couldn't create device <-> cache set symlinks
[ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.482169] Call Trace:
[ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---
We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time. After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink(). Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().
So the fix is trival and we clear this flag in bcache_device_link().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
---
drivers/md/bcache/super.c | 2 ++
1 file changed, 2 insertions(+)
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -708,6 +708,8 @@
WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") ||
sysfs_create_link(&c->kobj, &d->kobj, d->name),
"Couldn't create device <-> cache set symlinks");
+
+ clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags);
}
static void bcache_device_detach(struct bcache_device *d)
[-- Attachment #4: bcache-cond_resched.patch --]
[-- Type: application/octet-stream, Size: 735 bytes --]
From f0e6320a7874af434575f37a11ec6e4992cef790 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kmo@daterainc.com>
Date: Sat, 1 Nov 2014 13:44:47 -0700
Subject: [PATCH 1/5] bcache: Add a cond_resched() call to gc
Git-commit: f0e6320a7874af434575f37a11ec6e4992cef790
Patch-mainline: Submitted
References: bnc#910440
Change-id: Id4f18c533b80ddb40df94ed0bb5e2a236a4bc325
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
drivers/md/bcache/btree.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1741,6 +1741,7 @@
do {
ret = btree_root(gc_root, c, &op, &writes, &stats);
closure_sync(&writes);
+ cond_resched();
if (ret && ret != -EAGAIN)
pr_warn("gc failed!");
[-- Attachment #5: bcache-fix_passthrough_shutdown_crash.patch --]
[-- Type: application/octet-stream, Size: 1044 bytes --]
---
From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Mon, 18 Aug 2014 20:00:13 +0200
Subject: [PATCH] bcache: prevent crash on changing writeback_running
commit a664d0f05a2e ("bcache: fix crash on shutdown in passthrough mode")
added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change adds the same check before trying to
wake up the thread for that case.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
drivers/md/bcache/writeback.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,7 +63,8 @@
static inline void bch_writeback_queue(struct cached_dev *dc)
{
- wake_up_process(dc->writeback_thread);
+ if (!IS_ERR_OR_NULL(dc->writeback_thread))
+ wake_up_process(dc->writeback_thread);
}
static inline void bch_writeback_add(struct cached_dev *dc)
[-- Attachment #6: bcache-fix-memleak-bch_cached_dev_run.patch --]
[-- Type: application/octet-stream, Size: 802 bytes --]
From: Joshua Schmid <jschmid@suse.com>
Subject: [PATCH] fix a leak in bch_cached_dev_run()
Newsgroups: gmane.linux.kernel.bcache.devel
Date: 2015-02-03 11:24:06 GMT (3 weeks, 2 days, 11 hours and 43 minutes ago)
From: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Joshua Schmid <jschmid@suse.com>
---
drivers/md/bcache/super.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -872,8 +872,11 @@
buf[SB_LABEL_SIZE] = '\0';
env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf);
- if (atomic_xchg(&dc->running, 1))
+ if (atomic_xchg(&dc->running, 1)) {
+ kfree(env[1]);
+ kfree(env[2]);
return;
+ }
if (!d->c &&
BDEV_STATE(&dc->sb) != BDEV_STATE_NONE) {
[-- Attachment #7: bcache-rcu-sched-bugfix.patch --]
[-- Type: application/octet-stream, Size: 2490 bytes --]
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: linux-bcache@vger.kernel.org
Cc: Zheng Liu <wenqing.lz@taobao.com>, Joshua Schmid <jschmid@suse.com>, Zhu Yanhai <zhu.yanhai@gmail.com>, Kent Overstreet <kmo@daterainc.com>
Subject: [PATCH v2] bcache: fix a livelock in btree lock
Date: Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
From: Zheng Liu <wenqing.lz@taobao.com>
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
---
changelog:
v2: fix a bug that stops all concurrency writes unconditionally.
drivers/md/bcache/btree.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2162,8 +2162,10 @@
rw_lock(true, b, b->level);
if (b->key.ptr[0] != btree_ptr ||
- b->seq != seq + 1)
+ b->seq != seq + 1) {
+ op->lock = b->level;
goto out;
+ }
}
SET_KEY_PTRS(check_key, 1);
[-- Attachment #8: bcache-refill_dirty_to_always_scan_entire_disk.patch --]
[-- Type: application/octet-stream, Size: 1963 bytes --]
Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
Previously, it would only scan the entire disk if it was starting from the very
start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so that
refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if necessary,
regardless of what last_scanned was, the code gets cleaner and we fix that bug
too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
drivers/md/bcache/writeback.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -372,11 +372,13 @@
}
}
+/*
+ * Returns true if we scanned the entire disk
+ */
static bool refill_dirty(struct cached_dev *dc)
{
struct keybuf *buf = &dc->writeback_keys;
- struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
- bool searched_from_start = false;
+ struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
if (dc->partial_stripes_expensive) {
refill_full_stripes(dc);
@@ -384,14 +386,20 @@
return false;
}
- if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
- buf->last_scanned = KEY(dc->disk.id, 0, 0);
- searched_from_start = true;
- }
-
+ start_pos = buf->last_scanned;
bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
- return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
+ if (bkey_cmp(&buf->last_scanned, &end) < 0)
+ return false;
+
+ /*
+ * If we get to the end start scanning again from the beginning, and
+ * only scan up to where we initially started scanning from:
+ */
+ buf->last_scanned = KEY(dc->disk.id, 0, 0);
+ bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
+
+ return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
}
static int bch_writeback_thread(void *arg)
[-- Attachment #9: bcache-remove-bio-splitting.patch --]
[-- Type: application/octet-stream, Size: 10266 bytes --]
From: Kent Overstreet <kent.overstreet@gmail.com>
The bcache driver has always accepted arbitrarily large bios and split
them internally. Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
drivers/md/bcache/bcache.h | 18 --------
drivers/md/bcache/io.c | 100 +-----------------------------------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +++----
drivers/md/bcache/super.c | 32 +-------------
drivers/md/bcache/util.h | 5 ++-
drivers/md/bcache/writeback.c | 4 +-
7 files changed, 18 insertions(+), 161 deletions(-)
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -243,19 +243,6 @@
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};
-struct bio_split_pool {
- struct bio_set *bio_split;
- mempool_t *bio_split_hook;
-};
-
-struct bio_split_hook {
- struct closure cl;
- struct bio_split_pool *p;
- struct bio *bio;
- bio_end_io_t *bi_end_io;
- void *bi_private;
-};
-
struct bcache_device {
struct closure cl;
@@ -288,8 +275,6 @@
int (*cache_miss)(struct btree *, struct search *,
struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
- struct bio_split_pool bio_split_hook;
};
struct io {
@@ -454,8 +439,6 @@
atomic_long_t meta_sectors_written;
atomic_long_t btree_sectors_written;
atomic_long_t sectors_written;
-
- struct bio_split_pool bio_split_hook;
};
struct gc_stat {
@@ -873,7 +856,6 @@
void bch_bbio_free(struct bio *, struct cache_set *);
struct bio *bch_bbio_alloc(struct cache_set *);
-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
void __bch_submit_bbio(struct bio *, struct cache_set *);
void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -61,7 +61,7 @@
bio->bi_private = &cl;
bch_bio_map(bio, data);
- closure_bio_submit(bio, &cl, ca);
+ closure_bio_submit(bio, &cl);
closure_sync(&cl);
/* This function could be simpler now since we no longer write
@@ -648,7 +648,7 @@
spin_unlock(&c->journal.lock);
while ((bio = bio_list_pop(&list)))
- closure_bio_submit(bio, cl, c->cache[0]);
+ closure_bio_submit(bio, cl);
continue_at(cl, journal_write_done, NULL);
}
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -718,7 +718,7 @@
/* XXX: invalidate cache */
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}
continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -841,7 +841,7 @@
s->cache_miss = miss;
s->iop.bio = cache_bio;
bio_get(cache_bio);
- closure_bio_submit(cache_bio, &s->cl, s->d);
+ closure_bio_submit(cache_bio, &s->cl);
return ret;
out_put:
@@ -849,7 +849,7 @@
out_submit:
miss->bi_end_io = request_endio;
miss->bi_private = &s->cl;
- closure_bio_submit(miss, &s->cl, s->d);
+ closure_bio_submit(miss, &s->cl);
return ret;
}
@@ -914,7 +914,7 @@
if (!(bio->bi_rw & REQ_DISCARD) ||
blk_queue_discard(bdev_get_queue(dc->bdev)))
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
} else if (s->iop.writeback) {
bch_writeback_add(dc);
s->iop.bio = bio;
@@ -929,12 +929,12 @@
flush->bi_end_io = request_endio;
flush->bi_private = cl;
- closure_bio_submit(flush, cl, s->d);
+ closure_bio_submit(flush, cl);
}
} else {
s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}
closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -950,7 +950,7 @@
bch_journal_meta(s->iop.c, cl);
/* If it's a flush, we send the flush to the backing device too */
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
continue_at(cl, cached_dev_bio_complete, NULL);
}
@@ -994,7 +994,7 @@
!blk_queue_discard(bdev_get_queue(dc->bdev)))
bio_endio(bio, 0);
else
- bch_generic_make_request(bio, &d->bio_split_hook);
+ generic_make_request(bio);
}
}
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,29 +59,6 @@
#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)
-static void bio_split_pool_free(struct bio_split_pool *p)
-{
- if (p->bio_split_hook)
- mempool_destroy(p->bio_split_hook);
-
- if (p->bio_split)
- bioset_free(p->bio_split);
-}
-
-static int bio_split_pool_init(struct bio_split_pool *p)
-{
- p->bio_split = bioset_create(4, 0);
- if (!p->bio_split)
- return -ENOMEM;
-
- p->bio_split_hook = mempool_create_kmalloc_pool(4,
- sizeof(struct bio_split_hook));
- if (!p->bio_split_hook)
- return -ENOMEM;
-
- return 0;
-}
-
/* Superblock */
static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
@@ -537,7 +514,7 @@
bio->bi_private = ca;
bch_bio_map(bio, ca->disk_buckets);
- closure_bio_submit(bio, &ca->prio, ca);
+ closure_bio_submit(bio, &ca->prio);
closure_sync(cl);
}
@@ -759,7 +736,6 @@
put_disk(d->disk);
}
- bio_split_pool_free(&d->bio_split_hook);
if (d->bio_split)
bioset_free(d->bio_split);
kvfree(d->full_dirty_stripes);
@@ -806,7 +782,6 @@
return minor;
if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
- bio_split_pool_init(&d->bio_split_hook) ||
!(d->disk = alloc_disk(1))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1798,8 +1773,6 @@
ca->set->cache[ca->sb.nr_this_dev] = NULL;
}
- bio_split_pool_free(&ca->bio_split_hook);
-
free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
kfree(ca->prio_buckets);
vfree(ca->buckets);
@@ -1844,8 +1817,7 @@
ca->sb.nbuckets)) ||
!(ca->prio_buckets = kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
2, GFP_KERNEL)) ||
- !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)) ||
- bio_split_pool_init(&ca->bio_split_hook))
+ !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)))
return -ENOMEM;
ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
#include <linux/blkdev.h>
#include <linux/errno.h>
+#include <linux/blkdev.h>
#include <linux/kernel.h>
#include <linux/llist.h>
#include <linux/ratelimit.h>
@@ -570,10 +571,10 @@
return bdev->bd_inode->i_size >> 9;
}
-#define closure_bio_submit(bio, cl, dev) \
+#define closure_bio_submit(bio, cl) \
do { \
closure_get(cl); \
- bch_generic_make_request(bio, &(dev)->bio_split_hook); \
+ generic_make_request(bio); \
} while (0)
uint64_t bch_crc64_update(uint64_t, const void *, size_t);
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -188,7 +188,7 @@
io->bio.bi_bdev = io->dc->bdev;
io->bio.bi_end_io = dirty_endio;
- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);
continue_at(cl, write_dirty_finish, system_wq);
}
@@ -208,7 +208,7 @@
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);
continue_at(cl, write_dirty, system_wq);
}
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,105 +11,6 @@
#include <linux/blkdev.h>
-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- struct bio_vec bv;
- struct bvec_iter iter;
- unsigned ret = 0, seg = 0;
-
- if (bio->bi_rw & REQ_DISCARD)
- return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
- bio_for_each_segment(bv, bio, iter) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = ret << 9,
- .bi_rw = bio->bi_rw,
- };
-
- if (seg == min_t(unsigned, BIO_MAX_PAGES,
- queue_max_segments(q)))
- break;
-
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- break;
-
- seg++;
- ret += bv.bv_len >> 9;
- }
-
- ret = min(ret, queue_max_sectors(q));
-
- WARN_ON(!ret);
- ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
- return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- s->bio->bi_end_io = s->bi_end_io;
- s->bio->bi_private = s->bi_private;
- bio_endio(s->bio, 0);
-
- closure_debug_destroy(&s->cl);
- mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
- struct closure *cl = bio->bi_private;
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- if (error)
- clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
-
- bio_put(bio);
- closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
- struct bio_split_hook *s;
- struct bio *n;
-
- if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
- goto submit;
-
- if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
- goto submit;
-
- s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
- closure_init(&s->cl, NULL);
-
- s->bio = bio;
- s->p = p;
- s->bi_end_io = bio->bi_end_io;
- s->bi_private = bio->bi_private;
- bio_get(bio);
-
- do {
- n = bio_next_split(bio, bch_bio_max_sectors(bio),
- GFP_NOIO, s->p->bio_split);
-
- n->bi_end_io = bch_bio_submit_split_endio;
- n->bi_private = &s->cl;
-
- closure_get(&s->cl);
- generic_make_request(n);
- } while (n != bio);
-
- continue_at(&s->cl, bch_bio_submit_split_done, NULL);
- return;
-submit:
- generic_make_request(bio);
-}
-
/* Bios with headers */
void bch_bbio_free(struct bio *bio, struct cache_set *c)
@@ -139,7 +40,7 @@
bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev;
b->submit_time_us = local_clock_us();
- closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
+ closure_bio_submit(bio, bio->bi_private);
}
void bch_submit_bbio(struct bio *bio, struct cache_set *c,
[-- Attachment #10: bcache-unregister-reboot-notifier.patch --]
[-- Type: application/octet-stream, Size: 761 bytes --]
From: Zheng Liu <wenqing.lz@taobao.com>
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device. This commit fixes this.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
---
drivers/md/bcache/super.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2094,8 +2094,10 @@
closure_debug_init();
bcache_major = register_blkdev(0, "bcache");
- if (bcache_major < 0)
+ if (bcache_major < 0) {
+ unregister_reboot_notifier(&reboot);
return bcache_major;
+ }
if (!(bcache_wq = create_workqueue("bcache")) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-16 21:08 ` Denis Bychkov
@ 2015-09-17 15:30 ` Denis Bychkov
2015-09-17 16:40 ` Kent Overstreet
2015-09-29 23:00 ` Peter Kieser
1 sibling, 1 reply; 10+ messages in thread
From: Denis Bychkov @ 2015-09-17 15:30 UTC (permalink / raw)
To: Kent Overstreet
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac,
Jiri Kosina, Jens Axboe
Well, it turns out my celebration was a bit premature.
PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
The interesting thing is that it somehow damaged the partition that
was not supposed to receive any writes (the file system was mounted
read-only), so my guess is that the patch causes the blocks residing
in the write-back cache to flush to the wrong blocks on the backing
device.
Everything was going great until I rebooted and saw this in the log:
[ 19.639082] attempt to access beyond end of device
[ 19.643984] md1p2: rw=1, want=75497520, limit=62914560
[ 19.659033] attempt to access beyond end of device
[ 19.663929] md1p2: rw=1, want=75497624, limit=62914560
[ 19.669447] attempt to access beyond end of device
[ 19.674338] md1p2: rw=1, want=75497752, limit=62914560
[ 19.679195] attempt to access beyond end of device
[ 19.679199] md1p2: rw=1, want=75498080, limit=62914560
[ 19.689007] attempt to access beyond end of device
[ 19.689011] md1p2: rw=1, want=75563376, limit=62914560
[ 19.699055] attempt to access beyond end of device
[ 19.699059] md1p2: rw=1, want=79691816, limit=62914560
[ 19.719246] attempt to access beyond end of device
[ 19.724144] md1p2: rw=1, want=79691928, limit=62914560
......
(it's a small example, the list was much longer)
And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
Oh well, it's a good thing I have backups.
I knew what I was doing when trying the untested patches. I should
have made the RAID md partition read-only, not the file system. I kind
of expected that something could have gone wrong with the file system
I was testing, just did not expect it would fire nukes at the innocent
bystanders.
On Wed, Sep 16, 2015 at 5:08 PM, Denis Bychkov <manover@gmail.com> wrote:
> Hi Kent, Vojtech, list
>
> Your fix works perfectly, I can finally remove my patch that disables
> the partial_stripes_expensive branch and unleash the full bcache
> performance on my RAID-6 array. I was aware of this problem for some
> time now, but never got to learning the bcache codebase enough to find
> out why this happens with partial stripes write-back enabled, so I
> just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for
> finding the reason and to you for the patch. I have quite a collection
> of stability patches for the mainline bcache. Would you please be kind
> to review them when convenient and merge upstream the ones you think
> are worth merging. It will be 4.4 merge window, I believe.
> Please find all the patches attached. All of them are rebased on
> mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let
> me know.
>
>
> On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
>> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
>>> Fix writeback_thread never finishing writing back all dirty data in bcache when
>>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.
>>>
>>> Signed-off-by: Vojtech Pavlik <vojtech@suse.com>
>>> ---
>>>
>>> This is a fix for the current upstream bcache, not the devel branch.
>>>
>>> If partial_stripes_expensive is set for a cache set, then writeback_thread
>>> always attempts to write full stripes out back to the backing device first.
>>> However, since it does that based on a bitmap and not a simple linear
>>> search, like the rest of the code of refill_dirty(), it changes the
>>> last_scanned pointer so that never points to 'end'. refill_dirty() then
>>> never tries to scan from 'start', resulting in the writeback_thread
>>> looping, consuming 100% of CPU, but never making progress in writing out
>>> the incomplete dirty stripes in the cache.
>>>
>>> Scanning the tree after not finding enough full stripes fixes the issue.
>>>
>>> Incomplete dirty stripes are written to the backing device, the device
>>> eventually reaches a clean state if there is nothing dirtying data and
>>> writeback_thread sleeps. This also fixes the problem of the cache device
>>> not being possible to detach in the partial_stripes_expensive scenario.
>>
>> Good catch!
>>
>>> It may be more efficient to separate the last_scanned field for normal and
>>> stripe scans instead.
>>
>> Actually, I think I like your approach - I think it gives us the behaviour we
>> want, although it took some thinking to figure out why.
>>
>> One of the reasons for last_scanned is just to do writeback in LBA order and
>> avoid making the disks seek around - so, if refill_full_stripes() did just queue
>> up some IO at last_scanned, we do want to keep scanning from that position for
>> non full stripes.
>>
>> But since (as you noted) last_scanned is never going to be >= end after calling
>> refill_full_strips() (if there were any full stripes) - really the correct thing
>> to do is loop around to the start if necessary so that we can successfully scan
>> everything.
>>
>> The only inefficiency I see with your approach is that on the second scan, after
>> we've looped, we don't need to scan to the end of the disk - we only need to
>> scan up to where we initially started.
>>
>> But, another observation - if we change refill_dirty() so that it always scans
>> the entire keyspace if necessary, regardless of where last_scanned was - well,
>> that really isn't specific to the refill_full_stripes() case - we can just
>> always do that.
>>
>> Can you give this patch a try?
>>
>> -- >8 --
>> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
>>
>> Previously, it would only scan the entire disk if it was starting from the very
>> start of the disk - i.e. if the previous scan got to the end.
>>
>> This was broken by refill_full_stripes(), which updates last_scanned so that
>> refill_dirty was never triggering the searched_from_start path.
>>
>> But if we change refill_dirty() to always scan the entire disk if necessary,
>> regardless of what last_scanned was, the code gets cleaner and we fix that bug
>> too.
>>
>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
>> ---
>> drivers/md/bcache/writeback.c | 24 ++++++++++++++++--------
>> 1 file changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index cdde0f32f0..08a52db38b 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -359,11 +359,13 @@ next:
>> }
>> }
>>
>> +/*
>> + * Returns true if we scanned the entire disk
>> + */
>> static bool refill_dirty(struct cached_dev *dc)
>> {
>> struct keybuf *buf = &dc->writeback_keys;
>> - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>> - bool searched_from_start = false;
>> + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>>
>> if (dc->partial_stripes_expensive) {
>> refill_full_stripes(dc);
>> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc)
>> return false;
>> }
>>
>> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
>> - buf->last_scanned = KEY(dc->disk.id, 0, 0);
>> - searched_from_start = true;
>> - }
>> -
>> + start_pos = buf->last_scanned;
>> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
>>
>> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
>> + if (bkey_cmp(&buf->last_scanned, &end) < 0)
>> + return false;
>> +
>> + /*
>> + * If we get to the end start scanning again from the beginning, and
>> + * only scan up to where we initially started scanning from:
>> + */
>> + buf->last_scanned = KEY(dc->disk.id, 0, 0);
>> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
>> +
>> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
>> }
>>
>> static void bch_writeback(struct cached_dev *dc)
>> --
>> 2.5.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
> Denis
--
Denis
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-17 15:30 ` Denis Bychkov
@ 2015-09-17 16:40 ` Kent Overstreet
2015-09-17 18:31 ` Kent Overstreet
0 siblings, 1 reply; 10+ messages in thread
From: Kent Overstreet @ 2015-09-17 16:40 UTC (permalink / raw)
To: Denis Bychkov
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac,
Jiri Kosina, Jens Axboe
On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
> Well, it turns out my celebration was a bit premature.
>
> PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
> posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
>
> The interesting thing is that it somehow damaged the partition that
> was not supposed to receive any writes (the file system was mounted
> read-only), so my guess is that the patch causes the blocks residing
> in the write-back cache to flush to the wrong blocks on the backing
> device.
> Everything was going great until I rebooted and saw this in the log:
>
> [ 19.639082] attempt to access beyond end of device
> [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560
> [ 19.659033] attempt to access beyond end of device
> [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560
> [ 19.669447] attempt to access beyond end of device
> [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560
> [ 19.679195] attempt to access beyond end of device
> [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560
> [ 19.689007] attempt to access beyond end of device
> [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560
> [ 19.699055] attempt to access beyond end of device
> [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560
> [ 19.719246] attempt to access beyond end of device
> [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560
> ......
> (it's a small example, the list was much longer)
> And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
> Oh well, it's a good thing I have backups.
> I knew what I was doing when trying the untested patches. I should
> have made the RAID md partition read-only, not the file system. I kind
> of expected that something could have gone wrong with the file system
> I was testing, just did not expect it would fire nukes at the innocent
> bystanders.
Aw, shit. That's just _bizzare_.
I have a theory - it appears that last_scanned isn't getting initialized before
it's used, so it's going to be all 0s the very first time... which it appears
could cause it to slurp up keys from the wrong device (and if that device was
bigger than the correct device, that could explain the accesses beyond the end
of the device).
Currently just a theory though, and I have no clue why it would only be exposed
with my patch.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-17 16:40 ` Kent Overstreet
@ 2015-09-17 18:31 ` Kent Overstreet
2015-09-17 20:54 ` Denis Bychkov
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Kent Overstreet @ 2015-09-17 18:31 UTC (permalink / raw)
To: Denis Bychkov
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac,
Jiri Kosina, Jens Axboe
On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:
> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
> > Well, it turns out my celebration was a bit premature.
> >
> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
> >
> > The interesting thing is that it somehow damaged the partition that
> > was not supposed to receive any writes (the file system was mounted
> > read-only), so my guess is that the patch causes the blocks residing
> > in the write-back cache to flush to the wrong blocks on the backing
> > device.
> > Everything was going great until I rebooted and saw this in the log:
> >
> > [ 19.639082] attempt to access beyond end of device
> > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560
> > [ 19.659033] attempt to access beyond end of device
> > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560
> > [ 19.669447] attempt to access beyond end of device
> > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560
> > [ 19.679195] attempt to access beyond end of device
> > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560
> > [ 19.689007] attempt to access beyond end of device
> > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560
> > [ 19.699055] attempt to access beyond end of device
> > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560
> > [ 19.719246] attempt to access beyond end of device
> > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560
> > ......
> > (it's a small example, the list was much longer)
> > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
> > Oh well, it's a good thing I have backups.
> > I knew what I was doing when trying the untested patches. I should
> > have made the RAID md partition read-only, not the file system. I kind
> > of expected that something could have gone wrong with the file system
> > I was testing, just did not expect it would fire nukes at the innocent
> > bystanders.
>
> Aw, shit. That's just _bizzare_.
>
> I have a theory - it appears that last_scanned isn't getting initialized before
> it's used, so it's going to be all 0s the very first time... which it appears
> could cause it to slurp up keys from the wrong device (and if that device was
> bigger than the correct device, that could explain the accesses beyond the end
> of the device).
>
> Currently just a theory though, and I have no clue why it would only be exposed
> with my patch.
Here's an updated patch that has a fix for _that_ theory, and also a new
BUG_ON(). Any chance you could test it?
Oh - I didn't ask - _do_ you have multiple backing devices attached to the same
cache set? Because if you don't, this isn't it at all...
-- >8 --
Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
Previously, it would only scan the entire disk if it was starting from the very
start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so that
refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if necessary,
regardless of what last_scanned was, the code gets cleaner and we fix that bug
too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index cdde0f32f0..d383024247 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode,
static bool dirty_pred(struct keybuf *buf, struct bkey *k)
{
+ struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys);
+
+ BUG_ON(KEY_INODE(k) != dc->disk.id);
+
return KEY_DIRTY(k);
}
@@ -359,11 +363,24 @@ next:
}
}
+/*
+ * Returns true if we scanned the entire disk
+ */
static bool refill_dirty(struct cached_dev *dc)
{
struct keybuf *buf = &dc->writeback_keys;
+ struct bkey start = KEY(dc->disk.id, 0, 0);
struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
- bool searched_from_start = false;
+ struct bkey start_pos;
+
+ /*
+ * make sure keybuf pos is inside the range for this disk - at bringup
+ * we might not be attached yet so this disk's inode nr isn't
+ * initialized then
+ */
+ if (bkey_cmp(&buf->last_scanned, &start) < 0 ||
+ bkey_cmp(&buf->last_scanned, &end) > 0)
+ buf->last_scanned = start;
if (dc->partial_stripes_expensive) {
refill_full_stripes(dc);
@@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc)
return false;
}
- if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
- buf->last_scanned = KEY(dc->disk.id, 0, 0);
- searched_from_start = true;
- }
-
+ start_pos = buf->last_scanned;
bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
- return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
+ if (bkey_cmp(&buf->last_scanned, &end) < 0)
+ return false;
+
+ /*
+ * If we get to the end start scanning again from the beginning, and
+ * only scan up to where we initially started scanning from:
+ */
+ buf->last_scanned = start;
+ bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
+
+ return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
}
static void bch_writeback(struct cached_dev *dc)
--
2.5.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-17 18:31 ` Kent Overstreet
@ 2015-09-17 20:54 ` Denis Bychkov
2015-09-19 4:47 ` Denis Bychkov
2015-10-01 9:01 ` Tim Small
2 siblings, 0 replies; 10+ messages in thread
From: Denis Bychkov @ 2015-09-17 20:54 UTC (permalink / raw)
To: Kent Overstreet
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac,
Jiri Kosina, Jens Axboe
Yes, sure, I'll try it.
And yes, this was exactly my setup - several backing devices sharing
the same cache set, all in write back mode.
When I decided to try your patch yesterday, I created a few more and
attached them to the same cache set. Did not want to experiment with
the live ones - so just remounted them RO to make sure they won't send
any writes. Instead of switching the md array into RO mode - I am an
idiot.
And today, after I detached the botched XFS partition from the cache
(to make sure, there are no more dirty data left) and looked at the
area where there used to be an XFS super block, I found a super block
belonging to one of the new devices I created. So there is no mistake
here, this is exactly what happened, I think you pinned it.
The only thing I don't understand is how those writes ended up on a
different device. I'd understand it if it was the same device,
different partition, we would write to the sectors that belong to a
different partition on the same disk. But this is not the case, it is
a different md device that received the unwanted writes. Unless ...
unless, of course, the addresses of the write requests that are stored
in bcache cache set have already been translated into a physical
interface (ATA/SCSI) address. Have they? If that is the case, they
share the same address space, because they are the same SATA drives.
But, that can't be true, right? Bcache can not and does not store
physical drive addresses in the cache, does it?
On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:
>> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
>> > Well, it turns out my celebration was a bit premature.
>> >
>> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
>> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
>> >
>> > The interesting thing is that it somehow damaged the partition that
>> > was not supposed to receive any writes (the file system was mounted
>> > read-only), so my guess is that the patch causes the blocks residing
>> > in the write-back cache to flush to the wrong blocks on the backing
>> > device.
>> > Everything was going great until I rebooted and saw this in the log:
>> >
>> > [ 19.639082] attempt to access beyond end of device
>> > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560
>> > [ 19.659033] attempt to access beyond end of device
>> > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560
>> > [ 19.669447] attempt to access beyond end of device
>> > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560
>> > [ 19.679195] attempt to access beyond end of device
>> > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560
>> > [ 19.689007] attempt to access beyond end of device
>> > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560
>> > [ 19.699055] attempt to access beyond end of device
>> > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560
>> > [ 19.719246] attempt to access beyond end of device
>> > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560
>> > ......
>> > (it's a small example, the list was much longer)
>> > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
>> > Oh well, it's a good thing I have backups.
>> > I knew what I was doing when trying the untested patches. I should
>> > have made the RAID md partition read-only, not the file system. I kind
>> > of expected that something could have gone wrong with the file system
>> > I was testing, just did not expect it would fire nukes at the innocent
>> > bystanders.
>>
>> Aw, shit. That's just _bizzare_.
>>
>> I have a theory - it appears that last_scanned isn't getting initialized before
>> it's used, so it's going to be all 0s the very first time... which it appears
>> could cause it to slurp up keys from the wrong device (and if that device was
>> bigger than the correct device, that could explain the accesses beyond the end
>> of the device).
>>
>> Currently just a theory though, and I have no clue why it would only be exposed
>> with my patch.
>
> Here's an updated patch that has a fix for _that_ theory, and also a new
> BUG_ON(). Any chance you could test it?
>
> Oh - I didn't ask - _do_ you have multiple backing devices attached to the same
> cache set? Because if you don't, this isn't it at all...
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
>
> Previously, it would only scan the entire disk if it was starting from the very
> start of the disk - i.e. if the previous scan got to the end.
>
> This was broken by refill_full_stripes(), which updates last_scanned so that
> refill_dirty was never triggering the searched_from_start path.
>
> But if we change refill_dirty() to always scan the entire disk if necessary,
> regardless of what last_scanned was, the code gets cleaner and we fix that bug
> too.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
> drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++-------
> 1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index cdde0f32f0..d383024247 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode,
>
> static bool dirty_pred(struct keybuf *buf, struct bkey *k)
> {
> + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys);
> +
> + BUG_ON(KEY_INODE(k) != dc->disk.id);
> +
> return KEY_DIRTY(k);
> }
>
> @@ -359,11 +363,24 @@ next:
> }
> }
>
> +/*
> + * Returns true if we scanned the entire disk
> + */
> static bool refill_dirty(struct cached_dev *dc)
> {
> struct keybuf *buf = &dc->writeback_keys;
> + struct bkey start = KEY(dc->disk.id, 0, 0);
> struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
> - bool searched_from_start = false;
> + struct bkey start_pos;
> +
> + /*
> + * make sure keybuf pos is inside the range for this disk - at bringup
> + * we might not be attached yet so this disk's inode nr isn't
> + * initialized then
> + */
> + if (bkey_cmp(&buf->last_scanned, &start) < 0 ||
> + bkey_cmp(&buf->last_scanned, &end) > 0)
> + buf->last_scanned = start;
>
> if (dc->partial_stripes_expensive) {
> refill_full_stripes(dc);
> @@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc)
> return false;
> }
>
> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
> - buf->last_scanned = KEY(dc->disk.id, 0, 0);
> - searched_from_start = true;
> - }
> -
> + start_pos = buf->last_scanned;
> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
>
> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
> + if (bkey_cmp(&buf->last_scanned, &end) < 0)
> + return false;
> +
> + /*
> + * If we get to the end start scanning again from the beginning, and
> + * only scan up to where we initially started scanning from:
> + */
> + buf->last_scanned = start;
> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
> +
> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
> }
>
> static void bch_writeback(struct cached_dev *dc)
> --
> 2.5.1
>
--
Denis
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-17 18:31 ` Kent Overstreet
2015-09-17 20:54 ` Denis Bychkov
@ 2015-09-19 4:47 ` Denis Bychkov
2015-10-01 9:01 ` Tim Small
2 siblings, 0 replies; 10+ messages in thread
From: Denis Bychkov @ 2015-09-19 4:47 UTC (permalink / raw)
To: Kent Overstreet
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac,
Jiri Kosina, Jens Axboe
Hi Kent,
After running a day with this new version of your patch, I did not
notice any problems, so I assume, it works. I'll keep an eye on it and
report back if find anything bad. I believe, you finally fixed that
long lasting bug with writeback threads spinning CPU.
To Vojtech Pavlik: thank you for catching it! For a long time I had to
run bcache with the partial_stripes_expensive branch disabled, which
seriously affected its performance on my RAID-6.
On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:
>> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
>> > Well, it turns out my celebration was a bit premature.
>> >
>> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
>> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
>> >
>> > The interesting thing is that it somehow damaged the partition that
>> > was not supposed to receive any writes (the file system was mounted
>> > read-only), so my guess is that the patch causes the blocks residing
>> > in the write-back cache to flush to the wrong blocks on the backing
>> > device.
>> > Everything was going great until I rebooted and saw this in the log:
>> >
>> > [ 19.639082] attempt to access beyond end of device
>> > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560
>> > [ 19.659033] attempt to access beyond end of device
>> > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560
>> > [ 19.669447] attempt to access beyond end of device
>> > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560
>> > [ 19.679195] attempt to access beyond end of device
>> > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560
>> > [ 19.689007] attempt to access beyond end of device
>> > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560
>> > [ 19.699055] attempt to access beyond end of device
>> > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560
>> > [ 19.719246] attempt to access beyond end of device
>> > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560
>> > ......
>> > (it's a small example, the list was much longer)
>> > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
>> > Oh well, it's a good thing I have backups.
>> > I knew what I was doing when trying the untested patches. I should
>> > have made the RAID md partition read-only, not the file system. I kind
>> > of expected that something could have gone wrong with the file system
>> > I was testing, just did not expect it would fire nukes at the innocent
>> > bystanders.
>>
>> Aw, shit. That's just _bizzare_.
>>
>> I have a theory - it appears that last_scanned isn't getting initialized before
>> it's used, so it's going to be all 0s the very first time... which it appears
>> could cause it to slurp up keys from the wrong device (and if that device was
>> bigger than the correct device, that could explain the accesses beyond the end
>> of the device).
>>
>> Currently just a theory though, and I have no clue why it would only be exposed
>> with my patch.
>
> Here's an updated patch that has a fix for _that_ theory, and also a new
> BUG_ON(). Any chance you could test it?
>
> Oh - I didn't ask - _do_ you have multiple backing devices attached to the same
> cache set? Because if you don't, this isn't it at all...
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
>
> Previously, it would only scan the entire disk if it was starting from the very
> start of the disk - i.e. if the previous scan got to the end.
>
> This was broken by refill_full_stripes(), which updates last_scanned so that
> refill_dirty was never triggering the searched_from_start path.
>
> But if we change refill_dirty() to always scan the entire disk if necessary,
> regardless of what last_scanned was, the code gets cleaner and we fix that bug
> too.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
> drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++-------
> 1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index cdde0f32f0..d383024247 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode,
>
> static bool dirty_pred(struct keybuf *buf, struct bkey *k)
> {
> + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys);
> +
> + BUG_ON(KEY_INODE(k) != dc->disk.id);
> +
> return KEY_DIRTY(k);
> }
>
> @@ -359,11 +363,24 @@ next:
> }
> }
>
> +/*
> + * Returns true if we scanned the entire disk
> + */
> static bool refill_dirty(struct cached_dev *dc)
> {
> struct keybuf *buf = &dc->writeback_keys;
> + struct bkey start = KEY(dc->disk.id, 0, 0);
> struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
> - bool searched_from_start = false;
> + struct bkey start_pos;
> +
> + /*
> + * make sure keybuf pos is inside the range for this disk - at bringup
> + * we might not be attached yet so this disk's inode nr isn't
> + * initialized then
> + */
> + if (bkey_cmp(&buf->last_scanned, &start) < 0 ||
> + bkey_cmp(&buf->last_scanned, &end) > 0)
> + buf->last_scanned = start;
>
> if (dc->partial_stripes_expensive) {
> refill_full_stripes(dc);
> @@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc)
> return false;
> }
>
> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
> - buf->last_scanned = KEY(dc->disk.id, 0, 0);
> - searched_from_start = true;
> - }
> -
> + start_pos = buf->last_scanned;
> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
>
> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
> + if (bkey_cmp(&buf->last_scanned, &end) < 0)
> + return false;
> +
> + /*
> + * If we get to the end start scanning again from the beginning, and
> + * only scan up to where we initially started scanning from:
> + */
> + buf->last_scanned = start;
> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
> +
> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
> }
>
> static void bch_writeback(struct cached_dev *dc)
> --
> 2.5.1
>
--
Denis
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-16 21:08 ` Denis Bychkov
2015-09-17 15:30 ` Denis Bychkov
@ 2015-09-29 23:00 ` Peter Kieser
1 sibling, 0 replies; 10+ messages in thread
From: Peter Kieser @ 2015-09-29 23:00 UTC (permalink / raw)
To: Denis Bychkov, Kent Overstreet
Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Kent Overstreet,
Emmanuel Florac, Jiri Kosina, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 6764 bytes --]
My question still stands:
Why are none of these patches submitted to mainline or backported to
stable kernels?
Thanks.
On 2015-09-16 2:08 PM, Denis Bychkov wrote:
> Hi Kent, Vojtech, list
>
> Your fix works perfectly, I can finally remove my patch that disables
> the partial_stripes_expensive branch and unleash the full bcache
> performance on my RAID-6 array. I was aware of this problem for some
> time now, but never got to learning the bcache codebase enough to find
> out why this happens with partial stripes write-back enabled, so I
> just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for
> finding the reason and to you for the patch. I have quite a collection
> of stability patches for the mainline bcache. Would you please be kind
> to review them when convenient and merge upstream the ones you think
> are worth merging. It will be 4.4 merge window, I believe.
> Please find all the patches attached. All of them are rebased on
> mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let
> me know.
>
>
> On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
>> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
>>> Fix writeback_thread never finishing writing back all dirty data in bcache when
>>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.
>>>
>>> Signed-off-by: Vojtech Pavlik <vojtech@suse.com>
>>> ---
>>>
>>> This is a fix for the current upstream bcache, not the devel branch.
>>>
>>> If partial_stripes_expensive is set for a cache set, then writeback_thread
>>> always attempts to write full stripes out back to the backing device first.
>>> However, since it does that based on a bitmap and not a simple linear
>>> search, like the rest of the code of refill_dirty(), it changes the
>>> last_scanned pointer so that never points to 'end'. refill_dirty() then
>>> never tries to scan from 'start', resulting in the writeback_thread
>>> looping, consuming 100% of CPU, but never making progress in writing out
>>> the incomplete dirty stripes in the cache.
>>>
>>> Scanning the tree after not finding enough full stripes fixes the issue.
>>>
>>> Incomplete dirty stripes are written to the backing device, the device
>>> eventually reaches a clean state if there is nothing dirtying data and
>>> writeback_thread sleeps. This also fixes the problem of the cache device
>>> not being possible to detach in the partial_stripes_expensive scenario.
>> Good catch!
>>
>>> It may be more efficient to separate the last_scanned field for normal and
>>> stripe scans instead.
>> Actually, I think I like your approach - I think it gives us the behaviour we
>> want, although it took some thinking to figure out why.
>>
>> One of the reasons for last_scanned is just to do writeback in LBA order and
>> avoid making the disks seek around - so, if refill_full_stripes() did just queue
>> up some IO at last_scanned, we do want to keep scanning from that position for
>> non full stripes.
>>
>> But since (as you noted) last_scanned is never going to be >= end after calling
>> refill_full_strips() (if there were any full stripes) - really the correct thing
>> to do is loop around to the start if necessary so that we can successfully scan
>> everything.
>>
>> The only inefficiency I see with your approach is that on the second scan, after
>> we've looped, we don't need to scan to the end of the disk - we only need to
>> scan up to where we initially started.
>>
>> But, another observation - if we change refill_dirty() so that it always scans
>> the entire keyspace if necessary, regardless of where last_scanned was - well,
>> that really isn't specific to the refill_full_stripes() case - we can just
>> always do that.
>>
>> Can you give this patch a try?
>>
>> -- >8 --
>> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary
>>
>> Previously, it would only scan the entire disk if it was starting from the very
>> start of the disk - i.e. if the previous scan got to the end.
>>
>> This was broken by refill_full_stripes(), which updates last_scanned so that
>> refill_dirty was never triggering the searched_from_start path.
>>
>> But if we change refill_dirty() to always scan the entire disk if necessary,
>> regardless of what last_scanned was, the code gets cleaner and we fix that bug
>> too.
>>
>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
>> ---
>> drivers/md/bcache/writeback.c | 24 ++++++++++++++++--------
>> 1 file changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index cdde0f32f0..08a52db38b 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -359,11 +359,13 @@ next:
>> }
>> }
>>
>> +/*
>> + * Returns true if we scanned the entire disk
>> + */
>> static bool refill_dirty(struct cached_dev *dc)
>> {
>> struct keybuf *buf = &dc->writeback_keys;
>> - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>> - bool searched_from_start = false;
>> + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>>
>> if (dc->partial_stripes_expensive) {
>> refill_full_stripes(dc);
>> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc)
>> return false;
>> }
>>
>> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
>> - buf->last_scanned = KEY(dc->disk.id, 0, 0);
>> - searched_from_start = true;
>> - }
>> -
>> + start_pos = buf->last_scanned;
>> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
>>
>> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start;
>> + if (bkey_cmp(&buf->last_scanned, &end) < 0)
>> + return false;
>> +
>> + /*
>> + * If we get to the end start scanning again from the beginning, and
>> + * only scan up to where we initially started scanning from:
>> + */
>> + buf->last_scanned = KEY(dc->disk.id, 0, 0);
>> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred);
>> +
>> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0;
>> }
>>
>> static void bch_writeback(struct cached_dev *dc)
>> --
>> 2.5.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Peter Kieser
604.338.9294 / peter@kieser.ca
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4311 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
2015-09-17 18:31 ` Kent Overstreet
2015-09-17 20:54 ` Denis Bychkov
2015-09-19 4:47 ` Denis Bychkov
@ 2015-10-01 9:01 ` Tim Small
2 siblings, 0 replies; 10+ messages in thread
From: Tim Small @ 2015-10-01 9:01 UTC (permalink / raw)
To: Kent Overstreet; +Cc: linux-bcache
On 17/09/15 19:31, Kent Overstreet wrote:
> Here's an updated patch that has a fix for _that_ theory, and also a new
> BUG_ON(). Any chance you could test it?
I've applied your patch (to the Ubuntu 4.2.1 tree - since I need some of
their non-mainline apparmour changes on this box) , and it's stopped the
excessive CPU usage by the bcache_writebac thread.
I'd first experienced this CPU spinning when I tried to disable cache
usage (I'd had some lockups on the machine and I was trying to get to
debug, so I tried switching bcache cache_mode to none whilst I debugged
as part of my process of elimination). I assume the bug was hit whilst
trying to flush the dirty cache data.
I now have:
root@magic:~# cat /sys/block/bcache0/bcache/state
clean
root@magic:~# cat /sys/block/bcache0/bcache/dirty_data
1.5M
root@magic:~# cat /sys/block/bcache0/bcache/cache_mode
writethrough writeback writearound [none]
root@magic:~# uname -a
Linux magic 4.2.1 #2 SMP Sat Sep 26 10:00:40 UTC 2015 x86_64 x86_64
x86_64 GNU/Linux
i.e. state and cache_mode are inconsistent with dirty_data. Is this of
any concern and/or likely to cause problems if I re-enable the cache?
Cheers,
Tim.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-10-01 9:38 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-05 11:10 [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes Vojtech Pavlik
2015-09-16 11:32 ` Kent Overstreet
2015-09-16 21:08 ` Denis Bychkov
2015-09-17 15:30 ` Denis Bychkov
2015-09-17 16:40 ` Kent Overstreet
2015-09-17 18:31 ` Kent Overstreet
2015-09-17 20:54 ` Denis Bychkov
2015-09-19 4:47 ` Denis Bychkov
2015-10-01 9:01 ` Tim Small
2015-09-29 23:00 ` Peter Kieser
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).