* [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. @ 2015-09-05 11:10 Vojtech Pavlik 2015-09-16 11:32 ` Kent Overstreet 0 siblings, 1 reply; 10+ messages in thread From: Vojtech Pavlik @ 2015-09-05 11:10 UTC (permalink / raw) To: Kent Overstreet Cc: linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac, Jiri Kosina, Jens Axboe Fix writeback_thread never finishing writing back all dirty data in bcache when partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. Signed-off-by: Vojtech Pavlik <vojtech@suse.com> --- This is a fix for the current upstream bcache, not the devel branch. If partial_stripes_expensive is set for a cache set, then writeback_thread always attempts to write full stripes out back to the backing device first. However, since it does that based on a bitmap and not a simple linear search, like the rest of the code of refill_dirty(), it changes the last_scanned pointer so that never points to 'end'. refill_dirty() then never tries to scan from 'start', resulting in the writeback_thread looping, consuming 100% of CPU, but never making progress in writing out the incomplete dirty stripes in the cache. Scanning the tree after not finding enough full stripes fixes the issue. Incomplete dirty stripes are written to the backing device, the device eventually reaches a clean state if there is nothing dirtying data and writeback_thread sleeps. This also fixes the problem of the cache device not being possible to detach in the partial_stripes_expensive scenario. It may be more efficient to separate the last_scanned field for normal and stripe scans instead. drivers/md/bcache/writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index f1986bc..6f8b81d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc) refill_full_stripes(dc); if (array_freelist_empty(&buf->freelist)) return false; + bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); } if (bkey_cmp(&buf->last_scanned, &end) >= 0) { -- 2.1.4 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-05 11:10 [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes Vojtech Pavlik @ 2015-09-16 11:32 ` Kent Overstreet 2015-09-16 21:08 ` Denis Bychkov 0 siblings, 1 reply; 10+ messages in thread From: Kent Overstreet @ 2015-09-16 11:32 UTC (permalink / raw) To: Vojtech Pavlik Cc: linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac, Jiri Kosina, Jens Axboe On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote: > Fix writeback_thread never finishing writing back all dirty data in bcache when > partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. > > Signed-off-by: Vojtech Pavlik <vojtech@suse.com> > --- > > This is a fix for the current upstream bcache, not the devel branch. > > If partial_stripes_expensive is set for a cache set, then writeback_thread > always attempts to write full stripes out back to the backing device first. > However, since it does that based on a bitmap and not a simple linear > search, like the rest of the code of refill_dirty(), it changes the > last_scanned pointer so that never points to 'end'. refill_dirty() then > never tries to scan from 'start', resulting in the writeback_thread > looping, consuming 100% of CPU, but never making progress in writing out > the incomplete dirty stripes in the cache. > > Scanning the tree after not finding enough full stripes fixes the issue. > > Incomplete dirty stripes are written to the backing device, the device > eventually reaches a clean state if there is nothing dirtying data and > writeback_thread sleeps. This also fixes the problem of the cache device > not being possible to detach in the partial_stripes_expensive scenario. Good catch! > It may be more efficient to separate the last_scanned field for normal and > stripe scans instead. Actually, I think I like your approach - I think it gives us the behaviour we want, although it took some thinking to figure out why. One of the reasons for last_scanned is just to do writeback in LBA order and avoid making the disks seek around - so, if refill_full_stripes() did just queue up some IO at last_scanned, we do want to keep scanning from that position for non full stripes. But since (as you noted) last_scanned is never going to be >= end after calling refill_full_strips() (if there were any full stripes) - really the correct thing to do is loop around to the start if necessary so that we can successfully scan everything. The only inefficiency I see with your approach is that on the second scan, after we've looped, we don't need to scan to the end of the disk - we only need to scan up to where we initially started. But, another observation - if we change refill_dirty() so that it always scans the entire keyspace if necessary, regardless of where last_scanned was - well, that really isn't specific to the refill_full_stripes() case - we can just always do that. Can you give this patch a try? -- >8 -- Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> --- drivers/md/bcache/writeback.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index cdde0f32f0..08a52db38b 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -359,11 +359,13 @@ next: } } +/* + * Returns true if we scanned the entire disk + */ static bool refill_dirty(struct cached_dev *dc) { struct keybuf *buf = &dc->writeback_keys; - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); - bool searched_from_start = false; + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); if (dc->partial_stripes_expensive) { refill_full_stripes(dc); @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc) return false; } - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { - buf->last_scanned = KEY(dc->disk.id, 0, 0); - searched_from_start = true; - } - + start_pos = buf->last_scanned; bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; + if (bkey_cmp(&buf->last_scanned, &end) < 0) + return false; + + /* + * If we get to the end start scanning again from the beginning, and + * only scan up to where we initially started scanning from: + */ + buf->last_scanned = KEY(dc->disk.id, 0, 0); + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); + + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; } static void bch_writeback(struct cached_dev *dc) -- 2.5.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-16 11:32 ` Kent Overstreet @ 2015-09-16 21:08 ` Denis Bychkov 2015-09-17 15:30 ` Denis Bychkov 2015-09-29 23:00 ` Peter Kieser 0 siblings, 2 replies; 10+ messages in thread From: Denis Bychkov @ 2015-09-16 21:08 UTC (permalink / raw) To: Kent Overstreet Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac, Jiri Kosina, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 6228 bytes --] Hi Kent, Vojtech, list Your fix works perfectly, I can finally remove my patch that disables the partial_stripes_expensive branch and unleash the full bcache performance on my RAID-6 array. I was aware of this problem for some time now, but never got to learning the bcache codebase enough to find out why this happens with partial stripes write-back enabled, so I just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for finding the reason and to you for the patch. I have quite a collection of stability patches for the mainline bcache. Would you please be kind to review them when convenient and merge upstream the ones you think are worth merging. It will be 4.4 merge window, I believe. Please find all the patches attached. All of them are rebased on mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let me know. On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet <kent.overstreet@gmail.com> wrote: > On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote: >> Fix writeback_thread never finishing writing back all dirty data in bcache when >> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. >> >> Signed-off-by: Vojtech Pavlik <vojtech@suse.com> >> --- >> >> This is a fix for the current upstream bcache, not the devel branch. >> >> If partial_stripes_expensive is set for a cache set, then writeback_thread >> always attempts to write full stripes out back to the backing device first. >> However, since it does that based on a bitmap and not a simple linear >> search, like the rest of the code of refill_dirty(), it changes the >> last_scanned pointer so that never points to 'end'. refill_dirty() then >> never tries to scan from 'start', resulting in the writeback_thread >> looping, consuming 100% of CPU, but never making progress in writing out >> the incomplete dirty stripes in the cache. >> >> Scanning the tree after not finding enough full stripes fixes the issue. >> >> Incomplete dirty stripes are written to the backing device, the device >> eventually reaches a clean state if there is nothing dirtying data and >> writeback_thread sleeps. This also fixes the problem of the cache device >> not being possible to detach in the partial_stripes_expensive scenario. > > Good catch! > >> It may be more efficient to separate the last_scanned field for normal and >> stripe scans instead. > > Actually, I think I like your approach - I think it gives us the behaviour we > want, although it took some thinking to figure out why. > > One of the reasons for last_scanned is just to do writeback in LBA order and > avoid making the disks seek around - so, if refill_full_stripes() did just queue > up some IO at last_scanned, we do want to keep scanning from that position for > non full stripes. > > But since (as you noted) last_scanned is never going to be >= end after calling > refill_full_strips() (if there were any full stripes) - really the correct thing > to do is loop around to the start if necessary so that we can successfully scan > everything. > > The only inefficiency I see with your approach is that on the second scan, after > we've looped, we don't need to scan to the end of the disk - we only need to > scan up to where we initially started. > > But, another observation - if we change refill_dirty() so that it always scans > the entire keyspace if necessary, regardless of where last_scanned was - well, > that really isn't specific to the refill_full_stripes() case - we can just > always do that. > > Can you give this patch a try? > > -- >8 -- > Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary > > Previously, it would only scan the entire disk if it was starting from the very > start of the disk - i.e. if the previous scan got to the end. > > This was broken by refill_full_stripes(), which updates last_scanned so that > refill_dirty was never triggering the searched_from_start path. > > But if we change refill_dirty() to always scan the entire disk if necessary, > regardless of what last_scanned was, the code gets cleaner and we fix that bug > too. > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> > --- > drivers/md/bcache/writeback.c | 24 ++++++++++++++++-------- > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index cdde0f32f0..08a52db38b 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -359,11 +359,13 @@ next: > } > } > > +/* > + * Returns true if we scanned the entire disk > + */ > static bool refill_dirty(struct cached_dev *dc) > { > struct keybuf *buf = &dc->writeback_keys; > - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); > - bool searched_from_start = false; > + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); > > if (dc->partial_stripes_expensive) { > refill_full_stripes(dc); > @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc) > return false; > } > > - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { > - buf->last_scanned = KEY(dc->disk.id, 0, 0); > - searched_from_start = true; > - } > - > + start_pos = buf->last_scanned; > bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); > > - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; > + if (bkey_cmp(&buf->last_scanned, &end) < 0) > + return false; > + > + /* > + * If we get to the end start scanning again from the beginning, and > + * only scan up to where we initially started scanning from: > + */ > + buf->last_scanned = KEY(dc->disk.id, 0, 0); > + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); > + > + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; > } > > static void bch_writeback(struct cached_dev *dc) > -- > 2.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denis [-- Attachment #2: bcache_correct_errors_on_register.patch --] [-- Type: application/octet-stream, Size: 861 bytes --] From: Denis Bychkov <manover@gmail.com> Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1964,6 +1964,8 @@ else err = "device busy"; mutex_unlock(&bch_register_lock); + if (attr == &ksysfs_register_quiet) + goto out; } goto err; } @@ -2002,8 +2004,7 @@ err_close: blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); err: - if (attr != &ksysfs_register_quiet) - pr_info("error opening %s: %s", path, err); + pr_info("error opening %s: %s", path, err); ret = -EINVAL; goto out; } [-- Attachment #3: bcache-attach-detach-cleanup.patch --] [-- Type: application/octet-stream, Size: 5134 bytes --] From: Joshua Schmid <jschmid@suse.com> Subject: [PATCH] bcache: [BUG] clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device Newsgroups: gmane.linux.kernel.bcache.devel Date: 2015-02-03 11:18:01 GMT (3 weeks, 2 days, 11 hours and 45 minutes ago) From: Zheng Liu <wenqing.lz@taobao.com> This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> --- drivers/md/bcache/super.c | 2 ++ 1 file changed, 2 insertions(+) --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -708,6 +708,8 @@ WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") || sysfs_create_link(&c->kobj, &d->kobj, d->name), "Couldn't create device <-> cache set symlinks"); + + clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags); } static void bcache_device_detach(struct bcache_device *d) [-- Attachment #4: bcache-cond_resched.patch --] [-- Type: application/octet-stream, Size: 735 bytes --] From f0e6320a7874af434575f37a11ec6e4992cef790 Mon Sep 17 00:00:00 2001 From: Kent Overstreet <kmo@daterainc.com> Date: Sat, 1 Nov 2014 13:44:47 -0700 Subject: [PATCH 1/5] bcache: Add a cond_resched() call to gc Git-commit: f0e6320a7874af434575f37a11ec6e4992cef790 Patch-mainline: Submitted References: bnc#910440 Change-id: Id4f18c533b80ddb40df94ed0bb5e2a236a4bc325 Signed-off-by: Takashi Iwai <tiwai@suse.de> --- drivers/md/bcache/btree.c | 1 + 1 file changed, 1 insertion(+) --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1741,6 +1741,7 @@ do { ret = btree_root(gc_root, c, &op, &writes, &stats); closure_sync(&writes); + cond_resched(); if (ret && ret != -EAGAIN) pr_warn("gc failed!"); [-- Attachment #5: bcache-fix_passthrough_shutdown_crash.patch --] [-- Type: application/octet-stream, Size: 1044 bytes --] --- From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Mon, 18 Aug 2014 20:00:13 +0200 Subject: [PATCH] bcache: prevent crash on changing writeback_running commit a664d0f05a2e ("bcache: fix crash on shutdown in passthrough mode") added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc->writeback_thread); + if (!IS_ERR_OR_NULL(dc->writeback_thread)) + wake_up_process(dc->writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) [-- Attachment #6: bcache-fix-memleak-bch_cached_dev_run.patch --] [-- Type: application/octet-stream, Size: 802 bytes --] From: Joshua Schmid <jschmid@suse.com> Subject: [PATCH] fix a leak in bch_cached_dev_run() Newsgroups: gmane.linux.kernel.bcache.devel Date: 2015-02-03 11:24:06 GMT (3 weeks, 2 days, 11 hours and 43 minutes ago) From: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> --- drivers/md/bcache/super.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -872,8 +872,11 @@ buf[SB_LABEL_SIZE] = '\0'; env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf); - if (atomic_xchg(&dc->running, 1)) + if (atomic_xchg(&dc->running, 1)) { + kfree(env[1]); + kfree(env[2]); return; + } if (!d->c && BDEV_STATE(&dc->sb) != BDEV_STATE_NONE) { [-- Attachment #7: bcache-rcu-sched-bugfix.patch --] [-- Type: application/octet-stream, Size: 2490 bytes --] From: Zheng Liu <gnehzuil.liu@gmail.com> To: linux-bcache@vger.kernel.org Cc: Zheng Liu <wenqing.lz@taobao.com>, Joshua Schmid <jschmid@suse.com>, Zhu Yanhai <zhu.yanhai@gmail.com>, Kent Overstreet <kmo@daterainc.com> Subject: [PATCH v2] bcache: fix a livelock in btree lock Date: Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) From: Zheng Liu <wenqing.lz@taobao.com> This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() | ->cache_lookup_fn() | ->cached_dev_cache_miss() | ->bch_btree_insert_check_key() -| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> --- changelog: v2: fix a bug that stops all concurrency writes unconditionally. drivers/md/bcache/btree.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2162,8 +2162,10 @@ rw_lock(true, b, b->level); if (b->key.ptr[0] != btree_ptr || - b->seq != seq + 1) + b->seq != seq + 1) { + op->lock = b->level; goto out; + } } SET_KEY_PTRS(check_key, 1); [-- Attachment #8: bcache-refill_dirty_to_always_scan_entire_disk.patch --] [-- Type: application/octet-stream, Size: 1963 bytes --] Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> --- drivers/md/bcache/writeback.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -372,11 +372,13 @@ } } +/* + * Returns true if we scanned the entire disk + */ static bool refill_dirty(struct cached_dev *dc) { struct keybuf *buf = &dc->writeback_keys; - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); - bool searched_from_start = false; + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); if (dc->partial_stripes_expensive) { refill_full_stripes(dc); @@ -384,14 +386,20 @@ return false; } - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { - buf->last_scanned = KEY(dc->disk.id, 0, 0); - searched_from_start = true; - } - + start_pos = buf->last_scanned; bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; + if (bkey_cmp(&buf->last_scanned, &end) < 0) + return false; + + /* + * If we get to the end start scanning again from the beginning, and + * only scan up to where we initially started scanning from: + */ + buf->last_scanned = KEY(dc->disk.id, 0, 0); + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); + + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; } static int bch_writeback_thread(void *arg) [-- Attachment #9: bcache-remove-bio-splitting.patch --] [-- Type: application/octet-stream, Size: 10266 bytes --] From: Kent Overstreet <kent.overstreet@gmail.com> The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> --- drivers/md/bcache/bcache.h | 18 -------- drivers/md/bcache/io.c | 100 +----------------------------------------- drivers/md/bcache/journal.c | 4 +- drivers/md/bcache/request.c | 16 +++---- drivers/md/bcache/super.c | 32 +------------- drivers/md/bcache/util.h | 5 ++- drivers/md/bcache/writeback.c | 4 +- 7 files changed, 18 insertions(+), 161 deletions(-) --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -243,19 +243,6 @@ DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR); }; -struct bio_split_pool { - struct bio_set *bio_split; - mempool_t *bio_split_hook; -}; - -struct bio_split_hook { - struct closure cl; - struct bio_split_pool *p; - struct bio *bio; - bio_end_io_t *bi_end_io; - void *bi_private; -}; - struct bcache_device { struct closure cl; @@ -288,8 +275,6 @@ int (*cache_miss)(struct btree *, struct search *, struct bio *, unsigned); int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long); - - struct bio_split_pool bio_split_hook; }; struct io { @@ -454,8 +439,6 @@ atomic_long_t meta_sectors_written; atomic_long_t btree_sectors_written; atomic_long_t sectors_written; - - struct bio_split_pool bio_split_hook; }; struct gc_stat { @@ -873,7 +856,6 @@ void bch_bbio_free(struct bio *, struct cache_set *); struct bio *bch_bbio_alloc(struct cache_set *); -void bch_generic_make_request(struct bio *, struct bio_split_pool *); void __bch_submit_bbio(struct bio *, struct cache_set *); void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned); --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -61,7 +61,7 @@ bio->bi_private = &cl; bch_bio_map(bio, data); - closure_bio_submit(bio, &cl, ca); + closure_bio_submit(bio, &cl); closure_sync(&cl); /* This function could be simpler now since we no longer write @@ -648,7 +648,7 @@ spin_unlock(&c->journal.lock); while ((bio = bio_list_pop(&list))) - closure_bio_submit(bio, cl, c->cache[0]); + closure_bio_submit(bio, cl); continue_at(cl, journal_write_done, NULL); } --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -718,7 +718,7 @@ /* XXX: invalidate cache */ - closure_bio_submit(bio, cl, s->d); + closure_bio_submit(bio, cl); } continue_at(cl, cached_dev_cache_miss_done, NULL); @@ -841,7 +841,7 @@ s->cache_miss = miss; s->iop.bio = cache_bio; bio_get(cache_bio); - closure_bio_submit(cache_bio, &s->cl, s->d); + closure_bio_submit(cache_bio, &s->cl); return ret; out_put: @@ -849,7 +849,7 @@ out_submit: miss->bi_end_io = request_endio; miss->bi_private = &s->cl; - closure_bio_submit(miss, &s->cl, s->d); + closure_bio_submit(miss, &s->cl); return ret; } @@ -914,7 +914,7 @@ if (!(bio->bi_rw & REQ_DISCARD) || blk_queue_discard(bdev_get_queue(dc->bdev))) - closure_bio_submit(bio, cl, s->d); + closure_bio_submit(bio, cl); } else if (s->iop.writeback) { bch_writeback_add(dc); s->iop.bio = bio; @@ -929,12 +929,12 @@ flush->bi_end_io = request_endio; flush->bi_private = cl; - closure_bio_submit(flush, cl, s->d); + closure_bio_submit(flush, cl); } } else { s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split); - closure_bio_submit(bio, cl, s->d); + closure_bio_submit(bio, cl); } closure_call(&s->iop.cl, bch_data_insert, NULL, cl); @@ -950,7 +950,7 @@ bch_journal_meta(s->iop.c, cl); /* If it's a flush, we send the flush to the backing device too */ - closure_bio_submit(bio, cl, s->d); + closure_bio_submit(bio, cl); continue_at(cl, cached_dev_bio_complete, NULL); } @@ -994,7 +994,7 @@ !blk_queue_discard(bdev_get_queue(dc->bdev))) bio_endio(bio, 0); else - bch_generic_make_request(bio, &d->bio_split_hook); + generic_make_request(bio); } } --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -59,29 +59,6 @@ #define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE) -static void bio_split_pool_free(struct bio_split_pool *p) -{ - if (p->bio_split_hook) - mempool_destroy(p->bio_split_hook); - - if (p->bio_split) - bioset_free(p->bio_split); -} - -static int bio_split_pool_init(struct bio_split_pool *p) -{ - p->bio_split = bioset_create(4, 0); - if (!p->bio_split) - return -ENOMEM; - - p->bio_split_hook = mempool_create_kmalloc_pool(4, - sizeof(struct bio_split_hook)); - if (!p->bio_split_hook) - return -ENOMEM; - - return 0; -} - /* Superblock */ static const char *read_super(struct cache_sb *sb, struct block_device *bdev, @@ -537,7 +514,7 @@ bio->bi_private = ca; bch_bio_map(bio, ca->disk_buckets); - closure_bio_submit(bio, &ca->prio, ca); + closure_bio_submit(bio, &ca->prio); closure_sync(cl); } @@ -759,7 +736,6 @@ put_disk(d->disk); } - bio_split_pool_free(&d->bio_split_hook); if (d->bio_split) bioset_free(d->bio_split); kvfree(d->full_dirty_stripes); @@ -806,7 +782,6 @@ return minor; if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) || - bio_split_pool_init(&d->bio_split_hook) || !(d->disk = alloc_disk(1))) { ida_simple_remove(&bcache_minor, minor); return -ENOMEM; @@ -1798,8 +1773,6 @@ ca->set->cache[ca->sb.nr_this_dev] = NULL; } - bio_split_pool_free(&ca->bio_split_hook); - free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca))); kfree(ca->prio_buckets); vfree(ca->buckets); @@ -1844,8 +1817,7 @@ ca->sb.nbuckets)) || !(ca->prio_buckets = kzalloc(sizeof(uint64_t) * prio_buckets(ca) * 2, GFP_KERNEL)) || - !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)) || - bio_split_pool_init(&ca->bio_split_hook)) + !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca))) return -ENOMEM; ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca); --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -4,6 +4,7 @@ #include <linux/blkdev.h> #include <linux/errno.h> +#include <linux/blkdev.h> #include <linux/kernel.h> #include <linux/llist.h> #include <linux/ratelimit.h> @@ -570,10 +571,10 @@ return bdev->bd_inode->i_size >> 9; } -#define closure_bio_submit(bio, cl, dev) \ +#define closure_bio_submit(bio, cl) \ do { \ closure_get(cl); \ - bch_generic_make_request(bio, &(dev)->bio_split_hook); \ + generic_make_request(bio); \ } while (0) uint64_t bch_crc64_update(uint64_t, const void *, size_t); --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -188,7 +188,7 @@ io->bio.bi_bdev = io->dc->bdev; io->bio.bi_end_io = dirty_endio; - closure_bio_submit(&io->bio, cl, &io->dc->disk); + closure_bio_submit(&io->bio, cl); continue_at(cl, write_dirty_finish, system_wq); } @@ -208,7 +208,7 @@ { struct dirty_io *io = container_of(cl, struct dirty_io, cl); - closure_bio_submit(&io->bio, cl, &io->dc->disk); + closure_bio_submit(&io->bio, cl); continue_at(cl, write_dirty, system_wq); } --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -11,105 +11,6 @@ #include <linux/blkdev.h> -static unsigned bch_bio_max_sectors(struct bio *bio) -{ - struct request_queue *q = bdev_get_queue(bio->bi_bdev); - struct bio_vec bv; - struct bvec_iter iter; - unsigned ret = 0, seg = 0; - - if (bio->bi_rw & REQ_DISCARD) - return min(bio_sectors(bio), q->limits.max_discard_sectors); - - bio_for_each_segment(bv, bio, iter) { - struct bvec_merge_data bvm = { - .bi_bdev = bio->bi_bdev, - .bi_sector = bio->bi_iter.bi_sector, - .bi_size = ret << 9, - .bi_rw = bio->bi_rw, - }; - - if (seg == min_t(unsigned, BIO_MAX_PAGES, - queue_max_segments(q))) - break; - - if (q->merge_bvec_fn && - q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len) - break; - - seg++; - ret += bv.bv_len >> 9; - } - - ret = min(ret, queue_max_sectors(q)); - - WARN_ON(!ret); - ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9); - - return ret; -} - -static void bch_bio_submit_split_done(struct closure *cl) -{ - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - s->bio->bi_end_io = s->bi_end_io; - s->bio->bi_private = s->bi_private; - bio_endio(s->bio, 0); - - closure_debug_destroy(&s->cl); - mempool_free(s, s->p->bio_split_hook); -} - -static void bch_bio_submit_split_endio(struct bio *bio, int error) -{ - struct closure *cl = bio->bi_private; - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - if (error) - clear_bit(BIO_UPTODATE, &s->bio->bi_flags); - - bio_put(bio); - closure_put(cl); -} - -void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p) -{ - struct bio_split_hook *s; - struct bio *n; - - if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD)) - goto submit; - - if (bio_sectors(bio) <= bch_bio_max_sectors(bio)) - goto submit; - - s = mempool_alloc(p->bio_split_hook, GFP_NOIO); - closure_init(&s->cl, NULL); - - s->bio = bio; - s->p = p; - s->bi_end_io = bio->bi_end_io; - s->bi_private = bio->bi_private; - bio_get(bio); - - do { - n = bio_next_split(bio, bch_bio_max_sectors(bio), - GFP_NOIO, s->p->bio_split); - - n->bi_end_io = bch_bio_submit_split_endio; - n->bi_private = &s->cl; - - closure_get(&s->cl); - generic_make_request(n); - } while (n != bio); - - continue_at(&s->cl, bch_bio_submit_split_done, NULL); - return; -submit: - generic_make_request(bio); -} - /* Bios with headers */ void bch_bbio_free(struct bio *bio, struct cache_set *c) @@ -139,7 +40,7 @@ bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev; b->submit_time_us = local_clock_us(); - closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0)); + closure_bio_submit(bio, bio->bi_private); } void bch_submit_bbio(struct bio *bio, struct cache_set *c, [-- Attachment #10: bcache-unregister-reboot-notifier.patch --] [-- Type: application/octet-stream, Size: 761 bytes --] From: Zheng Liu <wenqing.lz@taobao.com> In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> --- drivers/md/bcache/super.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2094,8 +2094,10 @@ closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); - if (bcache_major < 0) + if (bcache_major < 0) { + unregister_reboot_notifier(&reboot); return bcache_major; + } if (!(bcache_wq = create_workqueue("bcache")) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-16 21:08 ` Denis Bychkov @ 2015-09-17 15:30 ` Denis Bychkov 2015-09-17 16:40 ` Kent Overstreet 2015-09-29 23:00 ` Peter Kieser 1 sibling, 1 reply; 10+ messages in thread From: Denis Bychkov @ 2015-09-17 15:30 UTC (permalink / raw) To: Kent Overstreet Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac, Jiri Kosina, Jens Axboe Well, it turns out my celebration was a bit premature. PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. The interesting thing is that it somehow damaged the partition that was not supposed to receive any writes (the file system was mounted read-only), so my guess is that the patch causes the blocks residing in the write-back cache to flush to the wrong blocks on the backing device. Everything was going great until I rebooted and saw this in the log: [ 19.639082] attempt to access beyond end of device [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 [ 19.659033] attempt to access beyond end of device [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 [ 19.669447] attempt to access beyond end of device [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 [ 19.679195] attempt to access beyond end of device [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 [ 19.689007] attempt to access beyond end of device [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 [ 19.699055] attempt to access beyond end of device [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 [ 19.719246] attempt to access beyond end of device [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 ...... (it's a small example, the list was much longer) And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) Oh well, it's a good thing I have backups. I knew what I was doing when trying the untested patches. I should have made the RAID md partition read-only, not the file system. I kind of expected that something could have gone wrong with the file system I was testing, just did not expect it would fire nukes at the innocent bystanders. On Wed, Sep 16, 2015 at 5:08 PM, Denis Bychkov <manover@gmail.com> wrote: > Hi Kent, Vojtech, list > > Your fix works perfectly, I can finally remove my patch that disables > the partial_stripes_expensive branch and unleash the full bcache > performance on my RAID-6 array. I was aware of this problem for some > time now, but never got to learning the bcache codebase enough to find > out why this happens with partial stripes write-back enabled, so I > just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for > finding the reason and to you for the patch. I have quite a collection > of stability patches for the mainline bcache. Would you please be kind > to review them when convenient and merge upstream the ones you think > are worth merging. It will be 4.4 merge window, I believe. > Please find all the patches attached. All of them are rebased on > mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let > me know. > > > On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet > <kent.overstreet@gmail.com> wrote: >> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote: >>> Fix writeback_thread never finishing writing back all dirty data in bcache when >>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. >>> >>> Signed-off-by: Vojtech Pavlik <vojtech@suse.com> >>> --- >>> >>> This is a fix for the current upstream bcache, not the devel branch. >>> >>> If partial_stripes_expensive is set for a cache set, then writeback_thread >>> always attempts to write full stripes out back to the backing device first. >>> However, since it does that based on a bitmap and not a simple linear >>> search, like the rest of the code of refill_dirty(), it changes the >>> last_scanned pointer so that never points to 'end'. refill_dirty() then >>> never tries to scan from 'start', resulting in the writeback_thread >>> looping, consuming 100% of CPU, but never making progress in writing out >>> the incomplete dirty stripes in the cache. >>> >>> Scanning the tree after not finding enough full stripes fixes the issue. >>> >>> Incomplete dirty stripes are written to the backing device, the device >>> eventually reaches a clean state if there is nothing dirtying data and >>> writeback_thread sleeps. This also fixes the problem of the cache device >>> not being possible to detach in the partial_stripes_expensive scenario. >> >> Good catch! >> >>> It may be more efficient to separate the last_scanned field for normal and >>> stripe scans instead. >> >> Actually, I think I like your approach - I think it gives us the behaviour we >> want, although it took some thinking to figure out why. >> >> One of the reasons for last_scanned is just to do writeback in LBA order and >> avoid making the disks seek around - so, if refill_full_stripes() did just queue >> up some IO at last_scanned, we do want to keep scanning from that position for >> non full stripes. >> >> But since (as you noted) last_scanned is never going to be >= end after calling >> refill_full_strips() (if there were any full stripes) - really the correct thing >> to do is loop around to the start if necessary so that we can successfully scan >> everything. >> >> The only inefficiency I see with your approach is that on the second scan, after >> we've looped, we don't need to scan to the end of the disk - we only need to >> scan up to where we initially started. >> >> But, another observation - if we change refill_dirty() so that it always scans >> the entire keyspace if necessary, regardless of where last_scanned was - well, >> that really isn't specific to the refill_full_stripes() case - we can just >> always do that. >> >> Can you give this patch a try? >> >> -- >8 -- >> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary >> >> Previously, it would only scan the entire disk if it was starting from the very >> start of the disk - i.e. if the previous scan got to the end. >> >> This was broken by refill_full_stripes(), which updates last_scanned so that >> refill_dirty was never triggering the searched_from_start path. >> >> But if we change refill_dirty() to always scan the entire disk if necessary, >> regardless of what last_scanned was, the code gets cleaner and we fix that bug >> too. >> >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> >> --- >> drivers/md/bcache/writeback.c | 24 ++++++++++++++++-------- >> 1 file changed, 16 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c >> index cdde0f32f0..08a52db38b 100644 >> --- a/drivers/md/bcache/writeback.c >> +++ b/drivers/md/bcache/writeback.c >> @@ -359,11 +359,13 @@ next: >> } >> } >> >> +/* >> + * Returns true if we scanned the entire disk >> + */ >> static bool refill_dirty(struct cached_dev *dc) >> { >> struct keybuf *buf = &dc->writeback_keys; >> - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); >> - bool searched_from_start = false; >> + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); >> >> if (dc->partial_stripes_expensive) { >> refill_full_stripes(dc); >> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc) >> return false; >> } >> >> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { >> - buf->last_scanned = KEY(dc->disk.id, 0, 0); >> - searched_from_start = true; >> - } >> - >> + start_pos = buf->last_scanned; >> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); >> >> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; >> + if (bkey_cmp(&buf->last_scanned, &end) < 0) >> + return false; >> + >> + /* >> + * If we get to the end start scanning again from the beginning, and >> + * only scan up to where we initially started scanning from: >> + */ >> + buf->last_scanned = KEY(dc->disk.id, 0, 0); >> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); >> + >> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; >> } >> >> static void bch_writeback(struct cached_dev *dc) >> -- >> 2.5.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Denis -- Denis ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-17 15:30 ` Denis Bychkov @ 2015-09-17 16:40 ` Kent Overstreet 2015-09-17 18:31 ` Kent Overstreet 0 siblings, 1 reply; 10+ messages in thread From: Kent Overstreet @ 2015-09-17 16:40 UTC (permalink / raw) To: Denis Bychkov Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac, Jiri Kosina, Jens Axboe On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote: > Well, it turns out my celebration was a bit premature. > > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. > > The interesting thing is that it somehow damaged the partition that > was not supposed to receive any writes (the file system was mounted > read-only), so my guess is that the patch causes the blocks residing > in the write-back cache to flush to the wrong blocks on the backing > device. > Everything was going great until I rebooted and saw this in the log: > > [ 19.639082] attempt to access beyond end of device > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 > [ 19.659033] attempt to access beyond end of device > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 > [ 19.669447] attempt to access beyond end of device > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 > [ 19.679195] attempt to access beyond end of device > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 > [ 19.689007] attempt to access beyond end of device > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 > [ 19.699055] attempt to access beyond end of device > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 > [ 19.719246] attempt to access beyond end of device > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 > ...... > (it's a small example, the list was much longer) > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) > Oh well, it's a good thing I have backups. > I knew what I was doing when trying the untested patches. I should > have made the RAID md partition read-only, not the file system. I kind > of expected that something could have gone wrong with the file system > I was testing, just did not expect it would fire nukes at the innocent > bystanders. Aw, shit. That's just _bizzare_. I have a theory - it appears that last_scanned isn't getting initialized before it's used, so it's going to be all 0s the very first time... which it appears could cause it to slurp up keys from the wrong device (and if that device was bigger than the correct device, that could explain the accesses beyond the end of the device). Currently just a theory though, and I have no clue why it would only be exposed with my patch. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-17 16:40 ` Kent Overstreet @ 2015-09-17 18:31 ` Kent Overstreet 2015-09-17 20:54 ` Denis Bychkov ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Kent Overstreet @ 2015-09-17 18:31 UTC (permalink / raw) To: Denis Bychkov Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac, Jiri Kosina, Jens Axboe On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote: > On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote: > > Well, it turns out my celebration was a bit premature. > > > > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech > > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. > > > > The interesting thing is that it somehow damaged the partition that > > was not supposed to receive any writes (the file system was mounted > > read-only), so my guess is that the patch causes the blocks residing > > in the write-back cache to flush to the wrong blocks on the backing > > device. > > Everything was going great until I rebooted and saw this in the log: > > > > [ 19.639082] attempt to access beyond end of device > > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 > > [ 19.659033] attempt to access beyond end of device > > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 > > [ 19.669447] attempt to access beyond end of device > > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 > > [ 19.679195] attempt to access beyond end of device > > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 > > [ 19.689007] attempt to access beyond end of device > > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 > > [ 19.699055] attempt to access beyond end of device > > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 > > [ 19.719246] attempt to access beyond end of device > > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 > > ...... > > (it's a small example, the list was much longer) > > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) > > Oh well, it's a good thing I have backups. > > I knew what I was doing when trying the untested patches. I should > > have made the RAID md partition read-only, not the file system. I kind > > of expected that something could have gone wrong with the file system > > I was testing, just did not expect it would fire nukes at the innocent > > bystanders. > > Aw, shit. That's just _bizzare_. > > I have a theory - it appears that last_scanned isn't getting initialized before > it's used, so it's going to be all 0s the very first time... which it appears > could cause it to slurp up keys from the wrong device (and if that device was > bigger than the correct device, that could explain the accesses beyond the end > of the device). > > Currently just a theory though, and I have no clue why it would only be exposed > with my patch. Here's an updated patch that has a fix for _that_ theory, and also a new BUG_ON(). Any chance you could test it? Oh - I didn't ask - _do_ you have multiple backing devices attached to the same cache set? Because if you don't, this isn't it at all... -- >8 -- Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> --- drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index cdde0f32f0..d383024247 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode, static bool dirty_pred(struct keybuf *buf, struct bkey *k) { + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys); + + BUG_ON(KEY_INODE(k) != dc->disk.id); + return KEY_DIRTY(k); } @@ -359,11 +363,24 @@ next: } } +/* + * Returns true if we scanned the entire disk + */ static bool refill_dirty(struct cached_dev *dc) { struct keybuf *buf = &dc->writeback_keys; + struct bkey start = KEY(dc->disk.id, 0, 0); struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); - bool searched_from_start = false; + struct bkey start_pos; + + /* + * make sure keybuf pos is inside the range for this disk - at bringup + * we might not be attached yet so this disk's inode nr isn't + * initialized then + */ + if (bkey_cmp(&buf->last_scanned, &start) < 0 || + bkey_cmp(&buf->last_scanned, &end) > 0) + buf->last_scanned = start; if (dc->partial_stripes_expensive) { refill_full_stripes(dc); @@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc) return false; } - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { - buf->last_scanned = KEY(dc->disk.id, 0, 0); - searched_from_start = true; - } - + start_pos = buf->last_scanned; bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; + if (bkey_cmp(&buf->last_scanned, &end) < 0) + return false; + + /* + * If we get to the end start scanning again from the beginning, and + * only scan up to where we initially started scanning from: + */ + buf->last_scanned = start; + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); + + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; } static void bch_writeback(struct cached_dev *dc) -- 2.5.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-17 18:31 ` Kent Overstreet @ 2015-09-17 20:54 ` Denis Bychkov 2015-09-19 4:47 ` Denis Bychkov 2015-10-01 9:01 ` Tim Small 2 siblings, 0 replies; 10+ messages in thread From: Denis Bychkov @ 2015-09-17 20:54 UTC (permalink / raw) To: Kent Overstreet Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac, Jiri Kosina, Jens Axboe Yes, sure, I'll try it. And yes, this was exactly my setup - several backing devices sharing the same cache set, all in write back mode. When I decided to try your patch yesterday, I created a few more and attached them to the same cache set. Did not want to experiment with the live ones - so just remounted them RO to make sure they won't send any writes. Instead of switching the md array into RO mode - I am an idiot. And today, after I detached the botched XFS partition from the cache (to make sure, there are no more dirty data left) and looked at the area where there used to be an XFS super block, I found a super block belonging to one of the new devices I created. So there is no mistake here, this is exactly what happened, I think you pinned it. The only thing I don't understand is how those writes ended up on a different device. I'd understand it if it was the same device, different partition, we would write to the sectors that belong to a different partition on the same disk. But this is not the case, it is a different md device that received the unwanted writes. Unless ... unless, of course, the addresses of the write requests that are stored in bcache cache set have already been translated into a physical interface (ATA/SCSI) address. Have they? If that is the case, they share the same address space, because they are the same SATA drives. But, that can't be true, right? Bcache can not and does not store physical drive addresses in the cache, does it? On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote: > On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote: >> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote: >> > Well, it turns out my celebration was a bit premature. >> > >> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech >> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. >> > >> > The interesting thing is that it somehow damaged the partition that >> > was not supposed to receive any writes (the file system was mounted >> > read-only), so my guess is that the patch causes the blocks residing >> > in the write-back cache to flush to the wrong blocks on the backing >> > device. >> > Everything was going great until I rebooted and saw this in the log: >> > >> > [ 19.639082] attempt to access beyond end of device >> > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 >> > [ 19.659033] attempt to access beyond end of device >> > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 >> > [ 19.669447] attempt to access beyond end of device >> > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 >> > [ 19.679195] attempt to access beyond end of device >> > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 >> > [ 19.689007] attempt to access beyond end of device >> > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 >> > [ 19.699055] attempt to access beyond end of device >> > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 >> > [ 19.719246] attempt to access beyond end of device >> > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 >> > ...... >> > (it's a small example, the list was much longer) >> > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) >> > Oh well, it's a good thing I have backups. >> > I knew what I was doing when trying the untested patches. I should >> > have made the RAID md partition read-only, not the file system. I kind >> > of expected that something could have gone wrong with the file system >> > I was testing, just did not expect it would fire nukes at the innocent >> > bystanders. >> >> Aw, shit. That's just _bizzare_. >> >> I have a theory - it appears that last_scanned isn't getting initialized before >> it's used, so it's going to be all 0s the very first time... which it appears >> could cause it to slurp up keys from the wrong device (and if that device was >> bigger than the correct device, that could explain the accesses beyond the end >> of the device). >> >> Currently just a theory though, and I have no clue why it would only be exposed >> with my patch. > > Here's an updated patch that has a fix for _that_ theory, and also a new > BUG_ON(). Any chance you could test it? > > Oh - I didn't ask - _do_ you have multiple backing devices attached to the same > cache set? Because if you don't, this isn't it at all... > > -- >8 -- > Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary > > Previously, it would only scan the entire disk if it was starting from the very > start of the disk - i.e. if the previous scan got to the end. > > This was broken by refill_full_stripes(), which updates last_scanned so that > refill_dirty was never triggering the searched_from_start path. > > But if we change refill_dirty() to always scan the entire disk if necessary, > regardless of what last_scanned was, the code gets cleaner and we fix that bug > too. > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> > --- > drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++------- > 1 file changed, 30 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index cdde0f32f0..d383024247 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode, > > static bool dirty_pred(struct keybuf *buf, struct bkey *k) > { > + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys); > + > + BUG_ON(KEY_INODE(k) != dc->disk.id); > + > return KEY_DIRTY(k); > } > > @@ -359,11 +363,24 @@ next: > } > } > > +/* > + * Returns true if we scanned the entire disk > + */ > static bool refill_dirty(struct cached_dev *dc) > { > struct keybuf *buf = &dc->writeback_keys; > + struct bkey start = KEY(dc->disk.id, 0, 0); > struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); > - bool searched_from_start = false; > + struct bkey start_pos; > + > + /* > + * make sure keybuf pos is inside the range for this disk - at bringup > + * we might not be attached yet so this disk's inode nr isn't > + * initialized then > + */ > + if (bkey_cmp(&buf->last_scanned, &start) < 0 || > + bkey_cmp(&buf->last_scanned, &end) > 0) > + buf->last_scanned = start; > > if (dc->partial_stripes_expensive) { > refill_full_stripes(dc); > @@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc) > return false; > } > > - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { > - buf->last_scanned = KEY(dc->disk.id, 0, 0); > - searched_from_start = true; > - } > - > + start_pos = buf->last_scanned; > bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); > > - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; > + if (bkey_cmp(&buf->last_scanned, &end) < 0) > + return false; > + > + /* > + * If we get to the end start scanning again from the beginning, and > + * only scan up to where we initially started scanning from: > + */ > + buf->last_scanned = start; > + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); > + > + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; > } > > static void bch_writeback(struct cached_dev *dc) > -- > 2.5.1 > -- Denis ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-17 18:31 ` Kent Overstreet 2015-09-17 20:54 ` Denis Bychkov @ 2015-09-19 4:47 ` Denis Bychkov 2015-10-01 9:01 ` Tim Small 2 siblings, 0 replies; 10+ messages in thread From: Denis Bychkov @ 2015-09-19 4:47 UTC (permalink / raw) To: Kent Overstreet Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Emmanuel Florac, Jiri Kosina, Jens Axboe Hi Kent, After running a day with this new version of your patch, I did not notice any problems, so I assume, it works. I'll keep an eye on it and report back if find anything bad. I believe, you finally fixed that long lasting bug with writeback threads spinning CPU. To Vojtech Pavlik: thank you for catching it! For a long time I had to run bcache with the partial_stripes_expensive branch disabled, which seriously affected its performance on my RAID-6. On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote: > On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote: >> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote: >> > Well, it turns out my celebration was a bit premature. >> > >> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech >> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. >> > >> > The interesting thing is that it somehow damaged the partition that >> > was not supposed to receive any writes (the file system was mounted >> > read-only), so my guess is that the patch causes the blocks residing >> > in the write-back cache to flush to the wrong blocks on the backing >> > device. >> > Everything was going great until I rebooted and saw this in the log: >> > >> > [ 19.639082] attempt to access beyond end of device >> > [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 >> > [ 19.659033] attempt to access beyond end of device >> > [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 >> > [ 19.669447] attempt to access beyond end of device >> > [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 >> > [ 19.679195] attempt to access beyond end of device >> > [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 >> > [ 19.689007] attempt to access beyond end of device >> > [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 >> > [ 19.699055] attempt to access beyond end of device >> > [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 >> > [ 19.719246] attempt to access beyond end of device >> > [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 >> > ...... >> > (it's a small example, the list was much longer) >> > And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) >> > Oh well, it's a good thing I have backups. >> > I knew what I was doing when trying the untested patches. I should >> > have made the RAID md partition read-only, not the file system. I kind >> > of expected that something could have gone wrong with the file system >> > I was testing, just did not expect it would fire nukes at the innocent >> > bystanders. >> >> Aw, shit. That's just _bizzare_. >> >> I have a theory - it appears that last_scanned isn't getting initialized before >> it's used, so it's going to be all 0s the very first time... which it appears >> could cause it to slurp up keys from the wrong device (and if that device was >> bigger than the correct device, that could explain the accesses beyond the end >> of the device). >> >> Currently just a theory though, and I have no clue why it would only be exposed >> with my patch. > > Here's an updated patch that has a fix for _that_ theory, and also a new > BUG_ON(). Any chance you could test it? > > Oh - I didn't ask - _do_ you have multiple backing devices attached to the same > cache set? Because if you don't, this isn't it at all... > > -- >8 -- > Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary > > Previously, it would only scan the entire disk if it was starting from the very > start of the disk - i.e. if the previous scan got to the end. > > This was broken by refill_full_stripes(), which updates last_scanned so that > refill_dirty was never triggering the searched_from_start path. > > But if we change refill_dirty() to always scan the entire disk if necessary, > regardless of what last_scanned was, the code gets cleaner and we fix that bug > too. > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> > --- > drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++------- > 1 file changed, 30 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index cdde0f32f0..d383024247 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode, > > static bool dirty_pred(struct keybuf *buf, struct bkey *k) > { > + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys); > + > + BUG_ON(KEY_INODE(k) != dc->disk.id); > + > return KEY_DIRTY(k); > } > > @@ -359,11 +363,24 @@ next: > } > } > > +/* > + * Returns true if we scanned the entire disk > + */ > static bool refill_dirty(struct cached_dev *dc) > { > struct keybuf *buf = &dc->writeback_keys; > + struct bkey start = KEY(dc->disk.id, 0, 0); > struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); > - bool searched_from_start = false; > + struct bkey start_pos; > + > + /* > + * make sure keybuf pos is inside the range for this disk - at bringup > + * we might not be attached yet so this disk's inode nr isn't > + * initialized then > + */ > + if (bkey_cmp(&buf->last_scanned, &start) < 0 || > + bkey_cmp(&buf->last_scanned, &end) > 0) > + buf->last_scanned = start; > > if (dc->partial_stripes_expensive) { > refill_full_stripes(dc); > @@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc) > return false; > } > > - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { > - buf->last_scanned = KEY(dc->disk.id, 0, 0); > - searched_from_start = true; > - } > - > + start_pos = buf->last_scanned; > bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); > > - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; > + if (bkey_cmp(&buf->last_scanned, &end) < 0) > + return false; > + > + /* > + * If we get to the end start scanning again from the beginning, and > + * only scan up to where we initially started scanning from: > + */ > + buf->last_scanned = start; > + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); > + > + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; > } > > static void bch_writeback(struct cached_dev *dc) > -- > 2.5.1 > -- Denis ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-17 18:31 ` Kent Overstreet 2015-09-17 20:54 ` Denis Bychkov 2015-09-19 4:47 ` Denis Bychkov @ 2015-10-01 9:01 ` Tim Small 2 siblings, 0 replies; 10+ messages in thread From: Tim Small @ 2015-10-01 9:01 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcache On 17/09/15 19:31, Kent Overstreet wrote: > Here's an updated patch that has a fix for _that_ theory, and also a new > BUG_ON(). Any chance you could test it? I've applied your patch (to the Ubuntu 4.2.1 tree - since I need some of their non-mainline apparmour changes on this box) , and it's stopped the excessive CPU usage by the bcache_writebac thread. I'd first experienced this CPU spinning when I tried to disable cache usage (I'd had some lockups on the machine and I was trying to get to debug, so I tried switching bcache cache_mode to none whilst I debugged as part of my process of elimination). I assume the bug was hit whilst trying to flush the dirty cache data. I now have: root@magic:~# cat /sys/block/bcache0/bcache/state clean root@magic:~# cat /sys/block/bcache0/bcache/dirty_data 1.5M root@magic:~# cat /sys/block/bcache0/bcache/cache_mode writethrough writeback writearound [none] root@magic:~# uname -a Linux magic 4.2.1 #2 SMP Sat Sep 26 10:00:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux i.e. state and cache_mode are inconsistent with dirty_data. Is this of any concern and/or likely to cause problems if I re-enable the cache? Cheers, Tim. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes. 2015-09-16 21:08 ` Denis Bychkov 2015-09-17 15:30 ` Denis Bychkov @ 2015-09-29 23:00 ` Peter Kieser 1 sibling, 0 replies; 10+ messages in thread From: Peter Kieser @ 2015-09-29 23:00 UTC (permalink / raw) To: Denis Bychkov, Kent Overstreet Cc: Vojtech Pavlik, linux-kernel, linux-bcache, Kent Overstreet, Emmanuel Florac, Jiri Kosina, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 6764 bytes --] My question still stands: Why are none of these patches submitted to mainline or backported to stable kernels? Thanks. On 2015-09-16 2:08 PM, Denis Bychkov wrote: > Hi Kent, Vojtech, list > > Your fix works perfectly, I can finally remove my patch that disables > the partial_stripes_expensive branch and unleash the full bcache > performance on my RAID-6 array. I was aware of this problem for some > time now, but never got to learning the bcache codebase enough to find > out why this happens with partial stripes write-back enabled, so I > just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for > finding the reason and to you for the patch. I have quite a collection > of stability patches for the mainline bcache. Would you please be kind > to review them when convenient and merge upstream the ones you think > are worth merging. It will be 4.4 merge window, I believe. > Please find all the patches attached. All of them are rebased on > mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let > me know. > > > On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet > <kent.overstreet@gmail.com> wrote: >> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote: >>> Fix writeback_thread never finishing writing back all dirty data in bcache when >>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. >>> >>> Signed-off-by: Vojtech Pavlik <vojtech@suse.com> >>> --- >>> >>> This is a fix for the current upstream bcache, not the devel branch. >>> >>> If partial_stripes_expensive is set for a cache set, then writeback_thread >>> always attempts to write full stripes out back to the backing device first. >>> However, since it does that based on a bitmap and not a simple linear >>> search, like the rest of the code of refill_dirty(), it changes the >>> last_scanned pointer so that never points to 'end'. refill_dirty() then >>> never tries to scan from 'start', resulting in the writeback_thread >>> looping, consuming 100% of CPU, but never making progress in writing out >>> the incomplete dirty stripes in the cache. >>> >>> Scanning the tree after not finding enough full stripes fixes the issue. >>> >>> Incomplete dirty stripes are written to the backing device, the device >>> eventually reaches a clean state if there is nothing dirtying data and >>> writeback_thread sleeps. This also fixes the problem of the cache device >>> not being possible to detach in the partial_stripes_expensive scenario. >> Good catch! >> >>> It may be more efficient to separate the last_scanned field for normal and >>> stripe scans instead. >> Actually, I think I like your approach - I think it gives us the behaviour we >> want, although it took some thinking to figure out why. >> >> One of the reasons for last_scanned is just to do writeback in LBA order and >> avoid making the disks seek around - so, if refill_full_stripes() did just queue >> up some IO at last_scanned, we do want to keep scanning from that position for >> non full stripes. >> >> But since (as you noted) last_scanned is never going to be >= end after calling >> refill_full_strips() (if there were any full stripes) - really the correct thing >> to do is loop around to the start if necessary so that we can successfully scan >> everything. >> >> The only inefficiency I see with your approach is that on the second scan, after >> we've looped, we don't need to scan to the end of the disk - we only need to >> scan up to where we initially started. >> >> But, another observation - if we change refill_dirty() so that it always scans >> the entire keyspace if necessary, regardless of where last_scanned was - well, >> that really isn't specific to the refill_full_stripes() case - we can just >> always do that. >> >> Can you give this patch a try? >> >> -- >8 -- >> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary >> >> Previously, it would only scan the entire disk if it was starting from the very >> start of the disk - i.e. if the previous scan got to the end. >> >> This was broken by refill_full_stripes(), which updates last_scanned so that >> refill_dirty was never triggering the searched_from_start path. >> >> But if we change refill_dirty() to always scan the entire disk if necessary, >> regardless of what last_scanned was, the code gets cleaner and we fix that bug >> too. >> >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> >> --- >> drivers/md/bcache/writeback.c | 24 ++++++++++++++++-------- >> 1 file changed, 16 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c >> index cdde0f32f0..08a52db38b 100644 >> --- a/drivers/md/bcache/writeback.c >> +++ b/drivers/md/bcache/writeback.c >> @@ -359,11 +359,13 @@ next: >> } >> } >> >> +/* >> + * Returns true if we scanned the entire disk >> + */ >> static bool refill_dirty(struct cached_dev *dc) >> { >> struct keybuf *buf = &dc->writeback_keys; >> - struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); >> - bool searched_from_start = false; >> + struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); >> >> if (dc->partial_stripes_expensive) { >> refill_full_stripes(dc); >> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cached_dev *dc) >> return false; >> } >> >> - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { >> - buf->last_scanned = KEY(dc->disk.id, 0, 0); >> - searched_from_start = true; >> - } >> - >> + start_pos = buf->last_scanned; >> bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); >> >> - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; >> + if (bkey_cmp(&buf->last_scanned, &end) < 0) >> + return false; >> + >> + /* >> + * If we get to the end start scanning again from the beginning, and >> + * only scan up to where we initially started scanning from: >> + */ >> + buf->last_scanned = KEY(dc->disk.id, 0, 0); >> + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); >> + >> + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; >> } >> >> static void bch_writeback(struct cached_dev *dc) >> -- >> 2.5.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Peter Kieser 604.338.9294 / peter@kieser.ca [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4311 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-10-01 9:38 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-05 11:10 [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes Vojtech Pavlik 2015-09-16 11:32 ` Kent Overstreet 2015-09-16 21:08 ` Denis Bychkov 2015-09-17 15:30 ` Denis Bychkov 2015-09-17 16:40 ` Kent Overstreet 2015-09-17 18:31 ` Kent Overstreet 2015-09-17 20:54 ` Denis Bychkov 2015-09-19 4:47 ` Denis Bychkov 2015-10-01 9:01 ` Tim Small 2015-09-29 23:00 ` Peter Kieser
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).