public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/10] cache device failure handling improvement
@ 2018-01-03 14:03 Coly Li
  2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
                   ` (10 more replies)
  0 siblings, 11 replies; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Hi maintainers and folks,

This patch set tries to improve cache device failure handling. A basic
idea to handle failed cache device is,
- Unregister cache set
- Detach all backing devices attached to this cache set
- Stop all bcache devices linked to this cache set
The above process is named 'cache set retire' by me. The result of cache
set retire is, cache set and bcache devices are all removed, following
I/O requests will get failed immediately to notift upper layer or user
space coce that the cache device is failed or disconnected.

The first 8 patches of this patch set is to fix existing bugs in bcache,
the last 2 patches do the real improvement. Order of applying these patches
is important, if the last 2 patches are applied firstly, kernel panic or
process hang will be observed. Therefore I suggest to apply the first 8
fixes, then apply the last 2 patches.

The patch set is tested with writethrough, writeback, writearound mode,
read/write/readwrite workloads, so far it works as expected. IMHO the
cache set retire logic is complicated, I need your help to review the
patches, any question is warmly wlecome.

Coly Li (10):
  bcache: exit bch_writeback_thread() with proper task state
  bcache: set task properly in allocator_wait()
  bcache: reduce cache_set devices iteration by devices_max_used
  bcache: fix cached_dev->count usage for bch_cache_set_error()
  bcache: stop dc->writeback_rate_update if cache set is stopping
  bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier
  bcache: set error_limit correctly
  bcache: fix misleading error message in bch_count_io_errors()
  bcache: add io_disable to struct cache_set
  bcache: stop all attached bcache devices for a retired cache set

 drivers/md/bcache/alloc.c     |  5 ++---
 drivers/md/bcache/bcache.h    | 19 +++++++++++++++-
 drivers/md/bcache/btree.c     |  8 ++++---
 drivers/md/bcache/io.c        | 15 ++++++++-----
 drivers/md/bcache/journal.c   |  4 ++--
 drivers/md/bcache/request.c   | 26 ++++++++++++++++------
 drivers/md/bcache/super.c     | 51 +++++++++++++++++++++++++++++++++++--------
 drivers/md/bcache/sysfs.c     |  8 +++++--
 drivers/md/bcache/util.h      |  6 -----
 drivers/md/bcache/writeback.c | 51 +++++++++++++++++++++++++++++++++----------
 drivers/md/bcache/writeback.h |  4 +---
 11 files changed, 144 insertions(+), 53 deletions(-)

Thanks in advance.

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-03 17:08   ` Michael Lyle
  2018-01-08  7:09   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 02/10] bcache: set task properly in allocator_wait() Coly Li
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Kernel thread routine bch_writeback_thread() has the following code block,

452                         set_current_state(TASK_INTERRUPTIBLE);
453
454                         if (kthread_should_stop())
455                                 return 0;
456
457                         schedule();
458                         continue;

At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
kthread_should_stop() is true, a "return 0" at line 455 will to function
kernel/kthread.c:kthread() and call do_exit().

It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
following code path might_sleep() is called and a warning message is
reported by __might_sleep(): "WARNING: do not call blocking ops when
!TASK_RUNNING; state=1 set at [xxxx]".

Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
state, but this warning message scares users, makes them feel there might
be something risky with bcache and hurt their data.

In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
so writeback kernel thread can exist and enter do_exit() with
TASK_RUNNING state. Warning message from might_sleep() is removed.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a37884ca8b..a57149803df6 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -449,11 +449,11 @@ static int bch_writeback_thread(void *arg)
 		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
 		     !dc->writeback_running)) {
 			up_write(&dc->writeback_lock);
-			set_current_state(TASK_INTERRUPTIBLE);
 
 			if (kthread_should_stop())
 				return 0;
 
+			set_current_state(TASK_INTERRUPTIBLE);
 			schedule();
 			continue;
 		}
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 02/10] bcache: set task properly in allocator_wait()
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
  2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-03 17:09   ` Michael Lyle
  2018-01-08  7:10   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used Coly Li
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Kernel thread routine bch_allocator_thread() references macro
allocator_wait() to wait for a condition or quit to do_exit()
when kthread_should_stop() is true.

Macro allocator_wait() has 2 issues in setting task state, let's
see its code piece,

284         while (1) {                                                   \
285                 set_current_state(TASK_INTERRUPTIBLE);                \
286                 if (cond)                                             \
287                         break;                                        \
288                                                                       \
289                 mutex_unlock(&(ca)->set->bucket_lock);                \
290                 if (kthread_should_stop())                            \
291                         return 0;                                     \
292                                                                       \
293                 schedule();                                           \
294                 mutex_lock(&(ca)->set->bucket_lock);                  \
295         }                                                             \
296         __set_current_state(TASK_RUNNING);                            \

1) At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
kthread_should_stop() is true, the kernel thread will terminate and return
to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
state. This is not a suggested behavior and a warning message will be
reported by might_sleep() in do_exit() code path: "WARNING: do not call
blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".

2) Because task state is set to TASK_INTERRUPTIBLE at line 285, when break
while-loop the task state has to be set back to TASK_RUNNING at line 296.
Indeed it is unncessary, if task state is set to TASK_INTERRUPTIBLE before
calling schedule() at line 293, we don't need to set the state back to
TASK_RUNNING at line 296 anymore. The reason is, allocator kthread is only
woken up by wake_up_process(), this routine makes sure the task state of
allocator kthread will be TASK_RUNNING after it returns from schedule() at
line 294 (see kernel/sched/core.c:try_to_wake_up() for more detailed
information).

This patch fixes the above 2 issues by,
1) Setting TASK_INTERRUPTIBLE state just before calling schedule().
2) Then setting TASK_RUNNING at line 296 is unnecessary, remove it.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc6d884..48c002faf08d 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -282,7 +282,6 @@ static void invalidate_buckets(struct cache *ca)
 #define allocator_wait(ca, cond)					\
 do {									\
 	while (1) {							\
-		set_current_state(TASK_INTERRUPTIBLE);			\
 		if (cond)						\
 			break;						\
 									\
@@ -290,10 +289,10 @@ do {									\
 		if (kthread_should_stop())				\
 			return 0;					\
 									\
+		set_current_state(TASK_INTERRUPTIBLE);			\
 		schedule();						\
 		mutex_lock(&(ca)->set->bucket_lock);			\
 	}								\
-	__set_current_state(TASK_RUNNING);				\
 } while (0)
 
 static int bch_allocator_push(struct cache *ca, long bucket)
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
  2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
  2018-01-03 14:03 ` [PATCH v1 02/10] bcache: set task properly in allocator_wait() Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-03 17:11   ` Michael Lyle
  2018-01-08  7:12   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error() Coly Li
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
	bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    | 1 +
 drivers/md/bcache/btree.c     | 2 +-
 drivers/md/bcache/super.c     | 9 ++++++---
 drivers/md/bcache/writeback.h | 2 +-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e017e1..83c569942bd0 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -488,6 +488,7 @@ struct cache_set {
 	int			caches_loaded;
 
 	struct bcache_device	**devices;
+	unsigned		devices_max_used;
 	struct list_head	cached_devs;
 	uint64_t		cached_dev_sectors;
 	struct closure		caching;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3dbe5e..bf0d7978bc3d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1678,7 +1678,7 @@ static void bch_btree_gc_finish(struct cache_set *c)
 
 	/* don't reclaim buckets to which writeback keys point */
 	rcu_read_lock();
-	for (i = 0; i < c->nr_uuids; i++) {
+	for (i = 0; i < c->devices_max_used; i++) {
 		struct bcache_device *d = c->devices[i];
 		struct cached_dev *dc;
 		struct keybuf_key *w, *n;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928dec5..064efd869017 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -721,6 +721,9 @@ static void bcache_device_attach(struct bcache_device *d, struct cache_set *c,
 	d->c = c;
 	c->devices[id] = d;
 
+	if (id >= c->devices_max_used)
+		c->devices_max_used = id + 1;
+
 	closure_get(&c->caching);
 }
 
@@ -1261,7 +1264,7 @@ static int flash_devs_run(struct cache_set *c)
 	struct uuid_entry *u;
 
 	for (u = c->uuids;
-	     u < c->uuids + c->nr_uuids && !ret;
+	     u < c->uuids + c->devices_max_used && !ret;
 	     u++)
 		if (UUID_FLASH_ONLY(u))
 			ret = flash_dev_run(c, u);
@@ -1427,7 +1430,7 @@ static void __cache_set_unregister(struct closure *cl)
 
 	mutex_lock(&bch_register_lock);
 
-	for (i = 0; i < c->nr_uuids; i++)
+	for (i = 0; i < c->devices_max_used; i++)
 		if (c->devices[i]) {
 			if (!UUID_FLASH_ONLY(&c->uuids[i]) &&
 			    test_bit(CACHE_SET_UNREGISTERING, &c->flags)) {
@@ -1490,7 +1493,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	c->bucket_bits		= ilog2(sb->bucket_size);
 	c->block_bits		= ilog2(sb->block_size);
 	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
-
+	c->devices_max_used	= 0;
 	c->btree_pages		= bucket_pages(c);
 	if (c->btree_pages > BTREE_MAX_PAGES)
 		c->btree_pages = max_t(int, c->btree_pages / 4,
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index a9e3ffb4b03c..1d284f3d0363 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -21,7 +21,7 @@ static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
 
 	mutex_lock(&bch_register_lock);
 
-	for (i = 0; i < c->nr_uuids; i++) {
+	for (i = 0; i < c->devices_max_used; i++) {
 		struct bcache_device *d = c->devices[i];
 
 		if (!d || !UUID_FLASH_ONLY(&c->uuids[i]))
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error()
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (2 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:16   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping Coly Li
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

When bcache metadata I/O fails, bcache will call bch_cache_set_error()
to retire the whole cache set. The expected behavior to retire a cache
set is to unregister the cache set, and unregister all backing device
attached to this cache set, then remove sysfs entries of the cache set
and all attached backing devices, finally release memory of structs
cache_set, cache, cached_dev and bcache_device.

In my testing when journal I/O failure triggered by disconnected cache
device, sometimes the cache set cannot be retired, and its sysfs
entry /sys/fs/bcache/<uuid> still exits and the backing device also
references it. This is not expected behavior.

When metadata I/O failes, the call senquence to retire whole cache set is,
	bch_cache_set_error()
	bch_cache_set_unregister()
	bch_cache_set_stop()
	__cache_set_unregister()     <- called as callback by calling
					clousre_queue(&c->caching)
	cache_set_flush()            <- called as a callback when refcount
					of cache_set->caching is 0
	cache_set_free()             <- called as a callback when refcount
					of catch_set->cl is 0
	bch_cache_set_release()      <- called as a callback when refcount
					of catch_set->kobj is 0

I find if kernel thread bch_writeback_thread() quits while-loop when
kthread_should_stop() is true and searched_full_index is false, clousre
callback cache_set_flush() set by continue_at() will never be called. The
result is, bcache fails to retire whole cache set.

cache_set_flush() will be called when refcount of closure c->caching is 0,
and in function bcache_device_detach() refcount of closure c->caching is
released to 0 by clousre_put(). In metadata error code path, function
bcache_device_detach() is called by cached_dev_detach_finish(). This is a
callback routine being called when cached_dev->count is 0. This refcount
is decreased by cached_dev_put().

The above dependence indicates, cache_set_flush() will be called when
refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
when refcount of cache_dev->count is 0.

The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
of cache_dev is not decreased properly.

In bch_writeback_thread(), cached_dev_put() is called only when
searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
there is no dirty data on cache. In most of run time it is correct, but
when bch_writeback_thread() quits the while-loop while cache is still
dirty, current code forget to call cached_dev_put() before this kernel
thread exits. This is why sometimes cache_set_flush() is not executed and
cache set fails to be retired.

The reason to call cached_dev_put() in bch_writeback_rate() is, when the
cache device changes from clean to dirty, cached_dev_get() is called, to
make sure during writeback operatiions both backing and cache devices
won't be released.

Adding following code in bch_writeback_thread() does not work,
    static int bch_writeback_thread(void *arg)
    [code snip]

    +       if (atomic_read(&dc->has_dirty))
    +               cached_dev_put()
    +
        return 0;
    [code snip]

because writeback kernel thread can be waken up and start via sysfs entry:
	echo 1 > /sys/block/bcache<N>/bcache/writeback_running
It is difficult to check whether backing device is dirty without race and
extra lock. So the above modification will introduce potential refcount
underflow in some conditions.

The correct fix is, to take cached dev refcount when creating the kernel
thread, and put it before the kernel thread exits. Then bcache does not
need to take a cached dev refcount when cache turns from clean to dirty,
or to put a cached dev refcount when cache turns from ditry to clean. The
writeback kernel thread is alwasy safe to reference data structure from
cache set, cache and cached device (because a refcount of cache device is
taken for it already), and no matter the kernel thread is stopped by I/O
errors or system reboot, cached_dev->count can always be used correctly.

The patch is simple, but understanding how it works is quite complicated.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c     |  1 -
 drivers/md/bcache/writeback.c | 10 +++++++---
 drivers/md/bcache/writeback.h |  2 --
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 064efd869017..5401d2356aa3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1044,7 +1044,6 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
 	if (BDEV_STATE(&dc->sb) == BDEV_STATE_DIRTY) {
 		bch_sectors_dirty_init(&dc->disk);
 		atomic_set(&dc->has_dirty, 1);
-		refcount_inc(&dc->count);
 		bch_writeback_queue(dc);
 	}
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a57149803df6..0789a9e18337 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -451,7 +451,7 @@ static int bch_writeback_thread(void *arg)
 			up_write(&dc->writeback_lock);
 
 			if (kthread_should_stop())
-				return 0;
+				break;
 
 			set_current_state(TASK_INTERRUPTIBLE);
 			schedule();
@@ -463,7 +463,6 @@ static int bch_writeback_thread(void *arg)
 		if (searched_full_index &&
 		    RB_EMPTY_ROOT(&dc->writeback_keys.keys)) {
 			atomic_set(&dc->has_dirty, 0);
-			cached_dev_put(dc);
 			SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
 			bch_write_bdev_super(dc, NULL);
 		}
@@ -484,6 +483,8 @@ static int bch_writeback_thread(void *arg)
 		}
 	}
 
+	cached_dev_put(dc);
+
 	return 0;
 }
 
@@ -547,10 +548,13 @@ int bch_cached_dev_writeback_start(struct cached_dev *dc)
 	if (!dc->writeback_write_wq)
 		return -ENOMEM;
 
+	cached_dev_get(dc);
 	dc->writeback_thread = kthread_create(bch_writeback_thread, dc,
 					      "bcache_writeback");
-	if (IS_ERR(dc->writeback_thread))
+	if (IS_ERR(dc->writeback_thread)) {
+		cached_dev_put(dc);
 		return PTR_ERR(dc->writeback_thread);
+	}
 
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 1d284f3d0363..aab21afe49cf 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -92,8 +92,6 @@ static inline void bch_writeback_add(struct cached_dev *dc)
 {
 	if (!atomic_read(&dc->has_dirty) &&
 	    !atomic_xchg(&dc->has_dirty, 1)) {
-		refcount_inc(&dc->count);
-
 		if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
 			SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);
 			/* XXX: should do this synchronously */
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (3 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error() Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:22   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier Coly Li
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

struct delayed_work writeback_rate_update in struct cache_dev is a delayed
worker to call function update_writeback_rate() in period (the interval is
defined by dc->writeback_rate_update_seconds).

When a metadate I/O error happens on cache device, bcache error handling
routine bch_cache_set_error() will call bch_cache_set_unregister() to
retire whole cache set. On the unregister code path, cached_dev_free()
calls cancel_delayed_work_sync(&dc->writeback_rate_update) to stop this
delayed work.

dc->writeback_rate_update is a special delayed work from others in bcache.
In its routine update_writeback_rate(), this delayed work is re-armed
after a piece of time. That means when cancel_delayed_work_sync() returns,
this delayed work can still be executed after several seconds defined by
dc->writeback_rate_update_seconds.

The problem is, after cancel_delayed_work_sync() returns, the cache set
unregister code path will eventually release memory of struct cache set.
Then the delayed work is scheduled to run, and inside its routine
update_writeback_rate() that already released cache set NULL pointer will
be accessed. Now a NULL pointer deference panic is triggered.

In order to avoid the above problem, this patch checks cache set flags in
delayed work routine update_writeback_rate(). If flag CACHE_SET_STOPPING
is set, this routine will quit without re-arm the delayed work. Then the
NULL pointer deference panic won't happen after cache set is released.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 0789a9e18337..745d9b2a326f 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -91,6 +91,11 @@ static void update_writeback_rate(struct work_struct *work)
 	struct cached_dev *dc = container_of(to_delayed_work(work),
 					     struct cached_dev,
 					     writeback_rate_update);
+	struct cache_set *c = dc->disk.c;
+
+	/* quit directly if cache set is stopping */
+	if (test_bit(CACHE_SET_STOPPING, &c->flags))
+		return;
 
 	down_read(&dc->writeback_lock);
 
@@ -100,6 +105,10 @@ static void update_writeback_rate(struct work_struct *work)
 
 	up_read(&dc->writeback_lock);
 
+	/* do not schedule delayed work if cache set is stopping */
+	if (test_bit(CACHE_SET_STOPPING, &c->flags))
+		return;
+
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (4 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:25   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 07/10] bcache: set error_limit correctly Coly Li
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Delayed worker dc->writeback_rate_update and kernel thread
dc->writeback_thread reference cache set data structure in their routine,
Therefor, before they are stopped, cache set should not be release. Other-
wise, NULL pointer deference will be triggered.

Currenly delayed worker dc->writeback_rate_update and kernel thread
dc->writeback_thread are stopped in cached_dev_free(). When cache set is
retiring by too many I/O errors, cached_dev_free() is called when refcount
of bcache device's closure (disk.cl) reaches 0. In most of cases, last
refcount of disk.cl is dropped in last line of cached_dev_detach_finish().
But in cached_dev_detach_finish() before calling closure_put(&dc->disk.cl),
bcache_device_detach() is called, and inside bcache_device_detach()
refcount of cache_set->caching is dropped by closure_put(&d->c->caching).

It is very probably this is the last refcount of this closure, so routine
cache_set_flush() will be called (it is set in __cache_set_unregister()),
and its parent closure cache_set->cl may also drop its last refcount and
cache_set_free() is called too. In cache_set_free() the last refcount of
cache_set->kobj is dropped and then bch_cache_set_release() is called. Now
in bch_cache_set_release(), the memory of struct cache_set is freeed.

bch_cache_set_release() is called before cached_dev_free(), then there is a
time window after cache set memory freed and before dc->writeback_thread
and dc->writeback_rate_update stopped, if one of them is scheduled to run,
a NULL pointer deference will be triggered.

This patch fixes the above problem by stopping dc->writeback_thread and
dc->writeback_rate_update earlier in bcache_device_detach() before calling
closure_put(&d->c->caching). Because cancel_delayed_work_sync() and
kthread_stop() are synchronized operations, we can make sure cache set
is available when the delayed work and kthread are stopping.

Because cached_dev_free() can also be called by writing 1 to sysfs file
/sys/block/bcache<N>/bcache/stop, this code path may not call
bcache_device_detach() if d-c is NULL. So stopping dc->writeback_thread
and dc->writeback_rate_update in cached_dev_free() is still necessary. In
order to avoid stop them twice, dc->rate_update_canceled is added to
indicate dc->writeback_rate_update is canceled, and dc->writeback_thread
is set to NULL to indicate it is stopped.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    |  1 +
 drivers/md/bcache/super.c     | 21 +++++++++++++++++++--
 drivers/md/bcache/writeback.c |  1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 83c569942bd0..395b87942a2f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -322,6 +322,7 @@ struct cached_dev {
 
 	struct bch_ratelimit	writeback_rate;
 	struct delayed_work	writeback_rate_update;
+	bool			rate_update_canceled;
 
 	/*
 	 * Internal to the writeback code, so read_dirty() can keep track of
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 5401d2356aa3..8912be4165c5 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -696,8 +696,20 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c,
 
 static void bcache_device_detach(struct bcache_device *d)
 {
+	struct cached_dev *dc;
+
 	lockdep_assert_held(&bch_register_lock);
 
+	dc = container_of(d, struct cached_dev, disk);
+	if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
+		kthread_stop(dc->writeback_thread);
+		dc->writeback_thread = NULL;
+	}
+	if (!dc->rate_update_canceled) {
+		cancel_delayed_work_sync(&dc->writeback_rate_update);
+		dc->rate_update_canceled = true;
+	}
+
 	if (test_bit(BCACHE_DEV_DETACHING, &d->flags)) {
 		struct uuid_entry *u = d->c->uuids + d->id;
 
@@ -1071,9 +1083,14 @@ static void cached_dev_free(struct closure *cl)
 {
 	struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl);
 
-	cancel_delayed_work_sync(&dc->writeback_rate_update);
-	if (!IS_ERR_OR_NULL(dc->writeback_thread))
+	if (!dc->rate_update_canceled) {
+		cancel_delayed_work_sync(&dc->writeback_rate_update);
+		dc->rate_update_canceled = true;
+	}
+	if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
 		kthread_stop(dc->writeback_thread);
+		dc->writeback_thread = NULL;
+	}
 	if (dc->writeback_write_wq)
 		destroy_workqueue(dc->writeback_write_wq);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 745d9b2a326f..ab2ac3d72393 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -548,6 +548,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 	dc->writeback_rate_i_term_inverse = 10000;
 
 	INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
+	dc->rate_update_canceled = false;
 }
 
 int bch_cached_dev_writeback_start(struct cached_dev *dc)
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 07/10] bcache: set error_limit correctly
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (5 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:26   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors() Coly Li
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Struct cache uses io_errors for two purposes,
- Error decay: when cache set error_decay is set, io_errors is used to
  generate a small piece of delay when I/O error happens.
- I/O errors counter: in order to generate big enough value for error
  decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
  IO_ERROR_SHIFT).

In function bch_count_io_errors(), if I/O errors counter reaches cache set
error limit, bch_cache_set_error() will be called to retire the whold cache
set. But current code is problematic when checking the error limit, see the
following code piece from bch_count_io_errors(),

 90   if (error) {
 91           char buf[BDEVNAME_SIZE];
 92           unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
 93                                               &ca->io_errors);
 94           errors >>= IO_ERROR_SHIFT;
 95
 96           if (errors < ca->set->error_limit)
 97                   pr_err("%s: IO error on %s, recovering",
 98                          bdevname(ca->bdev, buf), m);
 99           else
100                   bch_cache_set_error(ca->set,
101                                       "%s: too many IO errors %s",
102                                       bdevname(ca->bdev, buf), m);
103   }

At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
errors counter to compare at line 96. But ca->set->error_limit is initia-
lized with an amplified value in bch_cache_set_alloc(),
1545   c->error_limit  = 8 << IO_ERROR_SHIFT;

It means by default, in bch_count_io_errors(), before 8<<20 errors happened
bch_cache_set_error() won't be called to retire the problematic cache
device. If the average request size is 64KB, it means bcache won't handle
failed device until 512GB data is requested. This is too large to be an I/O
threashold. So I believe the correct error limit should be much less.

This patch sets default cache set error limit to 8, then in
bch_count_io_errors() when errors counter reaches 8 (if it is default
value), function bch_cache_set_error() will be called to retire the whole
cache set. This patch also removes bits shifting when store or show
io_error_limit value via sysfs interface.

Nowadays most of SSDs handle internal flash failure automatically by LBA
address re-indirect mapping. If an I/O error can be observed by upper layer
code, it will be a notable error because that SSD can not re-indirect
map the problematic LBA address to an available flash block. This situation
indicates the whole SSD will be failed very soon. Therefore setting 8 as
the default io error limit value makes sense, it is enough for most of
cache devices.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h | 1 +
 drivers/md/bcache/super.c  | 2 +-
 drivers/md/bcache/sysfs.c  | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 395b87942a2f..a31dc3737dae 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -654,6 +654,7 @@ struct cache_set {
 		ON_ERROR_UNREGISTER,
 		ON_ERROR_PANIC,
 	}			on_error;
+#define DEFAULT_IO_ERROR_LIMIT 8
 	unsigned		error_limit;
 	unsigned		error_decay;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8912be4165c5..02d9d7110769 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1561,7 +1561,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 
 	c->congested_read_threshold_us	= 2000;
 	c->congested_write_threshold_us	= 20000;
-	c->error_limit	= 8 << IO_ERROR_SHIFT;
+	c->error_limit	= DEFAULT_IO_ERROR_LIMIT;
 
 	return c;
 err:
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index b4184092c727..d7ce9a05b304 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -556,7 +556,7 @@ SHOW(__bch_cache_set)
 
 	/* See count_io_errors for why 88 */
 	sysfs_print(io_error_halflife,	c->error_decay * 88);
-	sysfs_print(io_error_limit,	c->error_limit >> IO_ERROR_SHIFT);
+	sysfs_print(io_error_limit,	c->error_limit);
 
 	sysfs_hprint(congested,
 		     ((uint64_t) bch_get_congested(c)) << 9);
@@ -656,7 +656,7 @@ STORE(__bch_cache_set)
 	}
 
 	if (attr == &sysfs_io_error_limit)
-		c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT;
+		c->error_limit = strtoul_or_return(buf);
 
 	/* See count_io_errors() for why 88 */
 	if (attr == &sysfs_io_error_halflife)
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors()
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (6 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 07/10] bcache: set error_limit correctly Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-03 17:14   ` Michael Lyle
  2018-01-08  7:27   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 09/10] bcache: add io_disable to struct cache_set Coly Li
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    |  2 +-
 drivers/md/bcache/io.c        | 13 +++++++++----
 drivers/md/bcache/super.c     |  4 +++-
 drivers/md/bcache/writeback.c |  4 +++-
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index a31dc3737dae..c53f312b2216 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -855,7 +855,7 @@ static inline void wake_up_allocators(struct cache_set *c)
 
 /* Forward declarations */
 
-void bch_count_io_errors(struct cache *, blk_status_t, const char *);
+void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
 			      blk_status_t, const char *);
 void bch_bbio_endio(struct cache_set *, struct bio *, blk_status_t,
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fac97ec2d0e2..a783c5a41ff1 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -51,7 +51,10 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
 
 /* IO errors */
 
-void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m)
+void bch_count_io_errors(struct cache *ca,
+			 blk_status_t error,
+			 int is_read,
+			 const char *m)
 {
 	/*
 	 * The halflife of an error is:
@@ -94,8 +97,9 @@ void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m)
 		errors >>= IO_ERROR_SHIFT;
 
 		if (errors < ca->set->error_limit)
-			pr_err("%s: IO error on %s, recovering",
-			       bdevname(ca->bdev, buf), m);
+			pr_err("%s: IO error on %s%s",
+			       bdevname(ca->bdev, buf), m,
+			       is_read ? ", recovering." : ".");
 		else
 			bch_cache_set_error(ca->set,
 					    "%s: too many IO errors %s",
@@ -108,6 +112,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio,
 {
 	struct bbio *b = container_of(bio, struct bbio, bio);
 	struct cache *ca = PTR_CACHE(c, &b->key, 0);
+	int is_read = (bio_data_dir(bio) == READ ? 1 : 0);
 
 	unsigned threshold = op_is_write(bio_op(bio))
 		? c->congested_write_threshold_us
@@ -129,7 +134,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio,
 			atomic_inc(&c->congested);
 	}
 
-	bch_count_io_errors(ca, error, m);
+	bch_count_io_errors(ca, error, is_read, m);
 }
 
 void bch_bbio_endio(struct cache_set *c, struct bio *bio,
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 02d9d7110769..bbe911847eea 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -274,7 +274,9 @@ static void write_super_endio(struct bio *bio)
 {
 	struct cache *ca = bio->bi_private;
 
-	bch_count_io_errors(ca, bio->bi_status, "writing superblock");
+	/* is_read = 0 */
+	bch_count_io_errors(ca, bio->bi_status, 0,
+			    "writing superblock");
 	closure_put(&ca->set->sb_write);
 }
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index ab2ac3d72393..e58f9be5ae43 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -228,8 +228,10 @@ static void read_dirty_endio(struct bio *bio)
 	struct keybuf_key *w = bio->bi_private;
 	struct dirty_io *io = w->private;
 
+	/* is_read = 1 */
 	bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0),
-			    bio->bi_status, "reading dirty data from cache");
+			    bio->bi_status, 1,
+			    "reading dirty data from cache");
 
 	dirty_endio(bio);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 09/10] bcache: add io_disable to struct cache_set
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (7 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors() Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:30   ` Hannes Reinecke
  2018-01-03 14:03 ` [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set Coly Li
  2018-01-03 17:07 ` [PATCH v1 00/10] cache device failure handling improvement Michael Lyle
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.

Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.

The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when io_disable is
true.

Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether cache_set->io_disable is true. If cache_set->io_disable is
true, closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.

A sysfs interface is also added for cache_set->io_disable, to read and set
io_disable value for debugging. It is helpful to trigger more corner case
issues for failed cache device.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/alloc.c     |  2 +-
 drivers/md/bcache/bcache.h    | 14 ++++++++++++++
 drivers/md/bcache/btree.c     |  6 ++++--
 drivers/md/bcache/io.c        |  2 +-
 drivers/md/bcache/journal.c   |  4 ++--
 drivers/md/bcache/request.c   | 26 +++++++++++++++++++-------
 drivers/md/bcache/super.c     |  7 ++++++-
 drivers/md/bcache/sysfs.c     |  4 ++++
 drivers/md/bcache/util.h      |  6 ------
 drivers/md/bcache/writeback.c | 34 ++++++++++++++++++++++------------
 10 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 48c002faf08d..3be737582f27 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -286,7 +286,7 @@ do {									\
 			break;						\
 									\
 		mutex_unlock(&(ca)->set->bucket_lock);			\
-		if (kthread_should_stop())				\
+		if (kthread_should_stop() || ca->set->io_disable)	\
 			return 0;					\
 									\
 		set_current_state(TASK_INTERRUPTIBLE);			\
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index c53f312b2216..9c7f9b1cb791 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -481,6 +481,7 @@ struct cache_set {
 	struct cache_accounting accounting;
 
 	unsigned long		flags;
+	bool			io_disable;
 
 	struct cache_sb		sb;
 
@@ -853,6 +854,19 @@ static inline void wake_up_allocators(struct cache_set *c)
 		wake_up_process(ca->alloc_thread);
 }
 
+static inline void closure_bio_submit(struct cache_set *c,
+				      struct bio *bio,
+				      struct closure *cl)
+{
+	closure_get(cl);
+	if (unlikely(c->io_disable)) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return;
+	}
+	generic_make_request(bio);
+}
+
 /* Forward declarations */
 
 void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index bf0d7978bc3d..75470cce1177 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1788,9 +1788,11 @@ static int bch_gc_thread(void *arg)
 
 	while (1) {
 		wait_event_interruptible(c->gc_wait,
-			   kthread_should_stop() || gc_should_run(c));
+					kthread_should_stop() ||
+					c->io_disable ||
+					gc_should_run(c));
 
-		if (kthread_should_stop())
+		if (kthread_should_stop() || c->io_disable)
 			break;
 
 		set_gc_sectors(c);
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index a783c5a41ff1..8013ecbcdbda 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -38,7 +38,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
 	bio_set_dev(bio, PTR_CACHE(c, &b->key, 0)->bdev);
 
 	b->submit_time_us = local_clock_us();
-	closure_bio_submit(bio, bio->bi_private);
+	closure_bio_submit(c, bio, bio->bi_private);
 }
 
 void bch_submit_bbio(struct bio *bio, struct cache_set *c,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a87165c1d8e5..979873641030 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -62,7 +62,7 @@ reread:		left = ca->sb.bucket_size - offset;
 		bio_set_op_attrs(bio, REQ_OP_READ, 0);
 		bch_bio_map(bio, data);
 
-		closure_bio_submit(bio, &cl);
+		closure_bio_submit(ca->set, bio, &cl);
 		closure_sync(&cl);
 
 		/* This function could be simpler now since we no longer write
@@ -653,7 +653,7 @@ static void journal_write_unlocked(struct closure *cl)
 	spin_unlock(&c->journal.lock);
 
 	while ((bio = bio_list_pop(&list)))
-		closure_bio_submit(bio, cl);
+		closure_bio_submit(c, bio, cl);
 
 	continue_at(cl, journal_write_done, NULL);
 }
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021624f..a85d6a605a8e 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -725,7 +725,7 @@ static void cached_dev_read_error(struct closure *cl)
 
 		/* XXX: invalidate cache */
 
-		closure_bio_submit(bio, cl);
+		closure_bio_submit(s->iop.c, bio, cl);
 	}
 
 	continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -850,7 +850,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	s->cache_miss	= miss;
 	s->iop.bio	= cache_bio;
 	bio_get(cache_bio);
-	closure_bio_submit(cache_bio, &s->cl);
+	closure_bio_submit(s->iop.c, cache_bio, &s->cl);
 
 	return ret;
 out_put:
@@ -858,7 +858,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 out_submit:
 	miss->bi_end_io		= request_endio;
 	miss->bi_private	= &s->cl;
-	closure_bio_submit(miss, &s->cl);
+	closure_bio_submit(s->iop.c, miss, &s->cl);
 	return ret;
 }
 
@@ -923,7 +923,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 
 		if ((bio_op(bio) != REQ_OP_DISCARD) ||
 		    blk_queue_discard(bdev_get_queue(dc->bdev)))
-			closure_bio_submit(bio, cl);
+			closure_bio_submit(s->iop.c, bio, cl);
 	} else if (s->iop.writeback) {
 		bch_writeback_add(dc);
 		s->iop.bio = bio;
@@ -938,12 +938,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 			flush->bi_private = cl;
 			flush->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
 
-			closure_bio_submit(flush, cl);
+			closure_bio_submit(s->iop.c, flush, cl);
 		}
 	} else {
 		s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
 
-		closure_bio_submit(bio, cl);
+		closure_bio_submit(s->iop.c, bio, cl);
 	}
 
 	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -959,7 +959,7 @@ static void cached_dev_nodata(struct closure *cl)
 		bch_journal_meta(s->iop.c, cl);
 
 	/* If it's a flush, we send the flush to the backing device too */
-	closure_bio_submit(bio, cl);
+	closure_bio_submit(s->iop.c, bio, cl);
 
 	continue_at(cl, cached_dev_bio_complete, NULL);
 }
@@ -974,6 +974,12 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	int rw = bio_data_dir(bio);
 
+	if (unlikely(d->c && d->c->io_disable)) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
 	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
 
 	bio_set_dev(bio, dc->bdev);
@@ -1089,6 +1095,12 @@ static blk_qc_t flash_dev_make_request(struct request_queue *q,
 	struct bcache_device *d = bio->bi_disk->private_data;
 	int rw = bio_data_dir(bio);
 
+	if (unlikely(d->c->io_disable)) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
 	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
 
 	s = search_alloc(bio, d);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bbe911847eea..7aa76c3e3556 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -521,7 +521,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, int op,
 	bio_set_op_attrs(bio, op, REQ_SYNC|REQ_META|op_flags);
 	bch_bio_map(bio, ca->disk_buckets);
 
-	closure_bio_submit(bio, &ca->prio);
+	closure_bio_submit(ca->set, bio, &ca->prio);
 	closure_sync(cl);
 }
 
@@ -1333,6 +1333,10 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
 	acquire_console_sem();
 	*/
 
+	c->io_disable = true;
+	/* make others know io_disable is true earlier */
+	smp_mb();
+
 	printk(KERN_ERR "bcache: error on %pU: ", c->sb.set_uuid);
 
 	va_start(args, fmt);
@@ -1564,6 +1568,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	c->congested_read_threshold_us	= 2000;
 	c->congested_write_threshold_us	= 20000;
 	c->error_limit	= DEFAULT_IO_ERROR_LIMIT;
+	c->io_disable = false;
 
 	return c;
 err:
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index d7ce9a05b304..acce7c82e111 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -92,6 +92,7 @@ read_attribute(partial_stripes_expensive);
 
 rw_attribute(synchronous);
 rw_attribute(journal_delay_ms);
+rw_attribute(io_disable);
 rw_attribute(discard);
 rw_attribute(running);
 rw_attribute(label);
@@ -573,6 +574,7 @@ SHOW(__bch_cache_set)
 	sysfs_printf(gc_always_rewrite,		"%i", c->gc_always_rewrite);
 	sysfs_printf(btree_shrinker_disabled,	"%i", c->shrinker_disabled);
 	sysfs_printf(copy_gc_enabled,		"%i", c->copy_gc_enabled);
+	sysfs_printf(io_disable,		"%i", c->io_disable);
 
 	if (attr == &sysfs_bset_tree_stats)
 		return bch_bset_print_stats(c, buf);
@@ -663,6 +665,7 @@ STORE(__bch_cache_set)
 		c->error_decay = strtoul_or_return(buf) / 88;
 
 	sysfs_strtoul(journal_delay_ms,		c->journal_delay_ms);
+	sysfs_strtoul_clamp(io_disable,		c->io_disable, 0, 1);
 	sysfs_strtoul(verify,			c->verify);
 	sysfs_strtoul(key_merging_disabled,	c->key_merging_disabled);
 	sysfs_strtoul(expensive_debug_checks,	c->expensive_debug_checks);
@@ -744,6 +747,7 @@ static struct attribute *bch_cache_set_internal_files[] = {
 	&sysfs_gc_always_rewrite,
 	&sysfs_btree_shrinker_disabled,
 	&sysfs_copy_gc_enabled,
+	&sysfs_io_disable,
 	NULL
 };
 KTYPE(bch_cache_set_internal);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index ed5e8a412eb8..03e533631798 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -564,12 +564,6 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
 	return bdev->bd_inode->i_size >> 9;
 }
 
-#define closure_bio_submit(bio, cl)					\
-do {									\
-	closure_get(cl);						\
-	generic_make_request(bio);					\
-} while (0)
-
 uint64_t bch_crc64_update(uint64_t, const void *, size_t);
 uint64_t bch_crc64(const void *, size_t);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index e58f9be5ae43..54add41d2569 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -93,8 +93,11 @@ static void update_writeback_rate(struct work_struct *work)
 					     writeback_rate_update);
 	struct cache_set *c = dc->disk.c;
 
-	/* quit directly if cache set is stopping */
-	if (test_bit(CACHE_SET_STOPPING, &c->flags))
+	/*
+	 * quit directly if cache set is stopping. c->io_disable
+	 * can be set via sysfs, check it here too.
+	 */
+	if (test_bit(CACHE_SET_STOPPING, &c->flags) || c->io_disable)
 		return;
 
 	down_read(&dc->writeback_lock);
@@ -105,8 +108,11 @@ static void update_writeback_rate(struct work_struct *work)
 
 	up_read(&dc->writeback_lock);
 
-	/* do not schedule delayed work if cache set is stopping */
-	if (test_bit(CACHE_SET_STOPPING, &c->flags))
+	/*
+	 * do not schedule delayed work if cache set is stopping,
+	 * c->io_disable can be set via sysfs, check it here too.
+	 */
+	if (test_bit(CACHE_SET_STOPPING, &c->flags) || c->io_disable)
 		return;
 
 	schedule_delayed_work(&dc->writeback_rate_update,
@@ -217,7 +223,7 @@ static void write_dirty(struct closure *cl)
 		bio_set_dev(&io->bio, io->dc->bdev);
 		io->bio.bi_end_io	= dirty_endio;
 
-		closure_bio_submit(&io->bio, cl);
+		closure_bio_submit(io->dc->disk.c, &io->bio, cl);
 	}
 
 	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
@@ -240,7 +246,7 @@ static void read_dirty_submit(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 
-	closure_bio_submit(&io->bio, cl);
+	closure_bio_submit(io->dc->disk.c, &io->bio, cl);
 
 	continue_at(cl, write_dirty, io->dc->writeback_write_wq);
 }
@@ -259,7 +265,7 @@ static void read_dirty(struct cached_dev *dc)
 	 * mempools.
 	 */
 
-	while (!kthread_should_stop()) {
+	while (!(kthread_should_stop() || dc->disk.c->io_disable)) {
 
 		w = bch_keybuf_next(&dc->writeback_keys);
 		if (!w)
@@ -269,7 +275,9 @@ static void read_dirty(struct cached_dev *dc)
 
 		if (KEY_START(&w->key) != dc->last_read ||
 		    jiffies_to_msecs(delay) > 50)
-			while (!kthread_should_stop() && delay)
+			while (!kthread_should_stop() &&
+			       !dc->disk.c->io_disable &&
+			       delay)
 				delay = schedule_timeout_interruptible(delay);
 
 		dc->last_read	= KEY_OFFSET(&w->key);
@@ -450,18 +458,19 @@ static bool refill_dirty(struct cached_dev *dc)
 static int bch_writeback_thread(void *arg)
 {
 	struct cached_dev *dc = arg;
+	struct cache_set *c = dc->disk.c;
 	bool searched_full_index;
 
 	bch_ratelimit_reset(&dc->writeback_rate);
 
-	while (!kthread_should_stop()) {
+	while (!(kthread_should_stop() || c->io_disable)) {
 		down_write(&dc->writeback_lock);
 		if (!atomic_read(&dc->has_dirty) ||
 		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
 		     !dc->writeback_running)) {
 			up_write(&dc->writeback_lock);
 
-			if (kthread_should_stop())
+			if (kthread_should_stop() || c->io_disable)
 				break;
 
 			set_current_state(TASK_INTERRUPTIBLE);
@@ -485,8 +494,8 @@ static int bch_writeback_thread(void *arg)
 		if (searched_full_index) {
 			unsigned delay = dc->writeback_delay * HZ;
 
-			while (delay &&
-			       !kthread_should_stop() &&
+			while (delay && !kthread_should_stop() &&
+			       !c->io_disable &&
 			       !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
 				delay = schedule_timeout_interruptible(delay);
 
@@ -494,6 +503,7 @@ static int bch_writeback_thread(void *arg)
 		}
 	}
 
+	dc->writeback_thread = NULL;
 	cached_dev_put(dc);
 
 	return 0;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (8 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 09/10] bcache: add io_disable to struct cache_set Coly Li
@ 2018-01-03 14:03 ` Coly Li
  2018-01-08  7:31   ` Hannes Reinecke
  2018-01-03 17:07 ` [PATCH v1 00/10] cache device failure handling improvement Michael Lyle
  10 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-03 14:03 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, mlyle, tang.junhui, Coly Li

When there are too many I/O errors on cache device, current bcache code
will retire the whole cache set, and detach all bcache devices. But the
detached bcache devices are not stopped, which is problematic when bcache
is in writeback mode.

If the retired cache set has dirty data of backing devices, continue
writing to bcache device will write to backing device directly. If the
LBA of write request has a dirty version cached on cache device, next time
when the cache device is re-registered and backing device re-attached to
it again, the stale dirty data on cache device will be written to backing
device, and overwrite latest directly written data. This situation causes
a quite data corruption.

This patch checkes whether cache_set->io_disable is true in
__cache_set_unregister(). If cache_set->io_disable is true, it means cache
set is unregistering by too many I/O errors, then all attached bcache
devices will be stopped as well. If cache_set->io_disable is not true, it
means __cache_set_unregister() is triggered by writing 1 to sysfs file
/sys/fs/bcache/<UUID>/bcache/stop. This is an exception because users do
it explicitly, this patch keeps existing behavior and does not stop any
bcache device.

Even the failed cache device has no dirty data, stopping bcache device is
still a desired behavior by many Ceph and data base users. Then their
application will report I/O errors due to disappeared bcache device, and
operation people will know the cache device is broken or disconnected.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 49d6fedf89c3..20a7a6959506 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1458,6 +1458,14 @@ static void __cache_set_unregister(struct closure *cl)
 				dc = container_of(c->devices[i],
 						  struct cached_dev, disk);
 				bch_cached_dev_detach(dc);
+				/*
+				 * If we come here by too many I/O errors,
+				 * bcache device should be stopped too, to
+				 * keep data consistency on cache and
+				 * backing devices.
+				 */
+				if (c->io_disable)
+					bcache_device_stop(c->devices[i]);
 			} else {
 				bcache_device_stop(c->devices[i]);
 			}
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 00/10] cache device failure handling improvement
  2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
                   ` (9 preceding siblings ...)
  2018-01-03 14:03 ` [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set Coly Li
@ 2018-01-03 17:07 ` Michael Lyle
  2018-01-04  2:20   ` Coly Li
  10 siblings, 1 reply; 35+ messages in thread
From: Michael Lyle @ 2018-01-03 17:07 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

Hey Coly,

On 01/03/2018 06:03 AM, Coly Li wrote:
[snip]
> The first 8 patches of this patch set is to fix existing bugs in bcache,
> the last 2 patches do the real improvement. Order of applying these patches
> is important, if the last 2 patches are applied firstly, kernel panic or
> process hang will be observed. Therefore I suggest to apply the first 8
> fixes, then apply the last 2 patches.

Wow, this is a lot of changes.  :D  Thanks for the fixes.  I've skimmed
through these (no real review yet) and overall what you're doing looks
good.  I think I'm going to concentrate on the first several patches in
the set for now as a strategy to get at least some of this in for the
first pass of 4.16 and we can figure out what to do from there.

We're up to RC6, so it's getting to be time to get things into next, and
pretty soon I need to focus on testing for awhile.

[off-topic, I wasn't able to find the time to go through the lock model
for 4.16 as I had hoped-- hopefully these changes make it to 4.17].

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
@ 2018-01-03 17:08   ` Michael Lyle
  2018-01-05 17:05     ` Coly Li
  2018-01-08  7:09   ` Hannes Reinecke
  1 sibling, 1 reply; 35+ messages in thread
From: Michael Lyle @ 2018-01-03 17:08 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

On 01/03/2018 06:03 AM, Coly Li wrote:
> Kernel thread routine bch_writeback_thread() has the following code block,
> 
> 452                         set_current_state(TASK_INTERRUPTIBLE);
> 453
> 454                         if (kthread_should_stop())
> 455                                 return 0;
> 456
> 457                         schedule();
> 458                         continue;
> 
> At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
> kthread_should_stop() is true, a "return 0" at line 455 will to function
> kernel/kthread.c:kthread() and call do_exit().
> 
> It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
> following code path might_sleep() is called and a warning message is
> reported by __might_sleep(): "WARNING: do not call blocking ops when
> !TASK_RUNNING; state=1 set at [xxxx]".
> 
> Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
> state, but this warning message scares users, makes them feel there might
> be something risky with bcache and hurt their data.
> 
> In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
> so writeback kernel thread can exist and enter do_exit() with
> TASK_RUNNING state. Warning message from might_sleep() is removed.
> 
> Signed-off-by: Coly Li <colyli@suse.de>

Reviewed-by: Michael Lyle <mlyle@lyle.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 02/10] bcache: set task properly in allocator_wait()
  2018-01-03 14:03 ` [PATCH v1 02/10] bcache: set task properly in allocator_wait() Coly Li
@ 2018-01-03 17:09   ` Michael Lyle
  2018-01-05 17:11     ` Coly Li
  2018-01-08  7:10   ` Hannes Reinecke
  1 sibling, 1 reply; 35+ messages in thread
From: Michael Lyle @ 2018-01-03 17:09 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

On 01/03/2018 06:03 AM, Coly Li wrote:
> Kernel thread routine bch_allocator_thread() references macro
> allocator_wait() to wait for a condition or quit to do_exit()
> when kthread_should_stop() is true.
> 
> Macro allocator_wait() has 2 issues in setting task state, let's
> see its code piece,
> 
> 284         while (1) {                                                   \
> 285                 set_current_state(TASK_INTERRUPTIBLE);                \
> 286                 if (cond)                                             \
> 287                         break;                                        \
> 288                                                                       \
> 289                 mutex_unlock(&(ca)->set->bucket_lock);                \
> 290                 if (kthread_should_stop())                            \
> 291                         return 0;                                     \
> 292                                                                       \
> 293                 schedule();                                           \
> 294                 mutex_lock(&(ca)->set->bucket_lock);                  \
> 295         }                                                             \
> 296         __set_current_state(TASK_RUNNING);                            \
> 
> 1) At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
> kthread_should_stop() is true, the kernel thread will terminate and return
> to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
> state. This is not a suggested behavior and a warning message will be
> reported by might_sleep() in do_exit() code path: "WARNING: do not call
> blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
> 
> 2) Because task state is set to TASK_INTERRUPTIBLE at line 285, when break
> while-loop the task state has to be set back to TASK_RUNNING at line 296.
> Indeed it is unncessary, if task state is set to TASK_INTERRUPTIBLE before
> calling schedule() at line 293, we don't need to set the state back to
> TASK_RUNNING at line 296 anymore. The reason is, allocator kthread is only
> woken up by wake_up_process(), this routine makes sure the task state of
> allocator kthread will be TASK_RUNNING after it returns from schedule() at
> line 294 (see kernel/sched/core.c:try_to_wake_up() for more detailed
> information).
> 
> This patch fixes the above 2 issues by,
> 1) Setting TASK_INTERRUPTIBLE state just before calling schedule().
> 2) Then setting TASK_RUNNING at line 296 is unnecessary, remove it.
> 
> Signed-off-by: Coly Li <colyli@suse.de>

Reviewed-by: Michael Lyle <mlyle@lyle.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used
  2018-01-03 14:03 ` [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used Coly Li
@ 2018-01-03 17:11   ` Michael Lyle
  2018-01-08  7:12   ` Hannes Reinecke
  1 sibling, 0 replies; 35+ messages in thread
From: Michael Lyle @ 2018-01-03 17:11 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

On 01/03/2018 06:03 AM, Coly Li wrote:
> Member devices of struct cache_set is used to reference all attached
> bcache devices to this cache set. If it is treated as array of pointers,
> size of devices[] is indicated by member nr_uuids of struct cache_set.
> 
> nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
> 	bucket_bytes(c) / sizeof(struct uuid_entry)
> Bucket size is determined by user space tool "make-bcache", by default it
> is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
> nr_uuids value is 4096 from the above calculation.
> 
> Every time when bcache code iterates bcache devices of a cache set, all
> the 4096 pointers are checked even only 1 bcache device is attached to the
> cache set, that's a wast of time and unncessary.
> 
> This patch adds a member devices_max_used to struct cache_set. Its value
> is 1 + the maximum used index of devices[] in a cache set. When iterating
> all valid bcache devices of a cache set, use c->devices_max_used in
> for-loop may reduce a lot of useless checking.
> 
> Personally, my motivation of this patch is not for performance, I use it
> in bcache debugging, which helps me to narrow down the scape to check
> valid bcached devices of a cache set.

Oh, OK.  I was going to argue that I didn't think this was worth it, but
I can see how it would be nice for inspecting memory and debug code.

> 
> Signed-off-by: Coly Li <colyli@suse.de>

Reviewed-by: Michael Lyle <mlyle@lyle.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors()
  2018-01-03 14:03 ` [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors() Coly Li
@ 2018-01-03 17:14   ` Michael Lyle
  2018-01-08  7:27   ` Hannes Reinecke
  1 sibling, 0 replies; 35+ messages in thread
From: Michael Lyle @ 2018-01-03 17:14 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

On 01/03/2018 06:03 AM, Coly Li wrote:
> Bcache only does recoverable I/O for read operations by calling
> cached_dev_read_error(). For write opertions there is no I/O recovery for
> failed requests.
> 
> But in bch_count_io_errors() no matter read or write I/Os, before errors
> counter reaches io error limit, pr_err() always prints "IO error on %,
> recoverying". For write requests this information is misleading, because
> there is no I/O recovery at all.
> 
> This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
> prints "recovering" by pr_err() when the bio direction is READ.
> 
> Signed-off-by: Coly Li <colyli@suse.de>

Reviewed-by: Michael Lyle <mlyle@lyle.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
@ 2018-01-03 18:30 tang.junhui
  0 siblings, 0 replies; 35+ messages in thread
From: tang.junhui @ 2018-01-03 18:30 UTC (permalink / raw)
  To: colyli; +Cc: mlyle, linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

LGTM.

Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>

>Kernel thread routine bch_writeback_thread() has the following code block,
>
>452                         set_current_state(TASK_INTERRUPTIBLE);
>453
>454                         if (kthread_should_stop())
>455                                 return 0;
>456
>457                         schedule();
>458                         continue;
>
>At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
>kthread_should_stop() is true, a "return 0" at line 455 will to function
>kernel/kthread.c:kthread() and call do_exit().
>
>It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
>following code path might_sleep() is called and a warning message is
>reported by __might_sleep(): "WARNING: do not call blocking ops when
>!TASK_RUNNING; state=1 set at [xxxx]".
>
>Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
>state, but this warning message scares users, makes them feel there might
>be something risky with bcache and hurt their data.
>
>In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
>so writeback kernel thread can exist and enter do_exit() with
>TASK_RUNNING state. Warning message from might_sleep() is removed.
>
>Signed-off-by: Coly Li <colyli@suse.de>
>---
> drivers/md/bcache/writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>index 56a37884ca8b..a57149803df6 100644
>--- a/drivers/md/bcache/writeback.c
>+++ b/drivers/md/bcache/writeback.c
>@@ -449,11 +449,11 @@ static int bch_writeback_thread(void *arg)
>             (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>              !dc->writeback_running)) {
>             up_write(&dc->writeback_lock);
>-            set_current_state(TASK_INTERRUPTIBLE);
> 
>             if (kthread_should_stop())
>                 return 0;
> 
>+            set_current_state(TASK_INTERRUPTIBLE);
>             schedule();
>             continue;
>         }
>-- 
>2.15.1

Thanks,
Tang Junhui

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 00/10] cache device failure handling improvement
  2018-01-03 17:07 ` [PATCH v1 00/10] cache device failure handling improvement Michael Lyle
@ 2018-01-04  2:20   ` Coly Li
  2018-01-04 17:46     ` Michael Lyle
  0 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-04  2:20 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block, tang.junhui

On 04/01/2018 1:07 AM, Michael Lyle wrote:
> Hey Coly,
> 
> On 01/03/2018 06:03 AM, Coly Li wrote:
> [snip]
>> The first 8 patches of this patch set is to fix existing bugs in bcache,
>> the last 2 patches do the real improvement. Order of applying these patches
>> is important, if the last 2 patches are applied firstly, kernel panic or
>> process hang will be observed. Therefore I suggest to apply the first 8
>> fixes, then apply the last 2 patches.
> 

Hi Mike,

> Wow, this is a lot of changes.  :D  Thanks for the fixes.  I've skimmed
> through these (no real review yet) and overall what you're doing looks
> good.  I think I'm going to concentrate on the first several patches in
> the set for now as a strategy to get at least some of this in for the
> first pass of 4.16 and we can figure out what to do from there.
> 

Yes, I am also testing it from my side, and after v2 patches (I will
rebase the rested patches agains bcache-for-next) our partners will join
the verification too.


> We're up to RC6, so it's getting to be time to get things into next, and
> pretty soon I need to focus on testing for awhile.
> 

Thanks for doing this. When you test the patch set, please apply them
together, because the bugs that the first 8 patches try to fix won't be
triggered without applying the last 2 patches, especially the io_disable
patch.

The io_disable patch fixes dc->count reference issue, then
cache_set_flush() can be executed, and the bugs are triggered.

> [off-topic, I wasn't able to find the time to go through the lock model
> for 4.16 as I had hoped-- hopefully these changes make it to 4.17].

No rush, I am glad to review the patches once they are done.

Thanks for your effort, again :-)

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 00/10] cache device failure handling improvement
  2018-01-04  2:20   ` Coly Li
@ 2018-01-04 17:46     ` Michael Lyle
  2018-01-05  4:04       ` Coly Li
  0 siblings, 1 reply; 35+ messages in thread
From: Michael Lyle @ 2018-01-04 17:46 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, tang.junhui

Coly--

On 01/03/2018 06:20 PM, Coly Li wrote:
> Thanks for doing this. When you test the patch set, please apply them
> together, because the bugs that the first 8 patches try to fix won't be
> triggered without applying the last 2 patches, especially the io_disable
> patch.

Eventually I will test & review the whole set, but I need to concentrate
on testing something for a first submission to 4.16.  So I've taken
1,2,3 and 8 because they're easiest and have some benefit on their own
without 9 & 10.

This means that next pass through the review we'll at least have a few
fewer patches to look at, too.

Thanks,

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 00/10] cache device failure handling improvement
  2018-01-04 17:46     ` Michael Lyle
@ 2018-01-05  4:04       ` Coly Li
  0 siblings, 0 replies; 35+ messages in thread
From: Coly Li @ 2018-01-05  4:04 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block, tang.junhui

On 05/01/2018 1:46 AM, Michael Lyle wrote:
> Coly--
> 
> On 01/03/2018 06:20 PM, Coly Li wrote:
>> Thanks for doing this. When you test the patch set, please apply them
>> together, because the bugs that the first 8 patches try to fix won't be
>> triggered without applying the last 2 patches, especially the io_disable
>> patch.
> 
> Eventually I will test & review the whole set, but I need to concentrate
> on testing something for a first submission to 4.16.  So I've taken
> 1,2,3 and 8 because they're easiest and have some benefit on their own
> without 9 & 10.
> 
> This means that next pass through the review we'll at least have a few
> fewer patches to look at, too.

Hi Mike,

Sure, I agree with you. Thanks :-)

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-03 17:08   ` Michael Lyle
@ 2018-01-05 17:05     ` Coly Li
  2018-01-05 17:09       ` Michael Lyle
  0 siblings, 1 reply; 35+ messages in thread
From: Coly Li @ 2018-01-05 17:05 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block, tang.junhui

On 04/01/2018 1:08 AM, Michael Lyle wrote:
> On 01/03/2018 06:03 AM, Coly Li wrote:
>> Kernel thread routine bch_writeback_thread() has the following code block,
>>
>> 452                         set_current_state(TASK_INTERRUPTIBLE);
>> 453
>> 454                         if (kthread_should_stop())
>> 455                                 return 0;
>> 456
>> 457                         schedule();
>> 458                         continue;
>>
>> At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
>> kthread_should_stop() is true, a "return 0" at line 455 will to function
>> kernel/kthread.c:kthread() and call do_exit().
>>
>> It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
>> following code path might_sleep() is called and a warning message is
>> reported by __might_sleep(): "WARNING: do not call blocking ops when
>> !TASK_RUNNING; state=1 set at [xxxx]".
>>
>> Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
>> state, but this warning message scares users, makes them feel there might
>> be something risky with bcache and hurt their data.
>>
>> In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
>> so writeback kernel thread can exist and enter do_exit() with
>> TASK_RUNNING state. Warning message from might_sleep() is removed.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>

Hi Mike,

I feel uncomfortable with this patch, can you help me to double check ?

In my patch, I set task state to TASK_INTERRUPTIBLE after checking the
following conditions,
469	if (!atomic_read(&dc->has_dirty) ||
470	    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
471	     !dc->writeback_running)) {

And set_current_state(TASK_INTERRUPTIBLE) is after
up_write(&dc->writeback_lock), there should be a potential race window
between up_write() and set_current_state() in my patch. If
wake_up_process(dc->writeback_thread) is called in this window, and then
current state is set to TASK_INTERRUPTIBLE again, writeback thread will
lose a chance to be waken up.

It seems I need to call set_current_state(TASK_INTERRUPTIBLE) before
up_write(&dc->writeback_lock), then dc->writeback_lock will protect and
avoid the race. Or move set_current_state(TASK_INTERRUPTIBLE) to the
location before the condition checks in line 469-471.

This race window is very small but I believe it exists, I will fix it in
v2 patch set.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-05 17:05     ` Coly Li
@ 2018-01-05 17:09       ` Michael Lyle
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Lyle @ 2018-01-05 17:09 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block, Junhui Tang

Coly---

On Fri, Jan 5, 2018 at 9:05 AM, Coly Li <colyli@suse.de> wrote:
> It seems I need to call set_current_state(TASK_INTERRUPTIBLE) before
> up_write(&dc->writeback_lock), then dc->writeback_lock will protect and
> avoid the race. Or move set_current_state(TASK_INTERRUPTIBLE) to the
> location before the condition checks in line 469-471.

I think you're right.  Sorry I missed this in my review.  I will pull
this from my testing/for-416 set.

Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 02/10] bcache: set task properly in allocator_wait()
  2018-01-03 17:09   ` Michael Lyle
@ 2018-01-05 17:11     ` Coly Li
  0 siblings, 0 replies; 35+ messages in thread
From: Coly Li @ 2018-01-05 17:11 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache; +Cc: linux-block, tang.junhui

On 04/01/2018 1:09 AM, Michael Lyle wrote:
> On 01/03/2018 06:03 AM, Coly Li wrote:
>> Kernel thread routine bch_allocator_thread() references macro
>> allocator_wait() to wait for a condition or quit to do_exit()
>> when kthread_should_stop() is true.
>>
>> Macro allocator_wait() has 2 issues in setting task state, let's
>> see its code piece,
>>
>> 284         while (1) {                                                   \
>> 285                 set_current_state(TASK_INTERRUPTIBLE);                \
>> 286                 if (cond)                                             \
>> 287                         break;                                        \
>> 288                                                                       \
>> 289                 mutex_unlock(&(ca)->set->bucket_lock);                \
>> 290                 if (kthread_should_stop())                            \
>> 291                         return 0;                                     \
>> 292                                                                       \
>> 293                 schedule();                                           \
>> 294                 mutex_lock(&(ca)->set->bucket_lock);                  \
>> 295         }                                                             \
>> 296         __set_current_state(TASK_RUNNING);                            \
>>
>> 1) At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
>> kthread_should_stop() is true, the kernel thread will terminate and return
>> to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
>> state. This is not a suggested behavior and a warning message will be
>> reported by might_sleep() in do_exit() code path: "WARNING: do not call
>> blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
>>
>> 2) Because task state is set to TASK_INTERRUPTIBLE at line 285, when break
>> while-loop the task state has to be set back to TASK_RUNNING at line 296.
>> Indeed it is unncessary, if task state is set to TASK_INTERRUPTIBLE before
>> calling schedule() at line 293, we don't need to set the state back to
>> TASK_RUNNING at line 296 anymore. The reason is, allocator kthread is only
>> woken up by wake_up_process(), this routine makes sure the task state of
>> allocator kthread will be TASK_RUNNING after it returns from schedule() at
>> line 294 (see kernel/sched/core.c:try_to_wake_up() for more detailed
>> information).
>>
>> This patch fixes the above 2 issues by,
>> 1) Setting TASK_INTERRUPTIBLE state just before calling schedule().
>> 2) Then setting TASK_RUNNING at line 296 is unnecessary, remove it.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>

Hi Mike,

I also feel I made mistake here. set_current_state(TASK_INTERRUPTIBLE)
must be called before the condition check, other wise there will be a
race window on task state setting. A correct fix should be setting task
state to TASK_RUNNING before calling 'return 0', or before calling
mutex_unlock(&(ca)->set->bucket_lock).

When I review the new closure_sync() patches, I feel I make mistake
here. It will be fixed in v2 patch set.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
  2018-01-03 17:08   ` Michael Lyle
@ 2018-01-08  7:09   ` Hannes Reinecke
  2018-01-08 13:50     ` Coly Li
  1 sibling, 1 reply; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:09 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Kernel thread routine bch_writeback_thread() has the following code block,
> 
> 452                         set_current_state(TASK_INTERRUPTIBLE);
> 453
> 454                         if (kthread_should_stop())
> 455                                 return 0;
> 456
> 457                         schedule();
> 458                         continue;
> 
> At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
> kthread_should_stop() is true, a "return 0" at line 455 will to function
> kernel/kthread.c:kthread() and call do_exit().
> 
> It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
> following code path might_sleep() is called and a warning message is
> reported by __might_sleep(): "WARNING: do not call blocking ops when
> !TASK_RUNNING; state=1 set at [xxxx]".
> 
> Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
> state, but this warning message scares users, makes them feel there might
> be something risky with bcache and hurt their data.
> 
> In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
> so writeback kernel thread can exist and enter do_exit() with
> TASK_RUNNING state. Warning message from might_sleep() is removed.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 56a37884ca8b..a57149803df6 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -449,11 +449,11 @@ static int bch_writeback_thread(void *arg)
>  		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>  		     !dc->writeback_running)) {
>  			up_write(&dc->writeback_lock);
> -			set_current_state(TASK_INTERRUPTIBLE);
>  
>  			if (kthread_should_stop())
>  				return 0;
>  
> +			set_current_state(TASK_INTERRUPTIBLE);
>  			schedule();
>  			continue;
>  		}
> 
Actually, TASK_INTERRUPTIBLE is okay for kthread_should_stop(); you just
need to set it to 'TASK_RUNNING' before calling 'return 0'.

So I think a fix like

set_current_state(TASK_INTERRUPTIBLE);

if (kthread_should_stop()) {
    set_current_state(TASK_RUNNING);
    return 0;
}

schedule();

would be better.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 02/10] bcache: set task properly in allocator_wait()
  2018-01-03 14:03 ` [PATCH v1 02/10] bcache: set task properly in allocator_wait() Coly Li
  2018-01-03 17:09   ` Michael Lyle
@ 2018-01-08  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:10 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Kernel thread routine bch_allocator_thread() references macro
> allocator_wait() to wait for a condition or quit to do_exit()
> when kthread_should_stop() is true.
> 
> Macro allocator_wait() has 2 issues in setting task state, let's
> see its code piece,
> 
> 284         while (1) {                                                   \
> 285                 set_current_state(TASK_INTERRUPTIBLE);                \
> 286                 if (cond)                                             \
> 287                         break;                                        \
> 288                                                                       \
> 289                 mutex_unlock(&(ca)->set->bucket_lock);                \
> 290                 if (kthread_should_stop())                            \
> 291                         return 0;                                     \
> 292                                                                       \
> 293                 schedule();                                           \
> 294                 mutex_lock(&(ca)->set->bucket_lock);                  \
> 295         }                                                             \
> 296         __set_current_state(TASK_RUNNING);                            \
> 
> 1) At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
> kthread_should_stop() is true, the kernel thread will terminate and return
> to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
> state. This is not a suggested behavior and a warning message will be
> reported by might_sleep() in do_exit() code path: "WARNING: do not call
> blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
> 
> 2) Because task state is set to TASK_INTERRUPTIBLE at line 285, when break
> while-loop the task state has to be set back to TASK_RUNNING at line 296.
> Indeed it is unncessary, if task state is set to TASK_INTERRUPTIBLE before
> calling schedule() at line 293, we don't need to set the state back to
> TASK_RUNNING at line 296 anymore. The reason is, allocator kthread is only
> woken up by wake_up_process(), this routine makes sure the task state of
> allocator kthread will be TASK_RUNNING after it returns from schedule() at
> line 294 (see kernel/sched/core.c:try_to_wake_up() for more detailed
> information).
> 
> This patch fixes the above 2 issues by,
> 1) Setting TASK_INTERRUPTIBLE state just before calling schedule().
> 2) Then setting TASK_RUNNING at line 296 is unnecessary, remove it.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index a0cc1bc6d884..48c002faf08d 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -282,7 +282,6 @@ static void invalidate_buckets(struct cache *ca)
>  #define allocator_wait(ca, cond)					\
>  do {									\
>  	while (1) {							\
> -		set_current_state(TASK_INTERRUPTIBLE);			\
>  		if (cond)						\
>  			break;						\
>  									\
> @@ -290,10 +289,10 @@ do {									\
>  		if (kthread_should_stop())				\
>  			return 0;					\
>  									\
> +		set_current_state(TASK_INTERRUPTIBLE);			\
>  		schedule();						\
>  		mutex_lock(&(ca)->set->bucket_lock);			\
>  	}								\
> -	__set_current_state(TASK_RUNNING);				\
>  } while (0)
>  
>  static int bch_allocator_push(struct cache *ca, long bucket)
> 
Same as with the previous patch; why not simply set it to TASK_RUNNING
before 'return 0' ?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used
  2018-01-03 14:03 ` [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used Coly Li
  2018-01-03 17:11   ` Michael Lyle
@ 2018-01-08  7:12   ` Hannes Reinecke
  1 sibling, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:12 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Member devices of struct cache_set is used to reference all attached
> bcache devices to this cache set. If it is treated as array of pointers,
> size of devices[] is indicated by member nr_uuids of struct cache_set.
> 
> nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
> 	bucket_bytes(c) / sizeof(struct uuid_entry)
> Bucket size is determined by user space tool "make-bcache", by default it
> is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
> nr_uuids value is 4096 from the above calculation.
> 
> Every time when bcache code iterates bcache devices of a cache set, all
> the 4096 pointers are checked even only 1 bcache device is attached to the
> cache set, that's a wast of time and unncessary.
> 
> This patch adds a member devices_max_used to struct cache_set. Its value
> is 1 + the maximum used index of devices[] in a cache set. When iterating
> all valid bcache devices of a cache set, use c->devices_max_used in
> for-loop may reduce a lot of useless checking.
> 
> Personally, my motivation of this patch is not for performance, I use it
> in bcache debugging, which helps me to narrow down the scape to check
> valid bcached devices of a cache set.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/bcache.h    | 1 +
>  drivers/md/bcache/btree.c     | 2 +-
>  drivers/md/bcache/super.c     | 9 ++++++---
>  drivers/md/bcache/writeback.h | 2 +-
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e017e1..83c569942bd0 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -488,6 +488,7 @@ struct cache_set {
>  	int			caches_loaded;
>  
>  	struct bcache_device	**devices;
> +	unsigned		devices_max_used;
>  	struct list_head	cached_devs;
>  	uint64_t		cached_dev_sectors;
>  	struct closure		caching;
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3dbe5e..bf0d7978bc3d 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1678,7 +1678,7 @@ static void bch_btree_gc_finish(struct cache_set *c)
>  
>  	/* don't reclaim buckets to which writeback keys point */
>  	rcu_read_lock();
> -	for (i = 0; i < c->nr_uuids; i++) {
> +	for (i = 0; i < c->devices_max_used; i++) {
>  		struct bcache_device *d = c->devices[i];
>  		struct cached_dev *dc;
>  		struct keybuf_key *w, *n;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928dec5..064efd869017 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -721,6 +721,9 @@ static void bcache_device_attach(struct bcache_device *d, struct cache_set *c,
>  	d->c = c;
>  	c->devices[id] = d;
>  
> +	if (id >= c->devices_max_used)
> +		c->devices_max_used = id + 1;
> +
>  	closure_get(&c->caching);
>  }
>  
> @@ -1261,7 +1264,7 @@ static int flash_devs_run(struct cache_set *c)
>  	struct uuid_entry *u;
>  
>  	for (u = c->uuids;
> -	     u < c->uuids + c->nr_uuids && !ret;
> +	     u < c->uuids + c->devices_max_used && !ret;
>  	     u++)
>  		if (UUID_FLASH_ONLY(u))
>  			ret = flash_dev_run(c, u);
> @@ -1427,7 +1430,7 @@ static void __cache_set_unregister(struct closure *cl)
>  
>  	mutex_lock(&bch_register_lock);
>  
> -	for (i = 0; i < c->nr_uuids; i++)
> +	for (i = 0; i < c->devices_max_used; i++)
>  		if (c->devices[i]) {
>  			if (!UUID_FLASH_ONLY(&c->uuids[i]) &&
>  			    test_bit(CACHE_SET_UNREGISTERING, &c->flags)) {
> @@ -1490,7 +1493,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>  	c->bucket_bits		= ilog2(sb->bucket_size);
>  	c->block_bits		= ilog2(sb->block_size);
>  	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
> -
> +	c->devices_max_used	= 0;
>  	c->btree_pages		= bucket_pages(c);
>  	if (c->btree_pages > BTREE_MAX_PAGES)
>  		c->btree_pages = max_t(int, c->btree_pages / 4,
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index a9e3ffb4b03c..1d284f3d0363 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -21,7 +21,7 @@ static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
>  
>  	mutex_lock(&bch_register_lock);
>  
> -	for (i = 0; i < c->nr_uuids; i++) {
> +	for (i = 0; i < c->devices_max_used; i++) {
>  		struct bcache_device *d = c->devices[i];
>  
>  		if (!d || !UUID_FLASH_ONLY(&c->uuids[i]))
> 
Hmm. This just cries out for using IDA ...
but maybe for a later patchset.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error()
  2018-01-03 14:03 ` [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error() Coly Li
@ 2018-01-08  7:16   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:16 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> When bcache metadata I/O fails, bcache will call bch_cache_set_error()
> to retire the whole cache set. The expected behavior to retire a cache
> set is to unregister the cache set, and unregister all backing device
> attached to this cache set, then remove sysfs entries of the cache set
> and all attached backing devices, finally release memory of structs
> cache_set, cache, cached_dev and bcache_device.
> 
> In my testing when journal I/O failure triggered by disconnected cache
> device, sometimes the cache set cannot be retired, and its sysfs
> entry /sys/fs/bcache/<uuid> still exits and the backing device also
> references it. This is not expected behavior.
> 
> When metadata I/O failes, the call senquence to retire whole cache set is,
> 	bch_cache_set_error()
> 	bch_cache_set_unregister()
> 	bch_cache_set_stop()
> 	__cache_set_unregister()     <- called as callback by calling
> 					clousre_queue(&c->caching)
> 	cache_set_flush()            <- called as a callback when refcount
> 					of cache_set->caching is 0
> 	cache_set_free()             <- called as a callback when refcount
> 					of catch_set->cl is 0
> 	bch_cache_set_release()      <- called as a callback when refcount
> 					of catch_set->kobj is 0
> 
> I find if kernel thread bch_writeback_thread() quits while-loop when
> kthread_should_stop() is true and searched_full_index is false, clousre
> callback cache_set_flush() set by continue_at() will never be called. The
> result is, bcache fails to retire whole cache set.
> 
> cache_set_flush() will be called when refcount of closure c->caching is 0,
> and in function bcache_device_detach() refcount of closure c->caching is
> released to 0 by clousre_put(). In metadata error code path, function
> bcache_device_detach() is called by cached_dev_detach_finish(). This is a
> callback routine being called when cached_dev->count is 0. This refcount
> is decreased by cached_dev_put().
> 
> The above dependence indicates, cache_set_flush() will be called when
> refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
> when refcount of cache_dev->count is 0.
> 
> The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
> and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
> of cache_dev is not decreased properly.
> 
> In bch_writeback_thread(), cached_dev_put() is called only when
> searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
> there is no dirty data on cache. In most of run time it is correct, but
> when bch_writeback_thread() quits the while-loop while cache is still
> dirty, current code forget to call cached_dev_put() before this kernel
> thread exits. This is why sometimes cache_set_flush() is not executed and
> cache set fails to be retired.
> 
> The reason to call cached_dev_put() in bch_writeback_rate() is, when the
> cache device changes from clean to dirty, cached_dev_get() is called, to
> make sure during writeback operatiions both backing and cache devices
> won't be released.
> 
> Adding following code in bch_writeback_thread() does not work,
>     static int bch_writeback_thread(void *arg)
>     [code snip]
> 
>     +       if (atomic_read(&dc->has_dirty))
>     +               cached_dev_put()
>     +
>         return 0;
>     [code snip]
> 
> because writeback kernel thread can be waken up and start via sysfs entry:
> 	echo 1 > /sys/block/bcache<N>/bcache/writeback_running
> It is difficult to check whether backing device is dirty without race and
> extra lock. So the above modification will introduce potential refcount
> underflow in some conditions.
> 
> The correct fix is, to take cached dev refcount when creating the kernel
> thread, and put it before the kernel thread exits. Then bcache does not
> need to take a cached dev refcount when cache turns from clean to dirty,
> or to put a cached dev refcount when cache turns from ditry to clean. The
> writeback kernel thread is alwasy safe to reference data structure from
> cache set, cache and cached device (because a refcount of cache device is
> taken for it already), and no matter the kernel thread is stopped by I/O
> errors or system reboot, cached_dev->count can always be used correctly.
> 
> The patch is simple, but understanding how it works is quite complicated.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/super.c     |  1 -
>  drivers/md/bcache/writeback.c | 10 +++++++---
>  drivers/md/bcache/writeback.h |  2 --
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
Personally, I'm not a big fan to remove an entire device on error.
How is one supposed to do error recovery here?
Or (gasp) even error _debugging_?
All relevant structures have been removed...

But again, hardly the fault of this patch, and should be fixed in a
later patchset.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping
  2018-01-03 14:03 ` [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping Coly Li
@ 2018-01-08  7:22   ` Hannes Reinecke
  2018-01-08 16:01     ` Coly Li
  0 siblings, 1 reply; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:22 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> struct delayed_work writeback_rate_update in struct cache_dev is a delayed
> worker to call function update_writeback_rate() in period (the interval is
> defined by dc->writeback_rate_update_seconds).
> 
> When a metadate I/O error happens on cache device, bcache error handling
> routine bch_cache_set_error() will call bch_cache_set_unregister() to
> retire whole cache set. On the unregister code path, cached_dev_free()
> calls cancel_delayed_work_sync(&dc->writeback_rate_update) to stop this
> delayed work.
> 
> dc->writeback_rate_update is a special delayed work from others in bcache.
> In its routine update_writeback_rate(), this delayed work is re-armed
> after a piece of time. That means when cancel_delayed_work_sync() returns,
> this delayed work can still be executed after several seconds defined by
> dc->writeback_rate_update_seconds.
> 
> The problem is, after cancel_delayed_work_sync() returns, the cache set
> unregister code path will eventually release memory of struct cache set.
> Then the delayed work is scheduled to run, and inside its routine
> update_writeback_rate() that already released cache set NULL pointer will
> be accessed. Now a NULL pointer deference panic is triggered.
> 
> In order to avoid the above problem, this patch checks cache set flags in
> delayed work routine update_writeback_rate(). If flag CACHE_SET_STOPPING
> is set, this routine will quit without re-arm the delayed work. Then the
> NULL pointer deference panic won't happen after cache set is released.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/writeback.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 0789a9e18337..745d9b2a326f 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -91,6 +91,11 @@ static void update_writeback_rate(struct work_struct *work)
>  	struct cached_dev *dc = container_of(to_delayed_work(work),
>  					     struct cached_dev,
>  					     writeback_rate_update);
> +	struct cache_set *c = dc->disk.c;
> +
> +	/* quit directly if cache set is stopping */
> +	if (test_bit(CACHE_SET_STOPPING, &c->flags))
> +		return;
>  
>  	down_read(&dc->writeback_lock);
>  
> @@ -100,6 +105,10 @@ static void update_writeback_rate(struct work_struct *work)
>  
>  	up_read(&dc->writeback_lock);
>  
> +	/* do not schedule delayed work if cache set is stopping */
> +	if (test_bit(CACHE_SET_STOPPING, &c->flags))
> +		return;
> +
>  	schedule_delayed_work(&dc->writeback_rate_update,
>  			      dc->writeback_rate_update_seconds * HZ);
>  }
> 
This is actually not quite correct; the function might still be called
after 'struct cached_dev' has been removed.
The correct way of fixing is to either take a reference to struct
cached_dev and release it once 'update_writeback_rate' is finished, or
to call 'cancel_delayed_work_sync()' before deleting struct cached_dev.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier
  2018-01-03 14:03 ` [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier Coly Li
@ 2018-01-08  7:25   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:25 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Delayed worker dc->writeback_rate_update and kernel thread
> dc->writeback_thread reference cache set data structure in their routine,
> Therefor, before they are stopped, cache set should not be release. Other-
> wise, NULL pointer deference will be triggered.
> 
> Currenly delayed worker dc->writeback_rate_update and kernel thread
> dc->writeback_thread are stopped in cached_dev_free(). When cache set is
> retiring by too many I/O errors, cached_dev_free() is called when refcount
> of bcache device's closure (disk.cl) reaches 0. In most of cases, last
> refcount of disk.cl is dropped in last line of cached_dev_detach_finish().
> But in cached_dev_detach_finish() before calling closure_put(&dc->disk.cl),
> bcache_device_detach() is called, and inside bcache_device_detach()
> refcount of cache_set->caching is dropped by closure_put(&d->c->caching).
> 
> It is very probably this is the last refcount of this closure, so routine
> cache_set_flush() will be called (it is set in __cache_set_unregister()),
> and its parent closure cache_set->cl may also drop its last refcount and
> cache_set_free() is called too. In cache_set_free() the last refcount of
> cache_set->kobj is dropped and then bch_cache_set_release() is called. Now
> in bch_cache_set_release(), the memory of struct cache_set is freeed.
> 
> bch_cache_set_release() is called before cached_dev_free(), then there is a
> time window after cache set memory freed and before dc->writeback_thread
> and dc->writeback_rate_update stopped, if one of them is scheduled to run,
> a NULL pointer deference will be triggered.
> 
> This patch fixes the above problem by stopping dc->writeback_thread and
> dc->writeback_rate_update earlier in bcache_device_detach() before calling
> closure_put(&d->c->caching). Because cancel_delayed_work_sync() and
> kthread_stop() are synchronized operations, we can make sure cache set
> is available when the delayed work and kthread are stopping.
> 
> Because cached_dev_free() can also be called by writing 1 to sysfs file
> /sys/block/bcache<N>/bcache/stop, this code path may not call
> bcache_device_detach() if d-c is NULL. So stopping dc->writeback_thread
> and dc->writeback_rate_update in cached_dev_free() is still necessary. In
> order to avoid stop them twice, dc->rate_update_canceled is added to
> indicate dc->writeback_rate_update is canceled, and dc->writeback_thread
> is set to NULL to indicate it is stopped.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/bcache.h    |  1 +
>  drivers/md/bcache/super.c     | 21 +++++++++++++++++++--
>  drivers/md/bcache/writeback.c |  1 +
>  3 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 83c569942bd0..395b87942a2f 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -322,6 +322,7 @@ struct cached_dev {
>  
>  	struct bch_ratelimit	writeback_rate;
>  	struct delayed_work	writeback_rate_update;
> +	bool			rate_update_canceled;
>  
>  	/*
>  	 * Internal to the writeback code, so read_dirty() can keep track of
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 5401d2356aa3..8912be4165c5 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -696,8 +696,20 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c,
>  
>  static void bcache_device_detach(struct bcache_device *d)
>  {
> +	struct cached_dev *dc;
> +
>  	lockdep_assert_held(&bch_register_lock);
>  
> +	dc = container_of(d, struct cached_dev, disk);
> +	if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
> +		kthread_stop(dc->writeback_thread);
> +		dc->writeback_thread = NULL;
> +	}
> +	if (!dc->rate_update_canceled) {
> +		cancel_delayed_work_sync(&dc->writeback_rate_update);
> +		dc->rate_update_canceled = true;
> +	}
> +
>  	if (test_bit(BCACHE_DEV_DETACHING, &d->flags)) {
>  		struct uuid_entry *u = d->c->uuids + d->id;
>  
> @@ -1071,9 +1083,14 @@ static void cached_dev_free(struct closure *cl)
>  {
>  	struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl);
>  
> -	cancel_delayed_work_sync(&dc->writeback_rate_update);
> -	if (!IS_ERR_OR_NULL(dc->writeback_thread))
> +	if (!dc->rate_update_canceled) {
> +		cancel_delayed_work_sync(&dc->writeback_rate_update);
> +		dc->rate_update_canceled = true;
> +	}
> +	if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
>  		kthread_stop(dc->writeback_thread);
> +		dc->writeback_thread = NULL;
> +	}
>  	if (dc->writeback_write_wq)
>  		destroy_workqueue(dc->writeback_write_wq);
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 745d9b2a326f..ab2ac3d72393 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -548,6 +548,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  	dc->writeback_rate_i_term_inverse = 10000;
>  
>  	INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
> +	dc->rate_update_canceled = false;
>  }
>  
>  int bch_cached_dev_writeback_start(struct cached_dev *dc)
> 
Hehe. Just as I said in the comment to the previous patch.
I would suggest merge this and the previous patch :-)

But in general, I don't think you need 'rate_update_canceled'.
cancel_delayed_work_sync() will be a no-op if no work item has been
scheduled.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 07/10] bcache: set error_limit correctly
  2018-01-03 14:03 ` [PATCH v1 07/10] bcache: set error_limit correctly Coly Li
@ 2018-01-08  7:26   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:26 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Struct cache uses io_errors for two purposes,
> - Error decay: when cache set error_decay is set, io_errors is used to
>   generate a small piece of delay when I/O error happens.
> - I/O errors counter: in order to generate big enough value for error
>   decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
>   IO_ERROR_SHIFT).
> 
> In function bch_count_io_errors(), if I/O errors counter reaches cache set
> error limit, bch_cache_set_error() will be called to retire the whold cache
> set. But current code is problematic when checking the error limit, see the
> following code piece from bch_count_io_errors(),
> 
>  90   if (error) {
>  91           char buf[BDEVNAME_SIZE];
>  92           unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
>  93                                               &ca->io_errors);
>  94           errors >>= IO_ERROR_SHIFT;
>  95
>  96           if (errors < ca->set->error_limit)
>  97                   pr_err("%s: IO error on %s, recovering",
>  98                          bdevname(ca->bdev, buf), m);
>  99           else
> 100                   bch_cache_set_error(ca->set,
> 101                                       "%s: too many IO errors %s",
> 102                                       bdevname(ca->bdev, buf), m);
> 103   }
> 
> At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
> errors counter to compare at line 96. But ca->set->error_limit is initia-
> lized with an amplified value in bch_cache_set_alloc(),
> 1545   c->error_limit  = 8 << IO_ERROR_SHIFT;
> 
> It means by default, in bch_count_io_errors(), before 8<<20 errors happened
> bch_cache_set_error() won't be called to retire the problematic cache
> device. If the average request size is 64KB, it means bcache won't handle
> failed device until 512GB data is requested. This is too large to be an I/O
> threashold. So I believe the correct error limit should be much less.
> 
> This patch sets default cache set error limit to 8, then in
> bch_count_io_errors() when errors counter reaches 8 (if it is default
> value), function bch_cache_set_error() will be called to retire the whole
> cache set. This patch also removes bits shifting when store or show
> io_error_limit value via sysfs interface.
> 
> Nowadays most of SSDs handle internal flash failure automatically by LBA
> address re-indirect mapping. If an I/O error can be observed by upper layer
> code, it will be a notable error because that SSD can not re-indirect
> map the problematic LBA address to an available flash block. This situation
> indicates the whole SSD will be failed very soon. Therefore setting 8 as
> the default io error limit value makes sense, it is enough for most of
> cache devices.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/bcache.h | 1 +
>  drivers/md/bcache/super.c  | 2 +-
>  drivers/md/bcache/sysfs.c  | 4 ++--
>  3 files changed, 4 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors()
  2018-01-03 14:03 ` [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors() Coly Li
  2018-01-03 17:14   ` Michael Lyle
@ 2018-01-08  7:27   ` Hannes Reinecke
  1 sibling, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:27 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> Bcache only does recoverable I/O for read operations by calling
> cached_dev_read_error(). For write opertions there is no I/O recovery for
> failed requests.
> 
> But in bch_count_io_errors() no matter read or write I/Os, before errors
> counter reaches io error limit, pr_err() always prints "IO error on %,
> recoverying". For write requests this information is misleading, because
> there is no I/O recovery at all.
> 
> This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
> prints "recovering" by pr_err() when the bio direction is READ.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/bcache.h    |  2 +-
>  drivers/md/bcache/io.c        | 13 +++++++++----
>  drivers/md/bcache/super.c     |  4 +++-
>  drivers/md/bcache/writeback.c |  4 +++-
>  4 files changed, 16 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 09/10] bcache: add io_disable to struct cache_set
  2018-01-03 14:03 ` [PATCH v1 09/10] bcache: add io_disable to struct cache_set Coly Li
@ 2018-01-08  7:30   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:30 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> When too many I/Os failed on cache device, bch_cache_set_error() is called
> in the error handling code path to retire whole problematic cache set. If
> new I/O requests continue to come and take refcount dc->count, the cache
> set won't be retired immediately, this is a problem.
> 
> Further more, there are several kernel thread and self-armed kernel work
> may still running after bch_cache_set_error() is called. It needs to wait
> quite a while for them to stop, or they won't stop at all. They also
> prevent the cache set from being retired.
> 
> The solution in this patch is, to add per cache set flag to disable I/O
> request on this cache and all attached backing devices. Then new coming I/O
> requests can be rejected in *_make_request() before taking refcount, kernel
> threads and self-armed kernel worker can stop very fast when io_disable is
> true.
> 
> Because bcache also do internal I/Os for writeback, garbage collection,
> bucket allocation, journaling, this kind of I/O should be disabled after
> bch_cache_set_error() is called. So closure_bio_submit() is modified to
> check whether cache_set->io_disable is true. If cache_set->io_disable is
> true, closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
> return, generic_make_request() won't be called.
> 
> A sysfs interface is also added for cache_set->io_disable, to read and set
> io_disable value for debugging. It is helpful to trigger more corner case
> issues for failed cache device.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/alloc.c     |  2 +-
>  drivers/md/bcache/bcache.h    | 14 ++++++++++++++
>  drivers/md/bcache/btree.c     |  6 ++++--
>  drivers/md/bcache/io.c        |  2 +-
>  drivers/md/bcache/journal.c   |  4 ++--
>  drivers/md/bcache/request.c   | 26 +++++++++++++++++++-------
>  drivers/md/bcache/super.c     |  7 ++++++-
>  drivers/md/bcache/sysfs.c     |  4 ++++
>  drivers/md/bcache/util.h      |  6 ------
>  drivers/md/bcache/writeback.c | 34 ++++++++++++++++++++++------------
>  10 files changed, 73 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index 48c002faf08d..3be737582f27 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -286,7 +286,7 @@ do {									\
>  			break;						\
>  									\
>  		mutex_unlock(&(ca)->set->bucket_lock);			\
> -		if (kthread_should_stop())				\
> +		if (kthread_should_stop() || ca->set->io_disable)	\
>  			return 0;					\
>  									\
>  		set_current_state(TASK_INTERRUPTIBLE);			\
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index c53f312b2216..9c7f9b1cb791 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -481,6 +481,7 @@ struct cache_set {
>  	struct cache_accounting accounting;
>  
>  	unsigned long		flags;
> +	bool			io_disable;
>  
>  	struct cache_sb		sb;
>  
> @@ -853,6 +854,19 @@ static inline void wake_up_allocators(struct cache_set *c)
>  		wake_up_process(ca->alloc_thread);
>  }
>  
> +static inline void closure_bio_submit(struct cache_set *c,
> +				      struct bio *bio,
> +				      struct closure *cl)
> +{
> +	closure_get(cl);
> +	if (unlikely(c->io_disable)) {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +		return;
> +	}
> +	generic_make_request(bio);
> +}
> +
>  /* Forward declarations */
>  
>  void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index bf0d7978bc3d..75470cce1177 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1788,9 +1788,11 @@ static int bch_gc_thread(void *arg)
>  
>  	while (1) {
>  		wait_event_interruptible(c->gc_wait,
> -			   kthread_should_stop() || gc_should_run(c));
> +					kthread_should_stop() ||
> +					c->io_disable ||
> +					gc_should_run(c));
>  
> -		if (kthread_should_stop())
> +		if (kthread_should_stop() || c->io_disable)
>  			break;
>  
>  		set_gc_sectors(c);
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index a783c5a41ff1..8013ecbcdbda 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -38,7 +38,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
>  	bio_set_dev(bio, PTR_CACHE(c, &b->key, 0)->bdev);
>  
>  	b->submit_time_us = local_clock_us();
> -	closure_bio_submit(bio, bio->bi_private);
> +	closure_bio_submit(c, bio, bio->bi_private);
>  }
>  
>  void bch_submit_bbio(struct bio *bio, struct cache_set *c,
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index a87165c1d8e5..979873641030 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -62,7 +62,7 @@ reread:		left = ca->sb.bucket_size - offset;
>  		bio_set_op_attrs(bio, REQ_OP_READ, 0);
>  		bch_bio_map(bio, data);
>  
> -		closure_bio_submit(bio, &cl);
> +		closure_bio_submit(ca->set, bio, &cl);
>  		closure_sync(&cl);
>  
>  		/* This function could be simpler now since we no longer write
> @@ -653,7 +653,7 @@ static void journal_write_unlocked(struct closure *cl)
>  	spin_unlock(&c->journal.lock);
>  
>  	while ((bio = bio_list_pop(&list)))
> -		closure_bio_submit(bio, cl);
> +		closure_bio_submit(c, bio, cl);
>  
>  	continue_at(cl, journal_write_done, NULL);
>  }
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 643c3021624f..a85d6a605a8e 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -725,7 +725,7 @@ static void cached_dev_read_error(struct closure *cl)
>  
>  		/* XXX: invalidate cache */
>  
> -		closure_bio_submit(bio, cl);
> +		closure_bio_submit(s->iop.c, bio, cl);
>  	}
>  
>  	continue_at(cl, cached_dev_cache_miss_done, NULL);
> @@ -850,7 +850,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>  	s->cache_miss	= miss;
>  	s->iop.bio	= cache_bio;
>  	bio_get(cache_bio);
> -	closure_bio_submit(cache_bio, &s->cl);
> +	closure_bio_submit(s->iop.c, cache_bio, &s->cl);
>  
>  	return ret;
>  out_put:
> @@ -858,7 +858,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>  out_submit:
>  	miss->bi_end_io		= request_endio;
>  	miss->bi_private	= &s->cl;
> -	closure_bio_submit(miss, &s->cl);
> +	closure_bio_submit(s->iop.c, miss, &s->cl);
>  	return ret;
>  }
>  
> @@ -923,7 +923,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  
>  		if ((bio_op(bio) != REQ_OP_DISCARD) ||
>  		    blk_queue_discard(bdev_get_queue(dc->bdev)))
> -			closure_bio_submit(bio, cl);
> +			closure_bio_submit(s->iop.c, bio, cl);
>  	} else if (s->iop.writeback) {
>  		bch_writeback_add(dc);
>  		s->iop.bio = bio;
> @@ -938,12 +938,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  			flush->bi_private = cl;
>  			flush->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
>  
> -			closure_bio_submit(flush, cl);
> +			closure_bio_submit(s->iop.c, flush, cl);
>  		}
>  	} else {
>  		s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);
>  
> -		closure_bio_submit(bio, cl);
> +		closure_bio_submit(s->iop.c, bio, cl);
>  	}
>  
>  	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
> @@ -959,7 +959,7 @@ static void cached_dev_nodata(struct closure *cl)
>  		bch_journal_meta(s->iop.c, cl);
>  
>  	/* If it's a flush, we send the flush to the backing device too */
> -	closure_bio_submit(bio, cl);
> +	closure_bio_submit(s->iop.c, bio, cl);
>  
>  	continue_at(cl, cached_dev_bio_complete, NULL);
>  }
> @@ -974,6 +974,12 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
>  	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>  	int rw = bio_data_dir(bio);
>  
> +	if (unlikely(d->c && d->c->io_disable)) {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +		return BLK_QC_T_NONE;
> +	}
> +
>  	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
>  
>  	bio_set_dev(bio, dc->bdev);
> @@ -1089,6 +1095,12 @@ static blk_qc_t flash_dev_make_request(struct request_queue *q,
>  	struct bcache_device *d = bio->bi_disk->private_data;
>  	int rw = bio_data_dir(bio);
>  
> +	if (unlikely(d->c->io_disable)) {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +		return BLK_QC_T_NONE;
> +	}
> +
>  	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
>  
>  	s = search_alloc(bio, d);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index bbe911847eea..7aa76c3e3556 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -521,7 +521,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, int op,
>  	bio_set_op_attrs(bio, op, REQ_SYNC|REQ_META|op_flags);
>  	bch_bio_map(bio, ca->disk_buckets);
>  
> -	closure_bio_submit(bio, &ca->prio);
> +	closure_bio_submit(ca->set, bio, &ca->prio);
>  	closure_sync(cl);
>  }
>  
> @@ -1333,6 +1333,10 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
>  	acquire_console_sem();
>  	*/
>  
> +	c->io_disable = true;
> +	/* make others know io_disable is true earlier */
> +	smp_mb();
> +
>  	printk(KERN_ERR "bcache: error on %pU: ", c->sb.set_uuid);
>  
>  	va_start(args, fmt);
> @@ -1564,6 +1568,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>  	c->congested_read_threshold_us	= 2000;
>  	c->congested_write_threshold_us	= 20000;
>  	c->error_limit	= DEFAULT_IO_ERROR_LIMIT;
> +	c->io_disable = false;
>  
>  	return c;
>  err:
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index d7ce9a05b304..acce7c82e111 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -92,6 +92,7 @@ read_attribute(partial_stripes_expensive);
>  
>  rw_attribute(synchronous);
>  rw_attribute(journal_delay_ms);
> +rw_attribute(io_disable);
>  rw_attribute(discard);
>  rw_attribute(running);
>  rw_attribute(label);
> @@ -573,6 +574,7 @@ SHOW(__bch_cache_set)
>  	sysfs_printf(gc_always_rewrite,		"%i", c->gc_always_rewrite);
>  	sysfs_printf(btree_shrinker_disabled,	"%i", c->shrinker_disabled);
>  	sysfs_printf(copy_gc_enabled,		"%i", c->copy_gc_enabled);
> +	sysfs_printf(io_disable,		"%i", c->io_disable);
>  
>  	if (attr == &sysfs_bset_tree_stats)
>  		return bch_bset_print_stats(c, buf);
> @@ -663,6 +665,7 @@ STORE(__bch_cache_set)
>  		c->error_decay = strtoul_or_return(buf) / 88;
>  
>  	sysfs_strtoul(journal_delay_ms,		c->journal_delay_ms);
> +	sysfs_strtoul_clamp(io_disable,		c->io_disable, 0, 1);
>  	sysfs_strtoul(verify,			c->verify);
>  	sysfs_strtoul(key_merging_disabled,	c->key_merging_disabled);
>  	sysfs_strtoul(expensive_debug_checks,	c->expensive_debug_checks);
> @@ -744,6 +747,7 @@ static struct attribute *bch_cache_set_internal_files[] = {
>  	&sysfs_gc_always_rewrite,
>  	&sysfs_btree_shrinker_disabled,
>  	&sysfs_copy_gc_enabled,
> +	&sysfs_io_disable,
>  	NULL
>  };
>  KTYPE(bch_cache_set_internal);
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index ed5e8a412eb8..03e533631798 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -564,12 +564,6 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
>  	return bdev->bd_inode->i_size >> 9;
>  }
>  
> -#define closure_bio_submit(bio, cl)					\
> -do {									\
> -	closure_get(cl);						\
> -	generic_make_request(bio);					\
> -} while (0)
> -
>  uint64_t bch_crc64_update(uint64_t, const void *, size_t);
>  uint64_t bch_crc64(const void *, size_t);
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index e58f9be5ae43..54add41d2569 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -93,8 +93,11 @@ static void update_writeback_rate(struct work_struct *work)
>  					     writeback_rate_update);
>  	struct cache_set *c = dc->disk.c;
>  
> -	/* quit directly if cache set is stopping */
> -	if (test_bit(CACHE_SET_STOPPING, &c->flags))
> +	/*
> +	 * quit directly if cache set is stopping. c->io_disable
> +	 * can be set via sysfs, check it here too.
> +	 */
> +	if (test_bit(CACHE_SET_STOPPING, &c->flags) || c->io_disable)
>  		return;
>  
>  	down_read(&dc->writeback_lock);
> @@ -105,8 +108,11 @@ static void update_writeback_rate(struct work_struct *work)
>  
>  	up_read(&dc->writeback_lock);
>  
> -	/* do not schedule delayed work if cache set is stopping */
> -	if (test_bit(CACHE_SET_STOPPING, &c->flags))
> +	/*
> +	 * do not schedule delayed work if cache set is stopping,
> +	 * c->io_disable can be set via sysfs, check it here too.
> +	 */
> +	if (test_bit(CACHE_SET_STOPPING, &c->flags) || c->io_disable)
>  		return;
>  
>  	schedule_delayed_work(&dc->writeback_rate_update,
> @@ -217,7 +223,7 @@ static void write_dirty(struct closure *cl)
>  		bio_set_dev(&io->bio, io->dc->bdev);
>  		io->bio.bi_end_io	= dirty_endio;
>  
> -		closure_bio_submit(&io->bio, cl);
> +		closure_bio_submit(io->dc->disk.c, &io->bio, cl);
>  	}
>  
>  	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
> @@ -240,7 +246,7 @@ static void read_dirty_submit(struct closure *cl)
>  {
>  	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>  
> -	closure_bio_submit(&io->bio, cl);
> +	closure_bio_submit(io->dc->disk.c, &io->bio, cl);
>  
>  	continue_at(cl, write_dirty, io->dc->writeback_write_wq);
>  }
> @@ -259,7 +265,7 @@ static void read_dirty(struct cached_dev *dc)
>  	 * mempools.
>  	 */
>  
> -	while (!kthread_should_stop()) {
> +	while (!(kthread_should_stop() || dc->disk.c->io_disable)) {
>  
>  		w = bch_keybuf_next(&dc->writeback_keys);
>  		if (!w)
> @@ -269,7 +275,9 @@ static void read_dirty(struct cached_dev *dc)
>  
>  		if (KEY_START(&w->key) != dc->last_read ||
>  		    jiffies_to_msecs(delay) > 50)
> -			while (!kthread_should_stop() && delay)
> +			while (!kthread_should_stop() &&
> +			       !dc->disk.c->io_disable &&
> +			       delay)
>  				delay = schedule_timeout_interruptible(delay);
>  
>  		dc->last_read	= KEY_OFFSET(&w->key);
> @@ -450,18 +458,19 @@ static bool refill_dirty(struct cached_dev *dc)
>  static int bch_writeback_thread(void *arg)
>  {
>  	struct cached_dev *dc = arg;
> +	struct cache_set *c = dc->disk.c;
>  	bool searched_full_index;
>  
>  	bch_ratelimit_reset(&dc->writeback_rate);
>  
> -	while (!kthread_should_stop()) {
> +	while (!(kthread_should_stop() || c->io_disable)) {
>  		down_write(&dc->writeback_lock);
>  		if (!atomic_read(&dc->has_dirty) ||
>  		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>  		     !dc->writeback_running)) {
>  			up_write(&dc->writeback_lock);
>  
> -			if (kthread_should_stop())
> +			if (kthread_should_stop() || c->io_disable)
>  				break;
>  
>  			set_current_state(TASK_INTERRUPTIBLE);
> @@ -485,8 +494,8 @@ static int bch_writeback_thread(void *arg)
>  		if (searched_full_index) {
>  			unsigned delay = dc->writeback_delay * HZ;
>  
> -			while (delay &&
> -			       !kthread_should_stop() &&
> +			while (delay && !kthread_should_stop() &&
> +			       !c->io_disable &&
>  			       !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
>  				delay = schedule_timeout_interruptible(delay);
>  
> @@ -494,6 +503,7 @@ static int bch_writeback_thread(void *arg)
>  		}
>  	}
>  
> +	dc->writeback_thread = NULL;
>  	cached_dev_put(dc);
>  
>  	return 0;
> 
The last hunk actually belong to the previous patch; please move it to
there.

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set
  2018-01-03 14:03 ` [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set Coly Li
@ 2018-01-08  7:31   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2018-01-08  7:31 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 01/03/2018 03:03 PM, Coly Li wrote:
> When there are too many I/O errors on cache device, current bcache code
> will retire the whole cache set, and detach all bcache devices. But the
> detached bcache devices are not stopped, which is problematic when bcache
> is in writeback mode.
> 
> If the retired cache set has dirty data of backing devices, continue
> writing to bcache device will write to backing device directly. If the
> LBA of write request has a dirty version cached on cache device, next time
> when the cache device is re-registered and backing device re-attached to
> it again, the stale dirty data on cache device will be written to backing
> device, and overwrite latest directly written data. This situation causes
> a quite data corruption.
> 
> This patch checkes whether cache_set->io_disable is true in
> __cache_set_unregister(). If cache_set->io_disable is true, it means cache
> set is unregistering by too many I/O errors, then all attached bcache
> devices will be stopped as well. If cache_set->io_disable is not true, it
> means __cache_set_unregister() is triggered by writing 1 to sysfs file
> /sys/fs/bcache/<UUID>/bcache/stop. This is an exception because users do
> it explicitly, this patch keeps existing behavior and does not stop any
> bcache device.
> 
> Even the failed cache device has no dirty data, stopping bcache device is
> still a desired behavior by many Ceph and data base users. Then their
> application will report I/O errors due to disappeared bcache device, and
> operation people will know the cache device is broken or disconnected.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/super.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 49d6fedf89c3..20a7a6959506 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1458,6 +1458,14 @@ static void __cache_set_unregister(struct closure *cl)
>  				dc = container_of(c->devices[i],
>  						  struct cached_dev, disk);
>  				bch_cached_dev_detach(dc);
> +				/*
> +				 * If we come here by too many I/O errors,
> +				 * bcache device should be stopped too, to
> +				 * keep data consistency on cache and
> +				 * backing devices.
> +				 */
> +				if (c->io_disable)
> +					bcache_device_stop(c->devices[i]);
>  			} else {
>  				bcache_device_stop(c->devices[i]);
>  			}
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state
  2018-01-08  7:09   ` Hannes Reinecke
@ 2018-01-08 13:50     ` Coly Li
  0 siblings, 0 replies; 35+ messages in thread
From: Coly Li @ 2018-01-08 13:50 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 08/01/2018 3:09 PM, Hannes Reinecke wrote:
> On 01/03/2018 03:03 PM, Coly Li wrote:
>> Kernel thread routine bch_writeback_thread() has the following code block,
>>
>> 452                         set_current_state(TASK_INTERRUPTIBLE);
>> 453
>> 454                         if (kthread_should_stop())
>> 455                                 return 0;
>> 456
>> 457                         schedule();
>> 458                         continue;
>>
>> At line 452, its status is set to TASK_INTERRUPTIBLE, and at line 454 if
>> kthread_should_stop() is true, a "return 0" at line 455 will to function
>> kernel/kthread.c:kthread() and call do_exit().
>>
>> It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in
>> following code path might_sleep() is called and a warning message is
>> reported by __might_sleep(): "WARNING: do not call blocking ops when
>> !TASK_RUNNING; state=1 set at [xxxx]".
>>
>> Indeed it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE
>> state, but this warning message scares users, makes them feel there might
>> be something risky with bcache and hurt their data.
>>
>> In this patch, TASK_INTERRUPTIBLE is set after kthread_should_stop(),
>> so writeback kernel thread can exist and enter do_exit() with
>> TASK_RUNNING state. Warning message from might_sleep() is removed.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/writeback.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index 56a37884ca8b..a57149803df6 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -449,11 +449,11 @@ static int bch_writeback_thread(void *arg)
>>  		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
>>  		     !dc->writeback_running)) {
>>  			up_write(&dc->writeback_lock);
>> -			set_current_state(TASK_INTERRUPTIBLE);
>>  
>>  			if (kthread_should_stop())
>>  				return 0;
>>  
>> +			set_current_state(TASK_INTERRUPTIBLE);
>>  			schedule();
>>  			continue;
>>  		}
>>
> Actually, TASK_INTERRUPTIBLE is okay for kthread_should_stop(); you just
> need to set it to 'TASK_RUNNING' before calling 'return 0'.
> 
> So I think a fix like
> 
> set_current_state(TASK_INTERRUPTIBLE);
> 
> if (kthread_should_stop()) {
>     set_current_state(TASK_RUNNING);
>     return 0;
> }
> 
> schedule();
> 
> would be better.

Hi Hannes,

Yes, this is same as v2 patch does. Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping
  2018-01-08  7:22   ` Hannes Reinecke
@ 2018-01-08 16:01     ` Coly Li
  0 siblings, 0 replies; 35+ messages in thread
From: Coly Li @ 2018-01-08 16:01 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, mlyle, tang.junhui

On 08/01/2018 3:22 PM, Hannes Reinecke wrote:
> On 01/03/2018 03:03 PM, Coly Li wrote:
>> struct delayed_work writeback_rate_update in struct cache_dev is a delayed
>> worker to call function update_writeback_rate() in period (the interval is
>> defined by dc->writeback_rate_update_seconds).
>>
>> When a metadate I/O error happens on cache device, bcache error handling
>> routine bch_cache_set_error() will call bch_cache_set_unregister() to
>> retire whole cache set. On the unregister code path, cached_dev_free()
>> calls cancel_delayed_work_sync(&dc->writeback_rate_update) to stop this
>> delayed work.
>>
>> dc->writeback_rate_update is a special delayed work from others in bcache.
>> In its routine update_writeback_rate(), this delayed work is re-armed
>> after a piece of time. That means when cancel_delayed_work_sync() returns,
>> this delayed work can still be executed after several seconds defined by
>> dc->writeback_rate_update_seconds.
>>
>> The problem is, after cancel_delayed_work_sync() returns, the cache set
>> unregister code path will eventually release memory of struct cache set.
>> Then the delayed work is scheduled to run, and inside its routine
>> update_writeback_rate() that already released cache set NULL pointer will
>> be accessed. Now a NULL pointer deference panic is triggered.
>>
>> In order to avoid the above problem, this patch checks cache set flags in
>> delayed work routine update_writeback_rate(). If flag CACHE_SET_STOPPING
>> is set, this routine will quit without re-arm the delayed work. Then the
>> NULL pointer deference panic won't happen after cache set is released.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/writeback.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index 0789a9e18337..745d9b2a326f 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -91,6 +91,11 @@ static void update_writeback_rate(struct work_struct *work)
>>  	struct cached_dev *dc = container_of(to_delayed_work(work),
>>  					     struct cached_dev,
>>  					     writeback_rate_update);
>> +	struct cache_set *c = dc->disk.c;
>> +
>> +	/* quit directly if cache set is stopping */
>> +	if (test_bit(CACHE_SET_STOPPING, &c->flags))
>> +		return;
>>  
>>  	down_read(&dc->writeback_lock);
>>  
>> @@ -100,6 +105,10 @@ static void update_writeback_rate(struct work_struct *work)
>>  
>>  	up_read(&dc->writeback_lock);
>>  
>> +	/* do not schedule delayed work if cache set is stopping */
>> +	if (test_bit(CACHE_SET_STOPPING, &c->flags))
>> +		return;
>> +
>>  	schedule_delayed_work(&dc->writeback_rate_update,
>>  			      dc->writeback_rate_update_seconds * HZ);
>>  }
>>
> This is actually not quite correct; the function might still be called
> after 'struct cached_dev' has been removed.
> The correct way of fixing is to either take a reference to struct
> cached_dev and release it once 'update_writeback_rate' is finished, or
> to call 'cancel_delayed_work_sync()' before deleting struct cached_dev.

Hi Hannes,

The problem is not cached_dev, it is cache_set. In
__update_writeback_rate(), struct cache_set is referenced. The solutions
is similar as you suggested, call cancel_delayed_work_sync() before
deleting struct cache_set. Junhui posted another patch to fix duplicated
writeback threads issue, but also fixes this problem too. Therefore just
prevent this kworker from re-arm itself again should be enough, and my
next patche to stop dc->writeback_thread and dc->writeback_rate_update
can be ignored, Junhui's patch is in bcache-for-next already.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-01-08 16:01 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-03 14:03 [PATCH v1 00/10] cache device failure handling improvement Coly Li
2018-01-03 14:03 ` [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state Coly Li
2018-01-03 17:08   ` Michael Lyle
2018-01-05 17:05     ` Coly Li
2018-01-05 17:09       ` Michael Lyle
2018-01-08  7:09   ` Hannes Reinecke
2018-01-08 13:50     ` Coly Li
2018-01-03 14:03 ` [PATCH v1 02/10] bcache: set task properly in allocator_wait() Coly Li
2018-01-03 17:09   ` Michael Lyle
2018-01-05 17:11     ` Coly Li
2018-01-08  7:10   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 03/10] bcache: reduce cache_set devices iteration by devices_max_used Coly Li
2018-01-03 17:11   ` Michael Lyle
2018-01-08  7:12   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 04/10] bcache: fix cached_dev->count usage for bch_cache_set_error() Coly Li
2018-01-08  7:16   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 05/10] bcache: stop dc->writeback_rate_update if cache set is stopping Coly Li
2018-01-08  7:22   ` Hannes Reinecke
2018-01-08 16:01     ` Coly Li
2018-01-03 14:03 ` [PATCH v1 06/10] bcache: stop dc->writeback_rate_update, dc->writeback_thread earlier Coly Li
2018-01-08  7:25   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 07/10] bcache: set error_limit correctly Coly Li
2018-01-08  7:26   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 08/10] bcache: fix misleading error message in bch_count_io_errors() Coly Li
2018-01-03 17:14   ` Michael Lyle
2018-01-08  7:27   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 09/10] bcache: add io_disable to struct cache_set Coly Li
2018-01-08  7:30   ` Hannes Reinecke
2018-01-03 14:03 ` [PATCH v1 10/10] bcache: stop all attached bcache devices for a retired cache set Coly Li
2018-01-08  7:31   ` Hannes Reinecke
2018-01-03 17:07 ` [PATCH v1 00/10] cache device failure handling improvement Michael Lyle
2018-01-04  2:20   ` Coly Li
2018-01-04 17:46     ` Michael Lyle
2018-01-05  4:04       ` Coly Li
  -- strict thread matches above, loose matches on Subject: below --
2018-01-03 18:30 [PATCH v1 01/10] bcache: exit bch_writeback_thread() with proper task state tang.junhui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox