Review of dm-block-manager.c

All of lore.kernel.org
 help / color / mirror / Atom feed

* Review of dm-block-manager.c
@ 2011-08-01 21:00 Mikulas Patocka
  2011-08-01 21:17 ` Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mikulas Patocka @ 2011-08-01 21:00 UTC (permalink / raw)
  To: Alasdair G. Kergon, Edward Thornber; +Cc: dm-devel

Hi

This is review of dm-block-manager.c:

char buffer_cache_name[32];
sprintf(bm->buffer_cache_name, "dm_block_buffer-%d:%d",
--- it may not fit in 32 bytes.

__wait_block uses TASK_INTERRUPTIBLE sleep and returns error code 
-ERESTARTSYS if interrupted by a signal. But this error code is never 
checked. Consequently, if the process receives a signal, this signal will 
interrupt waiting, and the rest of the buffer management code will 
mistakenly think that the event to wait for happened.
This should be replaced by TASK_UNINTERRUPTIBLE sleep and functions 
__wait_io, __wait_unlocked, __wait_read_lockable, __wait_all_writes, 
__wait_all_io, __wait_clean be changed to return void (because their 
return code is never checked anyway).

The code uses only a spinlock to protect it state. When the spinlock is 
dropped (for example during wait), the buffer may have been reused for 
other purposes, but it is not checked. There is a comment "/* FIXME: Can b 
have been recycled between io completion and here? */" indicating that Joe 
is aware of the problem.

b->write_lock_pending++;
__wait_unlocked(b, &flags);
b->write_lock_pending--;
if (b->where != block)
        goto retry;
If the buffer was reused while we were waiting, b->write_lock_pending was 
already reset to zero (in __transition BS_EMPTY). We decrement it to 
0xffffffff.

Error buffers are linked in error_list and this list is only flushed at a 
specific case (in __wait_flush). If there are many i/o errors (for 
example, the disk is unplugged) and __wait_flush is not called 
sufficiently often, all existing buffers will be moved to error_list and 
then the code deadlocks as there would be no empty or clean buffers.

The code uses fixed-size cache of 4096 buffers and a single process may 
hold more than one buffer. This may deadlock in case of massive 
parallelism --- for example, imagine that 4096 processes come 
concurrently, each process requesting two buffers --- each process 
allocates one buffer and then a deadlock happens, each process is waiting 
for some free buffer that never comes. (this bug existed already the last 
year when I looked at the code)

Mikulas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of dm-block-manager.c
  2011-08-01 21:00 Review of dm-block-manager.c Mikulas Patocka
@ 2011-08-01 21:17 ` Mike Snitzer
  2011-08-02  0:15   ` Mike Snitzer
  2011-08-02  0:30   ` Mike Snitzer
  2011-08-02 13:07 ` Joe Thornber
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
  2 siblings, 2 replies; 15+ messages in thread
From: Mike Snitzer @ 2011-08-01 21:17 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Edward Thornber, Alasdair G. Kergon

On Mon, Aug 01 2011 at  5:00pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> This is review of dm-block-manager.c:
> 
> 
> char buffer_cache_name[32];
> sprintf(bm->buffer_cache_name, "dm_block_buffer-%d:%d",
> --- it may not fit in 32 bytes.

It can accomodate nearly 1 trillion DM devices:

dm_block_buffer-253:9999999999

The goal is to move to using a common slab cache per blocksize long
before this limit becomes a concern.

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of dm-block-manager.c
  2011-08-01 21:17 ` Mike Snitzer
@ 2011-08-02  0:15   ` Mike Snitzer
  2011-08-02  0:30   ` Mike Snitzer
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Snitzer @ 2011-08-02  0:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Edward Thornber, Alasdair G. Kergon

On Mon, Aug 01 2011 at  5:17pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Aug 01 2011 at  5:00pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > This is review of dm-block-manager.c:
> > 
> > 
> > char buffer_cache_name[32];
> > sprintf(bm->buffer_cache_name, "dm_block_buffer-%d:%d",
> > --- it may not fit in 32 bytes.
> 
> It can accomodate nearly 1 trillion DM devices:
> 
> dm_block_buffer-253:9999999999

But more importantly, as agk pointed out to me, it will work with
maximum maj=2^12 min=2^20

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of dm-block-manager.c
  2011-08-01 21:17 ` Mike Snitzer
  2011-08-02  0:15   ` Mike Snitzer
@ 2011-08-02  0:30   ` Mike Snitzer
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Snitzer @ 2011-08-02  0:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Edward Thornber, Alasdair G. Kergon

On Mon, Aug 01 2011 at  5:17pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Aug 01 2011 at  5:00pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > This is review of dm-block-manager.c:
> > 
> > 
> > char buffer_cache_name[32];
> > sprintf(bm->buffer_cache_name, "dm_block_buffer-%d:%d",
> > --- it may not fit in 32 bytes.
> 
> It can accomodate nearly 1 trillion DM devices:
> 
> dm_block_buffer-253:9999999999

Um, not nearly 1 trillion... no idea how I got that ;)

(it's a moot point anyway).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of dm-block-manager.c
  2011-08-01 21:00 Review of dm-block-manager.c Mikulas Patocka
  2011-08-01 21:17 ` Mike Snitzer
@ 2011-08-02 13:07 ` Joe Thornber
  2011-08-02 13:29   ` Joe Thornber
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
  2 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 13:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon

Hi Mikulas,

Thanks for taking the time to review.

On Mon, Aug 01, 2011 at 05:00:32PM -0400, Mikulas Patocka wrote:
> Hi
> 
> This is review of dm-block-manager.c:
> 
> 
> char buffer_cache_name[32];
> sprintf(bm->buffer_cache_name, "dm_block_buffer-%d:%d",
> --- it may not fit in 32 bytes.
> 
> 
> __wait_block uses TASK_INTERRUPTIBLE sleep and returns error code 
> -ERESTARTSYS if interrupted by a signal. But this error code is never 
> checked. Consequently, if the process receives a signal, this signal will 
> interrupt waiting, and the rest of the buffer management code will 
> mistakenly think that the event to wait for happened.
> This should be replaced by TASK_UNINTERRUPTIBLE sleep and functions 
> __wait_io, __wait_unlocked, __wait_read_lockable, __wait_all_writes, 
> __wait_all_io, __wait_clean be changed to return void (because their 
> return code is never checked anyway).

ok.  Sounds simple.

> The code uses only a spinlock to protect it state. When the spinlock is 
> dropped (for example during wait), the buffer may have been reused for 
> other purposes, but it is not checked. There is a comment "/* FIXME: Can b 
> have been recycled between io completion and here? */" indicating that Joe 
> is aware of the problem.

Yep.

> b->write_lock_pending++;
> __wait_unlocked(b, &flags);
> b->write_lock_pending--;
> if (b->where != block)
>         goto retry;
> If the buffer was reused while we were waiting, b->write_lock_pending was 
> already reset to zero (in __transition BS_EMPTY). We decrement it to 
> 0xffffffff.

Sounds like the same block recycling issue.

> Error buffers are linked in error_list and this list is only flushed at a 
> specific case (in __wait_flush). If there are many i/o errors (for 
> example, the disk is unplugged) and __wait_flush is not called 
> sufficiently often, all existing buffers will be moved to error_list and 
> then the code deadlocks as there would be no empty or clean buffers.

Ouch.


> The code uses fixed-size cache of 4096 buffers and a single process may 
> hold more than one buffer. This may deadlock in case of massive 
> parallelism --- for example, imagine that 4096 processes come 
> concurrently, each process requesting two buffers --- each process 
> allocates one buffer and then a deadlock happens, each process is waiting 
> for some free buffer that never comes. (this bug existed already the last 
> year when I looked at the code)

There isn't that degree of parallelism.  We can't have multiple
threads pulling the cache in different directions for performance
reasons.  So we have multiple threads that use this in a non-blocking
mode.  ie. they use the try_lock variants, and only get the data if
it's already available in the cache.  If the non-blocking requests
failed then it gets passed across for a worker thread to deal with.
This is the only thread that updates the cache.  There is no issue
here.

Fancy digging through the btree next?  Or submitting patches for the
above?

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of dm-block-manager.c
  2011-08-02 13:07 ` Joe Thornber
@ 2011-08-02 13:29   ` Joe Thornber
  0 siblings, 0 replies; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 13:29 UTC (permalink / raw)
  To: Mikulas Patocka, Alasdair G. Kergon, dm-devel

On Tue, Aug 02, 2011 at 02:07:55PM +0100, Joe Thornber wrote:
> There isn't that degree of parallelism.  We can't have multiple
> threads pulling the cache in different directions for performance
> reasons.  So we have multiple threads that use this in a non-blocking
> mode.  ie. they use the try_lock variants, and only get the data if
> it's already available in the cache.  If the non-blocking requests
> failed then it gets passed across for a worker thread to deal with.
> This is the only thread that updates the cache.  There is no issue
> here.

In fact because we have only a single mutator the block recycling
concerns are not an issue for thinp, though they should still be
fixed.

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void.
  2011-08-01 21:00 Review of dm-block-manager.c Mikulas Patocka
  2011-08-01 21:17 ` Mike Snitzer
  2011-08-02 13:07 ` Joe Thornber
@ 2011-08-02 14:36 ` Joe Thornber
  2011-08-02 14:36   ` [PATCH 2/4] Fix a race between reading a new block and having it recycled Joe Thornber
                     ` (3 more replies)
  2 siblings, 4 replies; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 14:36 UTC (permalink / raw)
  To: mpatocka; +Cc: dm-devel, Joe Thornber

---
 drivers/md/persistent-data/dm-block-manager.c |   23 +++++++----------------
 1 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index 4e2f240..c9fb132 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -371,46 +371,37 @@ static void __clear_errors(struct dm_block_manager *bm)
 
 #define __wait_block(wq, lock, flags, sched_fn, condition)	\
 do {								\
-	int r = 0;						\
-								\
 	DEFINE_WAIT(wait);					\
 	add_wait_queue(wq, &wait);				\
 								\
 	for (;;) {						\
-		prepare_to_wait(wq, &wait, TASK_INTERRUPTIBLE); \
+		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE); \
 		if (condition)					\
 			break;					\
 								\
 		spin_unlock_irqrestore(lock, flags);		\
-		if (signal_pending(current)) {			\
-			r = -ERESTARTSYS;			\
-			spin_lock_irqsave(lock, flags);		\
-			break;					\
-		}						\
-								\
 		sched_fn();					\
 		spin_lock_irqsave(lock, flags);			\
 	}							\
 								\
 	finish_wait(wq, &wait);					\
-	return r;						\
 } while (0)
 
-static int __wait_io(struct dm_block *b, unsigned long *flags)
+static void __wait_io(struct dm_block *b, unsigned long *flags)
 	__retains(&b->bm->lock)
 {
 	__wait_block(&b->io_q, &b->bm->lock, *flags, io_schedule,
 		     ((b->state != BS_READING) && (b->state != BS_WRITING)));
 }
 
-static int __wait_unlocked(struct dm_block *b, unsigned long *flags)
+static void __wait_unlocked(struct dm_block *b, unsigned long *flags)
 	__retains(&b->bm->lock)
 {
 	__wait_block(&b->io_q, &b->bm->lock, *flags, schedule,
 		     ((b->state == BS_CLEAN) || (b->state == BS_DIRTY)));
 }
 
-static int __wait_read_lockable(struct dm_block *b, unsigned long *flags)
+static void __wait_read_lockable(struct dm_block *b, unsigned long *flags)
 	__retains(&b->bm->lock)
 {
 	__wait_block(&b->io_q, &b->bm->lock, *flags, schedule,
@@ -419,21 +410,21 @@ static int __wait_read_lockable(struct dm_block *b, unsigned long *flags)
 						 b->state == BS_READ_LOCKED)));
 }
 
-static int __wait_all_writes(struct dm_block_manager *bm, unsigned long *flags)
+static void __wait_all_writes(struct dm_block_manager *bm, unsigned long *flags)
 	__retains(&bm->lock)
 {
 	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
 		     !bm->writing_count);
 }
 
-static int __wait_all_io(struct dm_block_manager *bm, unsigned long *flags)
+static void __wait_all_io(struct dm_block_manager *bm, unsigned long *flags)
 	__retains(&bm->lock)
 {
 	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
 		     !bm->writing_count && !bm->reading_count);
 }
 
-static int __wait_clean(struct dm_block_manager *bm, unsigned long *flags)
+static void __wait_clean(struct dm_block_manager *bm, unsigned long *flags)
 	__retains(&bm->lock)
 {
 	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] Fix a race between reading a new block and having it recycled.
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
@ 2011-08-02 14:36   ` Joe Thornber
  2011-08-03 14:53     ` Mikulas Patocka
  2011-08-02 14:36   ` [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block Joe Thornber
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 14:36 UTC (permalink / raw)
  To: mpatocka; +Cc: dm-devel, Joe Thornber

---
 drivers/md/persistent-data/dm-block-manager.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index c9fb132..b68be88 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -447,6 +447,7 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
 	 * Wait for a block to appear on the empty or clean lists.
 	 */
 	spin_lock_irqsave(&bm->lock, flags);
+retry:
 	while (1) {
 		/*
 		 * Once we can lock and do io concurrently then we should
@@ -486,7 +487,11 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
 		spin_lock_irqsave(&bm->lock, flags);
 		__wait_io(b, &flags);
 
-		/* FIXME: Can b have been recycled between io completion and here? */
+		/*
+		 * Has b been recycled whilst we were unlocked?
+		 */
+		if (b->where != where)
+			goto retry;
 
 		/*
 		 * Did the io succeed?
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block.
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
  2011-08-02 14:36   ` [PATCH 2/4] Fix a race between reading a new block and having it recycled Joe Thornber
@ 2011-08-02 14:36   ` Joe Thornber
  2011-08-03 14:50     ` Mikulas Patocka
  2011-08-02 14:36   ` [PATCH 4/4] Track errored blocks Joe Thornber
  2011-08-03 14:42   ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Mikulas Patocka
  3 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 14:36 UTC (permalink / raw)
  To: mpatocka; +Cc: dm-devel, Joe Thornber

---
 drivers/md/persistent-data/dm-block-manager.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index b68be88..d27ab6e 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -756,9 +756,15 @@ retry:

 				b->write_lock_pending++;
 				__wait_unlocked(b, &flags);
-				b->write_lock_pending--;
 				if (b->where != block)
+					/*
+					 * Recycled blocks have their
+					 * write_lock_pending count reset
+					 * to zero, so no need to undo the
+					 * above increment.
+					 */
 					goto retry;
+				b->write_lock_pending--;
 			}
 			break;
 		}
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] Track errored blocks
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
  2011-08-02 14:36   ` [PATCH 2/4] Fix a race between reading a new block and having it recycled Joe Thornber
  2011-08-02 14:36   ` [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block Joe Thornber
@ 2011-08-02 14:36   ` Joe Thornber
  2011-08-03 15:00     ` Mikulas Patocka
  2011-08-03 14:42   ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Mikulas Patocka
  3 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2011-08-02 14:36 UTC (permalink / raw)
  To: mpatocka; +Cc: dm-devel, Joe Thornber

i) Keep track of how many blocks are in the error state.

ii) Make the client pass in the max number of held locks by a thread
at any one time.

iii) Change recycle_block to give up if there are too many in error
state.
---
 drivers/md/dm-thin-metadata.c                 |    2 +-
 drivers/md/persistent-data/dm-block-manager.c |   24 +++++++++++++++++++++---
 drivers/md/persistent-data/dm-block-manager.h |    9 ++++++++-
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index f3b8825..4c9470a 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -561,7 +561,7 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
 	int create;
 
 	bm = dm_block_manager_create(bdev, THIN_METADATA_BLOCK_SIZE,
-				     THIN_METADATA_CACHE_SIZE);
+				     THIN_METADATA_CACHE_SIZE, 3);
 	if (!bm) {
 		DMERR("could not create block manager");
 		return ERR_PTR(-ENOMEM);
diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index d27ab6e..dd22ef2 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -58,7 +58,8 @@ struct dm_block {
 
 struct dm_block_manager {
 	struct block_device *bdev;
-	unsigned cache_size;	/* In bytes */
+	unsigned cache_size;
+	unsigned max_held_per_thread;
 	unsigned block_size;	/* In bytes */
 	dm_block_t nr_blocks;
 
@@ -74,6 +75,7 @@ struct dm_block_manager {
 	 */
 	spinlock_t lock;
 
+	unsigned error_count;
 	unsigned available_count;
 	unsigned reading_count;
 	unsigned writing_count;
@@ -161,8 +163,10 @@ static void __transition(struct dm_block *b, enum dm_block_state new_state)
 		b->io_flags = 0;
 		b->validator = NULL;
 
-		if (b->state == BS_ERROR)
+		if (b->state == BS_ERROR) {
+			bm->error_count--;
 			bm->available_count++;
+		}
 		break;
 
 	case BS_CLEAN:
@@ -244,6 +248,7 @@ static void __transition(struct dm_block *b, enum dm_block_state new_state)
 		/* DOT: reading -> error */
 		BUG_ON(!((b->state == BS_WRITING) ||
 			 (b->state == BS_READING)));
+		bm->error_count++;
 		list_add_tail(&b->list, &bm->error_list);
 		break;
 	}
@@ -450,6 +455,16 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
 retry:
 	while (1) {
 		/*
+		 * The calling thread may hold some locks on blocks, and
+		 * the rest be errored.  In which case we're never going to
+		 * succeed here.
+		 */
+		if (bm->error_count == bm->cache_size - bm->max_held_per_thread) {
+			spin_unlock_irqrestore(&bm->lock, flags);
+			return -ENOMEM;
+		}
+
+		/*
 		 * Once we can lock and do io concurrently then we should
 		 * probably flush at bm->cache_size / 2 and write _all_
 		 * dirty blocks.
@@ -599,7 +614,8 @@ static unsigned calc_hash_size(unsigned cache_size)
 
 struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
 						 unsigned block_size,
-						 unsigned cache_size)
+						 unsigned cache_size,
+						 unsigned max_held_per_thread)
 {
 	unsigned i;
 	unsigned hash_size = calc_hash_size(cache_size);
@@ -613,6 +629,7 @@ struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
 
 	bm->bdev = bdev;
 	bm->cache_size = max(MAX_CACHE_SIZE, cache_size);
+	bm->max_held_per_thread = max_held_per_thread;
 	bm->block_size = block_size;
 	bm->nr_blocks = i_size_read(bdev->bd_inode);
 	do_div(bm->nr_blocks, block_size);
@@ -623,6 +640,7 @@ struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
 	INIT_LIST_HEAD(&bm->clean_list);
 	INIT_LIST_HEAD(&bm->dirty_list);
 	INIT_LIST_HEAD(&bm->error_list);
+	bm->error_count = 0;
 	bm->available_count = 0;
 	bm->reading_count = 0;
 	bm->writing_count = 0;
diff --git a/drivers/md/persistent-data/dm-block-manager.h b/drivers/md/persistent-data/dm-block-manager.h
index ebea2d5..38c49c7 100644
--- a/drivers/md/persistent-data/dm-block-manager.h
+++ b/drivers/md/persistent-data/dm-block-manager.h
@@ -37,7 +37,14 @@ static inline uint32_t dm_block_csum_data(const void *data_le, unsigned length)
 /*----------------------------------------------------------------*/
 
 struct dm_block_manager;
-struct dm_block_manager *dm_block_manager_create(struct block_device *bdev, unsigned block_size, unsigned cache_size);
+
+/*
+ * @max_held_per_thread should be the maximum number of locks, read or
+ * write, that an individual thread holds at any one time.
+ */
+struct dm_block_manager *dm_block_manager_create(
+	struct block_device *bdev, unsigned block_size,
+	unsigned cache_size, unsigned max_held_per_thread);
 void dm_block_manager_destroy(struct dm_block_manager *bm);
 
 unsigned dm_bm_block_size(struct dm_block_manager *bm);
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void.
  2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
                     ` (2 preceding siblings ...)
  2011-08-02 14:36   ` [PATCH 4/4] Track errored blocks Joe Thornber
@ 2011-08-03 14:42   ` Mikulas Patocka
  3 siblings, 0 replies; 15+ messages in thread
From: Mikulas Patocka @ 2011-08-03 14:42 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

Ack.

Mikulas

On Tue, 2 Aug 2011, Joe Thornber wrote:

> ---
>  drivers/md/persistent-data/dm-block-manager.c |   23 +++++++----------------
>  1 files changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> index 4e2f240..c9fb132 100644
> --- a/drivers/md/persistent-data/dm-block-manager.c
> +++ b/drivers/md/persistent-data/dm-block-manager.c
> @@ -371,46 +371,37 @@ static void __clear_errors(struct dm_block_manager *bm)
>  
>  #define __wait_block(wq, lock, flags, sched_fn, condition)	\
>  do {								\
> -	int r = 0;						\
> -								\
>  	DEFINE_WAIT(wait);					\
>  	add_wait_queue(wq, &wait);				\
>  								\
>  	for (;;) {						\
> -		prepare_to_wait(wq, &wait, TASK_INTERRUPTIBLE); \
> +		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE); \
>  		if (condition)					\
>  			break;					\
>  								\
>  		spin_unlock_irqrestore(lock, flags);		\
> -		if (signal_pending(current)) {			\
> -			r = -ERESTARTSYS;			\
> -			spin_lock_irqsave(lock, flags);		\
> -			break;					\
> -		}						\
> -								\
>  		sched_fn();					\
>  		spin_lock_irqsave(lock, flags);			\
>  	}							\
>  								\
>  	finish_wait(wq, &wait);					\
> -	return r;						\
>  } while (0)
>  
> -static int __wait_io(struct dm_block *b, unsigned long *flags)
> +static void __wait_io(struct dm_block *b, unsigned long *flags)
>  	__retains(&b->bm->lock)
>  {
>  	__wait_block(&b->io_q, &b->bm->lock, *flags, io_schedule,
>  		     ((b->state != BS_READING) && (b->state != BS_WRITING)));
>  }
>  
> -static int __wait_unlocked(struct dm_block *b, unsigned long *flags)
> +static void __wait_unlocked(struct dm_block *b, unsigned long *flags)
>  	__retains(&b->bm->lock)
>  {
>  	__wait_block(&b->io_q, &b->bm->lock, *flags, schedule,
>  		     ((b->state == BS_CLEAN) || (b->state == BS_DIRTY)));
>  }
>  
> -static int __wait_read_lockable(struct dm_block *b, unsigned long *flags)
> +static void __wait_read_lockable(struct dm_block *b, unsigned long *flags)
>  	__retains(&b->bm->lock)
>  {
>  	__wait_block(&b->io_q, &b->bm->lock, *flags, schedule,
> @@ -419,21 +410,21 @@ static int __wait_read_lockable(struct dm_block *b, unsigned long *flags)
>  						 b->state == BS_READ_LOCKED)));
>  }
>  
> -static int __wait_all_writes(struct dm_block_manager *bm, unsigned long *flags)
> +static void __wait_all_writes(struct dm_block_manager *bm, unsigned long *flags)
>  	__retains(&bm->lock)
>  {
>  	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
>  		     !bm->writing_count);
>  }
>  
> -static int __wait_all_io(struct dm_block_manager *bm, unsigned long *flags)
> +static void __wait_all_io(struct dm_block_manager *bm, unsigned long *flags)
>  	__retains(&bm->lock)
>  {
>  	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
>  		     !bm->writing_count && !bm->reading_count);
>  }
>  
> -static int __wait_clean(struct dm_block_manager *bm, unsigned long *flags)
> +static void __wait_clean(struct dm_block_manager *bm, unsigned long *flags)
>  	__retains(&bm->lock)
>  {
>  	__wait_block(&bm->io_q, &bm->lock, *flags, io_schedule,
> -- 
> 1.7.4.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block.
  2011-08-02 14:36   ` [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block Joe Thornber
@ 2011-08-03 14:50     ` Mikulas Patocka
  2011-08-04  9:06       ` Joe Thornber
  0 siblings, 1 reply; 15+ messages in thread
From: Mikulas Patocka @ 2011-08-03 14:50 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

I think this is not correct.

The problem here is that the block may have been recycled and the newly 
created block may have the same block number as the old block.

If b->where != block, we know that the block was recycled.
If b->where == block, the block may have been recycled or not and we 
don't know.

I think the correct solution could be: make write_lock_pending a boolean 
variable, not a counter.

Set write_lock_pending inside __wait_block when we are about to wait (the 
block may have been recycled each time we waited, so we need to set it 
each time we are going to wait)
Clear write_lock_pending when __wait_unlocked exits.

If we make it a boolean variable, double clearing makes no harm.

Mikulas

On Tue, 2 Aug 2011, Joe Thornber wrote:

> ---
>  drivers/md/persistent-data/dm-block-manager.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> index b68be88..d27ab6e 100644
> --- a/drivers/md/persistent-data/dm-block-manager.c
> +++ b/drivers/md/persistent-data/dm-block-manager.c
> @@ -756,9 +756,15 @@ retry:
>  
>  				b->write_lock_pending++;
>  				__wait_unlocked(b, &flags);
> -				b->write_lock_pending--;
>  				if (b->where != block)
> +					/*
> +					 * Recycled blocks have their
> +					 * write_lock_pending count reset
> +					 * to zero, so no need to undo the
> +					 * above increment.
> +					 */
>  					goto retry;
> +				b->write_lock_pending--;
>  			}
>  			break;
>  		}
> -- 
> 1.7.4.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] Fix a race between reading a new block and having it recycled.
  2011-08-02 14:36   ` [PATCH 2/4] Fix a race between reading a new block and having it recycled Joe Thornber
@ 2011-08-03 14:53     ` Mikulas Patocka
  0 siblings, 0 replies; 15+ messages in thread
From: Mikulas Patocka @ 2011-08-03 14:53 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

Ack.

Mikulas

On Tue, 2 Aug 2011, Joe Thornber wrote:

> ---
>  drivers/md/persistent-data/dm-block-manager.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> index c9fb132..b68be88 100644
> --- a/drivers/md/persistent-data/dm-block-manager.c
> +++ b/drivers/md/persistent-data/dm-block-manager.c
> @@ -447,6 +447,7 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
>  	 * Wait for a block to appear on the empty or clean lists.
>  	 */
>  	spin_lock_irqsave(&bm->lock, flags);
> +retry:
>  	while (1) {
>  		/*
>  		 * Once we can lock and do io concurrently then we should
> @@ -486,7 +487,11 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
>  		spin_lock_irqsave(&bm->lock, flags);
>  		__wait_io(b, &flags);
>  
> -		/* FIXME: Can b have been recycled between io completion and here? */
> +		/*
> +		 * Has b been recycled whilst we were unlocked?
> +		 */
> +		if (b->where != where)
> +			goto retry;
>  
>  		/*
>  		 * Did the io succeed?
> -- 
> 1.7.4.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] Track errored blocks
  2011-08-02 14:36   ` [PATCH 4/4] Track errored blocks Joe Thornber
@ 2011-08-03 15:00     ` Mikulas Patocka
  0 siblings, 0 replies; 15+ messages in thread
From: Mikulas Patocka @ 2011-08-03 15:00 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

Ack.

BTW. I found another quirk in recycle_block:
if (b->state == BS_ERROR) {
	__transition(b, BS_EMPTY);
	r = -EIO;
}
if (b->validator) {
	r = b->validator->check(b->validator, b, bm->block_size);
	...
}

--- I think errorneous buffers should not be validated, change it to
"if (!r && b->validator)"

Mikulas

On Tue, 2 Aug 2011, Joe Thornber wrote:

> i) Keep track of how many blocks are in the error state.
> 
> ii) Make the client pass in the max number of held locks by a thread
> at any one time.
> 
> iii) Change recycle_block to give up if there are too many in error
> state.
> ---
>  drivers/md/dm-thin-metadata.c                 |    2 +-
>  drivers/md/persistent-data/dm-block-manager.c |   24 +++++++++++++++++++++---
>  drivers/md/persistent-data/dm-block-manager.h |    9 ++++++++-
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
> index f3b8825..4c9470a 100644
> --- a/drivers/md/dm-thin-metadata.c
> +++ b/drivers/md/dm-thin-metadata.c
> @@ -561,7 +561,7 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
>  	int create;
>  
>  	bm = dm_block_manager_create(bdev, THIN_METADATA_BLOCK_SIZE,
> -				     THIN_METADATA_CACHE_SIZE);
> +				     THIN_METADATA_CACHE_SIZE, 3);
>  	if (!bm) {
>  		DMERR("could not create block manager");
>  		return ERR_PTR(-ENOMEM);
> diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> index d27ab6e..dd22ef2 100644
> --- a/drivers/md/persistent-data/dm-block-manager.c
> +++ b/drivers/md/persistent-data/dm-block-manager.c
> @@ -58,7 +58,8 @@ struct dm_block {
>  
>  struct dm_block_manager {
>  	struct block_device *bdev;
> -	unsigned cache_size;	/* In bytes */
> +	unsigned cache_size;
> +	unsigned max_held_per_thread;
>  	unsigned block_size;	/* In bytes */
>  	dm_block_t nr_blocks;
>  
> @@ -74,6 +75,7 @@ struct dm_block_manager {
>  	 */
>  	spinlock_t lock;
>  
> +	unsigned error_count;
>  	unsigned available_count;
>  	unsigned reading_count;
>  	unsigned writing_count;
> @@ -161,8 +163,10 @@ static void __transition(struct dm_block *b, enum dm_block_state new_state)
>  		b->io_flags = 0;
>  		b->validator = NULL;
>  
> -		if (b->state == BS_ERROR)
> +		if (b->state == BS_ERROR) {
> +			bm->error_count--;
>  			bm->available_count++;
> +		}
>  		break;
>  
>  	case BS_CLEAN:
> @@ -244,6 +248,7 @@ static void __transition(struct dm_block *b, enum dm_block_state new_state)
>  		/* DOT: reading -> error */
>  		BUG_ON(!((b->state == BS_WRITING) ||
>  			 (b->state == BS_READING)));
> +		bm->error_count++;
>  		list_add_tail(&b->list, &bm->error_list);
>  		break;
>  	}
> @@ -450,6 +455,16 @@ static int recycle_block(struct dm_block_manager *bm, dm_block_t where,
>  retry:
>  	while (1) {
>  		/*
> +		 * The calling thread may hold some locks on blocks, and
> +		 * the rest be errored.  In which case we're never going to
> +		 * succeed here.
> +		 */
> +		if (bm->error_count == bm->cache_size - bm->max_held_per_thread) {
> +			spin_unlock_irqrestore(&bm->lock, flags);
> +			return -ENOMEM;
> +		}
> +
> +		/*
>  		 * Once we can lock and do io concurrently then we should
>  		 * probably flush at bm->cache_size / 2 and write _all_
>  		 * dirty blocks.
> @@ -599,7 +614,8 @@ static unsigned calc_hash_size(unsigned cache_size)
>  
>  struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
>  						 unsigned block_size,
> -						 unsigned cache_size)
> +						 unsigned cache_size,
> +						 unsigned max_held_per_thread)
>  {
>  	unsigned i;
>  	unsigned hash_size = calc_hash_size(cache_size);
> @@ -613,6 +629,7 @@ struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
>  
>  	bm->bdev = bdev;
>  	bm->cache_size = max(MAX_CACHE_SIZE, cache_size);
> +	bm->max_held_per_thread = max_held_per_thread;
>  	bm->block_size = block_size;
>  	bm->nr_blocks = i_size_read(bdev->bd_inode);
>  	do_div(bm->nr_blocks, block_size);
> @@ -623,6 +640,7 @@ struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
>  	INIT_LIST_HEAD(&bm->clean_list);
>  	INIT_LIST_HEAD(&bm->dirty_list);
>  	INIT_LIST_HEAD(&bm->error_list);
> +	bm->error_count = 0;
>  	bm->available_count = 0;
>  	bm->reading_count = 0;
>  	bm->writing_count = 0;
> diff --git a/drivers/md/persistent-data/dm-block-manager.h b/drivers/md/persistent-data/dm-block-manager.h
> index ebea2d5..38c49c7 100644
> --- a/drivers/md/persistent-data/dm-block-manager.h
> +++ b/drivers/md/persistent-data/dm-block-manager.h
> @@ -37,7 +37,14 @@ static inline uint32_t dm_block_csum_data(const void *data_le, unsigned length)
>  /*----------------------------------------------------------------*/
>  
>  struct dm_block_manager;
> -struct dm_block_manager *dm_block_manager_create(struct block_device *bdev, unsigned block_size, unsigned cache_size);
> +
> +/*
> + * @max_held_per_thread should be the maximum number of locks, read or
> + * write, that an individual thread holds at any one time.
> + */
> +struct dm_block_manager *dm_block_manager_create(
> +	struct block_device *bdev, unsigned block_size,
> +	unsigned cache_size, unsigned max_held_per_thread);
>  void dm_block_manager_destroy(struct dm_block_manager *bm);
>  
>  unsigned dm_bm_block_size(struct dm_block_manager *bm);
> -- 
> 1.7.4.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block.
  2011-08-03 14:50     ` Mikulas Patocka
@ 2011-08-04  9:06       ` Joe Thornber
  0 siblings, 0 replies; 15+ messages in thread
From: Joe Thornber @ 2011-08-04  9:06 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Joe Thornber, dm-devel

On Wed, Aug 03, 2011 at 10:50:33AM -0400, Mikulas Patocka wrote:
> I think this is not correct.

I had a similar thought last night, however my concern was the
previous 'read' patch that you've acked.  I'll go back and look at
these today.

- Joe

> 
> The problem here is that the block may have been recycled and the newly 
> created block may have the same block number as the old block.
> 
> If b->where != block, we know that the block was recycled.
> If b->where == block, the block may have been recycled or not and we 
> don't know.
> 
> 
> I think the correct solution could be: make write_lock_pending a boolean 
> variable, not a counter.
> 
> Set write_lock_pending inside __wait_block when we are about to wait (the 
> block may have been recycled each time we waited, so we need to set it 
> each time we are going to wait)
> Clear write_lock_pending when __wait_unlocked exits.
> 
> If we make it a boolean variable, double clearing makes no harm.
> 
> Mikulas
> 
> On Tue, 2 Aug 2011, Joe Thornber wrote:
> 
> > ---
> >  drivers/md/persistent-data/dm-block-manager.c |    8 +++++++-
> >  1 files changed, 7 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
> > index b68be88..d27ab6e 100644
> > --- a/drivers/md/persistent-data/dm-block-manager.c
> > +++ b/drivers/md/persistent-data/dm-block-manager.c
> > @@ -756,9 +756,15 @@ retry:
> >  
> >  				b->write_lock_pending++;
> >  				__wait_unlocked(b, &flags);
> > -				b->write_lock_pending--;
> >  				if (b->where != block)
> > +					/*
> > +					 * Recycled blocks have their
> > +					 * write_lock_pending count reset
> > +					 * to zero, so no need to undo the
> > +					 * above increment.
> > +					 */
> >  					goto retry;
> > +				b->write_lock_pending--;
> >  			}
> >  			break;
> >  		}
> > -- 
> > 1.7.4.1
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-08-04  9:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-01 21:00 Review of dm-block-manager.c Mikulas Patocka
2011-08-01 21:17 ` Mike Snitzer
2011-08-02  0:15   ` Mike Snitzer
2011-08-02  0:30   ` Mike Snitzer
2011-08-02 13:07 ` Joe Thornber
2011-08-02 13:29   ` Joe Thornber
2011-08-02 14:36 ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Joe Thornber
2011-08-02 14:36   ` [PATCH 2/4] Fix a race between reading a new block and having it recycled Joe Thornber
2011-08-03 14:53     ` Mikulas Patocka
2011-08-02 14:36   ` [PATCH 3/4] [block-manager] remove spurious decrement of write_lock_pending in the case of a recycled block Joe Thornber
2011-08-03 14:50     ` Mikulas Patocka
2011-08-04  9:06       ` Joe Thornber
2011-08-02 14:36   ` [PATCH 4/4] Track errored blocks Joe Thornber
2011-08-03 15:00     ` Mikulas Patocka
2011-08-03 14:42   ` [PATCH 1/4] The return code from the various wait functions is never acted upon. So change to uninterrupible waits and change the return type to void Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.