Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 2/2] block: trace completion of all bios.
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c               |   13 +++++++++++++
 block/blk-core.c          |   10 +++++++++-
 drivers/md/dm.c           |    1 -
 drivers/md/raid5.c        |    2 --
 include/linux/blk_types.h |    2 ++
 5 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 12c2837c4277..3db98fc3ee87 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1789,6 +1789,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  *   bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
  *   way to end I/O on a bio. No one should call bi_end_io() directly on a
  *   bio unless they own it and thus know that it has an end_io function.
+ *
+ *   bio_endio() can be called several times on a bio that has been chained
+ *   using bio_chain().  The ->bi_end_io() function will only be called the
+ *   last time.  At this point the BLK_TA_COMPLETE tracing event will be
+ *   generated if BIO_TRACE_COMPLETION is set.
  **/
 void bio_endio(struct bio *bio)
 {
@@ -1809,6 +1814,11 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
+		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
+	}
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
@@ -1847,6 +1857,9 @@ struct bio *bio_split(struct bio *bio, int sectors,
 
 	bio_advance(bio, split->bi_iter.bi_size);
 
+	if (bio_flagged(bio, BIO_TRACE_COMPLETION))
+		bio_set_flag(bio, BIO_TRACE_COMPLETION);
+
 	return split;
 }
 EXPORT_SYMBOL(bio_split);
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..2a7993063a7e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1936,7 +1936,13 @@ generic_make_request_checks(struct bio *bio)
 	if (!blkcg_bio_issue_check(q, bio))
 		return false;
 
-	trace_block_bio_queue(q, bio);
+	if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
+		trace_block_bio_queue(q, bio);
+		/* Now that enqueuing has been traced, we need to trace
+		 * completion as well.
+		 */
+		bio_set_flag(bio, BIO_TRACE_COMPLETION);
+	}
 	return true;
 
 not_supported:
@@ -2601,6 +2607,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 		if (bio_bytes == bio->bi_iter.bi_size)
 			req->bio = bio->bi_next;
 
+		/* Completion has already been traced */
+		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
 		req_bio_endio(req, bio, bio_bytes, error);
 
 		total_bytes += bio_bytes;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..cd93a3b9ceca 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -810,7 +810,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d8bd25f235dc..14557259dd69 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5138,8 +5138,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ff07d891f38f..56c2a8180ce1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,8 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
+				 * of this bio. */
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*

^ permalink raw reply related

* [PATCH 1/2] block: simple improvements for bio->flags
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

The comment for the 'flags' field of 'bio' mentions
"command" which is no longer stored there, and doesn't
mention the bvec pool number, which is.

BIO_RESET_BITS is set in such a way that it would need to be
updated if new bits were added, which is easy to miss.

BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
ranges from 0 to 6, so 3 bits are sufficient.

This patch make improvements in each of these areas.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/blk_types.h |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..ff07d891f38f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -29,7 +29,7 @@ struct bio {
 						 * top bits REQ_OP. Use
 						 * accessors.
 						 */
-	unsigned short		bi_flags;	/* status, command, etc */
+	unsigned short		bi_flags;	/* status, etc and bvec pool number */
 	unsigned short		bi_ioprio;
 
 	struct bvec_iter	bi_iter;
@@ -102,12 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
-
-/*
- * Flags starting here get preserved by bio_reset() - this includes
- * BVEC_POOL_IDX()
- */
-#define BIO_RESET_BITS	10
+/* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*
  * We support 6 different bvec pools, the last one is magic in that it
@@ -117,13 +112,22 @@ struct bio {
 #define BVEC_POOL_MAX		(BVEC_POOL_NR - 1)
 
 /*
- * Top 4 bits of bio flags indicate the pool the bvecs came from.  We add
+ * Top 3 bits of bio flags indicate the pool the bvecs came from.  We add
  * 1 to the actual index so that 0 indicates that there are no bvecs to be
  * freed.
  */
-#define BVEC_POOL_BITS		(4)
+#define BVEC_POOL_BITS		(3)
 #define BVEC_POOL_OFFSET	(16 - BVEC_POOL_BITS)
 #define BVEC_POOL_IDX(bio)	((bio)->bi_flags >> BVEC_POOL_OFFSET)
+#if (1<< BVEC_POOL_BITS) < (BVEC_POOL_NR+1)
+# error "BVEC_POOL_BITS is too small"
+#endif
+
+/*
+ * Flags starting here get preserved by bio_reset() - this includes
+ * only BVEC_POOL_IDX()
+ */
+#define BIO_RESET_BITS	BVEC_POOL_OFFSET
 
 /*
  * Operations and flags common to the bio and request structures.

^ permalink raw reply related

* [PATCH 0/2] Trace completion of all bios
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon

Hi Jens,
 I think all objections to this patch have been answered so I'm
 resending.
 I've added a small cleanup patch first which makes some small
 enhancements to the documentation and #defines for the bio->flags
 field.

Thanks,
NeilBrown

---

NeilBrown (2):
      block: simple improvements for bio->flags
      block: trace completion of all bios.


 block/bio.c               |   13 +++++++++++++
 block/blk-core.c          |   10 +++++++++-
 drivers/md/dm.c           |    1 -
 drivers/md/raid5.c        |    2 --
 include/linux/blk_types.h |   24 +++++++++++++++---------
 5 files changed, 37 insertions(+), 13 deletions(-)

--
Signature

^ permalink raw reply

* [PATCH v3] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-06 23:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Michal Hocko, linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <20170406065326.GB5497@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 2574 bytes --]

When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file.  This double-throttling can trigger
positive feedback loops that create significant delays.  The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.

The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server.  It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.

To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
time taken.

When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable.  52-460 seconds.  Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.

There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---

Hi Jens,
 I think this version meets with everyone's approval.

Thanks,
NeilBrown

 drivers/block/loop.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..035b8651b8bf 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -844,10 +844,16 @@ static void loop_unprepare_queue(struct loop_device *lo)
 	kthread_stop(lo->worker_task);
 }

+static int loop_kthread_worker_fn(void *worker_ptr)
+{
+	current->flags |= PF_LESS_THROTTLE;
+	return kthread_worker_fn(worker_ptr);
+}
+
 static int loop_prepare_queue(struct loop_device *lo)
 {
 	kthread_init_worker(&lo->worker);
-	lo->worker_task = kthread_run(kthread_worker_fn,
+	lo->worker_task = kthread_run(loop_kthread_worker_fn,
 			&lo->worker, "loop%d", lo->lo_number);
 	if (IS_ERR(lo->worker_task))
 		return -ENOMEM;
-- 
2.12.2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH 7/8] nowait aio: xfs
From: Darrick J. Wong @ 2017-04-06 22:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, linux-fsdevel, jack, linux-block, linux-btrfs,
	linux-ext4, linux-xfs, sagi, avi, axboe, linux-api, willy,
	tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170404065211.GD21942@infradead.org>

On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> > +	if (unaligned_io) {
> > +		/* If we are going to wait for other DIO to finish, bail */
> > +		if ((iocb->ki_flags & IOCB_NOWAIT) &&
> > +		     atomic_read(&inode->i_dio_count))
> > +			return -EAGAIN;
> >  		inode_dio_wait(inode);
> 
> This checks i_dio_count twice in the nowait case, I think it should be:
> 
> 	if (iocb->ki_flags & IOCB_NOWAIT) {
> 		if (atomic_read(&inode->i_dio_count))
> 			return -EAGAIN;
> 	} else {
> 		inode_dio_wait(inode);
> 	}
> 
> >  	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> >  		if (flags & IOMAP_DIRECT) {
> > +			/* A reflinked inode will result in CoW alloc */
> > +			if (flags & IOMAP_NOWAIT) {
> > +				error = -EAGAIN;
> > +				goto out_unlock;
> > +			}
> 
> This is a bit pessimistic - just because the inode has any shared
> extents we could still write into unshared ones.  For now I think this
> pessimistic check is fine, but the comment should be corrected.

Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
is already an existing reservation in the CoW fork then we'll have to
CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
anything in the CoW fork, then we have to see if there are shared blocks
by calling _reflink_trim_around_shared.  That performs a refcountbt
lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.

IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
cover both write-to-reflinked-file cases.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 14/25] nbd: don't use req->errors
From: Josef Bacik @ 2017-04-06 21:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Josef Bacik, James Smart, Konrad Rzeszutek Wilk,
	Roger Pau Monné, linux-scsi, linux-nvme, linux-block,
	dm-devel
In-Reply-To: <20170406153944.10058-15-hch@lst.de>


> On Apr 6, 2017, at 11:39 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> Add a nbd-specific field instead.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This is fine with me, you can add,

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply

* Re: [PATCH 00/12] nbd: Netlink interface and path failure enhancements
From: Josef Bacik @ 2017-04-06 21:05 UTC (permalink / raw)
  To: Jens Axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>


> On Apr 6, 2017, at 5:01 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>=20
> This patchset adds a new netlink configuration interface to NBD as =
well as a
> bunch of enhancments around path failures.  The patches provide the =
following
> enhancemnts to NBD
>=20
> - Netlink configuration interface that doesn't leave a userspace =
application
>   waiting in kernel space for the device to disconnect.
> - Netlink reconfigure interface for adding re-connected sockets to =
replace dead
>   sockets.
> - A flag to destroy the NBD device on disconnect, much like how mount =
-o loop
>   works.
> - A status interface that currently will only report whether a device =
is
>   connected or not, but can be extended to include whatever in the =
future.
> - A netlink multicast notification scheme to notify user space when =
there are
>   connection issues to allow for seamless reconnects.
> - Dead link handling.  You can specify a dead link timeout and the NBD =
device
>   will pause IO for that timeout waiting to see if the connection can =
be
>   re-established.  This is helpful to allow for things like nbd server =
upgrades
>   where the whole server disappears for a short period of time.
>=20
> These patches have been thorougly and continuously tested for about a =
month.
> I've been finding bugs in various places, but this batch has been =
solid for the
> last few days of testing, which include a constant =
disconnect/reconnect torture
> test.  Thanks,

Oops forgot to mention that I have patches for the nbd user space side =
that utilizes all of these new interfaces, you can find the tree here

https://github.com/josefbacik/nbd

The user space patches are a bit rough, but utilize everything so good =
for testing.  They will be cleaned up and submitted once the kernel part =
is upstream.  Thanks,

Josef=

^ permalink raw reply

* [PATCH 12/12] nbd: add a flag to destroy an nbd device on disconnect
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

For ease of management it would be nice for users to specify that the
device node for a nbd device is destroyed once it is disconnected and
there are no more users.  Add a client flag and enable this operation to
happen.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c      | 30 ++++++++++++++++++++++++++++++
 include/uapi/linux/nbd.h |  6 +++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b33014a..d220045 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -74,6 +74,7 @@ struct link_dead_args {
 #define NBD_HAS_PID_FILE		3
 #define NBD_HAS_CONFIG_REF		4
 #define NBD_BOUND			5
+#define NBD_DESTROY_ON_DISCONNECT	6
 
 struct nbd_config {
 	u32 flags;
@@ -174,6 +175,7 @@ static void nbd_dev_remove(struct nbd_device *nbd)
 		del_gendisk(disk);
 		blk_cleanup_queue(disk->queue);
 		blk_mq_free_tag_set(&nbd->tag_set);
+		disk->private_data = NULL;
 		put_disk(disk);
 	}
 	kfree(nbd);
@@ -1028,6 +1030,7 @@ static void nbd_config_put(struct nbd_device *nbd)
 			kfree(config->socks);
 		}
 		nbd_reset(nbd);
+
 		mutex_unlock(&nbd->config_lock);
 		nbd_put(nbd);
 		module_put(THIS_MODULE);
@@ -1540,6 +1543,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 	struct nbd_config *config;
 	int index = -1;
 	int ret;
+	bool put_dev = false;
 
 	if (!netlink_capable(skb, CAP_SYS_ADMIN))
 		return -EPERM;
@@ -1634,6 +1638,15 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 	if (info->attrs[NBD_ATTR_SERVER_FLAGS])
 		config->flags =
 			nla_get_u64(info->attrs[NBD_ATTR_SERVER_FLAGS]);
+	if (info->attrs[NBD_ATTR_CLIENT_FLAGS]) {
+		u64 flags = nla_get_u64(info->attrs[NBD_ATTR_CLIENT_FLAGS]);
+		if (flags & NBD_CFLAG_DESTROY_ON_DISCONNECT) {
+			set_bit(NBD_DESTROY_ON_DISCONNECT,
+				&config->runtime_flags);
+			put_dev = true;
+		}
+	}
+
 	if (info->attrs[NBD_ATTR_SOCKETS]) {
 		struct nlattr *attr;
 		int rem, fd;
@@ -1671,6 +1684,8 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 		nbd_connect_reply(info, nbd->index);
 	}
 	nbd_config_put(nbd);
+	if (put_dev)
+		nbd_put(nbd);
 	return ret;
 }
 
@@ -1723,6 +1738,7 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 	struct nbd_config *config;
 	int index;
 	int ret = -EINVAL;
+	bool put_dev = false;
 
 	if (!netlink_capable(skb, CAP_SYS_ADMIN))
 		return -EPERM;
@@ -1774,6 +1790,18 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 			nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]);
 		config->dead_conn_timeout *= HZ;
 	}
+	if (info->attrs[NBD_ATTR_CLIENT_FLAGS]) {
+		u64 flags = nla_get_u64(info->attrs[NBD_ATTR_CLIENT_FLAGS]);
+		if (flags & NBD_CFLAG_DESTROY_ON_DISCONNECT) {
+			if (!test_and_set_bit(NBD_DESTROY_ON_DISCONNECT,
+					      &config->runtime_flags))
+				put_dev = true;
+		} else {
+			if (test_and_clear_bit(NBD_DESTROY_ON_DISCONNECT,
+					       &config->runtime_flags))
+				refcount_inc(&nbd->refs);
+		}
+	}
 
 	if (info->attrs[NBD_ATTR_SOCKETS]) {
 		struct nlattr *attr;
@@ -1811,6 +1839,8 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 	mutex_unlock(&nbd->config_lock);
 	nbd_config_put(nbd);
 	nbd_put(nbd);
+	if (put_dev)
+		nbd_put(nbd);
 	return ret;
 }
 
diff --git a/include/uapi/linux/nbd.h b/include/uapi/linux/nbd.h
index c91c642..155e33f 100644
--- a/include/uapi/linux/nbd.h
+++ b/include/uapi/linux/nbd.h
@@ -37,7 +37,7 @@ enum {
 	NBD_CMD_TRIM = 4
 };
 
-/* values for flags field */
+/* values for flags field, these are server interaction specific. */
 #define NBD_FLAG_HAS_FLAGS	(1 << 0) /* nbd-server supports flags */
 #define NBD_FLAG_READ_ONLY	(1 << 1) /* device is read-only */
 #define NBD_FLAG_SEND_FLUSH	(1 << 2) /* can flush writeback cache */
@@ -45,6 +45,10 @@ enum {
 #define NBD_FLAG_SEND_TRIM	(1 << 5) /* send trim/discard */
 #define NBD_FLAG_CAN_MULTI_CONN	(1 << 8)	/* Server supports multiple connections per export. */
 
+/* These are client behavior specific flags. */
+#define NBD_CFLAG_DESTROY_ON_DISCONNECT	(1 << 0) /* delete the nbd device on
+						    disconnect. */
+
 /* userspace doesn't need the nbd_device structure */
 
 /* These are sent over the network in the request/reply magic fields */
-- 
2.7.4

^ permalink raw reply related

* [PATCH 11/12] nbd: add device refcounting
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

In order to support deleting the device on disconnect we need to
refcount the actual nbd_device struct.  So add the refcounting framework
and change how we free the normal devices at rmmod time so we can catch
reference leaks.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 104 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e1289d0..b33014a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -99,10 +99,12 @@ struct nbd_device {
 
 	int index;
 	refcount_t config_refs;
+	refcount_t refs;
 	struct nbd_config *config;
 	struct mutex config_lock;
 	struct gendisk *disk;
 
+	struct list_head list;
 	struct task_struct *task_recv;
 	struct task_struct *task_setup;
 };
@@ -165,6 +167,28 @@ static struct device_attribute pid_attr = {
 	.show = pid_show,
 };
 
+static void nbd_dev_remove(struct nbd_device *nbd)
+{
+	struct gendisk *disk = nbd->disk;
+	if (disk) {
+		del_gendisk(disk);
+		blk_cleanup_queue(disk->queue);
+		blk_mq_free_tag_set(&nbd->tag_set);
+		put_disk(disk);
+	}
+	kfree(nbd);
+}
+
+static void nbd_put(struct nbd_device *nbd)
+{
+	if (refcount_dec_and_mutex_lock(&nbd->refs,
+					&nbd_index_mutex)) {
+		idr_remove(&nbd_index_idr, nbd->index);
+		mutex_unlock(&nbd_index_mutex);
+		nbd_dev_remove(nbd);
+	}
+}
+
 static int nbd_disconnected(struct nbd_config *config)
 {
 	return test_bit(NBD_DISCONNECTED, &config->runtime_flags) ||
@@ -1005,6 +1029,7 @@ static void nbd_config_put(struct nbd_device *nbd)
 		}
 		nbd_reset(nbd);
 		mutex_unlock(&nbd->config_lock);
+		nbd_put(nbd);
 		module_put(THIS_MODULE);
 	}
 }
@@ -1199,6 +1224,10 @@ static int nbd_open(struct block_device *bdev, fmode_t mode)
 		ret = -ENXIO;
 		goto out;
 	}
+	if (!refcount_inc_not_zero(&nbd->refs)) {
+		ret = -ENXIO;
+		goto out;
+	}
 	if (!refcount_inc_not_zero(&nbd->config_refs)) {
 		struct nbd_config *config;
 
@@ -1214,6 +1243,7 @@ static int nbd_open(struct block_device *bdev, fmode_t mode)
 			goto out;
 		}
 		refcount_set(&nbd->config_refs, 1);
+		refcount_inc(&nbd->refs);
 		mutex_unlock(&nbd->config_lock);
 	}
 out:
@@ -1225,6 +1255,7 @@ static void nbd_release(struct gendisk *disk, fmode_t mode)
 {
 	struct nbd_device *nbd = disk->private_data;
 	nbd_config_put(nbd);
+	nbd_put(nbd);
 }
 
 static const struct block_device_operations nbd_fops =
@@ -1378,18 +1409,6 @@ static const struct blk_mq_ops nbd_mq_ops = {
 	.timeout	= nbd_xmit_timeout,
 };
 
-static void nbd_dev_remove(struct nbd_device *nbd)
-{
-	struct gendisk *disk = nbd->disk;
-	if (disk) {
-		del_gendisk(disk);
-		blk_cleanup_queue(disk->queue);
-		blk_mq_free_tag_set(&nbd->tag_set);
-		put_disk(disk);
-	}
-	kfree(nbd);
-}
-
 static int nbd_dev_add(int index)
 {
 	struct nbd_device *nbd;
@@ -1453,6 +1472,8 @@ static int nbd_dev_add(int index)
 
 	mutex_init(&nbd->config_lock);
 	refcount_set(&nbd->config_refs, 0);
+	refcount_set(&nbd->refs, 1);
+	INIT_LIST_HEAD(&nbd->list);
 	disk->major = NBD_MAJOR;
 	disk->first_minor = index << part_shift;
 	disk->fops = &nbd_fops;
@@ -1550,16 +1571,26 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 	} else {
 		nbd = idr_find(&nbd_index_idr, index);
 	}
-	mutex_unlock(&nbd_index_mutex);
 	if (!nbd) {
 		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
 		       index);
+		mutex_unlock(&nbd_index_mutex);
+		return -EINVAL;
+	}
+	if (!refcount_inc_not_zero(&nbd->refs)) {
+		mutex_unlock(&nbd_index_mutex);
+		if (index == -1)
+			goto again;
+		printk(KERN_ERR "nbd: device at index %d is going down\n",
+		       index);
 		return -EINVAL;
 	}
+	mutex_unlock(&nbd_index_mutex);
 
 	mutex_lock(&nbd->config_lock);
 	if (refcount_read(&nbd->config_refs)) {
 		mutex_unlock(&nbd->config_lock);
+		nbd_put(nbd);
 		if (index == -1)
 			goto again;
 		printk(KERN_ERR "nbd: nbd%d already in use\n", index);
@@ -1567,11 +1598,13 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 	}
 	if (WARN_ON(nbd->config)) {
 		mutex_unlock(&nbd->config_lock);
+		nbd_put(nbd);
 		return -EINVAL;
 	}
 	config = nbd->config = nbd_alloc_config();
 	if (!nbd->config) {
 		mutex_unlock(&nbd->config_lock);
+		nbd_put(nbd);
 		printk(KERN_ERR "nbd: couldn't allocate config\n");
 		return -ENOMEM;
 	}
@@ -1656,14 +1689,23 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
 	index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
 	mutex_lock(&nbd_index_mutex);
 	nbd = idr_find(&nbd_index_idr, index);
-	mutex_unlock(&nbd_index_mutex);
 	if (!nbd) {
+		mutex_unlock(&nbd_index_mutex);
 		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
 		       index);
 		return -EINVAL;
 	}
-	if (!refcount_inc_not_zero(&nbd->config_refs))
+	if (!refcount_inc_not_zero(&nbd->refs)) {
+		mutex_unlock(&nbd_index_mutex);
+		printk(KERN_ERR "nbd: device at index %d is going down\n",
+		       index);
+		return -EINVAL;
+	}
+	mutex_unlock(&nbd_index_mutex);
+	if (!refcount_inc_not_zero(&nbd->config_refs)) {
+		nbd_put(nbd);
 		return 0;
+	}
 	mutex_lock(&nbd->config_lock);
 	nbd_disconnect(nbd);
 	mutex_unlock(&nbd->config_lock);
@@ -1671,6 +1713,7 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
 			       &nbd->config->runtime_flags))
 		nbd_config_put(nbd);
 	nbd_config_put(nbd);
+	nbd_put(nbd);
 	return 0;
 }
 
@@ -1691,16 +1734,24 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 	index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
 	mutex_lock(&nbd_index_mutex);
 	nbd = idr_find(&nbd_index_idr, index);
-	mutex_unlock(&nbd_index_mutex);
 	if (!nbd) {
+		mutex_unlock(&nbd_index_mutex);
 		printk(KERN_ERR "nbd: couldn't find a device at index %d\n",
 		       index);
 		return -EINVAL;
 	}
+	if (!refcount_inc_not_zero(&nbd->refs)) {
+		mutex_unlock(&nbd_index_mutex);
+		printk(KERN_ERR "nbd: device at index %d is going down\n",
+		       index);
+		return -EINVAL;
+	}
+	mutex_unlock(&nbd_index_mutex);
 
 	if (!refcount_inc_not_zero(&nbd->config_refs)) {
 		dev_err(nbd_to_dev(nbd),
 			"not configured, cannot reconfigure\n");
+		nbd_put(nbd);
 		return -EINVAL;
 	}
 
@@ -1759,6 +1810,7 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 out:
 	mutex_unlock(&nbd->config_lock);
 	nbd_config_put(nbd);
+	nbd_put(nbd);
 	return ret;
 }
 
@@ -2004,16 +2056,32 @@ static int __init nbd_init(void)
 
 static int nbd_exit_cb(int id, void *ptr, void *data)
 {
+	struct list_head *list = (struct list_head *)data;
 	struct nbd_device *nbd = ptr;
-	nbd_dev_remove(nbd);
+
+	refcount_inc(&nbd->refs);
+	list_add_tail(&nbd->list, list);
 	return 0;
 }
 
 static void __exit nbd_cleanup(void)
 {
+	struct nbd_device *nbd;
+	LIST_HEAD(del_list);
+
 	nbd_dbg_close();
 
-	idr_for_each(&nbd_index_idr, &nbd_exit_cb, NULL);
+	mutex_lock(&nbd_index_mutex);
+	idr_for_each(&nbd_index_idr, &nbd_exit_cb, &del_list);
+	mutex_unlock(&nbd_index_mutex);
+
+	list_for_each_entry(nbd, &del_list, list) {
+		if (refcount_read(&nbd->refs) != 2)
+			printk(KERN_ERR "nbd: possibly leaking a device\n");
+		nbd_put(nbd);
+		nbd_put(nbd);
+	}
+
 	idr_destroy(&nbd_index_idr);
 	genl_unregister_family(&nbd_genl_family);
 	destroy_workqueue(recv_workqueue);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 10/12] nbd: add a status netlink command
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

Allow users to query the status of existing nbd devices.  Right now this
only returns whether or not the device is connected, but could be
extended in the future to include more information.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 108 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/nbd-netlink.h |  25 +++++++++
 2 files changed, 133 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index fd3d535..e1289d0 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -45,6 +45,7 @@
 
 static DEFINE_IDR(nbd_index_idr);
 static DEFINE_MUTEX(nbd_index_mutex);
+static int nbd_total_devices = 0;
 
 struct nbd_sock {
 	struct socket *sock;
@@ -130,6 +131,7 @@ static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
 static void nbd_config_put(struct nbd_device *nbd);
 static void nbd_connect_reply(struct genl_info *info, int index);
+static int nbd_genl_status(struct sk_buff *skb, struct genl_info *info);
 static void nbd_dead_link_work(struct work_struct *work);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
@@ -1458,6 +1460,7 @@ static int nbd_dev_add(int index)
 	sprintf(disk->disk_name, "nbd%d", index);
 	nbd_reset(nbd);
 	add_disk(disk);
+	nbd_total_devices++;
 	return index;
 
 out_free_tags:
@@ -1494,12 +1497,22 @@ static struct nla_policy nbd_attr_policy[NBD_ATTR_MAX + 1] = {
 	[NBD_ATTR_CLIENT_FLAGS]		=	{ .type = NLA_U64 },
 	[NBD_ATTR_SOCKETS]		=	{ .type = NLA_NESTED},
 	[NBD_ATTR_DEAD_CONN_TIMEOUT]	=	{ .type = NLA_U64 },
+	[NBD_ATTR_DEVICE_LIST]		=	{ .type = NLA_NESTED},
 };
 
 static struct nla_policy nbd_sock_policy[NBD_SOCK_MAX + 1] = {
 	[NBD_SOCK_FD]			=	{ .type = NLA_U32 },
 };
 
+/* We don't use this right now since we don't parse the incoming list, but we
+ * still want it here so userspace knows what to expect.
+ */
+static struct nla_policy __attribute__((unused))
+nbd_device_policy[NBD_DEVICE_ATTR_MAX + 1] = {
+	[NBD_DEVICE_INDEX]		=	{ .type = NLA_U32 },
+	[NBD_DEVICE_CONNECTED]		=	{ .type = NLA_U8 },
+};
+
 static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nbd_device *nbd = NULL;
@@ -1765,6 +1778,11 @@ static const struct genl_ops nbd_connect_genl_ops[] = {
 		.policy	= nbd_attr_policy,
 		.doit	= nbd_genl_reconfigure,
 	},
+	{
+		.cmd	= NBD_CMD_STATUS,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_status,
+	},
 };
 
 static const struct genl_multicast_group nbd_mcast_grps[] = {
@@ -1783,6 +1801,96 @@ static struct genl_family nbd_genl_family __ro_after_init = {
 	.n_mcgrps	= ARRAY_SIZE(nbd_mcast_grps),
 };
 
+static int populate_nbd_status(struct nbd_device *nbd, struct sk_buff *reply)
+{
+	struct nlattr *dev_opt;
+	u8 connected = 0;
+	int ret;
+
+	/* This is a little racey, but for status it's ok.  The
+	 * reason we don't take a ref here is because we can't
+	 * take a ref in the index == -1 case as we would need
+	 * to put under the nbd_index_mutex, which could
+	 * deadlock if we are configured to remove ourselves
+	 * once we're disconnected.
+	 */
+	if (refcount_read(&nbd->config_refs))
+		connected = 1;
+	dev_opt = nla_nest_start(reply, NBD_DEVICE_ITEM);
+	if (!dev_opt)
+		return -EMSGSIZE;
+	ret = nla_put_u32(reply, NBD_DEVICE_INDEX, nbd->index);
+	if (ret)
+		return -EMSGSIZE;
+	ret = nla_put_u8(reply, NBD_DEVICE_CONNECTED,
+			 connected);
+	if (ret)
+		return -EMSGSIZE;
+	nla_nest_end(reply, dev_opt);
+	return 0;
+}
+
+static int status_cb(int id, void *ptr, void *data)
+{
+	struct nbd_device *nbd = ptr;
+	return populate_nbd_status(nbd, (struct sk_buff *)data);
+}
+
+static int nbd_genl_status(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr *dev_list;
+	struct sk_buff *reply;
+	void *reply_head;
+	size_t msg_size;
+	int index = -1;
+	int ret = -ENOMEM;
+
+	if (info->attrs[NBD_ATTR_INDEX])
+		index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+
+	mutex_lock(&nbd_index_mutex);
+
+	msg_size = nla_total_size(nla_attr_size(sizeof(u32)) +
+				  nla_attr_size(sizeof(u8)));
+	msg_size *= (index == -1) ? nbd_total_devices : 1;
+
+	reply = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!reply)
+		goto out;
+	reply_head = genlmsg_put_reply(reply, info, &nbd_genl_family, 0,
+				       NBD_CMD_STATUS);
+	if (!reply_head) {
+		nlmsg_free(reply);
+		goto out;
+	}
+
+	dev_list = nla_nest_start(reply, NBD_ATTR_DEVICE_LIST);
+	if (index == -1) {
+		ret = idr_for_each(&nbd_index_idr, &status_cb, reply);
+		if (ret) {
+			nlmsg_free(reply);
+			goto out;
+		}
+	} else {
+		struct nbd_device *nbd;
+		nbd = idr_find(&nbd_index_idr, index);
+		if (nbd) {
+			ret = populate_nbd_status(nbd, reply);
+			if (ret) {
+				nlmsg_free(reply);
+				goto out;
+			}
+		}
+	}
+	nla_nest_end(reply, dev_list);
+	genlmsg_end(reply, reply_head);
+	genlmsg_reply(reply, info);
+	ret = 0;
+out:
+	mutex_unlock(&nbd_index_mutex);
+	return ret;
+}
+
 static void nbd_connect_reply(struct genl_info *info, int index)
 {
 	struct sk_buff *skb;
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
index c2209c75..6f7ca3d 100644
--- a/include/uapi/linux/nbd-netlink.h
+++ b/include/uapi/linux/nbd-netlink.h
@@ -33,11 +33,35 @@ enum {
 	NBD_ATTR_CLIENT_FLAGS,
 	NBD_ATTR_SOCKETS,
 	NBD_ATTR_DEAD_CONN_TIMEOUT,
+	NBD_ATTR_DEVICE_LIST,
 	__NBD_ATTR_MAX,
 };
 #define NBD_ATTR_MAX (__NBD_ATTR_MAX - 1)
 
 /*
+ * This is the format for multiple devices with NBD_ATTR_DEVICE_LIST
+ *
+ * [NBD_ATTR_DEVICE_LIST]
+ *   [NBD_DEVICE_ITEM]
+ *     [NBD_DEVICE_INDEX]
+ *     [NBD_DEVICE_CONNECTED]
+ */
+enum {
+	NBD_DEVICE_ITEM_UNSPEC,
+	NBD_DEVICE_ITEM,
+	__NBD_DEVICE_ITEM_MAX,
+};
+#define NBD_DEVICE_ITEM_MAX (__NBD_DEVICE_ITEM_MAX - 1)
+
+enum {
+	NBD_DEVICE_UNSPEC,
+	NBD_DEVICE_INDEX,
+	NBD_DEVICE_CONNECTED,
+	__NBD_DEVICE_MAX,
+};
+#define NBD_DEVICE_ATTR_MAX (__NBD_DEVICE_MAX - 1)
+
+/*
  * This is the format for multiple sockets with NBD_ATTR_SOCKETS
  *
  * [NBD_ATTR_SOCKETS]
@@ -66,6 +90,7 @@ enum {
 	NBD_CMD_DISCONNECT,
 	NBD_CMD_RECONFIGURE,
 	NBD_CMD_LINK_DEAD,
+	NBD_CMD_STATUS,
 	__NBD_CMD_MAX,
 };
 #define NBD_CMD_MAX	(__NBD_CMD_MAX - 1)
-- 
2.7.4

^ permalink raw reply related

* [PATCH 09/12] nbd: handle dead connections
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

Sometimes we like to upgrade our server without making all of our
clients freak out and reconnect.  This patch provides a way to specify a
dead connection timeout to allow us to pause all requests and wait for
new connections to be opened.  With this in place I can take down the
nbd server for less than the dead connection timeout time and bring it
back up and everything resumes gracefully.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 63 +++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/nbd-netlink.h |  1 +
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 70c5e75..fd3d535 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -77,9 +77,12 @@ struct link_dead_args {
 struct nbd_config {
 	u32 flags;
 	unsigned long runtime_flags;
+	u64 dead_conn_timeout;
 
 	struct nbd_sock **socks;
 	int num_connections;
+	atomic_t live_connections;
+	wait_queue_head_t conn_wait;
 
 	atomic_t recv_threads;
 	wait_queue_head_t recv_wq;
@@ -178,8 +181,10 @@ static void nbd_mark_nsock_dead(struct nbd_device *nbd, struct nbd_sock *nsock,
 			queue_work(system_wq, &args->work);
 		}
 	}
-	if (!nsock->dead)
+	if (!nsock->dead) {
 		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+		atomic_dec(&nbd->config->live_connections);
+	}
 	nsock->dead = true;
 	nsock->pending = NULL;
 	nsock->sent = 0;
@@ -257,6 +262,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 		return BLK_EH_HANDLED;
 	}
 
+	/* If we are waiting on our dead timer then we could get timeout
+	 * callbacks for our request.  For this we just want to reset the timer
+	 * and let the queue side take care of everything.
+	 */
+	if (!completion_done(&cmd->send_complete)) {
+		nbd_config_put(nbd);
+		return BLK_EH_RESET_TIMER;
+	}
 	config = nbd->config;
 
 	if (config->num_connections > 1) {
@@ -665,6 +678,19 @@ static int find_fallback(struct nbd_device *nbd, int index)
 	return new_index;
 }
 
+static int wait_for_reconnect(struct nbd_device *nbd)
+{
+	struct nbd_config *config = nbd->config;
+	if (!config->dead_conn_timeout)
+		return 0;
+	if (test_bit(NBD_DISCONNECTED, &config->runtime_flags))
+		return 0;
+	wait_event_interruptible_timeout(config->conn_wait,
+					 atomic_read(&config->live_connections),
+					 config->dead_conn_timeout);
+	return atomic_read(&config->live_connections);
+}
+
 static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
@@ -691,12 +717,24 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	nsock = config->socks[index];
 	mutex_lock(&nsock->tx_lock);
 	if (nsock->dead) {
+		int old_index = index;
 		index = find_fallback(nbd, index);
+		mutex_unlock(&nsock->tx_lock);
 		if (index < 0) {
-			ret = -EIO;
-			goto out;
+			if (wait_for_reconnect(nbd)) {
+				index = old_index;
+				goto again;
+			}
+			/* All the sockets should already be down at this point,
+			 * we just want to make sure that DISCONNECTED is set so
+			 * any requests that come in that were queue'ed waiting
+			 * for the reconnect timer don't trigger the timer again
+			 * and instead just error out.
+			 */
+			sock_shutdown(nbd);
+			nbd_config_put(nbd);
+			return -EIO;
 		}
-		mutex_unlock(&nsock->tx_lock);
 		goto again;
 	}
 
@@ -809,6 +847,7 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
 	nsock->sent = 0;
 	nsock->cookie = 0;
 	socks[config->num_connections++] = nsock;
+	atomic_inc(&config->live_connections);
 
 	return 0;
 }
@@ -860,6 +899,9 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
 		 * need to queue_work outside of the tx_mutex.
 		 */
 		queue_work(recv_workqueue, &args->work);
+
+		atomic_inc(&config->live_connections);
+		wake_up(&config->conn_wait);
 		return 0;
 	}
 	sockfd_put(sock);
@@ -1137,7 +1179,9 @@ static struct nbd_config *nbd_alloc_config(void)
 		return NULL;
 	atomic_set(&config->recv_threads, 0);
 	init_waitqueue_head(&config->recv_wq);
+	init_waitqueue_head(&config->conn_wait);
 	config->blksize = 1024;
+	atomic_set(&config->live_connections, 0);
 	try_module_get(THIS_MODULE);
 	return config;
 }
@@ -1449,6 +1493,7 @@ static struct nla_policy nbd_attr_policy[NBD_ATTR_MAX + 1] = {
 	[NBD_ATTR_SERVER_FLAGS]		=	{ .type = NLA_U64 },
 	[NBD_ATTR_CLIENT_FLAGS]		=	{ .type = NLA_U64 },
 	[NBD_ATTR_SOCKETS]		=	{ .type = NLA_NESTED},
+	[NBD_ATTR_DEAD_CONN_TIMEOUT]	=	{ .type = NLA_U64 },
 };
 
 static struct nla_policy nbd_sock_policy[NBD_SOCK_MAX + 1] = {
@@ -1535,6 +1580,11 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
 		nbd->tag_set.timeout = timeout * HZ;
 		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
 	}
+	if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) {
+		config->dead_conn_timeout =
+			nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]);
+		config->dead_conn_timeout *= HZ;
+	}
 	if (info->attrs[NBD_ATTR_SERVER_FLAGS])
 		config->flags =
 			nla_get_u64(info->attrs[NBD_ATTR_SERVER_FLAGS]);
@@ -1655,6 +1705,11 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
 		nbd->tag_set.timeout = timeout * HZ;
 		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
 	}
+	if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) {
+		config->dead_conn_timeout =
+			nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]);
+		config->dead_conn_timeout *= HZ;
+	}
 
 	if (info->attrs[NBD_ATTR_SOCKETS]) {
 		struct nlattr *attr;
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
index b69105cc..c2209c75 100644
--- a/include/uapi/linux/nbd-netlink.h
+++ b/include/uapi/linux/nbd-netlink.h
@@ -32,6 +32,7 @@ enum {
 	NBD_ATTR_SERVER_FLAGS,
 	NBD_ATTR_CLIENT_FLAGS,
 	NBD_ATTR_SOCKETS,
+	NBD_ATTR_DEAD_CONN_TIMEOUT,
 	__NBD_ATTR_MAX,
 };
 #define NBD_ATTR_MAX (__NBD_ATTR_MAX - 1)
-- 
2.7.4

^ permalink raw reply related

* [PATCH 08/12] nbd: only clear the queue on device teardown
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

When running a disconnect torture test I noticed that sometimes we would
crash with a negative ref count on our queue.  This was because we were
ending the same request twice.  Turns out we were racing with
NBD_CLEAR_SOCK clearing the requests as well as the teardown of the
device clearing the requests.  So instead make the ioctl only shutdown
the sockets and make it so that we only ever run nbd_clear_que from the
device teardown.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 911e36c..70c5e75 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -616,7 +616,9 @@ static void nbd_clear_req(struct request *req, void *data, bool reserved)
 
 static void nbd_clear_que(struct nbd_device *nbd)
 {
+	blk_mq_stop_hw_queues(nbd->disk->queue);
 	blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL);
+	blk_mq_start_hw_queues(nbd->disk->queue);
 	dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
 }
 
@@ -1041,7 +1043,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
 static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
 				 struct block_device *bdev)
 {
-	nbd_clear_sock(nbd);
+	sock_shutdown(nbd);
 	kill_bdev(bdev);
 	nbd_bdev_reset(bdev);
 	if (test_and_clear_bit(NBD_HAS_CONFIG_REF,
-- 
2.7.4

^ permalink raw reply related

* [PATCH 07/12] nbd: multicast dead link notifications
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

Provide a mechanism to notify userspace that there's been a link problem
on a NBD device.  This will allow userspace to re-establish a connection
and provide the new socket to the device without disrupting the device.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 89 ++++++++++++++++++++++++++++++++++------
 include/uapi/linux/nbd-netlink.h |  6 ++-
 2 files changed, 81 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 27958c3..911e36c 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -53,6 +53,7 @@ struct nbd_sock {
 	int sent;
 	bool dead;
 	int fallback_index;
+	int cookie;
 };
 
 struct recv_thread_args {
@@ -61,6 +62,11 @@ struct recv_thread_args {
 	int index;
 };
 
+struct link_dead_args {
+	struct work_struct work;
+	int index;
+};
+
 #define NBD_TIMEDOUT			0
 #define NBD_DISCONNECT_REQUESTED	1
 #define NBD_DISCONNECTED		2
@@ -100,6 +106,7 @@ struct nbd_device {
 struct nbd_cmd {
 	struct nbd_device *nbd;
 	int index;
+	int cookie;
 	struct completion send_complete;
 };
 
@@ -120,6 +127,7 @@ static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
 static void nbd_config_put(struct nbd_device *nbd);
 static void nbd_connect_reply(struct genl_info *info, int index);
+static void nbd_dead_link_work(struct work_struct *work);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 {
@@ -152,8 +160,24 @@ static struct device_attribute pid_attr = {
 	.show = pid_show,
 };
 
-static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
+static int nbd_disconnected(struct nbd_config *config)
+{
+	return test_bit(NBD_DISCONNECTED, &config->runtime_flags) ||
+		test_bit(NBD_DISCONNECT_REQUESTED, &config->runtime_flags);
+}
+
+static void nbd_mark_nsock_dead(struct nbd_device *nbd, struct nbd_sock *nsock,
+				int notify)
 {
+	if (!nsock->dead && notify && !nbd_disconnected(nbd->config)) {
+		struct link_dead_args *args;
+		args = kmalloc(sizeof(struct link_dead_args), GFP_NOIO);
+		if (args) {
+			INIT_WORK(&args->work, nbd_dead_link_work);
+			args->index = nbd->index;
+			queue_work(system_wq, &args->work);
+		}
+	}
 	if (!nsock->dead)
 		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
 	nsock->dead = true;
@@ -215,8 +239,7 @@ static void sock_shutdown(struct nbd_device *nbd)
 	for (i = 0; i < config->num_connections; i++) {
 		struct nbd_sock *nsock = config->socks[i];
 		mutex_lock(&nsock->tx_lock);
-		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
-		nbd_mark_nsock_dead(nsock);
+		nbd_mark_nsock_dead(nbd, nsock, 0);
 		mutex_unlock(&nsock->tx_lock);
 	}
 	dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
@@ -248,7 +271,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 				struct nbd_sock *nsock =
 					config->socks[cmd->index];
 				mutex_lock(&nsock->tx_lock);
-				nbd_mark_nsock_dead(nsock);
+				/* We can have multiple outstanding requests, so
+				 * we don't want to mark the nsock dead if we've
+				 * already reconnected with a new socket, so
+				 * only mark it dead if its the same socket we
+				 * were sent out on.
+				 */
+				if (cmd->cookie == nsock->cookie)
+					nbd_mark_nsock_dead(nbd, nsock, 1);
 				mutex_unlock(&nsock->tx_lock);
 			}
 			blk_mq_requeue_request(req, true);
@@ -370,6 +400,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		iov_iter_advance(&from, sent);
 	}
 	cmd->index = index;
+	cmd->cookie = nsock->cookie;
 	request.type = htonl(type);
 	if (type != NBD_CMD_FLUSH) {
 		request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9);
@@ -458,12 +489,6 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	return 0;
 }
 
-static int nbd_disconnected(struct nbd_config *config)
-{
-	return test_bit(NBD_DISCONNECTED, &config->runtime_flags) ||
-		test_bit(NBD_DISCONNECT_REQUESTED, &config->runtime_flags);
-}
-
 /* NULL returned = something went wrong, inform userspace */
 static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 {
@@ -564,7 +589,7 @@ static void recv_work(struct work_struct *work)
 			struct nbd_sock *nsock = config->socks[args->index];
 
 			mutex_lock(&nsock->tx_lock);
-			nbd_mark_nsock_dead(nsock);
+			nbd_mark_nsock_dead(nbd, nsock, 1);
 			mutex_unlock(&nsock->tx_lock);
 			ret = PTR_ERR(cmd);
 			break;
@@ -691,7 +716,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	if (ret == -EAGAIN) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Request send failed trying another connection\n");
-		nbd_mark_nsock_dead(nsock);
+		nbd_mark_nsock_dead(nbd, nsock, 1);
 		mutex_unlock(&nsock->tx_lock);
 		goto again;
 	}
@@ -780,6 +805,7 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
 	nsock->sock = sock;
 	nsock->pending = NULL;
 	nsock->sent = 0;
+	nsock->cookie = 0;
 	socks[config->num_connections++] = nsock;
 
 	return 0;
@@ -824,6 +850,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
 		INIT_WORK(&args->work, recv_work);
 		args->index = i;
 		args->nbd = nbd;
+		nsock->cookie++;
 		mutex_unlock(&nsock->tx_lock);
 		sockfd_put(old);
 
@@ -1683,6 +1710,10 @@ static const struct genl_ops nbd_connect_genl_ops[] = {
 	},
 };
 
+static const struct genl_multicast_group nbd_mcast_grps[] = {
+	{ .name = NBD_GENL_MCAST_GROUP_NAME, },
+};
+
 static struct genl_family nbd_genl_family __ro_after_init = {
 	.hdrsize	= 0,
 	.name		= NBD_GENL_FAMILY_NAME,
@@ -1691,6 +1722,8 @@ static struct genl_family nbd_genl_family __ro_after_init = {
 	.ops		= nbd_connect_genl_ops,
 	.n_ops		= ARRAY_SIZE(nbd_connect_genl_ops),
 	.maxattr	= NBD_ATTR_MAX,
+	.mcgrps		= nbd_mcast_grps,
+	.n_mcgrps	= ARRAY_SIZE(nbd_mcast_grps),
 };
 
 static void nbd_connect_reply(struct genl_info *info, int index)
@@ -1717,6 +1750,38 @@ static void nbd_connect_reply(struct genl_info *info, int index)
 	genlmsg_reply(skb, info);
 }
 
+static void nbd_mcast_index(int index)
+{
+	struct sk_buff *skb;
+	void *msg_head;
+	int ret;
+
+	skb = genlmsg_new(nla_total_size(sizeof(u32)), GFP_KERNEL);
+	if (!skb)
+		return;
+	msg_head = genlmsg_put(skb, 0, 0, &nbd_genl_family, 0,
+				     NBD_CMD_LINK_DEAD);
+	if (!msg_head) {
+		nlmsg_free(skb);
+		return;
+	}
+	ret = nla_put_u32(skb, NBD_ATTR_INDEX, index);
+	if (ret) {
+		nlmsg_free(skb);
+		return;
+	}
+	genlmsg_end(skb, msg_head);
+	genlmsg_multicast(&nbd_genl_family, skb, 0, 0, GFP_KERNEL);
+}
+
+static void nbd_dead_link_work(struct work_struct *work)
+{
+	struct link_dead_args *args = container_of(work, struct link_dead_args,
+						   work);
+	nbd_mcast_index(args->index);
+	kfree(args);
+}
+
 static int __init nbd_init(void)
 {
 	int i;
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
index f932f96..b69105cc 100644
--- a/include/uapi/linux/nbd-netlink.h
+++ b/include/uapi/linux/nbd-netlink.h
@@ -18,8 +18,9 @@
 #ifndef _UAPILINUX_NBD_NETLINK_H
 #define _UAPILINUX_NBD_NETLINK_H
 
-#define NBD_GENL_FAMILY_NAME	"nbd"
-#define NBD_GENL_VERSION	0x1
+#define NBD_GENL_FAMILY_NAME		"nbd"
+#define NBD_GENL_VERSION		0x1
+#define NBD_GENL_MCAST_GROUP_NAME	"nbd_mc_group"
 
 /* Configuration policy attributes, used for CONNECT */
 enum {
@@ -63,6 +64,7 @@ enum {
 	NBD_CMD_CONNECT,
 	NBD_CMD_DISCONNECT,
 	NBD_CMD_RECONFIGURE,
+	NBD_CMD_LINK_DEAD,
 	__NBD_CMD_MAX,
 };
 #define NBD_CMD_MAX	(__NBD_CMD_MAX - 1)
-- 
2.7.4

^ permalink raw reply related

* [PATCH 06/12] nbd: add a reconfigure netlink command
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

We want to be able to reconnect dead connections to existing block
devices, so add a reconfigure netlink command.  We will also allow users
to change their timeout on the fly, but everything else will require a
disconnect and reconnect.  You won't be able to add more connections
either, simply replace dead connections with new more lively
connections.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 141 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/nbd-netlink.h |   1 +
 2 files changed, 142 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b8764e1..27958c3 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -785,6 +785,59 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
 	return 0;
 }
 
+static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg)
+{
+	struct nbd_config *config = nbd->config;
+	struct socket *sock, *old;
+	struct recv_thread_args *args;
+	int i;
+	int err;
+
+	sock = sockfd_lookup(arg, &err);
+	if (!sock)
+		return err;
+
+	args = kzalloc(sizeof(*args), GFP_KERNEL);
+	if (!args) {
+		sockfd_put(sock);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < config->num_connections; i++) {
+		struct nbd_sock *nsock = config->socks[i];
+
+		if (!nsock->dead)
+			continue;
+
+		mutex_lock(&nsock->tx_lock);
+		if (!nsock->dead) {
+			mutex_unlock(&nsock->tx_lock);
+			continue;
+		}
+		sk_set_memalloc(sock->sk);
+		atomic_inc(&config->recv_threads);
+		refcount_inc(&nbd->config_refs);
+		old = nsock->sock;
+		nsock->fallback_index = -1;
+		nsock->sock = sock;
+		nsock->dead = false;
+		INIT_WORK(&args->work, recv_work);
+		args->index = i;
+		args->nbd = nbd;
+		mutex_unlock(&nsock->tx_lock);
+		sockfd_put(old);
+
+		/* We take the tx_mutex in an error path in the recv_work, so we
+		 * need to queue_work outside of the tx_mutex.
+		 */
+		queue_work(recv_workqueue, &args->work);
+		return 0;
+	}
+	sockfd_put(sock);
+	kfree(args);
+	return -ENOSPC;
+}
+
 /* Reset all properties of an NBD device */
 static void nbd_reset(struct nbd_device *nbd)
 {
@@ -1529,6 +1582,89 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
 	return 0;
 }
 
+static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nbd_device *nbd = NULL;
+	struct nbd_config *config;
+	int index;
+	int ret = -EINVAL;
+
+	if (!netlink_capable(skb, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!info->attrs[NBD_ATTR_INDEX]) {
+		printk(KERN_ERR "nbd: must specify a device to reconfigure\n");
+		return -EINVAL;
+	}
+	index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+	mutex_lock(&nbd_index_mutex);
+	nbd = idr_find(&nbd_index_idr, index);
+	mutex_unlock(&nbd_index_mutex);
+	if (!nbd) {
+		printk(KERN_ERR "nbd: couldn't find a device at index %d\n",
+		       index);
+		return -EINVAL;
+	}
+
+	if (!refcount_inc_not_zero(&nbd->config_refs)) {
+		dev_err(nbd_to_dev(nbd),
+			"not configured, cannot reconfigure\n");
+		return -EINVAL;
+	}
+
+	mutex_lock(&nbd->config_lock);
+	config = nbd->config;
+	if (!test_bit(NBD_BOUND, &config->runtime_flags) ||
+	    !nbd->task_recv) {
+		dev_err(nbd_to_dev(nbd),
+			"not configured, cannot reconfigure\n");
+		goto out;
+	}
+
+	if (info->attrs[NBD_ATTR_TIMEOUT]) {
+		u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]);
+		nbd->tag_set.timeout = timeout * HZ;
+		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
+	}
+
+	if (info->attrs[NBD_ATTR_SOCKETS]) {
+		struct nlattr *attr;
+		int rem, fd;
+
+		nla_for_each_nested(attr, info->attrs[NBD_ATTR_SOCKETS],
+				    rem) {
+			struct nlattr *socks[NBD_SOCK_MAX+1];
+
+			if (nla_type(attr) != NBD_SOCK_ITEM) {
+				printk(KERN_ERR "nbd: socks must be embedded in a SOCK_ITEM attr\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
+					       nbd_sock_policy);
+			if (ret != 0) {
+				printk(KERN_ERR "nbd: error processing sock list\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!socks[NBD_SOCK_FD])
+				continue;
+			fd = (int)nla_get_u32(socks[NBD_SOCK_FD]);
+			ret = nbd_reconnect_socket(nbd, fd);
+			if (ret) {
+				if (ret == -ENOSPC)
+					ret = 0;
+				goto out;
+			}
+			dev_info(nbd_to_dev(nbd), "reconnected socket\n");
+		}
+	}
+out:
+	mutex_unlock(&nbd->config_lock);
+	nbd_config_put(nbd);
+	return ret;
+}
+
 static const struct genl_ops nbd_connect_genl_ops[] = {
 	{
 		.cmd	= NBD_CMD_CONNECT,
@@ -1540,6 +1676,11 @@ static const struct genl_ops nbd_connect_genl_ops[] = {
 		.policy	= nbd_attr_policy,
 		.doit	= nbd_genl_disconnect,
 	},
+	{
+		.cmd	= NBD_CMD_RECONFIGURE,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_reconfigure,
+	},
 };
 
 static struct genl_family nbd_genl_family __ro_after_init = {
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
index fd0f4e4..f932f96 100644
--- a/include/uapi/linux/nbd-netlink.h
+++ b/include/uapi/linux/nbd-netlink.h
@@ -62,6 +62,7 @@ enum {
 	NBD_CMD_UNSPEC,
 	NBD_CMD_CONNECT,
 	NBD_CMD_DISCONNECT,
+	NBD_CMD_RECONFIGURE,
 	__NBD_CMD_MAX,
 };
 #define NBD_CMD_MAX	(__NBD_CMD_MAX - 1)
-- 
2.7.4

^ permalink raw reply related

* [PATCH 05/12] nbd: add a basic netlink interface
From: Josef Bacik @ 2017-04-06 21:02 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

The existing ioctl interface for configuring NBD devices is a bit
cumbersome and hard to extend.  The other problem is we leave a
userspace app sitting in it's syscall until the device disconnects,
which is less than ideal.

This patch introduces a netlink interface for adding and disconnecting
nbd devices.  This has the benefits of being easily extendable without
breaking older userspace applications, and allows us to configure a nbd
device without leaving a userspace app sitting waiting for the device to
disconnect.

With this interface we also gain the ability to configure more devices
than are preallocated at insmod time.  We also have gained the ability
to not specify a particular device and be provided one for us so that
userspace doesn't need to find a free device to configure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 316 +++++++++++++++++++++++++++++++++++----
 include/uapi/linux/nbd-netlink.h |  69 +++++++++
 2 files changed, 359 insertions(+), 26 deletions(-)
 create mode 100644 include/uapi/linux/nbd-netlink.h

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 8891889..b8764e1 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -40,6 +40,8 @@
 #include <asm/types.h>
 
 #include <linux/nbd.h>
+#include <linux/nbd-netlink.h>
+#include <net/genetlink.h>
 
 static DEFINE_IDR(nbd_index_idr);
 static DEFINE_MUTEX(nbd_index_mutex);
@@ -63,6 +65,8 @@ struct recv_thread_args {
 #define NBD_DISCONNECT_REQUESTED	1
 #define NBD_DISCONNECTED		2
 #define NBD_HAS_PID_FILE		3
+#define NBD_HAS_CONFIG_REF		4
+#define NBD_BOUND			5
 
 struct nbd_config {
 	u32 flags;
@@ -83,6 +87,7 @@ struct nbd_config {
 struct nbd_device {
 	struct blk_mq_tag_set tag_set;
 
+	int index;
 	refcount_t config_refs;
 	struct nbd_config *config;
 	struct mutex config_lock;
@@ -114,6 +119,7 @@ static int part_shift;
 static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
 static void nbd_config_put(struct nbd_device *nbd);
+static void nbd_connect_reply(struct genl_info *info, int index);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 {
@@ -728,7 +734,8 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
+static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
+			  bool netlink)
 {
 	struct nbd_config *config = nbd->config;
 	struct socket *sock;
@@ -740,13 +747,17 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 	if (!sock)
 		return err;
 
-	if (!nbd->task_setup)
+	if (!netlink && !nbd->task_setup &&
+	    !test_bit(NBD_BOUND, &config->runtime_flags))
 		nbd->task_setup = current;
-	if (nbd->task_setup != current) {
+
+	if (!netlink &&
+	    (nbd->task_setup != current ||
+	     test_bit(NBD_BOUND, &config->runtime_flags))) {
 		dev_err(disk_to_dev(nbd->disk),
 			"Device being setup by another task");
 		sockfd_put(sock);
-		return -EINVAL;
+		return -EBUSY;
 	}
 
 	socks = krealloc(config->socks, (config->num_connections + 1) *
@@ -872,7 +883,7 @@ static void nbd_config_put(struct nbd_device *nbd)
 	}
 }
 
-static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_start_device(struct nbd_device *nbd)
 {
 	struct nbd_config *config = nbd->config;
 	int num_connections = config->num_connections;
@@ -888,11 +899,8 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		return -EINVAL;
 	}
 
-	if (max_part)
-		bdev->bd_invalidated = 1;
 	blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections);
 	nbd->task_recv = current;
-	mutex_unlock(&nbd->config_lock);
 
 	nbd_parse_flags(nbd);
 
@@ -901,11 +909,7 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
 		return error;
 	}
-
 	set_bit(NBD_HAS_PID_FILE, &config->runtime_flags);
-	if (max_part)
-		bdev->bd_invalidated = 1;
-	bd_set_size(bdev, config->bytesize);
 
 	nbd_dev_dbg_init(nbd);
 	for (i = 0; i < num_connections; i++) {
@@ -924,18 +928,34 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		args->index = i;
 		queue_work(recv_workqueue, &args->work);
 	}
-	error = wait_event_interruptible(config->recv_wq,
+	return error;
+}
+
+static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *bdev)
+{
+	struct nbd_config *config = nbd->config;
+	int ret;
+
+	ret = nbd_start_device(nbd);
+	if (ret)
+		return ret;
+
+	bd_set_size(bdev, config->bytesize);
+	if (max_part)
+		bdev->bd_invalidated = 1;
+	mutex_unlock(&nbd->config_lock);
+	ret = wait_event_interruptible(config->recv_wq,
 					 atomic_read(&config->recv_threads) == 0);
-	if (error)
+	if (ret)
 		sock_shutdown(nbd);
 	mutex_lock(&nbd->config_lock);
-
+	bd_set_size(bdev, 0);
 	/* user requested, ignore socket errors */
 	if (test_bit(NBD_DISCONNECT_REQUESTED, &config->runtime_flags))
-		error = 0;
+		ret = 0;
 	if (test_bit(NBD_TIMEDOUT, &config->runtime_flags))
-		error = -ETIMEDOUT;
-	return error;
+		ret = -ETIMEDOUT;
+	return ret;
 }
 
 static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
@@ -944,6 +964,9 @@ static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
 	nbd_clear_sock(nbd);
 	kill_bdev(bdev);
 	nbd_bdev_reset(bdev);
+	if (test_and_clear_bit(NBD_HAS_CONFIG_REF,
+			       &nbd->config->runtime_flags))
+		nbd_config_put(nbd);
 }
 
 /* Must be called with config_lock held */
@@ -959,7 +982,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd_clear_sock_ioctl(nbd, bdev);
 		return 0;
 	case NBD_SET_SOCK:
-		return nbd_add_socket(nbd, arg);
+		return nbd_add_socket(nbd, arg, false);
 	case NBD_SET_BLKSIZE:
 		nbd_size_set(nbd, arg,
 			     div_s64(config->bytesize, arg));
@@ -982,7 +1005,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		config->flags = arg;
 		return 0;
 	case NBD_DO_IT:
-		return nbd_start_device(nbd, bdev);
+		return nbd_start_device_ioctl(nbd, bdev);
 	case NBD_CLEAR_QUE:
 		/*
 		 * This is for compatibility only.  The queue is always cleared
@@ -1003,13 +1026,22 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
 		     unsigned int cmd, unsigned long arg)
 {
 	struct nbd_device *nbd = bdev->bd_disk->private_data;
-	int error;
+	struct nbd_config *config = nbd->config;
+	int error = -EINVAL;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
 	mutex_lock(&nbd->config_lock);
-	error = __nbd_ioctl(bdev, nbd, cmd, arg);
+
+	/* Don't allow ioctl operations on a nbd device that was created with
+	 * netlink, unless it's DISCONNECT or CLEAR_SOCK, which are fine.
+	 */
+	if (!test_bit(NBD_BOUND, &config->runtime_flags) ||
+	    (cmd == NBD_DISCONNECT || cmd == NBD_CLEAR_SOCK))
+		error = __nbd_ioctl(bdev, nbd, cmd, arg);
+	else
+		dev_err(nbd_to_dev(nbd), "Cannot use ioctl interface on a netlink controlled device.\n");
 	mutex_unlock(&nbd->config_lock);
 	return error;
 }
@@ -1258,6 +1290,7 @@ static int nbd_dev_add(int index)
 	if (err < 0)
 		goto out_free_disk;
 
+	nbd->index = index;
 	nbd->disk = disk;
 	nbd->tag_set.ops = &nbd_mq_ops;
 	nbd->tag_set.nr_hw_queues = 1;
@@ -1313,10 +1346,235 @@ static int nbd_dev_add(int index)
 	return err;
 }
 
-/*
- * And here should be modules and kernel interface 
- *  (Just smiley confuses emacs :-)
- */
+static int find_free_cb(int id, void *ptr, void *data)
+{
+	struct nbd_device *nbd = ptr;
+	struct nbd_device **found = data;
+
+	if (!refcount_read(&nbd->config_refs)) {
+		*found = nbd;
+		return 1;
+	}
+	return 0;
+}
+
+/* Netlink interface. */
+static struct nla_policy nbd_attr_policy[NBD_ATTR_MAX + 1] = {
+	[NBD_ATTR_INDEX]		=	{ .type = NLA_U32 },
+	[NBD_ATTR_SIZE_BYTES]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_BLOCK_SIZE_BYTES]	=	{ .type = NLA_U64 },
+	[NBD_ATTR_TIMEOUT]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_SERVER_FLAGS]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_CLIENT_FLAGS]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_SOCKETS]		=	{ .type = NLA_NESTED},
+};
+
+static struct nla_policy nbd_sock_policy[NBD_SOCK_MAX + 1] = {
+	[NBD_SOCK_FD]			=	{ .type = NLA_U32 },
+};
+
+static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nbd_device *nbd = NULL;
+	struct nbd_config *config;
+	int index = -1;
+	int ret;
+
+	if (!netlink_capable(skb, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (info->attrs[NBD_ATTR_INDEX])
+		index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+	if (!info->attrs[NBD_ATTR_SOCKETS]) {
+		printk(KERN_ERR "nbd: must specify at least one socket\n");
+		return -EINVAL;
+	}
+	if (!info->attrs[NBD_ATTR_SIZE_BYTES]) {
+		printk(KERN_ERR "nbd: must specify a size in bytes for the device\n");
+		return -EINVAL;
+	}
+again:
+	mutex_lock(&nbd_index_mutex);
+	if (index == -1) {
+		ret = idr_for_each(&nbd_index_idr, &find_free_cb, &nbd);
+		if (ret == 0) {
+			int new_index;
+			new_index = nbd_dev_add(-1);
+			if (new_index < 0) {
+				mutex_unlock(&nbd_index_mutex);
+				printk(KERN_ERR "nbd: failed to add new device\n");
+				return ret;
+			}
+			nbd = idr_find(&nbd_index_idr, new_index);
+		}
+	} else {
+		nbd = idr_find(&nbd_index_idr, index);
+	}
+	mutex_unlock(&nbd_index_mutex);
+	if (!nbd) {
+		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
+		       index);
+		return -EINVAL;
+	}
+
+	mutex_lock(&nbd->config_lock);
+	if (refcount_read(&nbd->config_refs)) {
+		mutex_unlock(&nbd->config_lock);
+		if (index == -1)
+			goto again;
+		printk(KERN_ERR "nbd: nbd%d already in use\n", index);
+		return -EBUSY;
+	}
+	if (WARN_ON(nbd->config)) {
+		mutex_unlock(&nbd->config_lock);
+		return -EINVAL;
+	}
+	config = nbd->config = nbd_alloc_config();
+	if (!nbd->config) {
+		mutex_unlock(&nbd->config_lock);
+		printk(KERN_ERR "nbd: couldn't allocate config\n");
+		return -ENOMEM;
+	}
+	refcount_set(&nbd->config_refs, 1);
+	set_bit(NBD_BOUND, &config->runtime_flags);
+
+	if (info->attrs[NBD_ATTR_SIZE_BYTES]) {
+		u64 bytes = nla_get_u64(info->attrs[NBD_ATTR_SIZE_BYTES]);
+		nbd_size_set(nbd, config->blksize,
+			     div64_u64(bytes, config->blksize));
+	}
+	if (info->attrs[NBD_ATTR_BLOCK_SIZE_BYTES]) {
+		u64 bsize =
+			nla_get_u64(info->attrs[NBD_ATTR_BLOCK_SIZE_BYTES]);
+		nbd_size_set(nbd, bsize, div64_u64(config->bytesize, bsize));
+	}
+	if (info->attrs[NBD_ATTR_TIMEOUT]) {
+		u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]);
+		nbd->tag_set.timeout = timeout * HZ;
+		blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ);
+	}
+	if (info->attrs[NBD_ATTR_SERVER_FLAGS])
+		config->flags =
+			nla_get_u64(info->attrs[NBD_ATTR_SERVER_FLAGS]);
+	if (info->attrs[NBD_ATTR_SOCKETS]) {
+		struct nlattr *attr;
+		int rem, fd;
+
+		nla_for_each_nested(attr, info->attrs[NBD_ATTR_SOCKETS],
+				    rem) {
+			struct nlattr *socks[NBD_SOCK_MAX+1];
+
+			if (nla_type(attr) != NBD_SOCK_ITEM) {
+				printk(KERN_ERR "nbd: socks must be embedded in a SOCK_ITEM attr\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
+					       nbd_sock_policy);
+			if (ret != 0) {
+				printk(KERN_ERR "nbd: error processing sock list\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!socks[NBD_SOCK_FD])
+				continue;
+			fd = (int)nla_get_u32(socks[NBD_SOCK_FD]);
+			ret = nbd_add_socket(nbd, fd, true);
+			if (ret)
+				goto out;
+		}
+	}
+	ret = nbd_start_device(nbd);
+out:
+	mutex_unlock(&nbd->config_lock);
+	if (!ret) {
+		set_bit(NBD_HAS_CONFIG_REF, &config->runtime_flags);
+		refcount_inc(&nbd->config_refs);
+		nbd_connect_reply(info, nbd->index);
+	}
+	nbd_config_put(nbd);
+	return ret;
+}
+
+static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nbd_device *nbd;
+	int index;
+
+	if (!netlink_capable(skb, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!info->attrs[NBD_ATTR_INDEX]) {
+		printk(KERN_ERR "nbd: must specify an index to disconnect\n");
+		return -EINVAL;
+	}
+	index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+	mutex_lock(&nbd_index_mutex);
+	nbd = idr_find(&nbd_index_idr, index);
+	mutex_unlock(&nbd_index_mutex);
+	if (!nbd) {
+		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
+		       index);
+		return -EINVAL;
+	}
+	if (!refcount_inc_not_zero(&nbd->config_refs))
+		return 0;
+	mutex_lock(&nbd->config_lock);
+	nbd_disconnect(nbd);
+	mutex_unlock(&nbd->config_lock);
+	if (test_and_clear_bit(NBD_HAS_CONFIG_REF,
+			       &nbd->config->runtime_flags))
+		nbd_config_put(nbd);
+	nbd_config_put(nbd);
+	return 0;
+}
+
+static const struct genl_ops nbd_connect_genl_ops[] = {
+	{
+		.cmd	= NBD_CMD_CONNECT,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_connect,
+	},
+	{
+		.cmd	= NBD_CMD_DISCONNECT,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_disconnect,
+	},
+};
+
+static struct genl_family nbd_genl_family __ro_after_init = {
+	.hdrsize	= 0,
+	.name		= NBD_GENL_FAMILY_NAME,
+	.version	= NBD_GENL_VERSION,
+	.module		= THIS_MODULE,
+	.ops		= nbd_connect_genl_ops,
+	.n_ops		= ARRAY_SIZE(nbd_connect_genl_ops),
+	.maxattr	= NBD_ATTR_MAX,
+};
+
+static void nbd_connect_reply(struct genl_info *info, int index)
+{
+	struct sk_buff *skb;
+	void *msg_head;
+	int ret;
+
+	skb = genlmsg_new(nla_total_size(sizeof(u32)), GFP_KERNEL);
+	if (!skb)
+		return;
+	msg_head = genlmsg_put_reply(skb, info, &nbd_genl_family, 0,
+				     NBD_CMD_CONNECT);
+	if (!msg_head) {
+		nlmsg_free(skb);
+		return;
+	}
+	ret = nla_put_u32(skb, NBD_ATTR_INDEX, index);
+	if (ret) {
+		nlmsg_free(skb);
+		return;
+	}
+	genlmsg_end(skb, msg_head);
+	genlmsg_reply(skb, info);
+}
 
 static int __init nbd_init(void)
 {
@@ -1359,6 +1617,11 @@ static int __init nbd_init(void)
 		return -EIO;
 	}
 
+	if (genl_register_family(&nbd_genl_family)) {
+		unregister_blkdev(NBD_MAJOR, "nbd");
+		destroy_workqueue(recv_workqueue);
+		return -EINVAL;
+	}
 	nbd_dbg_init();
 
 	mutex_lock(&nbd_index_mutex);
@@ -1381,6 +1644,7 @@ static void __exit nbd_cleanup(void)
 
 	idr_for_each(&nbd_index_idr, &nbd_exit_cb, NULL);
 	idr_destroy(&nbd_index_idr);
+	genl_unregister_family(&nbd_genl_family);
 	destroy_workqueue(recv_workqueue);
 	unregister_blkdev(NBD_MAJOR, "nbd");
 }
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
new file mode 100644
index 0000000..fd0f4e4
--- /dev/null
+++ b/include/uapi/linux/nbd-netlink.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (C) 2017 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#ifndef _UAPILINUX_NBD_NETLINK_H
+#define _UAPILINUX_NBD_NETLINK_H
+
+#define NBD_GENL_FAMILY_NAME	"nbd"
+#define NBD_GENL_VERSION	0x1
+
+/* Configuration policy attributes, used for CONNECT */
+enum {
+	NBD_ATTR_UNSPEC,
+	NBD_ATTR_INDEX,
+	NBD_ATTR_SIZE_BYTES,
+	NBD_ATTR_BLOCK_SIZE_BYTES,
+	NBD_ATTR_TIMEOUT,
+	NBD_ATTR_SERVER_FLAGS,
+	NBD_ATTR_CLIENT_FLAGS,
+	NBD_ATTR_SOCKETS,
+	__NBD_ATTR_MAX,
+};
+#define NBD_ATTR_MAX (__NBD_ATTR_MAX - 1)
+
+/*
+ * This is the format for multiple sockets with NBD_ATTR_SOCKETS
+ *
+ * [NBD_ATTR_SOCKETS]
+ *   [NBD_SOCK_ITEM]
+ *     [NBD_SOCK_FD]
+ *   [NBD_SOCK_ITEM]
+ *     [NBD_SOCK_FD]
+ */
+enum {
+	NBD_SOCK_ITEM_UNSPEC,
+	NBD_SOCK_ITEM,
+	__NBD_SOCK_ITEM_MAX,
+};
+#define NBD_SOCK_ITEM_MAX (__NBD_SOCK_ITEM_MAX - 1)
+
+enum {
+	NBD_SOCK_UNSPEC,
+	NBD_SOCK_FD,
+	__NBD_SOCK_MAX,
+};
+#define NBD_SOCK_MAX (__NBD_SOCK_MAX - 1)
+
+enum {
+	NBD_CMD_UNSPEC,
+	NBD_CMD_CONNECT,
+	NBD_CMD_DISCONNECT,
+	__NBD_CMD_MAX,
+};
+#define NBD_CMD_MAX	(__NBD_CMD_MAX - 1)
+
+#endif /* _UAPILINUX_NBD_NETLINK_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH 04/12] nbd: stop using the bdev everywhere
From: Josef Bacik @ 2017-04-06 21:01 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

In preparation for the upcoming netlink interface we need to not rely on
already having the bdev for the NBD device we are doing operations on.
Instead of passing the bdev around, just use it in places where we know
we already have the bdev.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 92 +++++++++++++++++++++--------------------------------
 1 file changed, 37 insertions(+), 55 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index c2d9923..8891889 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -120,11 +120,6 @@ static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 	return disk_to_dev(nbd->disk);
 }
 
-static bool nbd_is_connected(struct nbd_device *nbd)
-{
-	return !!nbd->task_recv;
-}
-
 static const char *nbdcmd_to_ascii(int cmd)
 {
 	switch (cmd) {
@@ -160,36 +155,30 @@ static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
 	nsock->sent = 0;
 }
 
-static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_size_clear(struct nbd_device *nbd)
 {
 	if (nbd->config->bytesize) {
-		if (bdev->bd_openers <= 1)
-			bd_set_size(bdev, 0);
 		set_capacity(nbd->disk, 0);
 		kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
 	}
-
-	return 0;
 }
 
-static void nbd_size_update(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_size_update(struct nbd_device *nbd)
 {
 	struct nbd_config *config = nbd->config;
 	blk_queue_logical_block_size(nbd->disk->queue, config->blksize);
 	blk_queue_physical_block_size(nbd->disk->queue, config->blksize);
-	bd_set_size(bdev, config->bytesize);
 	set_capacity(nbd->disk, config->bytesize >> 9);
 	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
-static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
-			loff_t blocksize, loff_t nr_blocks)
+static void nbd_size_set(struct nbd_device *nbd, loff_t blocksize,
+			 loff_t nr_blocks)
 {
 	struct nbd_config *config = nbd->config;
 	config->blksize = blocksize;
 	config->bytesize = blocksize * nr_blocks;
-	if (nbd_is_connected(nbd))
-		nbd_size_update(nbd, bdev);
+	nbd_size_update(nbd);
 }
 
 static void nbd_end_request(struct nbd_cmd *cmd)
@@ -739,8 +728,7 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
-			  unsigned long arg)
+static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 {
 	struct nbd_config *config = nbd->config;
 	struct socket *sock;
@@ -783,8 +771,6 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 	nsock->sent = 0;
 	socks[config->num_connections++] = nsock;
 
-	if (max_part)
-		bdev->bd_invalidated = 1;
 	return 0;
 }
 
@@ -800,19 +786,20 @@ static void nbd_bdev_reset(struct block_device *bdev)
 {
 	if (bdev->bd_openers > 1)
 		return;
-	set_device_ro(bdev, false);
-	bdev->bd_inode->i_size = 0;
+	bd_set_size(bdev, 0);
 	if (max_part > 0) {
 		blkdev_reread_part(bdev);
 		bdev->bd_invalidated = 1;
 	}
 }
 
-static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_parse_flags(struct nbd_device *nbd)
 {
 	struct nbd_config *config = nbd->config;
 	if (config->flags & NBD_FLAG_READ_ONLY)
-		set_device_ro(bdev, true);
+		set_disk_ro(nbd->disk, true);
+	else
+		set_disk_ro(nbd->disk, false);
 	if (config->flags & NBD_FLAG_SEND_TRIM)
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
 	if (config->flags & NBD_FLAG_SEND_FLUSH)
@@ -841,52 +828,36 @@ static void send_disconnects(struct nbd_device *nbd)
 	}
 }
 
-static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_disconnect(struct nbd_device *nbd)
 {
 	struct nbd_config *config = nbd->config;
 
 	dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
-	mutex_unlock(&nbd->config_lock);
-	fsync_bdev(bdev);
-	mutex_lock(&nbd->config_lock);
-
 	if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED,
 			      &config->runtime_flags))
 		send_disconnects(nbd);
 	return 0;
 }
 
-static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_clear_sock(struct nbd_device *nbd)
 {
 	sock_shutdown(nbd);
 	nbd_clear_que(nbd);
-
-	__invalidate_device(bdev, true);
-	nbd_bdev_reset(bdev);
 	nbd->task_setup = NULL;
-	return 0;
 }
 
 static void nbd_config_put(struct nbd_device *nbd)
 {
 	if (refcount_dec_and_mutex_lock(&nbd->config_refs,
 					&nbd->config_lock)) {
-		struct block_device *bdev;
 		struct nbd_config *config = nbd->config;
-
-		bdev = bdget_disk(nbd->disk, 0);
-		if (!bdev) {
-			mutex_unlock(&nbd->config_lock);
-			return;
-		}
-
 		nbd_dev_dbg_close(nbd);
-		nbd_size_clear(nbd, bdev);
+		nbd_size_clear(nbd);
 		if (test_and_clear_bit(NBD_HAS_PID_FILE,
 				       &config->runtime_flags))
 			device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
 		nbd->task_recv = NULL;
-		nbd_clear_sock(nbd, bdev);
+		nbd_clear_sock(nbd);
 		if (config->num_connections) {
 			int i;
 			for (i = 0; i < config->num_connections; i++) {
@@ -897,7 +868,6 @@ static void nbd_config_put(struct nbd_device *nbd)
 		}
 		nbd_reset(nbd);
 		mutex_unlock(&nbd->config_lock);
-		bdput(bdev);
 		module_put(THIS_MODULE);
 	}
 }
@@ -912,27 +882,30 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		return -EBUSY;
 	if (!config->socks)
 		return -EINVAL;
-
 	if (num_connections > 1 &&
 	    !(config->flags & NBD_FLAG_CAN_MULTI_CONN)) {
 		dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n");
 		return -EINVAL;
 	}
 
+	if (max_part)
+		bdev->bd_invalidated = 1;
 	blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections);
 	nbd->task_recv = current;
 	mutex_unlock(&nbd->config_lock);
 
-	nbd_parse_flags(nbd, bdev);
+	nbd_parse_flags(nbd);
 
 	error = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (error) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
 		return error;
 	}
-	set_bit(NBD_HAS_PID_FILE, &config->runtime_flags);
 
-	nbd_size_update(nbd, bdev);
+	set_bit(NBD_HAS_PID_FILE, &config->runtime_flags);
+	if (max_part)
+		bdev->bd_invalidated = 1;
+	bd_set_size(bdev, config->bytesize);
 
 	nbd_dev_dbg_init(nbd);
 	for (i = 0; i < num_connections; i++) {
@@ -965,6 +938,14 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	return error;
 }
 
+static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
+				 struct block_device *bdev)
+{
+	nbd_clear_sock(nbd);
+	kill_bdev(bdev);
+	nbd_bdev_reset(bdev);
+}
+
 /* Must be called with config_lock held */
 static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		       unsigned int cmd, unsigned long arg)
@@ -973,21 +954,22 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 
 	switch (cmd) {
 	case NBD_DISCONNECT:
-		return nbd_disconnect(nbd, bdev);
+		return nbd_disconnect(nbd);
 	case NBD_CLEAR_SOCK:
-		return nbd_clear_sock(nbd, bdev);
+		nbd_clear_sock_ioctl(nbd, bdev);
+		return 0;
 	case NBD_SET_SOCK:
-		return nbd_add_socket(nbd, bdev, arg);
+		return nbd_add_socket(nbd, arg);
 	case NBD_SET_BLKSIZE:
-		nbd_size_set(nbd, bdev, arg,
+		nbd_size_set(nbd, arg,
 			     div_s64(config->bytesize, arg));
 		return 0;
 	case NBD_SET_SIZE:
-		nbd_size_set(nbd, bdev, config->blksize,
+		nbd_size_set(nbd, config->blksize,
 			     div_s64(arg, config->blksize));
 		return 0;
 	case NBD_SET_SIZE_BLOCKS:
-		nbd_size_set(nbd, bdev, config->blksize, arg);
+		nbd_size_set(nbd, config->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
 		if (arg) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH 03/12] nbd: separate out the config information
From: Josef Bacik @ 2017-04-06 21:01 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

In order to properly refcount the various aspects of a NBD device we
need to separate out the configuration elements of the nbd device.  The
configuration of a NBD device has a different lifetime from the actual
device, so it doesn't make sense to bundle these two concepts.  Add a
config_refs to keep track of the configuration structure, that way we
can be sure that we never access it when we've torn down the device.
Add a new nbd_config structure to hold all of the transient
configuration information.  Finally create this when we open the device
so that it is in place when we start to configure the device.  This has
a nice side-effect of fixing a long standing problem where you could end
up with a half-configured nbd device that needed to be "disconnected" in
order to be usable again.  Now once we close our device the
configuration will be discarded.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 437 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 263 insertions(+), 174 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4e64d60..c2d9923 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -53,35 +53,45 @@ struct nbd_sock {
 	int fallback_index;
 };
 
+struct recv_thread_args {
+	struct work_struct work;
+	struct nbd_device *nbd;
+	int index;
+};
+
 #define NBD_TIMEDOUT			0
 #define NBD_DISCONNECT_REQUESTED	1
 #define NBD_DISCONNECTED		2
-#define NBD_RUNNING			3
+#define NBD_HAS_PID_FILE		3
 
-struct nbd_device {
+struct nbd_config {
 	u32 flags;
 	unsigned long runtime_flags;
-	struct nbd_sock **socks;
-	int magic;
-
-	struct blk_mq_tag_set tag_set;
 
-	struct mutex config_lock;
-	struct gendisk *disk;
+	struct nbd_sock **socks;
 	int num_connections;
+
 	atomic_t recv_threads;
 	wait_queue_head_t recv_wq;
 	loff_t blksize;
 	loff_t bytesize;
-
-	struct task_struct *task_recv;
-	struct task_struct *task_setup;
-
 #if IS_ENABLED(CONFIG_DEBUG_FS)
 	struct dentry *dbg_dir;
 #endif
 };
 
+struct nbd_device {
+	struct blk_mq_tag_set tag_set;
+
+	refcount_t config_refs;
+	struct nbd_config *config;
+	struct mutex config_lock;
+	struct gendisk *disk;
+
+	struct task_struct *task_recv;
+	struct task_struct *task_setup;
+};
+
 struct nbd_cmd {
 	struct nbd_device *nbd;
 	int index;
@@ -103,7 +113,7 @@ static int part_shift;
 
 static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
-
+static void nbd_config_put(struct nbd_device *nbd);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 {
@@ -127,6 +137,20 @@ static const char *nbdcmd_to_ascii(int cmd)
 	return "invalid";
 }
 
+static ssize_t pid_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nbd_device *nbd = (struct nbd_device *)disk->private_data;
+
+	return sprintf(buf, "%d\n", task_pid_nr(nbd->task_recv));
+}
+
+static struct device_attribute pid_attr = {
+	.attr = { .name = "pid", .mode = S_IRUGO},
+	.show = pid_show,
+};
+
 static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
 {
 	if (!nsock->dead)
@@ -138,28 +162,32 @@ static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
 
 static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
 {
-	if (bdev->bd_openers <= 1)
-		bd_set_size(bdev, 0);
-	set_capacity(nbd->disk, 0);
-	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
+	if (nbd->config->bytesize) {
+		if (bdev->bd_openers <= 1)
+			bd_set_size(bdev, 0);
+		set_capacity(nbd->disk, 0);
+		kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
+	}
 
 	return 0;
 }
 
 static void nbd_size_update(struct nbd_device *nbd, struct block_device *bdev)
 {
-	blk_queue_logical_block_size(nbd->disk->queue, nbd->blksize);
-	blk_queue_physical_block_size(nbd->disk->queue, nbd->blksize);
-	bd_set_size(bdev, nbd->bytesize);
-	set_capacity(nbd->disk, nbd->bytesize >> 9);
+	struct nbd_config *config = nbd->config;
+	blk_queue_logical_block_size(nbd->disk->queue, config->blksize);
+	blk_queue_physical_block_size(nbd->disk->queue, config->blksize);
+	bd_set_size(bdev, config->bytesize);
+	set_capacity(nbd->disk, config->bytesize >> 9);
 	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
 static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
 			loff_t blocksize, loff_t nr_blocks)
 {
-	nbd->blksize = blocksize;
-	nbd->bytesize = blocksize * nr_blocks;
+	struct nbd_config *config = nbd->config;
+	config->blksize = blocksize;
+	config->bytesize = blocksize * nr_blocks;
 	if (nbd_is_connected(nbd))
 		nbd_size_update(nbd, bdev);
 }
@@ -181,17 +209,19 @@ static void nbd_end_request(struct nbd_cmd *cmd)
  */
 static void sock_shutdown(struct nbd_device *nbd)
 {
+	struct nbd_config *config = nbd->config;
 	int i;
 
-	if (nbd->num_connections == 0)
+	if (config->num_connections == 0)
 		return;
-	if (test_and_set_bit(NBD_DISCONNECTED, &nbd->runtime_flags))
+	if (test_and_set_bit(NBD_DISCONNECTED, &config->runtime_flags))
 		return;
 
-	for (i = 0; i < nbd->num_connections; i++) {
-		struct nbd_sock *nsock = nbd->socks[i];
+	for (i = 0; i < config->num_connections; i++) {
+		struct nbd_sock *nsock = config->socks[i];
 		mutex_lock(&nsock->tx_lock);
 		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+		nbd_mark_nsock_dead(nsock);
 		mutex_unlock(&nsock->tx_lock);
 	}
 	dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
@@ -202,38 +232,43 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 {
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 	struct nbd_device *nbd = cmd->nbd;
+	struct nbd_config *config;
+
+	if (!refcount_inc_not_zero(&nbd->config_refs)) {
+		req->errors = -EIO;
+		return BLK_EH_HANDLED;
+	}
 
-	if (nbd->num_connections > 1) {
+	config = nbd->config;
+
+	if (config->num_connections > 1) {
 		dev_err_ratelimited(nbd_to_dev(nbd),
 				    "Connection timed out, retrying\n");
-		mutex_lock(&nbd->config_lock);
 		/*
 		 * Hooray we have more connections, requeue this IO, the submit
 		 * path will put it on a real connection.
 		 */
-		if (nbd->socks && nbd->num_connections > 1) {
-			if (cmd->index < nbd->num_connections) {
+		if (config->socks && config->num_connections > 1) {
+			if (cmd->index < config->num_connections) {
 				struct nbd_sock *nsock =
-					nbd->socks[cmd->index];
+					config->socks[cmd->index];
 				mutex_lock(&nsock->tx_lock);
 				nbd_mark_nsock_dead(nsock);
 				mutex_unlock(&nsock->tx_lock);
 			}
-			mutex_unlock(&nbd->config_lock);
 			blk_mq_requeue_request(req, true);
+			nbd_config_put(nbd);
 			return BLK_EH_NOT_HANDLED;
 		}
-		mutex_unlock(&nbd->config_lock);
 	} else {
 		dev_err_ratelimited(nbd_to_dev(nbd),
 				    "Connection timed out\n");
 	}
-	set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
+	set_bit(NBD_TIMEDOUT, &config->runtime_flags);
 	req->errors = -EIO;
-
-	mutex_lock(&nbd->config_lock);
 	sock_shutdown(nbd);
-	mutex_unlock(&nbd->config_lock);
+	nbd_config_put(nbd);
+
 	return BLK_EH_HANDLED;
 }
 
@@ -243,7 +278,8 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 static int sock_xmit(struct nbd_device *nbd, int index, int send,
 		     struct iov_iter *iter, int msg_flags, int *sent)
 {
-	struct socket *sock = nbd->socks[index]->sock;
+	struct nbd_config *config = nbd->config;
+	struct socket *sock = config->socks[index]->sock;
 	int result;
 	struct msghdr msg;
 	unsigned long pflags = current->flags;
@@ -289,7 +325,8 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
-	struct nbd_sock *nsock = nbd->socks[index];
+	struct nbd_config *config = nbd->config;
+	struct nbd_sock *nsock = config->socks[index];
 	int result;
 	struct nbd_request request = {.magic = htonl(NBD_REQUEST_MAGIC)};
 	struct kvec iov = {.iov_base = &request, .iov_len = sizeof(request)};
@@ -320,7 +357,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	}
 
 	if (rq_data_dir(req) == WRITE &&
-	    (nbd->flags & NBD_FLAG_READ_ONLY)) {
+	    (config->flags & NBD_FLAG_READ_ONLY)) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Write on read-only\n");
 		return -EIO;
@@ -426,15 +463,16 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	return 0;
 }
 
-static int nbd_disconnected(struct nbd_device *nbd)
+static int nbd_disconnected(struct nbd_config *config)
 {
-	return test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) ||
-		test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags);
+	return test_bit(NBD_DISCONNECTED, &config->runtime_flags) ||
+		test_bit(NBD_DISCONNECT_REQUESTED, &config->runtime_flags);
 }
 
 /* NULL returned = something went wrong, inform userspace */
 static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 {
+	struct nbd_config *config = nbd->config;
 	int result;
 	struct nbd_reply reply;
 	struct nbd_cmd *cmd;
@@ -448,7 +486,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	iov_iter_kvec(&to, READ | ITER_KVEC, &iov, 1, sizeof(reply));
 	result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL);
 	if (result <= 0) {
-		if (!nbd_disconnected(nbd))
+		if (!nbd_disconnected(config))
 			dev_err(disk_to_dev(nbd->disk),
 				"Receive control failed (result %d)\n", result);
 		return ERR_PTR(result);
@@ -498,8 +536,8 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 				 * and let the timeout stuff handle resubmitting
 				 * this request onto another connection.
 				 */
-				if (nbd_disconnected(nbd) ||
-				    nbd->num_connections <= 1) {
+				if (nbd_disconnected(config) ||
+				    config->num_connections <= 1) {
 					req->errors = -EIO;
 					return cmd;
 				}
@@ -515,40 +553,20 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	return cmd;
 }
 
-static ssize_t pid_show(struct device *dev,
-			struct device_attribute *attr, char *buf)
-{
-	struct gendisk *disk = dev_to_disk(dev);
-	struct nbd_device *nbd = (struct nbd_device *)disk->private_data;
-
-	return sprintf(buf, "%d\n", task_pid_nr(nbd->task_recv));
-}
-
-static struct device_attribute pid_attr = {
-	.attr = { .name = "pid", .mode = S_IRUGO},
-	.show = pid_show,
-};
-
-struct recv_thread_args {
-	struct work_struct work;
-	struct nbd_device *nbd;
-	int index;
-};
-
 static void recv_work(struct work_struct *work)
 {
 	struct recv_thread_args *args = container_of(work,
 						     struct recv_thread_args,
 						     work);
 	struct nbd_device *nbd = args->nbd;
+	struct nbd_config *config = nbd->config;
 	struct nbd_cmd *cmd;
 	int ret = 0;
 
-	BUG_ON(nbd->magic != NBD_MAGIC);
 	while (1) {
 		cmd = nbd_read_stat(nbd, args->index);
 		if (IS_ERR(cmd)) {
-			struct nbd_sock *nsock = nbd->socks[args->index];
+			struct nbd_sock *nsock = config->socks[args->index];
 
 			mutex_lock(&nsock->tx_lock);
 			nbd_mark_nsock_dead(nsock);
@@ -559,8 +577,10 @@ static void recv_work(struct work_struct *work)
 
 		nbd_end_request(cmd);
 	}
-	atomic_dec(&nbd->recv_threads);
-	wake_up(&nbd->recv_wq);
+	atomic_dec(&config->recv_threads);
+	wake_up(&config->recv_wq);
+	nbd_config_put(nbd);
+	kfree(args);
 }
 
 static void nbd_clear_req(struct request *req, void *data, bool reserved)
@@ -576,39 +596,38 @@ static void nbd_clear_req(struct request *req, void *data, bool reserved)
 
 static void nbd_clear_que(struct nbd_device *nbd)
 {
-	BUG_ON(nbd->magic != NBD_MAGIC);
-
 	blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL);
 	dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
 }
 
 static int find_fallback(struct nbd_device *nbd, int index)
 {
+	struct nbd_config *config = nbd->config;
 	int new_index = -1;
-	struct nbd_sock *nsock = nbd->socks[index];
+	struct nbd_sock *nsock = config->socks[index];
 	int fallback = nsock->fallback_index;
 
-	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags))
+	if (test_bit(NBD_DISCONNECTED, &config->runtime_flags))
 		return new_index;
 
-	if (nbd->num_connections <= 1) {
+	if (config->num_connections <= 1) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on invalid socket\n");
 		return new_index;
 	}
 
-	if (fallback >= 0 && fallback < nbd->num_connections &&
-	    !nbd->socks[fallback]->dead)
+	if (fallback >= 0 && fallback < config->num_connections &&
+	    !config->socks[fallback]->dead)
 		return fallback;
 
 	if (nsock->fallback_index < 0 ||
-	    nsock->fallback_index >= nbd->num_connections ||
-	    nbd->socks[nsock->fallback_index]->dead) {
+	    nsock->fallback_index >= config->num_connections ||
+	    config->socks[nsock->fallback_index]->dead) {
 		int i;
-		for (i = 0; i < nbd->num_connections; i++) {
+		for (i = 0; i < config->num_connections; i++) {
 			if (i == index)
 				continue;
-			if (!nbd->socks[i]->dead) {
+			if (!config->socks[i]->dead) {
 				new_index = i;
 				break;
 			}
@@ -628,23 +647,34 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
 	struct nbd_device *nbd = cmd->nbd;
+	struct nbd_config *config;
 	struct nbd_sock *nsock;
 	int ret;
 
-	if (index >= nbd->num_connections) {
+	if (!refcount_inc_not_zero(&nbd->config_refs)) {
+		dev_err_ratelimited(disk_to_dev(nbd->disk),
+				    "Socks array is empty\n");
+		return -EINVAL;
+	}
+	config = nbd->config;
+
+	if (index >= config->num_connections) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on invalid socket\n");
+		nbd_config_put(nbd);
 		return -EINVAL;
 	}
 	req->errors = 0;
 again:
-	nsock = nbd->socks[index];
+	nsock = config->socks[index];
 	mutex_lock(&nsock->tx_lock);
 	if (nsock->dead) {
 		index = find_fallback(nbd, index);
+		if (index < 0) {
+			ret = -EIO;
+			goto out;
+		}
 		mutex_unlock(&nsock->tx_lock);
-		if (index < 0)
-			return -EIO;
 		goto again;
 	}
 
@@ -672,6 +702,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	}
 out:
 	mutex_unlock(&nsock->tx_lock);
+	nbd_config_put(nbd);
 	return ret;
 }
 
@@ -711,6 +742,7 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 			  unsigned long arg)
 {
+	struct nbd_config *config = nbd->config;
 	struct socket *sock;
 	struct nbd_sock **socks;
 	struct nbd_sock *nsock;
@@ -729,7 +761,7 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 		return -EINVAL;
 	}
 
-	socks = krealloc(nbd->socks, (nbd->num_connections + 1) *
+	socks = krealloc(config->socks, (config->num_connections + 1) *
 			 sizeof(struct nbd_sock *), GFP_KERNEL);
 	if (!socks) {
 		sockfd_put(sock);
@@ -741,7 +773,7 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 		return -ENOMEM;
 	}
 
-	nbd->socks = socks;
+	config->socks = socks;
 
 	nsock->fallback_index = -1;
 	nsock->dead = false;
@@ -749,7 +781,7 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 	nsock->sock = sock;
 	nsock->pending = NULL;
 	nsock->sent = 0;
-	socks[nbd->num_connections++] = nsock;
+	socks[config->num_connections++] = nsock;
 
 	if (max_part)
 		bdev->bd_invalidated = 1;
@@ -759,11 +791,7 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 /* Reset all properties of an NBD device */
 static void nbd_reset(struct nbd_device *nbd)
 {
-	nbd->runtime_flags = 0;
-	nbd->blksize = 1024;
-	nbd->bytesize = 0;
-	set_capacity(nbd->disk, 0);
-	nbd->flags = 0;
+	nbd->config = NULL;
 	nbd->tag_set.timeout = 0;
 	queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
 }
@@ -782,11 +810,12 @@ static void nbd_bdev_reset(struct block_device *bdev)
 
 static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev)
 {
-	if (nbd->flags & NBD_FLAG_READ_ONLY)
+	struct nbd_config *config = nbd->config;
+	if (config->flags & NBD_FLAG_READ_ONLY)
 		set_device_ro(bdev, true);
-	if (nbd->flags & NBD_FLAG_SEND_TRIM)
+	if (config->flags & NBD_FLAG_SEND_TRIM)
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
-	if (nbd->flags & NBD_FLAG_SEND_FLUSH)
+	if (config->flags & NBD_FLAG_SEND_FLUSH)
 		blk_queue_write_cache(nbd->disk->queue, true, false);
 	else
 		blk_queue_write_cache(nbd->disk->queue, false, false);
@@ -794,6 +823,7 @@ static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev)
 
 static void send_disconnects(struct nbd_device *nbd)
 {
+	struct nbd_config *config = nbd->config;
 	struct nbd_request request = {
 		.magic = htonl(NBD_REQUEST_MAGIC),
 		.type = htonl(NBD_CMD_DISC),
@@ -802,7 +832,7 @@ static void send_disconnects(struct nbd_device *nbd)
 	struct iov_iter from;
 	int i, ret;
 
-	for (i = 0; i < nbd->num_connections; i++) {
+	for (i = 0; i < config->num_connections; i++) {
 		iov_iter_kvec(&from, WRITE | ITER_KVEC, &iov, 1, sizeof(request));
 		ret = sock_xmit(nbd, i, 1, &from, 0, NULL);
 		if (ret <= 0)
@@ -813,20 +843,15 @@ static void send_disconnects(struct nbd_device *nbd)
 
 static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
 {
-	dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
-	if (!nbd->socks)
-		return -EINVAL;
+	struct nbd_config *config = nbd->config;
 
+	dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
 	mutex_unlock(&nbd->config_lock);
 	fsync_bdev(bdev);
 	mutex_lock(&nbd->config_lock);
 
-	/* Check again after getting mutex back.  */
-	if (!nbd->socks)
-		return -EINVAL;
-
 	if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED,
-			      &nbd->runtime_flags))
+			      &config->runtime_flags))
 		send_disconnects(nbd);
 	return 0;
 }
@@ -838,51 +863,63 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
 
 	__invalidate_device(bdev, true);
 	nbd_bdev_reset(bdev);
-	/*
-	 * We want to give the run thread a chance to wait for everybody
-	 * to clean up and then do it's own cleanup.
-	 */
-	if (!test_bit(NBD_RUNNING, &nbd->runtime_flags) &&
-	    nbd->num_connections) {
-		int i;
+	nbd->task_setup = NULL;
+	return 0;
+}
+
+static void nbd_config_put(struct nbd_device *nbd)
+{
+	if (refcount_dec_and_mutex_lock(&nbd->config_refs,
+					&nbd->config_lock)) {
+		struct block_device *bdev;
+		struct nbd_config *config = nbd->config;
 
-		for (i = 0; i < nbd->num_connections; i++) {
-			sockfd_put(nbd->socks[i]->sock);
-			kfree(nbd->socks[i]);
+		bdev = bdget_disk(nbd->disk, 0);
+		if (!bdev) {
+			mutex_unlock(&nbd->config_lock);
+			return;
 		}
-		kfree(nbd->socks);
-		nbd->socks = NULL;
-		nbd->num_connections = 0;
-	}
-	nbd->task_setup = NULL;
 
-	return 0;
+		nbd_dev_dbg_close(nbd);
+		nbd_size_clear(nbd, bdev);
+		if (test_and_clear_bit(NBD_HAS_PID_FILE,
+				       &config->runtime_flags))
+			device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
+		nbd->task_recv = NULL;
+		nbd_clear_sock(nbd, bdev);
+		if (config->num_connections) {
+			int i;
+			for (i = 0; i < config->num_connections; i++) {
+				sockfd_put(config->socks[i]->sock);
+				kfree(config->socks[i]);
+			}
+			kfree(config->socks);
+		}
+		nbd_reset(nbd);
+		mutex_unlock(&nbd->config_lock);
+		bdput(bdev);
+		module_put(THIS_MODULE);
+	}
 }
 
 static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 {
-	struct recv_thread_args *args;
-	int num_connections = nbd->num_connections;
+	struct nbd_config *config = nbd->config;
+	int num_connections = config->num_connections;
 	int error = 0, i;
 
 	if (nbd->task_recv)
 		return -EBUSY;
-	if (!nbd->socks)
+	if (!config->socks)
 		return -EINVAL;
+
 	if (num_connections > 1 &&
-	    !(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) {
+	    !(config->flags & NBD_FLAG_CAN_MULTI_CONN)) {
 		dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n");
-		error = -EINVAL;
-		goto out_err;
+		return -EINVAL;
 	}
 
-	set_bit(NBD_RUNNING, &nbd->runtime_flags);
-	blk_mq_update_nr_hw_queues(&nbd->tag_set, nbd->num_connections);
-	args = kcalloc(num_connections, sizeof(*args), GFP_KERNEL);
-	if (!args) {
-		error = -ENOMEM;
-		goto out_err;
-	}
+	blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections);
 	nbd->task_recv = current;
 	mutex_unlock(&nbd->config_lock);
 
@@ -891,41 +928,40 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	error = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (error) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
-		goto out_recv;
+		return error;
 	}
+	set_bit(NBD_HAS_PID_FILE, &config->runtime_flags);
 
 	nbd_size_update(nbd, bdev);
 
 	nbd_dev_dbg_init(nbd);
 	for (i = 0; i < num_connections; i++) {
-		sk_set_memalloc(nbd->socks[i]->sock->sk);
-		atomic_inc(&nbd->recv_threads);
-		INIT_WORK(&args[i].work, recv_work);
-		args[i].nbd = nbd;
-		args[i].index = i;
-		queue_work(recv_workqueue, &args[i].work);
+		struct recv_thread_args *args;
+
+		args = kzalloc(sizeof(*args), GFP_KERNEL);
+		if (!args) {
+			sock_shutdown(nbd);
+			return -ENOMEM;
+		}
+		sk_set_memalloc(config->socks[i]->sock->sk);
+		atomic_inc(&config->recv_threads);
+		refcount_inc(&nbd->config_refs);
+		INIT_WORK(&args->work, recv_work);
+		args->nbd = nbd;
+		args->index = i;
+		queue_work(recv_workqueue, &args->work);
 	}
-	wait_event_interruptible(nbd->recv_wq,
-				 atomic_read(&nbd->recv_threads) == 0);
-	for (i = 0; i < num_connections; i++)
-		flush_work(&args[i].work);
-	nbd_dev_dbg_close(nbd);
-	nbd_size_clear(nbd, bdev);
-	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
-out_recv:
+	error = wait_event_interruptible(config->recv_wq,
+					 atomic_read(&config->recv_threads) == 0);
+	if (error)
+		sock_shutdown(nbd);
 	mutex_lock(&nbd->config_lock);
-	nbd->task_recv = NULL;
-out_err:
-	clear_bit(NBD_RUNNING, &nbd->runtime_flags);
-	nbd_clear_sock(nbd, bdev);
 
 	/* user requested, ignore socket errors */
-	if (test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
+	if (test_bit(NBD_DISCONNECT_REQUESTED, &config->runtime_flags))
 		error = 0;
-	if (test_bit(NBD_TIMEDOUT, &nbd->runtime_flags))
+	if (test_bit(NBD_TIMEDOUT, &config->runtime_flags))
 		error = -ETIMEDOUT;
-
-	nbd_reset(nbd);
 	return error;
 }
 
@@ -933,6 +969,8 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		       unsigned int cmd, unsigned long arg)
 {
+	struct nbd_config *config = nbd->config;
+
 	switch (cmd) {
 	case NBD_DISCONNECT:
 		return nbd_disconnect(nbd, bdev);
@@ -942,14 +980,14 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		return nbd_add_socket(nbd, bdev, arg);
 	case NBD_SET_BLKSIZE:
 		nbd_size_set(nbd, bdev, arg,
-			     div_s64(nbd->bytesize, arg));
+			     div_s64(config->bytesize, arg));
 		return 0;
 	case NBD_SET_SIZE:
-		nbd_size_set(nbd, bdev, nbd->blksize,
-			     div_s64(arg, nbd->blksize));
+		nbd_size_set(nbd, bdev, config->blksize,
+			     div_s64(arg, config->blksize));
 		return 0;
 	case NBD_SET_SIZE_BLOCKS:
-		nbd_size_set(nbd, bdev, nbd->blksize, arg);
+		nbd_size_set(nbd, bdev, config->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
 		if (arg) {
@@ -959,7 +997,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		return 0;
 
 	case NBD_SET_FLAGS:
-		nbd->flags = arg;
+		config->flags = arg;
 		return 0;
 	case NBD_DO_IT:
 		return nbd_start_device(nbd, bdev);
@@ -988,18 +1026,70 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	BUG_ON(nbd->magic != NBD_MAGIC);
-
 	mutex_lock(&nbd->config_lock);
 	error = __nbd_ioctl(bdev, nbd, cmd, arg);
 	mutex_unlock(&nbd->config_lock);
-
 	return error;
 }
 
+static struct nbd_config *nbd_alloc_config(void)
+{
+	struct nbd_config *config;
+
+	config = kzalloc(sizeof(struct nbd_config), GFP_NOFS);
+	if (!config)
+		return NULL;
+	atomic_set(&config->recv_threads, 0);
+	init_waitqueue_head(&config->recv_wq);
+	config->blksize = 1024;
+	try_module_get(THIS_MODULE);
+	return config;
+}
+
+static int nbd_open(struct block_device *bdev, fmode_t mode)
+{
+	struct nbd_device *nbd;
+	int ret = 0;
+
+	mutex_lock(&nbd_index_mutex);
+	nbd = bdev->bd_disk->private_data;
+	if (!nbd) {
+		ret = -ENXIO;
+		goto out;
+	}
+	if (!refcount_inc_not_zero(&nbd->config_refs)) {
+		struct nbd_config *config;
+
+		mutex_lock(&nbd->config_lock);
+		if (refcount_inc_not_zero(&nbd->config_refs)) {
+			mutex_unlock(&nbd->config_lock);
+			goto out;
+		}
+		config = nbd->config = nbd_alloc_config();
+		if (!config) {
+			ret = -ENOMEM;
+			mutex_unlock(&nbd->config_lock);
+			goto out;
+		}
+		refcount_set(&nbd->config_refs, 1);
+		mutex_unlock(&nbd->config_lock);
+	}
+out:
+	mutex_unlock(&nbd_index_mutex);
+	return ret;
+}
+
+static void nbd_release(struct gendisk *disk, fmode_t mode)
+{
+	struct nbd_device *nbd = disk->private_data;
+	nbd_config_put(nbd);
+}
+
 static const struct block_device_operations nbd_fops =
 {
 	.owner =	THIS_MODULE,
+	.open =		nbd_open,
+	.release =	nbd_release,
 	.ioctl =	nbd_ioctl,
 	.compat_ioctl =	nbd_ioctl,
 };
@@ -1031,7 +1121,7 @@ static const struct file_operations nbd_dbg_tasks_ops = {
 static int nbd_dbg_flags_show(struct seq_file *s, void *unused)
 {
 	struct nbd_device *nbd = s->private;
-	u32 flags = nbd->flags;
+	u32 flags = nbd->config->flags;
 
 	seq_printf(s, "Hex: 0x%08x\n\n", flags);
 
@@ -1064,6 +1154,7 @@ static const struct file_operations nbd_dbg_flags_ops = {
 static int nbd_dev_dbg_init(struct nbd_device *nbd)
 {
 	struct dentry *dir;
+	struct nbd_config *config = nbd->config;
 
 	if (!nbd_dbg_dir)
 		return -EIO;
@@ -1074,12 +1165,12 @@ static int nbd_dev_dbg_init(struct nbd_device *nbd)
 			nbd_name(nbd));
 		return -EIO;
 	}
-	nbd->dbg_dir = dir;
+	config->dbg_dir = dir;
 
 	debugfs_create_file("tasks", 0444, dir, nbd, &nbd_dbg_tasks_ops);
-	debugfs_create_u64("size_bytes", 0444, dir, &nbd->bytesize);
+	debugfs_create_u64("size_bytes", 0444, dir, &config->bytesize);
 	debugfs_create_u32("timeout", 0444, dir, &nbd->tag_set.timeout);
-	debugfs_create_u64("blocksize", 0444, dir, &nbd->blksize);
+	debugfs_create_u64("blocksize", 0444, dir, &config->blksize);
 	debugfs_create_file("flags", 0444, dir, nbd, &nbd_dbg_flags_ops);
 
 	return 0;
@@ -1087,7 +1178,7 @@ static int nbd_dev_dbg_init(struct nbd_device *nbd)
 
 static void nbd_dev_dbg_close(struct nbd_device *nbd)
 {
-	debugfs_remove_recursive(nbd->dbg_dir);
+	debugfs_remove_recursive(nbd->config->dbg_dir);
 }
 
 static int nbd_dbg_init(void)
@@ -1148,7 +1239,6 @@ static const struct blk_mq_ops nbd_mq_ops = {
 static void nbd_dev_remove(struct nbd_device *nbd)
 {
 	struct gendisk *disk = nbd->disk;
-	nbd->magic = 0;
 	if (disk) {
 		del_gendisk(disk);
 		blk_cleanup_queue(disk->queue);
@@ -1218,14 +1308,13 @@ static int nbd_dev_add(int index)
 	blk_queue_max_hw_sectors(disk->queue, 65536);
 	disk->queue->limits.max_sectors = 256;
 
-	nbd->magic = NBD_MAGIC;
 	mutex_init(&nbd->config_lock);
+	refcount_set(&nbd->config_refs, 0);
 	disk->major = NBD_MAJOR;
 	disk->first_minor = index << part_shift;
 	disk->fops = &nbd_fops;
 	disk->private_data = nbd;
 	sprintf(disk->disk_name, "nbd%d", index);
-	init_waitqueue_head(&nbd->recv_wq);
 	nbd_reset(nbd);
 	add_disk(disk);
 	return index;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 02/12] nbd: handle single path failures gracefully
From: Josef Bacik @ 2017-04-06 21:01 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

Currently if we have multiple connections and one of them goes down we will tear
down the whole device.  However there's no reason we need to do this as we
could have other connections that are working fine.  Deal with this by keeping
track of the state of the different connections, and if we lose one we mark it
as dead and send all IO destined for that socket to one of the other healthy
sockets.  Any outstanding requests that were on the dead socket will timeout and
be re-submitted properly.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 151 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 125 insertions(+), 26 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index a4fd82e..4e64d60 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -49,6 +49,8 @@ struct nbd_sock {
 	struct mutex tx_lock;
 	struct request *pending;
 	int sent;
+	bool dead;
+	int fallback_index;
 };
 
 #define NBD_TIMEDOUT			0
@@ -82,6 +84,7 @@ struct nbd_device {
 
 struct nbd_cmd {
 	struct nbd_device *nbd;
+	int index;
 	struct completion send_complete;
 };
 
@@ -124,6 +127,15 @@ static const char *nbdcmd_to_ascii(int cmd)
 	return "invalid";
 }
 
+static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
+{
+	if (!nsock->dead)
+		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+	nsock->dead = true;
+	nsock->pending = NULL;
+	nsock->sent = 0;
+}
+
 static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
 {
 	if (bdev->bd_openers <= 1)
@@ -191,7 +203,31 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 	struct nbd_device *nbd = cmd->nbd;
 
-	dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down connection\n");
+	if (nbd->num_connections > 1) {
+		dev_err_ratelimited(nbd_to_dev(nbd),
+				    "Connection timed out, retrying\n");
+		mutex_lock(&nbd->config_lock);
+		/*
+		 * Hooray we have more connections, requeue this IO, the submit
+		 * path will put it on a real connection.
+		 */
+		if (nbd->socks && nbd->num_connections > 1) {
+			if (cmd->index < nbd->num_connections) {
+				struct nbd_sock *nsock =
+					nbd->socks[cmd->index];
+				mutex_lock(&nsock->tx_lock);
+				nbd_mark_nsock_dead(nsock);
+				mutex_unlock(&nsock->tx_lock);
+			}
+			mutex_unlock(&nbd->config_lock);
+			blk_mq_requeue_request(req, true);
+			return BLK_EH_NOT_HANDLED;
+		}
+		mutex_unlock(&nbd->config_lock);
+	} else {
+		dev_err_ratelimited(nbd_to_dev(nbd),
+				    "Connection timed out\n");
+	}
 	set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
 	req->errors = -EIO;
 
@@ -301,6 +337,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		}
 		iov_iter_advance(&from, sent);
 	}
+	cmd->index = index;
 	request.type = htonl(type);
 	if (type != NBD_CMD_FLUSH) {
 		request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9);
@@ -328,7 +365,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		}
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 			"Send control failed (result %d)\n", result);
-		return -EIO;
+		return -EAGAIN;
 	}
 send_pages:
 	if (type != NBD_CMD_WRITE)
@@ -370,7 +407,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 				dev_err(disk_to_dev(nbd->disk),
 					"Send data failed (result %d)\n",
 					result);
-				return -EIO;
+				return -EAGAIN;
 			}
 			/*
 			 * The completion might already have come in,
@@ -389,6 +426,12 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	return 0;
 }
 
+static int nbd_disconnected(struct nbd_device *nbd)
+{
+	return test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) ||
+		test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags);
+}
+
 /* NULL returned = something went wrong, inform userspace */
 static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 {
@@ -405,8 +448,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	iov_iter_kvec(&to, READ | ITER_KVEC, &iov, 1, sizeof(reply));
 	result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL);
 	if (result <= 0) {
-		if (!test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) &&
-		    !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
+		if (!nbd_disconnected(nbd))
 			dev_err(disk_to_dev(nbd->disk),
 				"Receive control failed (result %d)\n", result);
 		return ERR_PTR(result);
@@ -449,8 +491,19 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 			if (result <= 0) {
 				dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n",
 					result);
-				req->errors = -EIO;
-				return cmd;
+				/*
+				 * If we've disconnected or we only have 1
+				 * connection then we need to make sure we
+				 * complete this request, otherwise error out
+				 * and let the timeout stuff handle resubmitting
+				 * this request onto another connection.
+				 */
+				if (nbd_disconnected(nbd) ||
+				    nbd->num_connections <= 1) {
+					req->errors = -EIO;
+					return cmd;
+				}
+				return ERR_PTR(-EIO);
 			}
 			dev_dbg(nbd_to_dev(nbd), "request %p: got %d bytes data\n",
 				cmd, bvec.bv_len);
@@ -495,19 +548,17 @@ static void recv_work(struct work_struct *work)
 	while (1) {
 		cmd = nbd_read_stat(nbd, args->index);
 		if (IS_ERR(cmd)) {
+			struct nbd_sock *nsock = nbd->socks[args->index];
+
+			mutex_lock(&nsock->tx_lock);
+			nbd_mark_nsock_dead(nsock);
+			mutex_unlock(&nsock->tx_lock);
 			ret = PTR_ERR(cmd);
 			break;
 		}
 
 		nbd_end_request(cmd);
 	}
-
-	/*
-	 * We got an error, shut everybody down if this wasn't the result of a
-	 * disconnect request.
-	 */
-	if (ret && !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
-		sock_shutdown(nbd);
 	atomic_dec(&nbd->recv_threads);
 	wake_up(&nbd->recv_wq);
 }
@@ -531,6 +582,47 @@ static void nbd_clear_que(struct nbd_device *nbd)
 	dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
 }
 
+static int find_fallback(struct nbd_device *nbd, int index)
+{
+	int new_index = -1;
+	struct nbd_sock *nsock = nbd->socks[index];
+	int fallback = nsock->fallback_index;
+
+	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags))
+		return new_index;
+
+	if (nbd->num_connections <= 1) {
+		dev_err_ratelimited(disk_to_dev(nbd->disk),
+				    "Attempted send on invalid socket\n");
+		return new_index;
+	}
+
+	if (fallback >= 0 && fallback < nbd->num_connections &&
+	    !nbd->socks[fallback]->dead)
+		return fallback;
+
+	if (nsock->fallback_index < 0 ||
+	    nsock->fallback_index >= nbd->num_connections ||
+	    nbd->socks[nsock->fallback_index]->dead) {
+		int i;
+		for (i = 0; i < nbd->num_connections; i++) {
+			if (i == index)
+				continue;
+			if (!nbd->socks[i]->dead) {
+				new_index = i;
+				break;
+			}
+		}
+		nsock->fallback_index = new_index;
+		if (new_index < 0) {
+			dev_err_ratelimited(disk_to_dev(nbd->disk),
+					    "Dead connection, failed to find a fallback\n");
+			return new_index;
+		}
+	}
+	new_index = nsock->fallback_index;
+	return new_index;
+}
 
 static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 {
@@ -544,22 +636,16 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 				    "Attempted send on invalid socket\n");
 		return -EINVAL;
 	}
-
-	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags)) {
-		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Attempted send on closed socket\n");
-		return -EINVAL;
-	}
-
 	req->errors = 0;
-
+again:
 	nsock = nbd->socks[index];
 	mutex_lock(&nsock->tx_lock);
-	if (unlikely(!nsock->sock)) {
+	if (nsock->dead) {
+		index = find_fallback(nbd, index);
 		mutex_unlock(&nsock->tx_lock);
-		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Attempted send on closed socket\n");
-		return -EINVAL;
+		if (index < 0)
+			return -EIO;
+		goto again;
 	}
 
 	/* Handle the case that we have a pending request that was partially
@@ -572,7 +658,18 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 		ret = 0;
 		goto out;
 	}
+	/*
+	 * Some failures are related to the link going down, so anything that
+	 * returns EAGAIN can be retried on a different socket.
+	 */
 	ret = nbd_send_cmd(nbd, cmd, index);
+	if (ret == -EAGAIN) {
+		dev_err_ratelimited(disk_to_dev(nbd->disk),
+				    "Request send failed trying another connection\n");
+		nbd_mark_nsock_dead(nsock);
+		mutex_unlock(&nsock->tx_lock);
+		goto again;
+	}
 out:
 	mutex_unlock(&nsock->tx_lock);
 	return ret;
@@ -646,6 +743,8 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 
 	nbd->socks = socks;
 
+	nsock->fallback_index = -1;
+	nsock->dead = false;
 	mutex_init(&nsock->tx_lock);
 	nsock->sock = sock;
 	nsock->pending = NULL;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 01/12] nbd: put socket in error cases
From: Josef Bacik @ 2017-04-06 21:01 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1491512527-4286-1-git-send-email-jbacik@fb.com>

When adding a new socket we look it up and then try to add it to our
configuration.  If any of those steps fail we need to make sure we put
the socket so we don't leak them.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 03ae729..a4fd82e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -628,16 +628,21 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 	if (nbd->task_setup != current) {
 		dev_err(disk_to_dev(nbd->disk),
 			"Device being setup by another task");
+		sockfd_put(sock);
 		return -EINVAL;
 	}
 
 	socks = krealloc(nbd->socks, (nbd->num_connections + 1) *
 			 sizeof(struct nbd_sock *), GFP_KERNEL);
-	if (!socks)
+	if (!socks) {
+		sockfd_put(sock);
 		return -ENOMEM;
+	}
 	nsock = kzalloc(sizeof(struct nbd_sock), GFP_KERNEL);
-	if (!nsock)
+	if (!nsock) {
+		sockfd_put(sock);
 		return -ENOMEM;
+	}
 
 	nbd->socks = socks;
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 00/12] nbd: Netlink interface and path failure enhancements
From: Josef Bacik @ 2017-04-06 21:01 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

This patchset adds a new netlink configuration interface to NBD as well as a
bunch of enhancments around path failures.  The patches provide the following
enhancemnts to NBD

 - Netlink configuration interface that doesn't leave a userspace application
   waiting in kernel space for the device to disconnect.
 - Netlink reconfigure interface for adding re-connected sockets to replace dead
   sockets.
 - A flag to destroy the NBD device on disconnect, much like how mount -o loop
   works.
 - A status interface that currently will only report whether a device is
   connected or not, but can be extended to include whatever in the future.
 - A netlink multicast notification scheme to notify user space when there are
   connection issues to allow for seamless reconnects.
 - Dead link handling.  You can specify a dead link timeout and the NBD device
   will pause IO for that timeout waiting to see if the connection can be
   re-established.  This is helpful to allow for things like nbd server upgrades
   where the whole server disappears for a short period of time.

These patches have been thorougly and continuously tested for about a month.
I've been finding bugs in various places, but this batch has been solid for the
last few days of testing, which include a constant disconnect/reconnect torture
test.  Thanks,

Josef

^ permalink raw reply

* Re: kill req->errors
From: Konrad Rzeszutek Wilk @ 2017-04-06 20:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Josef Bacik, James Smart, Roger Pau Monné,
	linux-scsi, linux-nvme, linux-block, dm-devel
In-Reply-To: <20170406153944.10058-1-hch@lst.de>

On Thu, Apr 06, 2017 at 05:39:19PM +0200, Christoph Hellwig wrote:
> Currently the request structure has an errors field that is used in
> various different ways.  The oldest drivers use it as an error count,
> blk-mq and the generic timeout code assume that it holds a Linux
> errno for block completions, and various drivers use it for internal
> status values, often overwriting them with Linux errnos later,
> that is unless they are processing passthrough requests in which
> case they'll leave their errors in it.
> 
> This series kills the ->errors field and replaced it with new fields
> in the drivers (or an odd hack in a union in struct request for
> two bitrotting old drivers).
> 
> Note that this series expects that the patch to remove the mg_disk
> driver has been applied already on top of the block for-next tree.

You wouldn't have a git tree to easily test it? Thanks.

^ permalink raw reply

* Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator
From: Paolo Valente @ 2017-04-06 19:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
	Arianna Avanzini, broonie@kernel.org, tj@kernel.org,
	ulf.hansson@linaro.org
In-Reply-To: <1491319738.2513.2.camel@sandisk.com>


> Il giorno 04 apr 2017, alle ore 17:28, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Tue, 2017-04-04 at 12:42 +0200, Paolo Valente wrote:
>>> Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>>>=20
>>> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>>>> +               delta_ktime =3D ktime_get();
>>>> +       delta_ktime =3D ktime_sub(delta_ktime, =
bfqd->last_budget_start);
>>>> +       delta_usecs =3D ktime_to_us(delta_ktime);
>>>=20
>>> This patch changes the type of the variable in which the result of =
ktime_to_us()
>>> is stored from u64 into u32 and next compares that result with =
LONG_MAX. Since
>>> ktime_to_us() returns a signed 64-bit number, are you sure you want =
to store that
>>> result in a 32-bit variable? If ktime_to_us() would e.g. return =
0xffffffff00000100
>>> or 0x100000100 then the assignment will truncate these numbers to =
0x100.
>>=20
>> The instruction above the assignment you highlight stores in
>> delta_ktime the difference between 'now' and the last budget start.
>> The latter may have happened at most about 100 ms before 'now'.  So
>> there should be no overflow issue.
>=20
> Hello Paolo,
>=20
> Please double check the following code: if (delta_usecs < 1000 || =
delta_usecs >=3D LONG_MAX)
> Since delta_usecs is a 32-bit variable and LONG_MAX a 64-bit constant =
on 64-bit systems
> I'm not sure that code will do what it is intended to do.
>=20

Yes, sorry. Actually, it never occurred to me to see that extra =
condition to hold over the last eight years on 32-bit systems. So I =
think I will just remove it. Unless Fabio, who inserted that condition =
several years ago, has something to say.

Thanks,
Paolo

> Thanks,
>=20
> Bart.

^ permalink raw reply

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Jens Axboe @ 2017-04-06 19:29 UTC (permalink / raw)
  To: Ming Lei, Omar Sandoval; +Cc: linux-block, kernel-team
In-Reply-To: <20170406082330.GA3863@ming.t460p>

On 04/06/2017 02:23 AM, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> While dispatching requests, if we fail to get a driver tag, we mark the
>>>> hardware queue as waiting for a tag and put the requests on a
>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
>>>> queues if using a single-queue scheduler with a multiqueue device. If
>>>
>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
>>> curious why someone wants to do that in this way,:-)
>>
>> I don't know why anyone would want to, but it has to work :) The only
>> reason we noticed this is because when the NBD device is created, it
>> only has a single queue, so we automatically assign mq-deadline to it.
>> Later, we update the number of queues, but it's still using mq-deadline.
>>
>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
>>> of current hctx.
>>
>> Yup, that's right. It's weird, and I talked to Jens about just forcing
>> the MQ device into an SQ mode when using an SQ scheduler, but this way
>> works fine more or less.
> 
> Or just switch the elevator to the MQ default one when the device becomes
> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> hctx?

No, that would be a really bad idea imho. First of all, I don't want
kernel driven scheduler changes. Secondly, the framework should work
with a non-direct mapping between hardware dispatch queues and
scheduling queues.

While we could force a single queue usage to make that a 1:1 mapping
always, that loses big benefits on eg nbd, which uses multiple hardware
queues to up the bandwidth. Similarly on nvme, for example, we still
scale better with N submission queues and 1 scheduling queue compared to
having just 1 submission queue.

>>> If that is true, it looks like a issue in usage of I/O scheduler, since
>>> the mq-deadline scheduler just queues requests in one per-request_queue
>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
>>
>> That's an option, but that's a different scheduling policy. Again, I
>> agree that it's strange, but it's reasonable behavior.
> 
> IMO, the current mq-deadline isn't good/ready for MQ device, and it
> doesn't make sense to use it for MQ.

I don't think that's true at all. I do agree that it's somewhat quirky
since it does introduce scheduling dependencies between the hardware
queues, and we have to work at making that well understood and explicit,
as not to introduce bugs due to that. But in reality, all multiqueue
hardware we are deadling with are mapped to a single resource. As such,
it makes a lot of sense to schedule it as such. Hence I don't think that
a single queue deadline approach is necessarily a bad idea for even fast
storage.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v3 2/5] blk-mq: Restart a single queue if tag sets are shared
From: Jens Axboe @ 2017-04-06 19:21 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <4f5ecfba-f2c9-8b54-f879-86b84f407599@kernel.dk>

On 04/06/2017 01:12 PM, Jens Axboe wrote:
> On 04/06/2017 12:10 PM, Bart Van Assche wrote:
>> +		for (i = 0; i < queue->nr_hw_queues; i++) {
>> +			j = (i + hctx->queue_num + 1) % queue->nr_hw_queues;
>> +			h = queue->queue_hw_ctx[j];
>> +			if (h->tags == tags && blk_mq_sched_restart_hctx(h))
>> +				break;
> 
> I'm pretty sure that doing:
> 
> 	j = i + hctx->queue_num + 1;;

And 'i' too many there of course:

 	j = hctx->queue_num + 1;;

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v3 2/5] blk-mq: Restart a single queue if tag sets are shared
From: Jens Axboe @ 2017-04-06 19:12 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170406181050.12137-3-bart.vanassche@sandisk.com>

On 04/06/2017 12:10 PM, Bart Van Assche wrote:
> +		for (i = 0; i < queue->nr_hw_queues; i++) {
> +			j = (i + hctx->queue_num + 1) % queue->nr_hw_queues;
> +			h = queue->queue_hw_ctx[j];
> +			if (h->tags == tags && blk_mq_sched_restart_hctx(h))
> +				break;

I'm pretty sure that doing:

	j = i + hctx->queue_num + 1;;
	for (i = 0; i < queue->nr_hw_queues; i++, j++) {
		if (j == queue->nr_hw_queues)
			j = 0;
		h = queue->queue_hw_ctx[j];
		if (h->tags == tags && blk_mq_sched_restart_hctx(h))
			break;
	}

would be considerably more efficient than a modulo for each loop. And
let's rename 'h' to 'hctx', readability is much better that way.
Especially since you have both i and j as iterators.

-- 
Jens Axboe

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox