Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Johannes Thumshirn @ 2017-03-24 14:31 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
	hch, Fabian Holler, Milind Dumbare, Michael Wang, Kleber Souza,
	Danil Kipnis, Roman Pen
In-Reply-To: <CAMGffEn7Q+Tchaj4RXV1zMk0MzHqGRv=0W5Vd1G_-GvZaG8tPA@mail.gmail.com>

On Fri, Mar 24, 2017 at 01:54:04PM +0100, Jinpu Wang wrote:
> >> +
> >> +#define XX(a) case (a): return #a
> >
> > please no macros with retun in them and XX isn't quite too descriptive as
> > well.
> >
> > [...]
> >
> >> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> >> +{
> >> +     switch (opcode) {
> >> +     XX(IB_WC_SEND);
> >> +     XX(IB_WC_RDMA_WRITE);
> >> +     XX(IB_WC_RDMA_READ);
> >> +     XX(IB_WC_COMP_SWAP);
> >> +     XX(IB_WC_FETCH_ADD);
> >> +     /* recv-side); inbound completion */
> >> +     XX(IB_WC_RECV);
> >> +     XX(IB_WC_RECV_RDMA_WITH_IMM);
> >> +     default: return "IB_WC_OPCODE_UNKNOWN";
> >> +     }
> >> +}
> >
> > How about:
> >
> > struct {
> >         char *name;
> >         enum ib_wc_opcode opcode;
> > } ib_wc_opcode_table[] = {
> >         { stringyfy(IB_WC_SEND), IB_WC_SEND },
> >         { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
> >         { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
> >         { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
> >         { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
> >         { stringyfy(IB_WC_RECV), IB_WC_RECV },
> >         { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM },
> >         { NULL, 0 },
> > };
> >
> > static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> > {
> >         int i;
> >
> >         for (i = 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
> >                 if (ib_wc_opcode_table[i].opcode == opcode)
> >                         return ib_wc_opcode_table[i].name;
> >
> >         return "IB_WC_OPCODE_UNKNOWN";
> > }
> >
> Looks nice, might be better to put it into ib_verbs.h?

Probably yes, as are your kvec functions for lib/iov_iter.c

[...]

> > What about resolving the kernel bug instead of making workarounds?
> I tried to send a patch upsteam, but was rejected by Sean.
> http://www.spinics.net/lists/linux-rdma/msg22381.html
> 

I don't see a NACK in this thread.

>From http://www.spinics.net/lists/linux-rdma/msg22410.html:
"The port space (which maps to the service ID) needs to be included as part of
the check that determines the format of the private data, and not simply the
address family." 

After such a state I would have expected to see a v2 of the patch with above
comment addressed.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Jinpu Wang @ 2017-03-24 14:35 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
	hch, Fabian Holler, Milind Dumbare, Michael Wang, Kleber Souza,
	Danil Kipnis, Roman Pen
In-Reply-To: <20170324143127.GI3571@linux-x5ow.site>

On Fri, Mar 24, 2017 at 3:31 PM, Johannes Thumshirn <jthumshirn@suse.de> wr=
ote:
> On Fri, Mar 24, 2017 at 01:54:04PM +0100, Jinpu Wang wrote:
>> >> +
>> >> +#define XX(a) case (a): return #a
>> >
>> > please no macros with retun in them and XX isn't quite too descriptive=
 as
>> > well.
>> >
>> > [...]
>> >
>> >> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> >> +{
>> >> +     switch (opcode) {
>> >> +     XX(IB_WC_SEND);
>> >> +     XX(IB_WC_RDMA_WRITE);
>> >> +     XX(IB_WC_RDMA_READ);
>> >> +     XX(IB_WC_COMP_SWAP);
>> >> +     XX(IB_WC_FETCH_ADD);
>> >> +     /* recv-side); inbound completion */
>> >> +     XX(IB_WC_RECV);
>> >> +     XX(IB_WC_RECV_RDMA_WITH_IMM);
>> >> +     default: return "IB_WC_OPCODE_UNKNOWN";
>> >> +     }
>> >> +}
>> >
>> > How about:
>> >
>> > struct {
>> >         char *name;
>> >         enum ib_wc_opcode opcode;
>> > } ib_wc_opcode_table[] =3D {
>> >         { stringyfy(IB_WC_SEND), IB_WC_SEND },
>> >         { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
>> >         { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
>> >         { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
>> >         { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
>> >         { stringyfy(IB_WC_RECV), IB_WC_RECV },
>> >         { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IM=
M },
>> >         { NULL, 0 },
>> > };
>> >
>> > static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> > {
>> >         int i;
>> >
>> >         for (i =3D 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
>> >                 if (ib_wc_opcode_table[i].opcode =3D=3D opcode)
>> >                         return ib_wc_opcode_table[i].name;
>> >
>> >         return "IB_WC_OPCODE_UNKNOWN";
>> > }
>> >
>> Looks nice, might be better to put it into ib_verbs.h?
>
> Probably yes, as are your kvec functions for lib/iov_iter.c
Thanks, will do in next round!

>
> [...]
>
>> > What about resolving the kernel bug instead of making workarounds?
>> I tried to send a patch upsteam, but was rejected by Sean.
>> http://www.spinics.net/lists/linux-rdma/msg22381.html
>>
>
> I don't see a NACK in this thread.
>
> From http://www.spinics.net/lists/linux-rdma/msg22410.html:
> "The port space (which maps to the service ID) needs to be included as pa=
rt of
> the check that determines the format of the private data, and not simply =
the
> address family."
>
> After such a state I would have expected to see a v2 of the patch with ab=
ove
> comment addressed.
I might busy with other staff at that time, I will check again and
revisit the bug.

>
> Byte,
>         Johannes
> --
> Johannes Thumshirn                                          Storage
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
> HRB 21284 (AG N=C3=BCrnberg)
> Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Regards,
--=20
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss

^ permalink raw reply

* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Jinpu Wang @ 2017-03-24 14:37 UTC (permalink / raw)
  To: Steve Wise
  Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
	hch, Fabian Holler, Milind Dumbare, Michael Wang
In-Reply-To: <03bc01d2a4a9$b8b69f10$2a23dd30$@opengridcomputing.com>

On Fri, Mar 24, 2017 at 3:20 PM, Steve Wise <swise@opengridcomputing.com> w=
rote:
>>
>> From: Jack Wang <jinpu.wang@profitbricks.com>
>>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of b=
lock
> IO
>> over InfiniBand network. The driver presents itself as a block device on
> client
>> side and transmits the block requests in a zero-copy fashion to the
> server-side
>> via InfiniBand. The server part of the driver converts the incoming buff=
ers
> back
>> into BIOs and hands them down to the underlying block device. As soon as=
 IO
>> responses come back from the drive, they are being transmitted back to t=
he
>> client.
>
> Hey Jack, why is this IB specific?  Can it work over iWARP transports as =
well?
>
> Steve.
>
>
>
Hi Steve,

Because we only use IB in our production, as I replied to Bart, sorry, not =
yet.

Regards,
--=20
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss

^ permalink raw reply

* Re: [PATCH 5/8] nowait aio: return on congested block device
From: Jens Axboe @ 2017-03-24 14:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, linux-api, willy, Goldwyn Rodrigues
In-Reply-To: <47d742e9-1343-ef5b-f7d4-29a6e22bef84@suse.de>

On 03/24/2017 05:32 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 03/16/2017 09:33 AM, Jens Axboe wrote:
>> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0eeb99e..2e5cba2 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>  	do {
>>>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>>  
>>> -		if (likely(blk_queue_enter(q, false) == 0)) {
>>> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>>  			struct bio_list hold;
>>>  			struct bio_list lower, same;
>>>  
>>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>  			bio_list_merge(&bio_list_on_stack, &same);
>>>  			bio_list_merge(&bio_list_on_stack, &hold);
>>>  		} else {
>>> -			bio_io_error(bio);
>>> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>>> +				bio_wouldblock_error(bio);
>>> +			else
>>> +				bio_io_error(bio);
>>
>> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
>> happened to be set?
>>
> 
> Yes, I need to add a condition here to check for blk_queue_dying(). Thanks.
> 
>> And you're missing wbt_wait() as well as a blocking point. Ditto in
>> blk-mq.
> 
> wbt_wait() does not apply to WRITE_ODIRECT

It doesn't _currently_ block for it, but that could change. It should
have an explicit check.

Additionally, the more weird assumptions you make, the more half assed
this will be. The other assumption made here is that only direct IO
would set nonblock. We have to assume that any IO can have NONBLOCK set,
and not make any assumptions about the caller.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Hannes Reinecke @ 2017-03-24 15:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig; +Cc: Bart Van Assche
In-Reply-To: <20170324123621.5227-3-tom.leiming@gmail.com>

On 03/24/2017 01:36 PM, Ming Lei wrote:
> Without the barrier, reading DEAD flag of .q_usage_counter
> and reading .mq_freeze_depth may be reordered, then the
> following wait_event_interruptible() may never return.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/blk-core.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad388d5e309a..44eed17319c0 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -669,6 +669,14 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
>  		if (nowait)
>  			return -EBUSY;
>  
> +		/*
> +		 * read pair of barrier in blk_mq_freeze_queue_start(),
> +		 * we need to order reading DEAD flag of .q_usage_counter
> +		 * and reading .mq_freeze_depth, otherwise the following
> +		 * wait may never return if the two read are reordered.
> +		 */
> +		smp_rmb();
> +
>  		ret = wait_event_interruptible(q->mq_freeze_wq,
>  				!atomic_read(&q->mq_freeze_depth) ||
>  				blk_queue_dying(q));
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

^ permalink raw reply

* Re: [PATCH v2 3/4] block: rename blk_mq_freeze_queue_start()
From: Hannes Reinecke @ 2017-03-24 15:20 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig; +Cc: Bart Van Assche
In-Reply-To: <20170324123621.5227-4-tom.leiming@gmail.com>

On 03/24/2017 01:36 PM, Ming Lei wrote:
> As the .q_usage_counter is used by both legacy and
> mq path, we need to block new I/O if queue becomes
> dead in blk_queue_enter().
> 
> So rename it and we can use this function in both
> pathes.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/blk-core.c                  |  2 +-
>  block/blk-mq.c                    | 10 +++++-----
>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>  drivers/nvme/host/core.c          |  2 +-
>  include/linux/blk-mq.h            |  2 +-
>  5 files changed, 9 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

^ permalink raw reply

* Re: [PATCH v3 08/14] block: introduce bio_copy_data_partial
From: Jens Axboe @ 2017-03-24 16:41 UTC (permalink / raw)
  To: Ming Lei, Shaohua Li, linux-raid, linux-block, Christoph Hellwig
In-Reply-To: <20170316161235.27110-9-tom.leiming@gmail.com>

On 03/16/2017 10:12 AM, Ming Lei wrote:
> Turns out we can use bio_copy_data in raid1's write behind,
> and we can make alloc_behind_pages() more clean/efficient,
> but we need to partial version of bio_copy_data().
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

Reviewed-by: Jens Axboe <axboe@fb.com>

Shaohua, feel free to pull this through the md tree, that will be much
easier.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v3 02/14] md: move two macros into md.h
From: Shaohua Li @ 2017-03-24 16:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ming Lei, Jens Axboe, linux-raid, linux-block, Christoph Hellwig
In-Reply-To: <87tw6j8be6.fsf@notabene.neil.brown.name>

On Fri, Mar 24, 2017 at 04:57:37PM +1100, Neil Brown wrote:
> On Fri, Mar 17 2017, Ming Lei wrote:
> 
> > Both raid1 and raid10 share common resync
> > block size and page count, so move them into md.h.
> 
> I don't think this is necessary.
> These are just "magic" numbers.  They don't have any real
> meaning and so don't belong in md.h, or and .h file.
> 
> Possibly we should find more meaningful numbers, or make them auto-size
> or something.  I'm also happy for them to stay as they are for now.
> But I don't think we should pretend that they are meaningful.

I had the same concern when I looked at this patch firstly. The number for
raid1/10 doesn't need to be the same. But if we don't move the number to a
generic header, the third patch will become a little more complicated. I
eventually ignored this issue. If we really need different number for raid1/10,
lets do it at that time.

I think your suggestion that moving the number to raid1-10.h makes sense, and
add a comment declaring the number isn't required to be the same for raid1/10.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Bart Van Assche @ 2017-03-24 17:24 UTC (permalink / raw)
  To: hch@infradead.org, linux-block@vger.kernel.org,
	tom.leiming@gmail.com, axboe@fb.com
  Cc: hare@suse.de
In-Reply-To: <20170324123621.5227-3-tom.leiming@gmail.com>

On Fri, 2017-03-24 at 20:36 +0800, Ming Lei wrote:
> Without the barrier, reading DEAD flag of .q_usage_counter
> and reading .mq_freeze_depth may be reordered, then the
> following wait_event_interruptible() may never return.
>=20
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/blk-core.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>=20
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad388d5e309a..44eed17319c0 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -669,6 +669,14 @@ int blk_queue_enter(struct request_queue *q, bool no=
wait)
>  		if (nowait)
>  			return -EBUSY;
> =20
> +		/*
> +		 * read pair of barrier in blk_mq_freeze_queue_start(),
> +		 * we need to order reading DEAD flag of .q_usage_counter
> +		 * and reading .mq_freeze_depth, otherwise the following
> +		 * wait may never return if the two read are reordered.
> +		 */
> +		smp_rmb();
> +
>  		ret =3D wait_event_interruptible(q->mq_freeze_wq,
>  				!atomic_read(&q->mq_freeze_depth) ||
>  				blk_queue_dying(q));

Hello Ming,

The code looks fine to me but the comment not. You probably wanted to refer
to the "dying" flag instead of the "dead" flag? The read order has to be
enforced for the "dying" flag and q_usage_counter because of the order in
which these are set by blk_set_queue_dying().

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH v2 3/4] block: rename blk_mq_freeze_queue_start()
From: Bart Van Assche @ 2017-03-24 17:29 UTC (permalink / raw)
  To: hch@infradead.org, linux-block@vger.kernel.org,
	tom.leiming@gmail.com, axboe@fb.com
  Cc: hare@suse.de
In-Reply-To: <20170324123621.5227-4-tom.leiming@gmail.com>

On Fri, 2017-03-24 at 20:36 +0800, Ming Lei wrote:
> As the .q_usage_counter is used by both legacy and
> mq path, we need to block new I/O if queue becomes
> dead in blk_queue_enter().
>=20
> So rename it and we can use this function in both
> pathes.

Should "pathes" be changed into "paths" in the commit message? Additionally=
,
this patch breaks the symmetry the comment in blk_mq_freeze_queue() refers
to. Anyway:

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=

^ permalink raw reply

* Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Ming Lei @ 2017-03-24 17:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org, axboe@fb.com,
	hare@suse.de
In-Reply-To: <1490376267.3312.1.camel@sandisk.com>

On Sat, Mar 25, 2017 at 1:24 AM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Fri, 2017-03-24 at 20:36 +0800, Ming Lei wrote:
>> Without the barrier, reading DEAD flag of .q_usage_counter
>> and reading .mq_freeze_depth may be reordered, then the
>> following wait_event_interruptible() may never return.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  block/blk-core.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index ad388d5e309a..44eed17319c0 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -669,6 +669,14 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
>>               if (nowait)
>>                       return -EBUSY;
>>
>> +             /*
>> +              * read pair of barrier in blk_mq_freeze_queue_start(),
>> +              * we need to order reading DEAD flag of .q_usage_counter
>> +              * and reading .mq_freeze_depth, otherwise the following
>> +              * wait may never return if the two read are reordered.
>> +              */
>> +             smp_rmb();
>> +
>>               ret = wait_event_interruptible(q->mq_freeze_wq,
>>                               !atomic_read(&q->mq_freeze_depth) ||
>>                               blk_queue_dying(q));
>
> Hello Ming,
>
> The code looks fine to me but the comment not. You probably wanted to refer
> to the "dying" flag instead of the "dead" flag? The read order has to be

No, looks you misunderstand the issue.

I mean the order between reading __PERCPU_REF_DEAD of .q_usage_counter
and reading .mq_freeze_depth should be enhanced, especially it is in
blk_queue_enter() vs. blk_mq_freeze_queue_start().

In the last patch, you will find the dying flag is mentioned in above comment
after we call blk_freeze_queue_start() just after the dying flag is set.

> enforced for the "dying" flag and q_usage_counter because of the order in
> which these are set by blk_set_queue_dying().

As I explained, the dying flag should only be mentioned after we change
the code in blk_set_queue_dying().


Thanks,
Ming Lei

^ permalink raw reply

* [PATCH] blk-mq: include errors in did_work calculation
From: Jens Axboe @ 2017-03-24 17:39 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Signed-off-by: Jens Axboe <axboe@fb.com>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4546f060e80..e3b09abf9d5b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -978,7 +978,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	struct request *rq;
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
-	int queued, ret = BLK_MQ_RQ_QUEUE_OK;
+	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
 	/*
 	 * Start off with dptr being NULL, so we start the first request
@@ -989,7 +989,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	/*
 	 * Now process all the entries, sending them to the driver.
 	 */
-	queued = 0;
+	errors = queued = 0;
 	while (!list_empty(list)) {
 		struct blk_mq_queue_data bd;
 
@@ -1046,6 +1046,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		default:
 			pr_err("blk-mq: bad return on queue: %d\n", ret);
 		case BLK_MQ_RQ_QUEUE_ERROR:
+			errors++;
 			rq->errors = -EIO;
 			blk_mq_end_request(rq, rq->errors);
 			break;
@@ -1097,7 +1098,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 			blk_mq_run_hw_queue(hctx, true);
 	}
 
-	return queued != 0;
+	return (queued + errors) != 0;
 }
 
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)

^ permalink raw reply related

* Re: [PATCH v2 4/4] block: block new I/O just after queue is set as dying
From: Bart Van Assche @ 2017-03-24 17:45 UTC (permalink / raw)
  To: hch@infradead.org, linux-block@vger.kernel.org,
	tom.leiming@gmail.com, axboe@fb.com
  Cc: tj@kernel.org, hare@suse.de
In-Reply-To: <20170324123621.5227-5-tom.leiming@gmail.com>

On Fri, 2017-03-24 at 20:36 +0800, Ming Lei wrote:=20
> +	/* block new I/O coming */
> +	blk_freeze_queue_start(q);

As I have already mentioned two times, the comment above
blk_freeze_queue_start() should be made more clear. It should mention that
without that call blk_queue_enter() won't check the "dying" flag after it
has been set. If that is not mentioned in a comment the next person who
reads the blk_set_queue_dying() function will wonder why the
blk_freeze_queue_start() call is really needed and whether it can be remove=
d.

>  		/*
>  		 * read pair of barrier in blk_freeze_queue_start(),
>  		 * we need to order reading DEAD flag of .q_usage_counter
> -		 * and reading .mq_freeze_depth, otherwise the following
> -		 * wait may never return if the two read are reordered.
> +		 * and reading .mq_freeze_depth or dying flag, otherwise
> +		 * the following wait may never return if the two read
> +		 * are reordered.
>  		 */
>  		smp_rmb();

Please fix the spelling in the above comment ("two read").

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH v2 3/4] block: rename blk_mq_freeze_queue_start()
From: Ming Lei @ 2017-03-24 17:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org, axboe@fb.com,
	hare@suse.de
In-Reply-To: <1490376549.3312.3.camel@sandisk.com>

On Sat, Mar 25, 2017 at 1:29 AM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Fri, 2017-03-24 at 20:36 +0800, Ming Lei wrote:
>> As the .q_usage_counter is used by both legacy and
>> mq path, we need to block new I/O if queue becomes
>> dead in blk_queue_enter().
>>
>> So rename it and we can use this function in both
>> pathes.
>
> Should "pathes" be changed into "paths" in the commit message? Additionally,
> this patch breaks the symmetry the comment in blk_mq_freeze_queue() refers
> to. Anyway:

Really? Is there one function named as blk_mq_freeze_queue_end()?

The comment means blk_mq_freeze_queue() vs. blk_mq_unfreeze_queue(), which can't
be affected by this patch.


Thanks,
Ming Lei

^ permalink raw reply

* [PATCH] block: remove bio_clone_bioset_partial()
From: Shaohua Li @ 2017-03-24 17:55 UTC (permalink / raw)
  To: linux-block; +Cc: Christoph Hellwig, Jens Axboe, Ming Lei

commit c18a1e0(block: introduce bio_clone_bioset_partial()) introduced
bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is
rewritten by Ming. We don't need the API any more, so revert the commit.

Jens,
this depends on Ming's patches, so it would be great I put this to md branch.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio.c         | 61 ++++++++++++-----------------------------------------
 include/linux/bio.h | 11 ++--------
 2 files changed, 15 insertions(+), 57 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1ccff0d..0364359 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -631,20 +631,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
-static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
-				      struct bio_set *bs, int offset,
-				      int size)
+/**
+ * 	bio_clone_bioset - clone a bio
+ * 	@bio_src: bio to clone
+ *	@gfp_mask: allocation priority
+ *	@bs: bio_set to allocate from
+ *
+ *	Clone bio. Caller will own the returned bio, but not the actual data it
+ *	points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+			     struct bio_set *bs)
 {
 	struct bvec_iter iter;
 	struct bio_vec bv;
 	struct bio *bio;
-	struct bvec_iter iter_src = bio_src->bi_iter;
-
-	/* for supporting partial clone */
-	if (offset || size != bio_src->bi_iter.bi_size) {
-		bio_advance_iter(bio_src, &iter_src, offset);
-		iter_src.bi_size = size;
-	}
 
 	/*
 	 * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -668,8 +669,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 	 *    __bio_clone_fast() anyways.
 	 */
 
-	bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
-			       &iter_src), bs);
+	bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
 	if (!bio)
 		return NULL;
 	bio->bi_bdev		= bio_src->bi_bdev;
@@ -686,7 +686,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 		bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
 		break;
 	default:
-		__bio_for_each_segment(bv, bio_src, iter, iter_src)
+		bio_for_each_segment(bv, bio_src, iter)
 			bio->bi_io_vec[bio->bi_vcnt++] = bv;
 		break;
 	}
@@ -705,44 +705,9 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 
 	return bio;
 }
-
-/**
- * 	bio_clone_bioset - clone a bio
- * 	@bio_src: bio to clone
- *	@gfp_mask: allocation priority
- *	@bs: bio_set to allocate from
- *
- *	Clone bio. Caller will own the returned bio, but not the actual data it
- *	points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
-			     struct bio_set *bs)
-{
-	return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
-				  bio_src->bi_iter.bi_size);
-}
 EXPORT_SYMBOL(bio_clone_bioset);
 
 /**
- * 	bio_clone_bioset_partial - clone a partial bio
- * 	@bio_src: bio to clone
- *	@gfp_mask: allocation priority
- *	@bs: bio_set to allocate from
- *	@offset: cloned starting from the offset
- *	@size: size for the cloned bio
- *
- *	Clone bio. Caller will own the returned bio, but not the actual data it
- *	points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
-				     struct bio_set *bs, int offset,
-				     int size)
-{
-	return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
-}
-EXPORT_SYMBOL(bio_clone_bioset_partial);
-
-/**
  *	bio_add_pc_page	-	attempt to add page to bio
  *	@q: the target queue
  *	@bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 42b62a0..fafef63 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -183,7 +183,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
-static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
+static inline unsigned bio_segments(struct bio *bio)
 {
 	unsigned segs = 0;
 	struct bio_vec bv;
@@ -205,17 +205,12 @@ static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
 		break;
 	}
 
-	__bio_for_each_segment(bv, bio, iter, *bvec)
+	bio_for_each_segment(bv, bio, iter)
 		segs++;
 
 	return segs;
 }
 
-static inline unsigned bio_segments(struct bio *bio)
-{
-	return __bio_segments(bio, &bio->bi_iter);
-}
-
 /*
  * get a reference to a bio, so it won't disappear. the intended use is
  * something like:
@@ -389,8 +384,6 @@ extern void bio_put(struct bio *);
 extern void __bio_clone_fast(struct bio *, struct bio *);
 extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
 extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
-extern struct bio *bio_clone_bioset_partial(struct bio *, gfp_t,
-					    struct bio_set *, int, int);
 
 extern struct bio_set *fs_bio_set;
 
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH] blk-mq: include errors in did_work calculation
From: Josef Bacik @ 2017-03-24 17:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org
In-Reply-To: <f82c3ba7-2c19-1bf4-a68b-1793b3a2ac4e@kernel.dk>


> On Mar 24, 2017, at 1:39 PM, Jens Axboe <axboe@kernel.dk> wrote:
>=20
> Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
> successfully, but we really want to return whether or not the we made
> progress. Progress includes if we got an error return.  If we don't,
> this can lead to a hang in blk_mq_sched_dispatch_requests() when a
> driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
> manually ending the IO in error and return BLK_MQ_QUEUE_OK.
>=20
> Signed-off-by: Jens Axboe <axboe@fb.com>
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a4546f060e80..e3b09abf9d5b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -978,7 +978,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx =
*hctx, struct list_head *list)
> 	struct request *rq;
> 	LIST_HEAD(driver_list);
> 	struct list_head *dptr;
> -	int queued, ret =3D BLK_MQ_RQ_QUEUE_OK;
> +	int errors, queued, ret =3D BLK_MQ_RQ_QUEUE_OK;
>=20
> 	/*
> 	 * Start off with dptr being NULL, so we start the first request
> @@ -989,7 +989,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx =
*hctx, struct list_head *list)
> 	/*
> 	 * Now process all the entries, sending them to the driver.
> 	 */
> -	queued =3D 0;
> +	errors =3D queued =3D 0;
> 	while (!list_empty(list)) {
> 		struct blk_mq_queue_data bd;
>=20
> @@ -1046,6 +1046,7 @@ bool blk_mq_dispatch_rq_list(struct =
blk_mq_hw_ctx *hctx, struct list_head *list)
> 		default:
> 			pr_err("blk-mq: bad return on queue: %d\n", =
ret);
> 		case BLK_MQ_RQ_QUEUE_ERROR:
> +			errors++;
> 			rq->errors =3D -EIO;
> 			blk_mq_end_request(rq, rq->errors);
> 			break;
> @@ -1097,7 +1098,7 @@ bool blk_mq_dispatch_rq_list(struct =
blk_mq_hw_ctx *hctx, struct list_head *list)
> 			blk_mq_run_hw_queue(hctx, true);
> 	}
>=20
> -	return queued !=3D 0;
> +	return (queued + errors) !=3D 0;
> }
>=20
> static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>=20

Thanks this fixed it, you can add

Tested-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef=

^ permalink raw reply

* Re: [PATCH] blk-mq: include errors in did_work calculation
From: Bart Van Assche @ 2017-03-24 18:01 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <f82c3ba7-2c19-1bf4-a68b-1793b3a2ac4e@kernel.dk>

On Fri, 2017-03-24 at 11:39 -0600, Jens Axboe wrote:
> Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
> successfully, but we really want to return whether or not the we made
> progress. Progress includes if we got an error return.  If we don't,
> this can lead to a hang in blk_mq_sched_dispatch_requests() when a
> driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
> manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=

^ permalink raw reply

* Re: [PATCH] blk-mq: include errors in did_work calculation
From: Omar Sandoval @ 2017-03-24 18:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org
In-Reply-To: <f82c3ba7-2c19-1bf4-a68b-1793b3a2ac4e@kernel.dk>

On Fri, Mar 24, 2017 at 11:39:10AM -0600, Jens Axboe wrote:
> Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
> successfully, but we really want to return whether or not the we made
> progress. Progress includes if we got an error return.  If we don't,
> this can lead to a hang in blk_mq_sched_dispatch_requests() when a
> driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
> manually ending the IO in error and return BLK_MQ_QUEUE_OK.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

Reviewed-by: Omar Sandoval <osandov@fb.com>

^ permalink raw reply

* [PATCH 0/4] nbd fixes for this cycle
From: Josef Bacik @ 2017-03-24 18:08 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

These 4 patches are to fix up various regressions and problems in NBD.  The
ERESTARTSYS is the biggest patch but has been pretty well tested with a debug
patch that forced the behavior to happen.  Everything else is relatively small,
and the queue timeout patch is a regression from last cycle.  Thanks,

Josef

^ permalink raw reply

* [PATCH 1/4] nbd: handle ERESTARTSYS properly
From: Josef Bacik @ 2017-03-24 18:08 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team; +Cc: Josef Bacik
In-Reply-To: <1490378909-4056-1-git-send-email-josef@toxicpanda.com>

From: Josef Bacik <jbacik@fb.com>

We can submit IO in a processes context, which means there can be
pending signals.  This isn't a fatal error for NBD, but it does require
some finesse.  If the signal happens before we transmit anything then we
are ok, just requeue the request and carry on.  However if we've done a
partial transmit we can't allow anything else to be transmitted on this
socket until we transmit the remaining part of the request.  Deal with
this by keeping track of how much we've sent for the current request,
and if we get an ERESTARTSYS during any part of our transmission save
the state of that request and requeue the IO.  If anybody tries to
submit a request that isn't our pending request then requeue that
request until we are able to service the one that is pending.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 115 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 89 insertions(+), 26 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 7e4287b..3d1fc37a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -47,6 +47,8 @@ static DEFINE_MUTEX(nbd_index_mutex);
 struct nbd_sock {
 	struct socket *sock;
 	struct mutex tx_lock;
+	struct request *pending;
+	int sent;
 };
 
 #define NBD_TIMEDOUT			0
@@ -202,7 +204,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
  *  Send or receive packet.
  */
 static int sock_xmit(struct nbd_device *nbd, int index, int send,
-		     struct iov_iter *iter, int msg_flags)
+		     struct iov_iter *iter, int msg_flags, int *sent)
 {
 	struct socket *sock = nbd->socks[index]->sock;
 	int result;
@@ -237,6 +239,8 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 				result = -EPIPE; /* short read */
 			break;
 		}
+		if (sent)
+			*sent += result;
 	} while (msg_data_left(&msg));
 
 	tsk_restore_flags(current, pflags, PF_MEMALLOC);
@@ -248,6 +252,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
+	struct nbd_sock *nsock = nbd->socks[index];
 	int result;
 	struct nbd_request request = {.magic = htonl(NBD_REQUEST_MAGIC)};
 	struct kvec iov = {.iov_base = &request, .iov_len = sizeof(request)};
@@ -256,6 +261,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	struct bio *bio;
 	u32 type;
 	u32 tag = blk_mq_unique_tag(req);
+	int sent = nsock->sent, skip = 0;
 
 	iov_iter_kvec(&from, WRITE | ITER_KVEC, &iov, 1, sizeof(request));
 
@@ -283,6 +289,17 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		return -EIO;
 	}
 
+	/* We did a partial send previously, and we at least sent the whole
+	 * request struct, so just go and send the rest of the pages in the
+	 * request.
+	 */
+	if (sent) {
+		if (sent >= sizeof(request)) {
+			skip = sent - sizeof(request);
+			goto send_pages;
+		}
+		iov_iter_advance(&from, sent);
+	}
 	request.type = htonl(type);
 	if (type != NBD_CMD_FLUSH) {
 		request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9);
@@ -294,15 +311,27 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		cmd, nbdcmd_to_ascii(type),
 		(unsigned long long)blk_rq_pos(req) << 9, blk_rq_bytes(req));
 	result = sock_xmit(nbd, index, 1, &from,
-			(type == NBD_CMD_WRITE) ? MSG_MORE : 0);
+			(type == NBD_CMD_WRITE) ? MSG_MORE : 0, &sent);
 	if (result <= 0) {
+		if (result == -ERESTARTSYS) {
+			/* If we havne't sent anything we can just return BUSY,
+			 * however if we have sent something we need to make
+			 * sure we only allow this req to be sent until we are
+			 * completely done.
+			 */
+			if (sent) {
+				nsock->pending = req;
+				nsock->sent = sent;
+			}
+			return BLK_MQ_RQ_QUEUE_BUSY;
+		}
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 			"Send control failed (result %d)\n", result);
 		return -EIO;
 	}
-
+send_pages:
 	if (type != NBD_CMD_WRITE)
-		return 0;
+		goto out;
 
 	bio = req->bio;
 	while (bio) {
@@ -318,8 +347,25 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 				cmd, bvec.bv_len);
 			iov_iter_bvec(&from, ITER_BVEC | WRITE,
 				      &bvec, 1, bvec.bv_len);
-			result = sock_xmit(nbd, index, 1, &from, flags);
+			if (skip) {
+				if (skip >= iov_iter_count(&from)) {
+					skip -= iov_iter_count(&from);
+					continue;
+				}
+				iov_iter_advance(&from, skip);
+				skip = 0;
+			}
+			result = sock_xmit(nbd, index, 1, &from, flags, &sent);
 			if (result <= 0) {
+				if (result == -ERESTARTSYS) {
+					/* We've already sent the header, we
+					 * have no choice but to set pending and
+					 * return BUSY.
+					 */
+					nsock->pending = req;
+					nsock->sent = sent;
+					return BLK_MQ_RQ_QUEUE_BUSY;
+				}
 				dev_err(disk_to_dev(nbd->disk),
 					"Send data failed (result %d)\n",
 					result);
@@ -336,6 +382,9 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		}
 		bio = next;
 	}
+out:
+	nsock->pending = NULL;
+	nsock->sent = 0;
 	return 0;
 }
 
@@ -353,7 +402,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 
 	reply.magic = 0;
 	iov_iter_kvec(&to, READ | ITER_KVEC, &iov, 1, sizeof(reply));
-	result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL);
+	result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL);
 	if (result <= 0) {
 		if (!test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) &&
 		    !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
@@ -395,7 +444,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 		rq_for_each_segment(bvec, req, iter) {
 			iov_iter_bvec(&to, ITER_BVEC | READ,
 				      &bvec, 1, bvec.bv_len);
-			result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL);
+			result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL);
 			if (result <= 0) {
 				dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n",
 					result);
@@ -482,22 +531,23 @@ static void nbd_clear_que(struct nbd_device *nbd)
 }
 
 
-static void nbd_handle_cmd(struct nbd_cmd *cmd, int index)
+static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
 	struct nbd_device *nbd = cmd->nbd;
 	struct nbd_sock *nsock;
+	int ret;
 
 	if (index >= nbd->num_connections) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on invalid socket\n");
-		goto error_out;
+		return -EINVAL;
 	}
 
 	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags)) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on closed socket\n");
-		goto error_out;
+		return -EINVAL;
 	}
 
 	req->errors = 0;
@@ -508,29 +558,30 @@ static void nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 		mutex_unlock(&nsock->tx_lock);
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on closed socket\n");
-		goto error_out;
+		return -EINVAL;
 	}
 
-	if (nbd_send_cmd(nbd, cmd, index) != 0) {
-		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Request send failed\n");
-		req->errors++;
-		nbd_end_request(cmd);
+	/* Handle the case that we have a pending request that was partially
+	 * transmitted that _has_ to be serviced first.  We need to call requeue
+	 * here so that it gets put _after_ the request that is already on the
+	 * dispatch list.
+	 */
+	if (unlikely(nsock->pending && nsock->pending != req)) {
+		blk_mq_requeue_request(req, true);
+		ret = 0;
+		goto out;
 	}
-
+	ret = nbd_send_cmd(nbd, cmd, index);
+out:
 	mutex_unlock(&nsock->tx_lock);
-
-	return;
-
-error_out:
-	req->errors++;
-	nbd_end_request(cmd);
+	return ret;
 }
 
 static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 			const struct blk_mq_queue_data *bd)
 {
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	int ret;
 
 	/*
 	 * Since we look at the bio's to send the request over the network we
@@ -543,10 +594,20 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 */
 	init_completion(&cmd->send_complete);
 	blk_mq_start_request(bd->rq);
-	nbd_handle_cmd(cmd, hctx->queue_num);
+
+	/* We can be called directly from the user space process, which means we
+	 * could possibly have signals pending so our sendmsg will fail.  In
+	 * this case we need to return that we are busy, otherwise error out as
+	 * appropriate.
+	 */
+	ret = nbd_handle_cmd(cmd, hctx->queue_num);
+	if (ret < 0)
+		ret = BLK_MQ_RQ_QUEUE_ERROR;
+	if (!ret)
+		ret = BLK_MQ_RQ_QUEUE_OK;
 	complete(&cmd->send_complete);
 
-	return BLK_MQ_RQ_QUEUE_OK;
+	return ret;
 }
 
 static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
@@ -581,6 +642,8 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 
 	mutex_init(&nsock->tx_lock);
 	nsock->sock = sock;
+	nsock->pending = NULL;
+	nsock->sent = 0;
 	socks[nbd->num_connections++] = nsock;
 
 	if (max_part)
@@ -634,7 +697,7 @@ static void send_disconnects(struct nbd_device *nbd)
 
 	for (i = 0; i < nbd->num_connections; i++) {
 		iov_iter_kvec(&from, WRITE | ITER_KVEC, &iov, 1, sizeof(request));
-		ret = sock_xmit(nbd, i, 1, &from, 0);
+		ret = sock_xmit(nbd, i, 1, &from, 0, NULL);
 		if (ret <= 0)
 			dev_err(disk_to_dev(nbd->disk),
 				"Send disconnect failed %d\n", ret);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/4] nbd: set rq->errors to actual error code
From: Josef Bacik @ 2017-03-24 18:08 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team; +Cc: Josef Bacik
In-Reply-To: <1490378909-4056-1-git-send-email-josef@toxicpanda.com>

From: Josef Bacik <jbacik@fb.com>

We've been relying on the block layer to assume rq->errors being set
translates into -EIO.  I noticed in testing that sometimes this isn't
true, and really there's not much of a reason to have a counter instead
of just using -EIO.  So set it properly so we don't leak random numbers
to unsuspecting victims.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3d1fc37a..dbc22f4 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -192,7 +192,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 
 	dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down connection\n");
 	set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
-	req->errors++;
+	req->errors = -EIO;
 
 	mutex_lock(&nbd->config_lock);
 	sock_shutdown(nbd);
@@ -432,7 +432,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	if (ntohl(reply.error)) {
 		dev_err(disk_to_dev(nbd->disk), "Other side returned error (%d)\n",
 			ntohl(reply.error));
-		req->errors++;
+		req->errors = -EIO;
 		return cmd;
 	}
 
@@ -448,7 +448,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 			if (result <= 0) {
 				dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n",
 					result);
-				req->errors++;
+				req->errors = -EIO;
 				return cmd;
 			}
 			dev_dbg(nbd_to_dev(nbd), "request %p: got %d bytes data\n",
@@ -518,7 +518,7 @@ static void nbd_clear_req(struct request *req, void *data, bool reserved)
 	if (!blk_mq_request_started(req))
 		return;
 	cmd = blk_mq_rq_to_pdu(req);
-	req->errors++;
+	req->errors = -EIO;
 	nbd_end_request(cmd);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/4] nbd: set queue timeout properly
From: Josef Bacik @ 2017-03-24 18:08 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team; +Cc: Josef Bacik
In-Reply-To: <1490378909-4056-1-git-send-email-josef@toxicpanda.com>

From: Josef Bacik <jbacik@fb.com>

We can't just set the timeout on the tagset, we have to set it on the
queue as it would have been setup already at this point.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index dbc22f4..b0003da 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -844,7 +844,10 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd_size_set(nbd, bdev, nbd->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
-		nbd->tag_set.timeout = arg * HZ;
+		if (arg) {
+			nbd->tag_set.timeout = arg * HZ;
+			blk_queue_rq_timeout(nbd->disk->queue, arg * HZ);
+		}
 		return 0;
 
 	case NBD_SET_FLAGS:
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/4] nbd: replace kill_bdev() with __invalidate_device()
From: Josef Bacik @ 2017-03-24 18:08 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team; +Cc: Ratna Manoj Bolla
In-Reply-To: <1490378909-4056-1-git-send-email-josef@toxicpanda.com>

From: Ratna Manoj Bolla <manoj.br@gmail.com>

When a filesystem is mounted on a nbd device and on a disconnect, because
of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are
getting destroyed under mounted filesystem.

After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect,
followed by a sys_umount(),
        generic_shutdown_super()->...
        ->__sync_blockdev()->...
        -blkdev_writepages()->...
        ->do_invalidatepage()->...
        -discard_buffer()   is discarding superblock buffer_head assumed
to be in mapped state by ext4_commit_super().

[mlin: ported to 4.11-rc2]
Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b0003da..d8a2356 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -126,7 +126,8 @@ static const char *nbdcmd_to_ascii(int cmd)
 
 static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
 {
-	bd_set_size(bdev, 0);
+	if (bdev->bd_openers <= 1)
+		bd_set_size(bdev, 0);
 	set_capacity(nbd->disk, 0);
 	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
 
@@ -665,6 +666,8 @@ static void nbd_reset(struct nbd_device *nbd)
 
 static void nbd_bdev_reset(struct block_device *bdev)
 {
+	if (bdev->bd_openers > 1)
+		return;
 	set_device_ro(bdev, false);
 	bdev->bd_inode->i_size = 0;
 	if (max_part > 0) {
@@ -728,7 +731,8 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
 {
 	sock_shutdown(nbd);
 	nbd_clear_que(nbd);
-	kill_bdev(bdev);
+
+	__invalidate_device(bdev, true);
 	nbd_bdev_reset(bdev);
 	/*
 	 * We want to give the run thread a chance to wait for everybody
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] block: remove bio_clone_bioset_partial()
From: Jens Axboe @ 2017-03-24 18:11 UTC (permalink / raw)
  To: Shaohua Li, linux-block; +Cc: Christoph Hellwig, Ming Lei
In-Reply-To: <99da2cea22d37790b5ca93f07da3b16f890d16c2.1490377861.git.shli@fb.com>

On 03/24/2017 11:55 AM, Shaohua Li wrote:
> commit c18a1e0(block: introduce bio_clone_bioset_partial()) introduced
> bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is
> rewritten by Ming. We don't need the API any more, so revert the commit.
> 
> Jens,
> this depends on Ming's patches, so it would be great I put this to md branch.

Looks fine to me, feel free to do so.

Reviewed-by: Jens Axboe <axboe@fb.com>

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Bart Van Assche @ 2017-03-24 18:45 UTC (permalink / raw)
  To: tom.leiming@gmail.com
  Cc: hch@infradead.org, linux-block@vger.kernel.org, hare@suse.de,
	axboe@fb.com
In-Reply-To: <CACVXFVN7_PovXhKFofqf7N99Fq6RHEtFojHzwc6AN1oEpFLYxA@mail.gmail.com>

On Sat, 2017-03-25 at 01:38 +0800, Ming Lei wrote:
> As I explained, the dying flag should only be mentioned after we change
> the code in blk_set_queue_dying().

Hello Ming,

If patches 2 and 4 would be combined into a single patch then it wouldn't
be necessary anymore to update the comment introduced in patch 2 in patch 4=
.
I think that would make this patch series easier to review.

Since the issues fixed by your patches are longstanding issues, have you
considered to add a "Cc: stable" tag?

Thanks,

Bart.=

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox