Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Adrian Hunter @ 2017-04-06  6:33 UTC (permalink / raw)
  To: Vlastimil Babka, Richard Weinberger, Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
	netdev, Boris Brezillon
In-Reply-To: <9b9d5bca-e125-e07b-b700-196cc800bbd7@suse.cz>

On 05/04/17 14:39, Vlastimil Babka wrote:
> On 04/05/2017 01:36 PM, Richard Weinberger wrote:
>> Michal,
>>
>> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
>>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
>>>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
>>>> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
>>>> No functional change.
>>>
>>> This one smells like an abuser. Why the hell should read/write path
>>> touch memory reserves at all!
>>
>> Could be. Let's ask Adrian, AFAIK he wrote that code.
>> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> 
> I was thinking about it and concluded that since the simulator can be
> used as a block device where reclaimed pages go to, writing the data out
> is a memalloc operation. Then reading can be called as part of r-m-w
> cycle, so reading as well.

IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
and memory reclaim waiting on nandsim.

^ permalink raw reply

* Re: [Nbd] [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers
From: Wouter Verhelst @ 2017-04-06  6:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, nbd-general, Chris Leech, linux-scsi,
	Josef Bacik, netdev, linux-kernel, linux-block, linux-mm,
	Eric Dumazet, Lee Duncan, Johannes Weiner, Andrew Morton,
	open-iscsi, Mel Gorman, David S. Miller
In-Reply-To: <20170405113030.GL6035@dhcp22.suse.cz>

On Wed, Apr 05, 2017 at 01:30:31PM +0200, Michal Hocko wrote:
> On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> > We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
> > clearing of PF_MEMALLOC. Let's convert the code which was using the generic
> > tsk_restore_flags(). No functional change.
> 
> It would be really great to revisit why those places outside of the mm
> proper really need this flag. I know this is a painful exercise but I
> wouldn't be surprised if there were abusers there.
[...]
> > ---
> >  drivers/block/nbd.c      | 7 ++++---
> >  drivers/scsi/iscsi_tcp.c | 7 ++++---
> >  net/core/dev.c           | 7 ++++---
> >  net/core/sock.c          | 7 ++++---
> >  4 files changed, 16 insertions(+), 12 deletions(-)

These were all done to make swapping over network safe. The idea is that
if a socket has SOCK_MEMALLOC set, incoming packets for that socket can
access PFMEMALLOC reserves (whereas other sockets cannot); this all in
the hope that one packe destined to that socket will contain the TCP ACK
that confirms the swapout was successful and we can now release RAM
pages for other processes.

I don't know whether they need the PF_MEMALLOC flag specifically (not a
kernel hacker), but they do need to interact with it at any rate.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply

* Re: [PATCH 1/2] block: Implement global tagset
From: Arun Easi @ 2017-04-06  6:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi,
	Hannes Reinecke
In-Reply-To: <1491307665-47656-2-git-send-email-hare@suse.de>

Hi Hannes,

Thanks for taking a crack at the issue. My comments below..

On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:

> Most legacy HBAs have a tagset per HBA, not per queue. To map
> these devices onto block-mq this patch implements a new tagset
> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
> to use just one tagset for all hardware queues.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  block/blk-mq-tag.c     | 12 ++++++++----
>  block/blk-mq.c         | 10 ++++++++--
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index e48bc2c..a14e76c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
> -	int i;
> +	int i, lim = tagset->nr_hw_queues;
>  
> -	for (i = 0; i < tagset->nr_hw_queues; i++) {
> +	if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
> +		lim = 1;
> +	for (i = 0; i < lim; i++) {
>  		if (tagset->tags && tagset->tags[i])
>  			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>  	}
> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  
>  int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
>  {
> -	int i, j, ret = 0;
> +	int i, j, ret = 0, lim = set->nr_hw_queues;
>  
>  	if (!set->ops->reinit_request)
>  		goto out;
>  
> -	for (i = 0; i < set->nr_hw_queues; i++) {
> +	if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
> +		lim = 1;
> +	for (i = 0; i < lim; i++) {
>  		struct blk_mq_tags *tags = set->tags[i];
>  
>  		for (j = 0; j < tags->nr_tags; j++) {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a..db96ed0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>  {
>  	int ret = 0;
>  
> +	if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
> +		set->tags[hctx_idx] = set->tags[0];
> +		return true;
> +	}

So, this effectively make all request allocations to the same NUMA node 
locality of the hctx_idx 0, correct? Is the performance hit you were 
talking about in the cover letter?

Do you have any other alternatives in mind? Dynamic growing/shrinking 
tags/request-pool in hctx with a fixed base as start?

One alternative that comes to my mind is to move the divvy up logic to 
SCSI (instead of LLD doing it), IOW:

1. Have SCSI set tag_set.queue_depth to can_queue/nr_hw_queues
2. Have blk_mq_unique_tag() (or a new i/f) returning "hwq * nr_hw_queue + 
   rq->tag"

That would make the tags linear in the can_queue space, but could result 
in poor use of LLD resource if a given hctx has used up all it's tags.

On a related note, would not the current use of can_queue in SCSI lead to 
poor resource utilization in MQ cases? Like, block layer allocating 
nr_hw_queues * tags+request+driver_data.etc * can_queue, but SCSI limiting 
the number of requests to can_queue.

BTW, if you would like me to try out this patch on my setup, please let me 
know.

Regards,
-Arun

>  	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
>  					set->queue_depth, set->reserved_tags);
>  	if (!set->tags[hctx_idx])
> @@ -2080,8 +2084,10 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
>  					 unsigned int hctx_idx)
>  {
>  	if (set->tags[hctx_idx]) {
> -		blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> -		blk_mq_free_rq_map(set->tags[hctx_idx]);
> +		if (!(set->flags & BLK_MQ_F_GLOBAL_TAGS) || hctx_idx == 0) {
> +			blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> +			blk_mq_free_rq_map(set->tags[hctx_idx]);
> +		}
>  		set->tags[hctx_idx] = NULL;
>  	}
>  }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index b296a90..eee27b016 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -155,6 +155,7 @@ enum {
>  	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
> +	BLK_MQ_F_GLOBAL_TAGS	= 1 << 7,
>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
>  	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
>  
> 

^ permalink raw reply

* Re: [PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards
From: Hannes Reinecke @ 2017-04-06  6:18 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170405172125.22600-28-hch@lst.de>

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Separating discards and zeroout operations allows us to remove the LBPRZ
> block zeroing constraints from discards and honor the device preferences
> for UNMAP commands.
> 
> If supported by the device, we'll also choose UNMAP over one of the
> WRITE SAME variants for discards.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply

* Re: [PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
From: Hannes Reinecke @ 2017-04-06  6:17 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170405172125.22600-27-hch@lst.de>

On 04/05/2017 07:21 PM, Christoph Hellwig wrote:
> From: "Martin K. Petersen" <martin.petersen@oracle.com>
> 
> Now that zeroout and discards are distinct operations we need to
> separate the policy of choosing the appropriate command. Create a
> zeroing_mode which can be one of:
> 
> write:			Zeroout assist not present, use regular WRITE
> writesame:		Allow WRITE SAME(10/16) with a zeroed payload
> writesame_16_unmap:	Allow WRITE SAME(16) with UNMAP
> writesame_10_unmap:	Allow WRITE SAME(10) with UNMAP
> 
> The last two are conditional on the device being thin provisioned with
> LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.
> 
> Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
> if none of the _unmap variants are supported, regular WRITE SAME will be
> used if the device supports it.
> 
> The zeroout_mode is exported in sysfs and the detected mode for a given
> device can be overridden using the string constants above.
> 
> With this change in place we can now issue WRITE SAME(16) with UNMAP set
> for block zeroing applications that require hard guarantees and
> logical_block_size granularity. And at the same time use the UNMAP
> command with the device's preferred granulary and alignment for discard
> operations.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/scsi/sd.h |  8 ++++++++
>  2 files changed, 61 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply

* RE: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Long Li @ 2017-04-06  5:38 UTC (permalink / raw)
  To: KY Srinivasan, Bart Van Assche, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, axboe@kernel.dk
  Cc: Stephen Hemminger
In-Reply-To: <DM5PR03MB2490E70D9B18A701A81005C8A00D0@DM5PR03MB2490.namprd03.prod.outlook.com>

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogS1kgU3Jpbml2YXNhbg0K
PiBTZW50OiBXZWRuZXNkYXksIEFwcmlsIDUsIDIwMTcgOToyMSBQTQ0KPiBUbzogQmFydCBWYW4g
QXNzY2hlIDxCYXJ0LlZhbkFzc2NoZUBzYW5kaXNrLmNvbT47IGxpbnV4LQ0KPiBrZXJuZWxAdmdl
ci5rZXJuZWwub3JnOyBsaW51eC1ibG9ja0B2Z2VyLmtlcm5lbC5vcmc7IExvbmcgTGkNCj4gPGxv
bmdsaUBtaWNyb3NvZnQuY29tPjsgYXhib2VAa2VybmVsLmRrDQo+IENjOiBTdGVwaGVuIEhlbW1p
bmdlciA8c3RoZW1taW5AbWljcm9zb2Z0LmNvbT4NCj4gU3ViamVjdDogUkU6IFtQQVRDSF0gYmxv
Y2stbXE6IHNldCBib3RoIGJsb2NrIHF1ZXVlIGFuZCBoYXJkd2FyZSBxdWV1ZQ0KPiByZXN0YXJ0
IGJpdCBmb3IgcmVzdGFydA0KPiANCj4gDQo+IA0KPiA+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0t
LS0tDQo+ID4gRnJvbTogQmFydCBWYW4gQXNzY2hlIFttYWlsdG86QmFydC5WYW5Bc3NjaGVAc2Fu
ZGlzay5jb21dDQo+ID4gU2VudDogV2VkbmVzZGF5LCBBcHJpbCA1LCAyMDE3IDg6NDYgUE0NCj4g
PiBUbzogbGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZzsgbGludXgtYmxvY2tAdmdlci5rZXJu
ZWwub3JnOyBMb25nIExpDQo+ID4gPGxvbmdsaUBtaWNyb3NvZnQuY29tPjsgYXhib2VAa2VybmVs
LmRrDQo+ID4gQ2M6IFN0ZXBoZW4gSGVtbWluZ2VyIDxzdGhlbW1pbkBtaWNyb3NvZnQuY29tPjsg
S1kgU3Jpbml2YXNhbg0KPiA+IDxreXNAbWljcm9zb2Z0LmNvbT4NCj4gPiBTdWJqZWN0OiBSZTog
W1BBVENIXSBibG9jay1tcTogc2V0IGJvdGggYmxvY2sgcXVldWUgYW5kIGhhcmR3YXJlIHF1ZXVl
DQo+ID4gcmVzdGFydCBiaXQgZm9yIHJlc3RhcnQNCj4gPg0KPiA+IE9uIFRodSwgMjAxNy0wNC0w
NiBhdCAwMzozOCArMDAwMCwgTG9uZyBMaSB3cm90ZToNCj4gPiA+ID4gLS0tLS1PcmlnaW5hbCBN
ZXNzYWdlLS0tLS0NCj4gPiA+ID4gRnJvbTogQmFydCBWYW4gQXNzY2hlIFttYWlsdG86QmFydC5W
YW5Bc3NjaGVAc2FuZGlzay5jb21dDQo+ID4gPiA+DQo+ID4gPiA+IFBsZWFzZSBkcm9wIHRoaXMg
cGF0Y2guIEknbSB3b3JraW5nIG9uIGEgYmV0dGVyIHNvbHV0aW9uLg0KPiA+ID4NCj4gPiA+IFRo
YW5rIHlvdS4gTG9va2luZyBmb3J3YXJkIHRvIHlvdXIgcGF0Y2guDQo+ID4NCj4gPiBIZWxsbyBM
b25nLA0KPiA+DQo+ID4gSXQgd291bGQgaGVscCBpZiB5b3UgY291bGQgc2hhcmUgdGhlIG5hbWUg
b2YgdGhlIGJsb2NrIG9yIFNDU0kgZHJpdmVyDQo+ID4gd2l0aCB3aGljaCB5b3UgcmFuIGludG8g
dGhhdCBsb2NrdXAgYW5kIGFsc28gaWYgeW91IGNvdWxkIHNoYXJlIHRoZQ0KPiA+IG5hbWUgb2Yg
dGhlIEkvTyBzY2hlZHVsZXIgdXNlZCBpbiB5b3VyIHRlc3QuDQo+IA0KPiBUaGUgdGVzdHMgdGhh
dCBpbmRpY2F0ZWQgdGhlIGlzc3VlIHdlcmUgcnVuIEh5cGVyLVYuIFRoZSBkcml2ZXIgaXMNCj4g
c3RvcnZzY19kcnYuYyBUaGUgSS9PIHNjaGVkdWxlciB3YXMgSSB0aGluayBub29wLg0KDQpZZXMs
IHdlIHNlZSBJL08gaHVuZyBvbiBzY2hlZHVsZXIgbm9uZS4gQWxzbyB0cmllZCBvbiBtcS1kZWFk
bGluZSwgc2FtZSBodW5nIHdpdGggdGhlIHNhbWUgY2F1c2UuDQoNCj4gDQo+IEsuIFkNCj4gPg0K
PiA+IFRoYW5rcywNCj4gPg0KPiA+IEJhcnQuDQo=

^ permalink raw reply

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Ming Lei @ 2017-04-06  4:31 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, linux-block, kernel-team
In-Reply-To: <d1b5da71a5cba1f17d4eaafb207547a4651f81a9.1491418411.git.osandov@fb.com>

On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> While dispatching requests, if we fail to get a driver tag, we mark the
> hardware queue as waiting for a tag and put the requests on a
> hctx->dispatch list to be run later when a driver tag is freed. However,
> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> queues if using a single-queue scheduler with a multiqueue device. If

It can't perform well by using a SQ scheduler on a MQ device, so just be
curious why someone wants to do that in this way,:-)

I guess you mean that ops.mq.dispatch_request() may dispatch requests
from other hardware queues in blk_mq_sched_dispatch_requests() instead
of current hctx.

If that is true, it looks like a issue in usage of I/O scheduler, since
the mq-deadline scheduler just queues requests in one per-request_queue
linked list, for MQ device, the scheduler queue should have been per-hctx.

> blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> are processing. This means we end up using the hardware queue of the
> previous request, which may or may not be the same as that of the
> current request. If it isn't, the wrong hardware queue will end up
> waiting for a tag, and the requests will be on the wrong dispatch list,
> leading to a hang.
> 
> The fix is twofold:
> 
> 1. Make sure we save which hardware queue we were trying to get a
>    request for in blk_mq_get_driver_tag() regardless of whether it
>    succeeds or not.

It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().

> 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
>    blk_mq_hw_queue to make it clear that it must handle multiple
>    hardware queues, since I've already messed this up on a couple of
>    occasions.
> 
> This didn't appear in testing with nvme and mq-deadline because nvme has
> more driver tags than the default number of scheduler tags. However,
> with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-mq-sched.c |  9 +++++----
>  block/blk-mq.c       | 25 +++++++++++++------------
>  block/blk-mq.h       |  2 +-
>  3 files changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 09af8ff18719..fc00f00898d3 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
>  
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
> -	struct elevator_queue *e = hctx->queue->elevator;
> +	struct request_queue *q = hctx->queue;
> +	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
>  	bool did_work = false;
>  	LIST_HEAD(rq_list);
> @@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
> +		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> -		blk_mq_dispatch_rq_list(hctx, &rq_list);
> +		blk_mq_dispatch_rq_list(q, &rq_list);
>  	}
>  
>  	/*
> @@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  			if (!rq)
>  				break;
>  			list_add(&rq->queuelist, &rq_list);
> -		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
> +		} while (blk_mq_dispatch_rq_list(q, &rq_list));
>  	}
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7cd3208bcdf..09cff6d1ba76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
>  	};
>  
> -	if (rq->tag != -1) {
> -done:
> -		if (hctx)
> -			*hctx = data.hctx;
> -		return true;
> -	}
> +	if (rq->tag != -1)
> +		goto done;
>  
>  	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
>  		data.flags |= BLK_MQ_REQ_RESERVED;
> @@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  			atomic_inc(&data.hctx->nr_active);
>  		}
>  		data.hctx->tags->rqs[rq->tag] = rq;
> -		goto done;
>  	}
>  
> -	return false;
> +done:
> +	if (hctx)
> +		*hctx = data.hctx;
> +	return rq->tag != -1;
>  }
>  
>  static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> @@ -980,17 +978,20 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
>  	return true;
>  }
>  
> -bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> +bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
>  {
> -	struct request_queue *q = hctx->queue;
> +	struct blk_mq_hw_ctx *hctx;
>  	struct request *rq;
>  	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
>  
> +	if (list_empty(list))
> +		return false;
> +
>  	/*
>  	 * Now process all the entries, sending them to the driver.
>  	 */
>  	errors = queued = 0;
> -	while (!list_empty(list)) {
> +	do {
>  		struct blk_mq_queue_data bd;
>  
>  		rq = list_first_entry(list, struct request, queuelist);
> @@ -1053,7 +1054,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  
>  		if (ret == BLK_MQ_RQ_QUEUE_BUSY)
>  			break;
> -	}
> +	} while (!list_empty(list));
>  
>  	hctx->dispatched[queued_to_index(queued)]++;
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 8d49c06fc520..7e6f2e467696 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
>  void blk_mq_free_queue(struct request_queue *q);
>  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
>  void blk_mq_wake_waiters(struct request_queue *q);
> -bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
> +bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
>  void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
>  bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
>  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
> -- 
> 2.12.2
> 

-- 
Ming

^ permalink raw reply

* RE: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: KY Srinivasan @ 2017-04-06  4:21 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, Long Li, axboe@kernel.dk
  Cc: Stephen Hemminger
In-Reply-To: <1491450335.8994.1.camel@sandisk.com>

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogQmFydCBWYW4gQXNzY2hl
IFttYWlsdG86QmFydC5WYW5Bc3NjaGVAc2FuZGlzay5jb21dDQo+IFNlbnQ6IFdlZG5lc2RheSwg
QXByaWwgNSwgMjAxNyA4OjQ2IFBNDQo+IFRvOiBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3Jn
OyBsaW51eC1ibG9ja0B2Z2VyLmtlcm5lbC5vcmc7IExvbmcgTGkNCj4gPGxvbmdsaUBtaWNyb3Nv
ZnQuY29tPjsgYXhib2VAa2VybmVsLmRrDQo+IENjOiBTdGVwaGVuIEhlbW1pbmdlciA8c3RoZW1t
aW5AbWljcm9zb2Z0LmNvbT47IEtZIFNyaW5pdmFzYW4NCj4gPGt5c0BtaWNyb3NvZnQuY29tPg0K
PiBTdWJqZWN0OiBSZTogW1BBVENIXSBibG9jay1tcTogc2V0IGJvdGggYmxvY2sgcXVldWUgYW5k
IGhhcmR3YXJlIHF1ZXVlDQo+IHJlc3RhcnQgYml0IGZvciByZXN0YXJ0DQo+IA0KPiBPbiBUaHUs
IDIwMTctMDQtMDYgYXQgMDM6MzggKzAwMDAsIExvbmcgTGkgd3JvdGU6DQo+ID4gPiAtLS0tLU9y
aWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gRnJvbTogQmFydCBWYW4gQXNzY2hlIFttYWlsdG86
QmFydC5WYW5Bc3NjaGVAc2FuZGlzay5jb21dDQo+ID4gPg0KPiA+ID4gUGxlYXNlIGRyb3AgdGhp
cyBwYXRjaC4gSSdtIHdvcmtpbmcgb24gYSBiZXR0ZXIgc29sdXRpb24uDQo+ID4NCj4gPiBUaGFu
ayB5b3UuIExvb2tpbmcgZm9yd2FyZCB0byB5b3VyIHBhdGNoLg0KPiANCj4gSGVsbG8gTG9uZywN
Cj4gDQo+IEl0IHdvdWxkIGhlbHAgaWYgeW91IGNvdWxkIHNoYXJlIHRoZSBuYW1lIG9mIHRoZSBi
bG9jayBvciBTQ1NJIGRyaXZlciB3aXRoDQo+IHdoaWNoIHlvdSByYW4gaW50byB0aGF0IGxvY2t1
cCBhbmQgYWxzbyBpZiB5b3UgY291bGQgc2hhcmUgdGhlIG5hbWUgb2YgdGhlDQo+IEkvTyBzY2hl
ZHVsZXIgdXNlZCBpbiB5b3VyIHRlc3QuDQoNClRoZSB0ZXN0cyB0aGF0IGluZGljYXRlZCB0aGUg
aXNzdWUgd2VyZSBydW4gSHlwZXItVi4gVGhlIGRyaXZlciBpcyBzdG9ydnNjX2Rydi5jDQpUaGUg
SS9PIHNjaGVkdWxlciB3YXMgSSB0aGluayBub29wLg0KDQpLLiBZDQo+IA0KPiBUaGFua3MsDQo+
IA0KPiBCYXJ0Lg0K

^ permalink raw reply

* Re: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Bart Van Assche @ 2017-04-06  3:45 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	longli@microsoft.com, axboe@kernel.dk
  Cc: sthemmin@microsoft.com, kys@microsoft.com
In-Reply-To: <BN3PR03MB2227E6DBCAE0365A013B7972CE0D0@BN3PR03MB2227.namprd03.prod.outlook.com>

T24gVGh1LCAyMDE3LTA0LTA2IGF0IDAzOjM4ICswMDAwLCBMb25nIExpIHdyb3RlOg0KPiA+IC0t
LS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+ID4gRnJvbTogQmFydCBWYW4gQXNzY2hlIFttYWls
dG86QmFydC5WYW5Bc3NjaGVAc2FuZGlzay5jb21dDQo+ID4gDQo+ID4gUGxlYXNlIGRyb3AgdGhp
cyBwYXRjaC4gSSdtIHdvcmtpbmcgb24gYSBiZXR0ZXIgc29sdXRpb24uDQo+IA0KPiBUaGFuayB5
b3UuIExvb2tpbmcgZm9yd2FyZCB0byB5b3VyIHBhdGNoLg0KDQpIZWxsbyBMb25nLA0KDQpJdCB3
b3VsZCBoZWxwIGlmIHlvdSBjb3VsZCBzaGFyZSB0aGUgbmFtZSBvZiB0aGUgYmxvY2sgb3IgU0NT
SSBkcml2ZXIgd2l0aA0Kd2hpY2ggeW91IHJhbiBpbnRvIHRoYXQgbG9ja3VwIGFuZCBhbHNvIGlm
IHlvdSBjb3VsZCBzaGFyZSB0aGUgbmFtZSBvZiB0aGUNCkkvTyBzY2hlZHVsZXIgdXNlZCBpbiB5
b3VyIHRlc3QuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply

* RE: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Long Li @ 2017-04-06  3:38 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, axboe@kernel.dk
  Cc: Stephen Hemminger, KY Srinivasan
In-Reply-To: <1491438735.2787.18.camel@sandisk.com>



> -----Original Message-----
> From: Bart Van Assche [mailto:Bart.VanAssche@sandisk.com]
> Sent: Wednesday, April 5, 2017 5:32 PM
> To: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; Long Li
> <longli@microsoft.com>; axboe@kernel.dk
> Cc: Stephen Hemminger <sthemmin@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Long Li <longli@microsoft.com>
> Subject: Re: [PATCH] block-mq: set both block queue and hardware queue
> restart bit for restart
>=20
> On Wed, 2017-04-05 at 17:16 -0700, Long Li wrote:
> > Under heavy I/O, one hardware queue may be unable to dispatch any I/O
> > to the device layer. This poses a problem with restarting this
> > hardware queue on I/O finish in blk_mq_sched_restart_queues(), becaue
> > there is nothing pending that will finish in future on this hardware qe=
ueu.
> This will result in deadlock.
> >
> > With this patch, we check for all possible stalled hardware queues
> > when I/O finishes on any hardware queues. This prevents this deadlock.
> >
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  block/blk-mq-sched.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > 09af8ff..f7f3d44 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -202,7 +202,7 @@ void blk_mq_sched_dispatch_requests(struct
> blk_mq_hw_ctx *hctx)
> >  	 * needing a restart in that case.
> >  	 */
> >  	if (!list_empty(&rq_list)) {
> > -		blk_mq_sched_mark_restart_hctx(hctx);
> > +		blk_mq_sched_mark_restart_queue(hctx);
> >  		did_work =3D blk_mq_dispatch_rq_list(hctx, &rq_list);
> >  	} else if (!has_sched_dispatch) {
> >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>=20
> Please drop this patch. I'm working on a better solution.

Thank you. Looking forward to your patch.

>=20
> Thanks,
>=20
> Bart.

^ permalink raw reply

* Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-06  2:23 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jens Axboe, linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <20170405073233.GD6035@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 4074 bytes --]

On Wed, Apr 05 2017, Michal Hocko wrote:

> On Wed 05-04-17 09:19:27, Michal Hocko wrote:
>> On Wed 05-04-17 14:33:50, NeilBrown wrote:
> [...]
>> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> > index 0ecb6461ed81..44b3506fd086 100644
>> > --- a/drivers/block/loop.c
>> > +++ b/drivers/block/loop.c
>> > @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
>> >  	if (IS_ERR(lo->worker_task))
>> >  		return -ENOMEM;
>> >  	set_user_nice(lo->worker_task, MIN_NICE);
>> > +	lo->worker_task->flags |= PF_LESS_THROTTLE;
>> >  	return 0;
>> 
>> As mentioned elsewhere, PF flags should be updated only on the current
>> task otherwise there is potential rmw race. Is this safe? The code runs
>> concurrently with the worker thread.
>
> I believe you need something like this instead
> ---
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index f347285c67ec..07b2a909e4fb 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -844,10 +844,16 @@ static void loop_unprepare_queue(struct loop_device *lo)
>  	kthread_stop(lo->worker_task);
>  }
>  
> +int loop_kthread_worker_fn(void *worker_ptr)
> +{
> +	current->flags |= PF_LESS_THROTTLE;
> +	return kthread_worker_fn(worker_ptr);
> +}
> +
>  static int loop_prepare_queue(struct loop_device *lo)
>  {
>  	kthread_init_worker(&lo->worker);
> -	lo->worker_task = kthread_run(kthread_worker_fn,
> +	lo->worker_task = kthread_run(loop_kthread_worker_fn,
>  			&lo->worker, "loop%d", lo->lo_number);
>  	if (IS_ERR(lo->worker_task))
>  		return -ENOMEM;

Arg - of course.
How about we just split the kthread_create from the wake_up?

Thanks,
NeilBrown


From: NeilBrown <neilb@suse.com>
Subject: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.

When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file.  This double-throttling can trigger
positive feedback loops that create significant delays.  The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.

The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server.  It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.

To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
time taken.

When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable.  52-460 seconds.  Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.

There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/loop.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..95679d988725 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -847,10 +847,12 @@ static void loop_unprepare_queue(struct loop_device *lo)
 static int loop_prepare_queue(struct loop_device *lo)
 {
 	kthread_init_worker(&lo->worker);
-	lo->worker_task = kthread_run(kthread_worker_fn,
+	lo->worker_task = kthread_create(kthread_worker_fn,
 			&lo->worker, "loop%d", lo->lo_number);
 	if (IS_ERR(lo->worker_task))
 		return -ENOMEM;
+	lo->worker_task->flags |= PF_LESS_THROTTLE;
+	wake_up_process(lo->worker_task);
 	set_user_nice(lo->worker_task, MIN_NICE);
 	return 0;
 }
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Bart Van Assche @ 2017-04-06  0:32 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	longli@exchange.microsoft.com, axboe@kernel.dk
  Cc: sthemmin@microsoft.com, kys@microsoft.com, longli@microsoft.com
In-Reply-To: <1491437781-3565-1-git-send-email-longli@exchange.microsoft.com>

On Wed, 2017-04-05 at 17:16 -0700, Long Li wrote:
> Under heavy I/O, one hardware queue may be unable to dispatch any I/O to =
the=20
> device layer. This poses a problem with restarting this hardware queue on=
 I/O
> finish in blk_mq_sched_restart_queues(), becaue there is nothing pending =
that
> will finish in future on this hardware qeueu. This will result in deadloc=
k.
>=20
> With this patch, we check for all possible stalled hardware queues when I=
/O
> finishes on any hardware queues. This prevents this deadlock.
>=20
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  block/blk-mq-sched.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>=20
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 09af8ff..f7f3d44 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -202,7 +202,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_=
ctx *hctx)
>  	 * needing a restart in that case.
>  	 */
>  	if (!list_empty(&rq_list)) {
> -		blk_mq_sched_mark_restart_hctx(hctx);
> +		blk_mq_sched_mark_restart_queue(hctx);
>  		did_work =3D blk_mq_dispatch_rq_list(hctx, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);

Please drop this patch. I'm working on a better solution.

Thanks,

Bart.=

^ permalink raw reply

* [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Long Li @ 2017-04-06  0:16 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-kernel
  Cc: K. Y. Srinivasan, Stephen Hemminger, Long Li

From: Long Li <longli@microsoft.com>

Under heavy I/O, one hardware queue may be unable to dispatch any I/O to the 
device layer. This poses a problem with restarting this hardware queue on I/O
finish in blk_mq_sched_restart_queues(), becaue there is nothing pending that
will finish in future on this hardware qeueu. This will result in deadlock.

With this patch, we check for all possible stalled hardware queues when I/O
finishes on any hardware queues. This prevents this deadlock.

Signed-off-by: Long Li <longli@microsoft.com>
---
 block/blk-mq-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff..f7f3d44 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -202,7 +202,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * needing a restart in that case.
 	 */
 	if (!list_empty(&rq_list)) {
-		blk_mq_sched_mark_restart_hctx(hctx);
+		blk_mq_sched_mark_restart_queue(hctx);
 		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Omar Sandoval @ 2017-04-05 20:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, James.Bottomley@HansenPartnership.com,
	linux-block@vger.kernel.org, hare@suse.com,
	martin.petersen@oracle.com, axboe@kernel.dk
In-Reply-To: <1491425510.2787.15.camel@sandisk.com>

On Wed, Apr 05, 2017 at 08:51:51PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 13:41 -0700, Omar Sandoval wrote:
> > On Mon, Apr 03, 2017 at 04:22:26PM -0700, Bart Van Assche wrote:
> > > If a tag set is shared among multiple hardware queues, leave
> > > it to the block driver to rerun hardware queues. Hence remove
> > > QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
> > > Remove blk_mq_sched_mark_restart_queue() because this
> > > function has no callers.
> > 
> > Kyber uses blk_mq_sched_mark_restart_queue() and the QUEUE_FLAG_RESTART
> > bit. If it's not too much trouble, it'd make things easier for me if you
> > left it in place. If it's a pain, it's fine if you get rid of it, I can
> > reintroduce it in my series.
> 
> Hello Omar,
> 
> Would it be OK for you to reintroduce blk_mq_sched_mark_restart_queue()?
> Since that function does not yet have any users I can't test any changes
> I make to that function ...
> 
> Thanks,
> 
> Bart.

Yeah, that's fine.

^ permalink raw reply

* Re: [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Bart Van Assche @ 2017-04-05 20:51 UTC (permalink / raw)
  To: osandov@osandov.com
  Cc: hch@lst.de, James.Bottomley@HansenPartnership.com,
	linux-block@vger.kernel.org, hare@suse.com,
	martin.petersen@oracle.com, axboe@kernel.dk
In-Reply-To: <20170405204120.GC22645@vader.DHCP.thefacebook.com>

On Wed, 2017-04-05 at 13:41 -0700, Omar Sandoval wrote:
> On Mon, Apr 03, 2017 at 04:22:26PM -0700, Bart Van Assche wrote:
> > If a tag set is shared among multiple hardware queues, leave
> > it to the block driver to rerun hardware queues. Hence remove
> > QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
> > Remove blk_mq_sched_mark_restart_queue() because this
> > function has no callers.
>=20
> Kyber uses blk_mq_sched_mark_restart_queue() and the QUEUE_FLAG_RESTART
> bit. If it's not too much trouble, it'd make things easier for me if you
> left it in place. If it's a pain, it's fine if you get rid of it, I can
> reintroduce it in my series.

Hello Omar,

Would it be OK for you to reintroduce blk_mq_sched_mark_restart_queue()?
Since that function does not yet have any users I can't test any changes
I make to that function ...

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Omar Sandoval @ 2017-04-05 20:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-4-bart.vanassche@sandisk.com>

On Mon, Apr 03, 2017 at 04:22:26PM -0700, Bart Van Assche wrote:
> If a tag set is shared among multiple hardware queues, leave
> it to the block driver to rerun hardware queues. Hence remove
> QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
> Remove blk_mq_sched_mark_restart_queue() because this
> function has no callers.

Hi, Bart,

Kyber uses blk_mq_sched_mark_restart_queue() and the QUEUE_FLAG_RESTART
bit. If it's not too much trouble, it'd make things easier for me if you
left it in place. If it's a pain, it's fine if you get rid of it, I can
reintroduce it in my series.

^ permalink raw reply

* [PATCH v3 8/8] blk-mq: use true instead of 1 for blk_mq_queue_data.last
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de57e7727c52..c82fcb923ff8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1435,7 +1435,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
-		.last = 1
+		.last = true,
 	};
 	struct blk_mq_hw_ctx *hctx;
 	blk_qc_t new_cookie;
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 7/8] blk-mq: make driver tag failure path easier to follow
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 39e9af3498ac..de57e7727c52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1003,17 +1003,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (blk_mq_dispatch_wait_add(hctx)) {
-				/*
-				 * It's possible that a tag was freed in the
-				 * window between the allocation failure and
-				 * adding the hardware queue to the wait queue.
-				 */
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
-					break;
-			} else {
+			if (!blk_mq_dispatch_wait_add(hctx))
+				break;
+
+			/*
+			 * It's possible that a tag was freed in the window
+			 * between the allocation failure and adding the
+			 * hardware queue to the wait queue.
+			 */
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
 				break;
-			}
 		}
 
 		list_del_init(&rq->queuelist);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 6/8] blk-mq-sched: provide hooks for initializing hardware queue data
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 81 +++++++++++++++++++++++++-----------------------
 block/blk-mq-sched.h     |  4 ---
 include/linux/elevator.h |  2 ++
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e8c2ed654ef0..9d7f6d6ca693 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq,
 				      struct bio *bio,
@@ -465,11 +428,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx)
 {
 	struct elevator_queue *e = q->elevator;
+	int ret;
 
 	if (!e)
 		return 0;
 
-	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	if (ret)
+		return ret;
+
+	if (e->type->ops.mq.init_hctx) {
+		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+		if (ret) {
+			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+			return ret;
+		}
+	}
+
+	return 0;
 }
 
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -480,12 +456,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 	if (!e)
 		return;
 
+	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+		hctx->sched_data = NULL;
+	}
+
 	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
 
@@ -510,6 +492,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err;
 
+	if (e->ops.mq.init_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			ret = e->ops.mq.init_hctx(hctx, i);
+			if (ret) {
+				eq = q->elevator;
+				blk_mq_exit_sched(q, eq);
+				kobject_put(&eq->kobj);
+				return ret;
+			}
+		}
+	}
+
 	return 0;
 
 err:
@@ -520,6 +514,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 {
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	if (e->type->ops.mq.exit_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (hctx->sched_data) {
+				e->type->ops.mq.exit_hctx(hctx, i);
+				hctx->sched_data = NULL;
+			}
+		}
+	}
 	if (e->type->ops.mq.exit_sched)
 		e->type->ops.mq.exit_sched(e);
 	blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
 				 void (*exit)(struct blk_mq_hw_ctx *));
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_type *);
 	void (*exit_sched)(struct elevator_queue *);
+	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 5/8] blk-mq: remap queues when adding/removing hardware queues
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 147cc53134d8..39e9af3498ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2467,6 +2467,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+	if (set->ops->map_queues)
+		return set->ops->map_queues(set);
+	else
+		return blk_mq_map_queues(set);
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2521,10 +2529,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->mq_map)
 		goto out_free_tags;
 
-	if (set->ops->map_queues)
-		ret = set->ops->map_queues(set);
-	else
-		ret = blk_mq_map_queues(set);
+	ret = blk_mq_update_queue_map(set);
 	if (ret)
 		goto out_free_mq_map;
 
@@ -2616,6 +2621,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 		blk_mq_queue_reinit(q, cpu_online_mask);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 4/8] blk-mq-sched: fix crash in switch error path
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 13 +++++--
 block/blk-mq-sched.h     |  2 +-
 block/blk-mq.c           |  2 --
 block/blk-sysfs.c        |  2 +-
 block/elevator.c         | 94 +++++++++++++++++++++++++++---------------------
 include/linux/elevator.h |  2 +-
 6 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0bb13bb51daa..e8c2ed654ef0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -451,7 +451,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	return ret;
 }
 
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -513,10 +513,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	return 0;
 
 err:
-	blk_mq_sched_teardown(q);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
 	return ret;
 }
 
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+	if (e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
+}
+
 int blk_mq_sched_init(struct request_queue *q)
 {
 	int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 672430c8c342..147cc53134d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2137,8 +2137,6 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	blk_mq_sched_teardown(q);
-
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	if (q->elevator) {
 		ioc_clear_queue(q);
-		elevator_exit(q->elevator);
+		elevator_exit(q, q->elevator);
 	}
 
 	blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
 }
 EXPORT_SYMBOL(elevator_init);
 
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
 	if (e->uses_mq && e->type->ops.mq.exit_sched)
-		e->type->ops.mq.exit_sched(e);
+		blk_mq_exit_sched(q, e);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
+static int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e)
+{
+	int ret;
+
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
+	if (q->elevator) {
+		if (q->elevator->registered)
+			elv_unregister_queue(q);
+		ioc_clear_queue(q);
+		elevator_exit(q, q->elevator);
+	}
+
+	ret = blk_mq_init_sched(q, new_e);
+	if (ret)
+		goto out;
+
+	if (new_e) {
+		ret = elv_register_queue(q);
+		if (ret) {
+			elevator_exit(q, q->elevator);
+			goto out;
+		}
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
+
+out:
+	blk_mq_unfreeze_queue(q);
+	blk_mq_start_stopped_hw_queues(q, true);
+	return ret;
+
+}
+
 /*
  * switch to new_e io scheduler. be careful not to introduce deadlocks -
  * we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	bool old_registered = false;
 	int err;
 
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		blk_mq_quiesce_queue(q);
-	}
+	if (q->mq_ops)
+		return elevator_switch_mq(q, new_e);
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (old) {
 		old_registered = old->registered;
 
-		if (old->uses_mq)
-			blk_mq_sched_teardown(q);
-
-		if (!q->mq_ops)
-			blk_queue_bypass_start(q);
+		blk_queue_bypass_start(q);
 
 		/* unregister and clear all auxiliary data of the old elevator */
 		if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (q->mq_ops)
-		err = blk_mq_init_sched(q, new_e);
-	else
-		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
-	if (new_e) {
-		err = elv_register_queue(q);
-		if (err)
-			goto fail_register;
-	}
+	err = elv_register_queue(q);
+	if (err)
+		goto fail_register;
 
 	/* done, kill the old one and finish */
 	if (old) {
-		elevator_exit(old);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
+		elevator_exit(q, old);
+		blk_queue_bypass_end(q);
 	}
 
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
-	}
-
-	if (new_e)
-		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
-	else
-		blk_add_trace_msg(q, "elv switch: none");
+	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 	return 0;
 
 fail_register:
-	if (q->mq_ops)
-		blk_mq_sched_teardown(q);
-	elevator_exit(q->elevator);
+	elevator_exit(q, q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
 		elv_register_queue(q);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
-	}
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
+		blk_queue_bypass_end(q);
 	}
 
 	return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
 extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 22 ++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       |  9 ++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 6bd1758ea29b..0bb13bb51daa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -461,6 +461,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return 0;
+
+	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return;
+
+	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx);
+
 int blk_mq_sched_init(struct request_queue *q);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09cff6d1ba76..672430c8c342 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1823,6 +1823,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       hctx->fq->flush_rq, hctx_idx,
 				       flush_start_tag + hctx_idx);
 
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -1889,9 +1891,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
+	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+		goto exit_hctx;
+
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto exit_hctx;
+		goto sched_exit_hctx;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1906,6 +1911,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
+ sched_exit_hctx:
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 2/8] blk-mq-sched: refactor scheduler initialization
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     | 32 ++++++++------------
 3 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fc00f00898d3..6bd1758ea29b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,11 +432,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 	}
 }
 
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx,
+				   unsigned int hctx_idx)
+{
+	struct blk_mq_tag_set *set = q->tag_set;
+	int ret;
+
+	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+					       set->reserved_tags);
+	if (!hctx->sched_tags)
+		return -ENOMEM;
+
+	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+	if (ret)
+		blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+	return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret, i;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret;
+
+	if (!e) {
+		q->elevator = NULL;
+		return 0;
+	}
 
 	/*
 	 * Default to 256, since we don't split into sync/async like the
@@ -444,49 +478,21 @@ int blk_mq_sched_setup(struct request_queue *q)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
-	/*
-	 * We're switching to using an IO scheduler, so setup the hctx
-	 * scheduler tags and switch the request map from the regular
-	 * tags to scheduler tags. First allocate what we need, so we
-	 * can safely fail and fallback, if needed.
-	 */
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
-				q->nr_requests, set->reserved_tags);
-		if (!hctx->sched_tags) {
-			ret = -ENOMEM;
-			break;
-		}
-		ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
-			break;
+			goto err;
 	}
 
-	/*
-	 * If we failed, free what we did allocate
-	 */
-	if (ret) {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_sched_free_tags(set, hctx, i);
-		}
-
-		return ret;
-	}
+	ret = e->ops.mq.init_sched(q, e);
+	if (ret)
+		goto err;
 
 	return 0;
-}
 
-void blk_mq_sched_teardown(struct request_queue *q)
-{
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_sched_free_tags(set, hctx, i);
+err:
+	blk_mq_sched_teardown(q);
+	return ret;
 }
 
 int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
 int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	if (e->uses_mq) {
-		err = blk_mq_sched_setup(q);
-		if (!err)
-			err = e->ops.mq.init_sched(q, e);
-	} else
+	if (e->uses_mq)
+		err = blk_mq_init_sched(q, e);
+	else
 		err = e->ops.sq.elevator_init_fn(q, e);
-	if (err) {
-		if (e->uses_mq)
-			blk_mq_sched_teardown(q);
+	if (err)
 		elevator_put(e);
-	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (new_e) {
-		if (new_e->uses_mq) {
-			err = blk_mq_sched_setup(q);
-			if (!err)
-				err = new_e->ops.mq.init_sched(q, new_e);
-		} else
-			err = new_e->ops.sq.elevator_init_fn(q, new_e);
-		if (err)
-			goto fail_init;
+	if (q->mq_ops)
+		err = blk_mq_init_sched(q, new_e);
+	else
+		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	if (err)
+		goto fail_init;
 
+	if (new_e) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	} else
-		q->elevator = NULL;
+	}
 
 	/* done, kill the old one and finish */
 	if (old) {
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491418411.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c |  9 +++++----
 block/blk-mq.c       | 25 +++++++++++++------------
 block/blk-mq.h       |  2 +-
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..fc00f00898d3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	bool did_work = false;
 	LIST_HEAD(rq_list);
@@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
+		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
@@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 			if (!rq)
 				break;
 			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
+		} while (blk_mq_dispatch_rq_list(q, &rq_list));
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..09cff6d1ba76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
 
-	if (rq->tag != -1) {
-done:
-		if (hctx)
-			*hctx = data.hctx;
-		return true;
-	}
+	if (rq->tag != -1)
+		goto done;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
@@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
-		goto done;
 	}
 
-	return false;
+done:
+	if (hctx)
+		*hctx = data.hctx;
+	return rq->tag != -1;
 }
 
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
@@ -980,17 +978,20 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
-	struct request_queue *q = hctx->queue;
+	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
+	if (list_empty(list))
+		return false;
+
 	/*
 	 * Now process all the entries, sending them to the driver.
 	 */
 	errors = queued = 0;
-	while (!list_empty(list)) {
+	do {
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
@@ -1053,7 +1054,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 
 		if (ret == BLK_MQ_RQ_QUEUE_BUSY)
 			break;
-	}
+	} while (!list_empty(list));
 
 	hctx->dispatched[queued_to_index(queued)]++;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..7e6f2e467696 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 0/8] blk-mq: various fixes and cleanups
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Rebase of v2 [1] onto block-next. As with v2,

- Patch 1 is the new fix for a hang that Josef reported after trying v1.
- Patches 2-6 are the original series. Patch 5 now has Christoph's and
  Sagi's Reviewed-by.
- Patches 7 and 8 are trivial cleanups.

Patches 1-5 should probably go into 4.11, and 6-8 are for 4.12.

1: http://marc.info/?l=linux-block&m=149141696306216&w=2

Omar Sandoval (8):
  blk-mq: use the right hctx when getting a driver tag fails
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data
  blk-mq: make driver tag failure path easier to follow
  blk-mq: use true instead of 1 for blk_mq_queue_data.last

 block/blk-mq-sched.c     | 187 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  71 ++++++++++--------
 block/blk-mq.h           |   2 +-
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++--------------
 include/linux/elevator.h |   4 +-
 7 files changed, 227 insertions(+), 166 deletions(-)

-- 
2.12.2

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox