Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH v3 1/2] blk-stat: convert blk-stat bucket callback to signed
From: sbates @ 2017-04-07 12:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi
In-Reply-To: <1491567843-26190-1-git-send-email-sbates@raithlin.com>

From: Stephen Bates <sbates@raithlin.com>

In order to allow for filtering of IO based on some other properties
of the request than direction we allow the bucket function to return
an int.

If the bucket callback returns a negative do no count it in the stats
accumulation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 block/blk-stat.c | 6 ++++--
 block/blk-stat.h | 9 +++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index e77ec52..dde9d39 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -19,7 +19,7 @@ struct blk_queue_stats {
 	bool enable_accounting;
 };
 
-unsigned int blk_stat_rq_ddir(const struct request *rq)
+int blk_stat_rq_ddir(const struct request *rq)
 {
 	return rq_data_dir(rq);
 }
@@ -104,6 +104,8 @@ void blk_stat_add(struct request *rq)
 	list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
 		if (blk_stat_is_active(cb)) {
 			bucket = cb->bucket_fn(rq);
+			if (bucket < 0)
+				continue;
 			stat = &this_cpu_ptr(cb->cpu_stat)[bucket];
 			__blk_stat_add(stat, value);
 		}
@@ -135,7 +137,7 @@ static void blk_stat_timer_fn(unsigned long data)
 
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data)
 {
 	struct blk_stat_callback *cb;
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 53f08a6..622a62c 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -48,9 +48,10 @@ struct blk_stat_callback {
 
 	/**
 	 * @bucket_fn: Given a request, returns which statistics bucket it
-	 * should be accounted under.
+	 * should be accounted under. Return -1 for no bucket for this
+	 * request.
 	 */
-	unsigned int (*bucket_fn)(const struct request *);
+	int (*bucket_fn)(const struct request *);
 
 	/**
 	 * @buckets: Number of statistics buckets.
@@ -120,7 +121,7 @@ void blk_stat_enable_accounting(struct request_queue *q);
  *
  * Return: Data direction of the request, either READ or WRITE.
  */
-unsigned int blk_stat_rq_ddir(const struct request *rq);
+int blk_stat_rq_ddir(const struct request *rq);
 
 /**
  * blk_stat_alloc_callback() - Allocate a block statistics callback.
@@ -135,7 +136,7 @@ unsigned int blk_stat_rq_ddir(const struct request *rq);
  */
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data);
 
 /**
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 0/2] blk-stat: Add ability to not bucket IO; improve IO polling.
From: sbates @ 2017-04-07 12:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi

From: Stephen Bates <sbates@raithlin.com>

Omar recently developed some patches for block layer stats that use
callbacks to determine which bucket an IO should be considered for. At
the same time there was discussion at LSF/MM that we might not want to
consider all IO when generating stats for certain algorithms (e.g. IO
completion polling) or to bucket them in a more optimal fashion.

This set does two things. It makes the bucket callback for stats
signed so we can now ignore IO that cause a negative to be returned
from the bucket function. It then improves the IO polling latency
estimations by bucketing stats based on IO size and direction.

This patchset applies cleanly on 6809ef67eb7b4b68d (Merge branch
'for-4.12/block' into for-next) in Jens' for-next tree.

I've lightly tested this using QEMU and a real NVMe low-latency
device. I do not have performance number yet. Feedback would be
appreciated! I am not *super* happy with how the bucketing by size is
done. Any suggestions on how to improve this would be appreciated!

Cc: Damien.LeMoal@wdc.com
Cc: osandov@osandov.com

Changes since v1:
  Modified the bucket function as per Jens' suggestions.

Changes since v1:
  Dropped the cast in blk_stat_rq_ddir() as per Omar's suggestion.
  Moved to an array of buckets based on IO size rather than a filter
  as suggested by Jens and Damien.
  Rebased on Jen's for-next tree

Stephen Bates (2):
  blk-stat: convert blk-stat bucket callback to signed
  blk-mq: Add a polling specific stats function

 block/blk-mq.c   | 45 +++++++++++++++++++++++++++++++++++----------
 block/blk-stat.c |  6 ++++--
 block/blk-stat.h |  9 +++++----
 3 files changed, 44 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH v2 2/2] blk-mq: Add a polling specific stats function
From: Stephen  Bates @ 2017-04-07 12:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	Damien.LeMoal@wdc.com, osandov@osandov.com, sagi@grimberg.me
In-Reply-To: <370c8854-1735-b0e7-089f-4b04068ca4f1@kernel.dk>

T24gMjAxNy0wNC0wNSwgNzoxNCBQTSwgIkplbnMgQXhib2UiIDxheGJvZUBrZXJuZWwuZGs+IHdy
b3RlOg0KDQo+IFdoeSBub3QganVzdCBoYXZlIDggYnVja2V0cywgYW5kIG1ha2UgaXQ6DQo+DQo+
CWJ1Y2tldCA9IGRkaXIgKyBpbG9nMihieXRlcykgLSA5Ow0KPg0KPiBhbmQgY2FwIGl0IGF0IE1B
WF9CVUNLRVQgKDgpIGFuZCBwdXQgYWxsIHRob3NlIGFib3ZlIGludG8gdGhlIHRvcA0KPiBidWNr
ZXQuDQoNClRoYW5rcy4gSG93ZXZlciwgdGhhdCBlcXVhdGlvbiBkb2VzIG5vdCBkaWZmZXJlbnRp
YXRlIGJldHdlZW4gZGlyZWN0aW9uIGFuZCBzaXplLiBJbnN0ZWFkIHdlIGNhbiB1c2UNCg0KYnVj
a2V0ID0gZGRpciArIDIqKGlsb2cyKGJ5dGVzKSAtIDkpOw0KDQphbmQgdGhlbiBiaW4gYW55IElP
IG92ZXIgNjRLIGluIHRoZSBsYXJnZXN0IG9mIHRoZSB0d28gYnVja2V0cyBiYXNlZCBvbiBkaXJl
Y3Rpb24uIEnigJlsbCBpbXBsZW1lbnQgdGhpcyBpbiBhIHYz4oCmLg0KDQpDaGVlcnMNCg0KU3Rl
cGhlbg0KDQoNCg0KDQo=

^ permalink raw reply

* Re: [PATCH 7/8] nowait aio: xfs
From: Goldwyn Rodrigues @ 2017-04-07 11:34 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170406225415.GC4864@birch.djwong.org>



On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
>>> +	if (unaligned_io) {
>>> +		/* If we are going to wait for other DIO to finish, bail */
>>> +		if ((iocb->ki_flags & IOCB_NOWAIT) &&
>>> +		     atomic_read(&inode->i_dio_count))
>>> +			return -EAGAIN;
>>>  		inode_dio_wait(inode);
>>
>> This checks i_dio_count twice in the nowait case, I think it should be:
>>
>> 	if (iocb->ki_flags & IOCB_NOWAIT) {
>> 		if (atomic_read(&inode->i_dio_count))
>> 			return -EAGAIN;
>> 	} else {
>> 		inode_dio_wait(inode);
>> 	}
>>
>>>  	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
>>>  		if (flags & IOMAP_DIRECT) {
>>> +			/* A reflinked inode will result in CoW alloc */
>>> +			if (flags & IOMAP_NOWAIT) {
>>> +				error = -EAGAIN;
>>> +				goto out_unlock;
>>> +			}
>>
>> This is a bit pessimistic - just because the inode has any shared
>> extents we could still write into unshared ones.  For now I think this
>> pessimistic check is fine, but the comment should be corrected.
> 
> Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> is already an existing reservation in the CoW fork then we'll have to
> CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> anything in the CoW fork, then we have to see if there are shared blocks
> by calling _reflink_trim_around_shared.  That performs a refcountbt
> lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> 
> IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> cover both write-to-reflinked-file cases.
> 

IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
for direct-IO only. This is checked early on, when we are checking for
user-passed flags, and if not, -EINVAL is returned.


-- 
Goldwyn

^ permalink raw reply

* [PATCHv7 2/2] loop: support 4k physical blocksize
From: Hannes Reinecke @ 2017-04-07 10:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke,
	Hannes Reinecke
In-Reply-To: <1491562257-82574-1-git-send-email-hare@suse.de>

When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/loop.c      | 40 +++++++++++++++++++++++++++++++++++-----
 drivers/block/loop.h      |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81b3900..0909e06 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 }
 
 static int
-figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
+figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
+		 loff_t logical_blocksize)
 {
 	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
 	sector_t x = (sector_t)size;
@@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 		lo->lo_offset = offset;
 	if (lo->lo_sizelimit != sizelimit)
 		lo->lo_sizelimit = sizelimit;
+	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
+		lo->lo_logical_blocksize = logical_blocksize;
+		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
+		blk_queue_logical_block_size(lo->lo_queue,
+					     lo->lo_logical_blocksize);
+	}
 	set_capacity(lo->lo_disk, x);
 	bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
 	/* let user-space know about the new size */
@@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
 	struct file *file = lo->lo_backing_file;
 	struct inode *inode = file->f_mapping->host;
 	struct request_queue *q = lo->lo_queue;
+	int lo_bits = 9;
 
 	/*
 	 * We use punch hole to reclaim the free space used by the
@@ -833,7 +841,10 @@ static void loop_config_discard(struct loop_device *lo)
 
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
-	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE)
+		lo_bits = blksize_bits(lo->lo_logical_blocksize);
+
+	blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
 	q->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1087,6 +1098,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	int err;
 	struct loop_func_table *xfer;
 	kuid_t uid = current_uid();
+	int lo_flags = lo->lo_flags;
 
 	if (lo->lo_encrypt_key_size &&
 	    !uid_eq(lo->lo_key_owner, uid) &&
@@ -1119,9 +1131,26 @@ static int loop_clr_fd(struct loop_device *lo)
 	if (err)
 		goto exit;
 
+	if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
+		if (!(lo->lo_flags & LO_FLAGS_BLOCKSIZE))
+			lo->lo_logical_blocksize = 512;
+		lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
+		if (LO_INFO_BLOCKSIZE(info) != 512 &&
+		    LO_INFO_BLOCKSIZE(info) != 1024 &&
+		    LO_INFO_BLOCKSIZE(info) != 2048 &&
+		    LO_INFO_BLOCKSIZE(info) != 4096)
+			return -EINVAL;
+		if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
+			return -EINVAL;
+	}
+
 	if (lo->lo_offset != info->lo_offset ||
-	    lo->lo_sizelimit != info->lo_sizelimit)
-		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
+	    lo->lo_sizelimit != info->lo_sizelimit ||
+	    lo->lo_flags != lo_flags ||
+	    ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
+	     lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
+		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
+				     LO_INFO_BLOCKSIZE(info)))
 			err = -EFBIG;
 			goto exit;
 		}
@@ -1312,7 +1341,8 @@ static int loop_set_capacity(struct loop_device *lo)
 	if (unlikely(lo->lo_state != Lo_bound))
 		return -ENXIO;
 
-	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
+	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
+				lo->lo_logical_blocksize);
 }
 
 static int loop_set_dio(struct loop_device *lo, unsigned long arg)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c..579f2f7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -49,6 +49,7 @@ struct loop_device {
 	struct file *	lo_backing_file;
 	struct block_device *lo_device;
 	unsigned	lo_blocksize;
+	unsigned	lo_logical_blocksize;
 	void		*key_data; 
 
 	gfp_t		old_gfp_mask;
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index c8125ec..a3960f9 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -22,6 +22,7 @@ enum {
 	LO_FLAGS_AUTOCLEAR	= 4,
 	LO_FLAGS_PARTSCAN	= 8,
 	LO_FLAGS_DIRECT_IO	= 16,
+	LO_FLAGS_BLOCKSIZE	= 32,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
@@ -59,6 +60,8 @@ struct loop_info64 {
 	__u64		   lo_init[2];
 };
 
+#define LO_INFO_BLOCKSIZE(l) (l)->lo_init[0]
+
 /*
  * Loop filter types
  */
-- 
1.8.5.6

^ permalink raw reply related

* [PATCHv7 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity
From: Hannes Reinecke @ 2017-04-07 10:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <1491562257-82574-1-git-send-email-hare@suse.de>

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f3..81b3900 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	return err;
 }
 
-static int loop_set_capacity(struct loop_device *lo, struct block_device *bdev)
+static int loop_set_capacity(struct loop_device *lo)
 {
 	if (unlikely(lo->lo_state != Lo_bound))
 		return -ENXIO;
@@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 	case LOOP_SET_CAPACITY:
 		err = -EPERM;
 		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-			err = loop_set_capacity(lo, bdev);
+			err = loop_set_capacity(lo);
 		break;
 	case LOOP_SET_DIRECT_IO:
 		err = -EPERM;
-- 
1.8.5.6

^ permalink raw reply related

* [PATCHv7 0/2] loop: enable different logical blocksizes
From: Hannes Reinecke @ 2017-04-07 10:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke

Currently the loop driver just simulates 512-byte blocks. When
creating bootable images for virtual machines it might be required
to use a different physical blocksize (eg 4k for S/390 DASD), as
the some bootloaders (like lilo or zipl for S/390) need to know
the physical block addresses of the kernel and initrd.

With this patchset the loop driver will export the logical and
physical blocksize and the current LOOP_SET_STATUS64 ioctl is
extended to set the logical blocksize by re-using the existing
'init' fields, which are currently unused.

As usual, comments and reviews are welcome.

Changes to v1:
- Move LO_FLAGS_BLOCKSIZE definition
- Reshuffle patches
Changes to v2:
- Include reviews from Ming Lei
Changes to v3:
- Include reviews from Christoph
- Merge patches
Changes to v4:
- Add LO_INFO_BLOCKSIZE definition
Changes to v5:
- Rediff to latest upstream
Changes to v6:
- Include review from Ming Lei

Hannes Reinecke (2):
  loop: Remove unused 'bdev' argument from loop_set_capacity
  loop: support 4k physical blocksize

 drivers/block/loop.c      | 44 +++++++++++++++++++++++++++++++++++++-------
 drivers/block/loop.h      |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 41 insertions(+), 7 deletions(-)

-- 
1.8.5.6

^ permalink raw reply

* Re: [PATCH v3 4/5] blk-mq: Introduce blk_mq_delay_run_hw_queue()
From: Ming Lei @ 2017-04-07 10:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Long Li, K . Y . Srinivasan
In-Reply-To: <20170406181050.12137-5-bart.vanassche@sandisk.com>

On Thu, Apr 06, 2017 at 11:10:49AM -0700, Bart Van Assche wrote:
> Introduce a function that runs a hardware queue unconditionally

I appreciate if some background can be provided for this function,
like fixing some issues?


Thanks,
Ming

^ permalink raw reply

* Re: [PATCHv6 2/2] loop: support 4k physical blocksize
From: Ming Lei @ 2017-04-07 10:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <a03c7c2e-9f21-5123-eae4-1001dfc0b267@suse.de>

On Fri, Apr 07, 2017 at 11:52:27AM +0200, Hannes Reinecke wrote:
> On 04/07/2017 11:38 AM, Hannes Reinecke wrote:
> > On 04/07/2017 11:34 AM, Ming Lei wrote:
> >> On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
> >>> When generating bootable VM images certain systems (most notably
> >>> s390x) require devices with 4k blocksize. This patch implements
> >>> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> >>> blocksize to that of the underlying device, and allow to change
> >>> the logical blocksize for up to the physical blocksize.
> >>
> >> Actually this UAPI flag is only for setting logical block size. 
> >>
> > Hmm? No, we're setting the physical blocksize, too.
> > Or am I missing something?
> > 
> >>>
> >>> Signed-off-by: Hannes Reinecke <hare@suse.com>
> >>> ---
> >>>  drivers/block/loop.c      | 36 +++++++++++++++++++++++++++++++-----
> >>>  drivers/block/loop.h      |  1 +
> >>>  include/uapi/linux/loop.h |  3 +++
> >>>  3 files changed, 35 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> >>> index 81b3900..f098681 100644
> >>> --- a/drivers/block/loop.c
> >>> +++ b/drivers/block/loop.c
> >>> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
> >>>  }
> >>>  
> >>>  static int
> >>> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> >>> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> >>> +		 loff_t logical_blocksize)
> >>>  {
> >>>  	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
> >>>  	sector_t x = (sector_t)size;
> >>> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
> >>>  		lo->lo_offset = offset;
> >>>  	if (lo->lo_sizelimit != sizelimit)
> >>>  		lo->lo_sizelimit = sizelimit;
> >>> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> >>> +		lo->lo_logical_blocksize = logical_blocksize;
> >>> +		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> >>> +		blk_queue_logical_block_size(lo->lo_queue,
> >>> +					     lo->lo_logical_blocksize);
> >>> +	}
> >>
> >> We can move setting physical block size into loop_set_fd(), and set
> >> 512 bytes as default logical block size in loop_set_fd() too.
> >>
> > Okay.
> > 
> After thinking about it some more I'm not sure if I agree with that
> reasoning.
> 
> One of the goals of this patchset is to keep compability with existing
> installations.
> If we move setting the physical blocksize into loop_set_fd() (which
> needs to be called before loop_set_status()) the physical blocksize
> would always be set.
> Which would induce a user-visible change.
> 
> Hence I've formulated my patch to _not_ change the default setup if the
> new flag isn't set.

OK, better to not break current users, :-)

Thanks,
Ming

^ permalink raw reply

* Re: [PATCHv6 2/2] loop: support 4k physical blocksize
From: Hannes Reinecke @ 2017-04-07  9:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <2aa7767b-b8e4-1d54-0752-63253283ad5c@suse.de>

On 04/07/2017 11:38 AM, Hannes Reinecke wrote:
> On 04/07/2017 11:34 AM, Ming Lei wrote:
>> On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
>>> When generating bootable VM images certain systems (most notably
>>> s390x) require devices with 4k blocksize. This patch implements
>>> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
>>> blocksize to that of the underlying device, and allow to change
>>> the logical blocksize for up to the physical blocksize.
>>
>> Actually this UAPI flag is only for setting logical block size. 
>>
> Hmm? No, we're setting the physical blocksize, too.
> Or am I missing something?
> 
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>>> ---
>>>  drivers/block/loop.c      | 36 +++++++++++++++++++++++++++++++-----
>>>  drivers/block/loop.h      |  1 +
>>>  include/uapi/linux/loop.h |  3 +++
>>>  3 files changed, 35 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>>> index 81b3900..f098681 100644
>>> --- a/drivers/block/loop.c
>>> +++ b/drivers/block/loop.c
>>> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>>>  }
>>>  
>>>  static int
>>> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
>>> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
>>> +		 loff_t logical_blocksize)
>>>  {
>>>  	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>>>  	sector_t x = (sector_t)size;
>>> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>>>  		lo->lo_offset = offset;
>>>  	if (lo->lo_sizelimit != sizelimit)
>>>  		lo->lo_sizelimit = sizelimit;
>>> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
>>> +		lo->lo_logical_blocksize = logical_blocksize;
>>> +		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
>>> +		blk_queue_logical_block_size(lo->lo_queue,
>>> +					     lo->lo_logical_blocksize);
>>> +	}
>>
>> We can move setting physical block size into loop_set_fd(), and set
>> 512 bytes as default logical block size in loop_set_fd() too.
>>
> Okay.
> 
After thinking about it some more I'm not sure if I agree with that
reasoning.

One of the goals of this patchset is to keep compability with existing
installations.
If we move setting the physical blocksize into loop_set_fd() (which
needs to be called before loop_set_status()) the physical blocksize
would always be set.
Which would induce a user-visible change.

Hence I've formulated my patch to _not_ change the default setup if the
new flag isn't set.

If this is of no concern, however, I can easily modify the patch to
always set the physical blocksize; in fact, that's what I did
originally, but that was deemed bad style.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH v3 1/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Ming Lei @ 2017-04-07  9:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170406181050.12137-2-bart.vanassche@sandisk.com>

On Thu, Apr 06, 2017 at 11:10:46AM -0700, Bart Van Assche wrote:
> Since the next patch in this series will use RCU to iterate over
> tag_list, make this safe. Add lockdep_assert_held() statements
> in functions that iterate over tag_list to make clear that using
> list_for_each_entry() instead of list_for_each_entry_rcu() is
> fine in these functions.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> ---
>  block/blk-mq.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7cd3208bcdf..b5580b09b4a5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
>  {
>  	struct request_queue *q;
>  
> +	lockdep_assert_held(&set->tag_list_lock);
> +
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		blk_mq_freeze_queue(q);
>  		queue_set_hctx_shared(q, shared);
> @@ -2096,6 +2098,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>  		blk_mq_update_tag_set_depth(set, false);
>  	}
>  	mutex_unlock(&set->tag_list_lock);
> +
> +	synchronize_rcu();

Looks synchronize_rcu() is only needed in deletion path, so it can
be moved to blk_mq_del_queue_tag_set().

Also list_del_init/list_add_tail() need to be replaced with RCU
safe version in functions operating &set->tag_list.

>  }
>  
>  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> @@ -2601,6 +2605,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  {
>  	struct request_queue *q;
>  
> +	lockdep_assert_held(&set->tag_list_lock);
> +
>  	if (nr_hw_queues > nr_cpu_ids)
>  		nr_hw_queues = nr_cpu_ids;
>  	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
> -- 
> 2.12.0
> 

-- 
Ming

^ permalink raw reply

* Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls
From: Ming Lei @ 2017-04-07  9:41 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block
In-Reply-To: <20170406181050.12137-1-bart.vanassche@sandisk.com>

On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> The five patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.

I read the commit log of the 5 patches, looks not found descriptions
about root cause of the queue lockup, so could you explain a bit about
the reason behind?

Thanks,
Ming

^ permalink raw reply

* Re: [PATCHv6 2/2] loop: support 4k physical blocksize
From: Hannes Reinecke @ 2017-04-07  9:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <20170407093428.GA27631@ming.t460p>

On 04/07/2017 11:34 AM, Ming Lei wrote:
> On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
>> When generating bootable VM images certain systems (most notably
>> s390x) require devices with 4k blocksize. This patch implements
>> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
>> blocksize to that of the underlying device, and allow to change
>> the logical blocksize for up to the physical blocksize.
> 
> Actually this UAPI flag is only for setting logical block size. 
> 
Hmm? No, we're setting the physical blocksize, too.
Or am I missing something?

>>
>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>> ---
>>  drivers/block/loop.c      | 36 +++++++++++++++++++++++++++++++-----
>>  drivers/block/loop.h      |  1 +
>>  include/uapi/linux/loop.h |  3 +++
>>  3 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index 81b3900..f098681 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>>  }
>>  
>>  static int
>> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
>> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
>> +		 loff_t logical_blocksize)
>>  {
>>  	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>>  	sector_t x = (sector_t)size;
>> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>>  		lo->lo_offset = offset;
>>  	if (lo->lo_sizelimit != sizelimit)
>>  		lo->lo_sizelimit = sizelimit;
>> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
>> +		lo->lo_logical_blocksize = logical_blocksize;
>> +		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
>> +		blk_queue_logical_block_size(lo->lo_queue,
>> +					     lo->lo_logical_blocksize);
>> +	}
> 
> We can move setting physical block size into loop_set_fd(), and set
> 512 bytes as default logical block size in loop_set_fd() too.
> 
Okay.

>>  	set_capacity(lo->lo_disk, x);
>>  	bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>>  	/* let user-space know about the new size */
>> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>>  	struct file *file = lo->lo_backing_file;
>>  	struct inode *inode = file->f_mapping->host;
>>  	struct request_queue *q = lo->lo_queue;
>> +	int lo_bits = blksize_bits(lo->lo_logical_blocksize);
>>  
>>  	/*
>>  	 * We use punch hole to reclaim the free space used by the
>> @@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
>>  
>>  	q->limits.discard_granularity = inode->i_sb->s_blocksize;
>>  	q->limits.discard_alignment = 0;
>> -	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
>> +	blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>>  	q->limits.discard_zeroes_data = 1;
>>  	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>>  }
>> @@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>>  
>>  	lo->use_dio = false;
>>  	lo->lo_blocksize = lo_blocksize;
>> +	lo->lo_logical_blocksize = 512;
> 
> It isn't enough to just this local vaiable as 512 here, since the last
> logical block size is still kept in queue's setting.
> 
Right. Will be updating the queue setting, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCHv6 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity
From: Ming Lei @ 2017-04-07  9:37 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Jens Axboe, Omar Sandoval, Ming Lei, linux-block
In-Reply-To: <1491548338-71005-2-git-send-email-hare@suse.de>

On Fri, Apr 07, 2017 at 08:58:57AM +0200, Hannes Reinecke wrote:
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/block/loop.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cc981f3..81b3900 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  	return err;
>  }
>  
> -static int loop_set_capacity(struct loop_device *lo, struct block_device *bdev)
> +static int loop_set_capacity(struct loop_device *lo)
>  {
>  	if (unlikely(lo->lo_state != Lo_bound))
>  		return -ENXIO;
> @@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
>  	case LOOP_SET_CAPACITY:
>  		err = -EPERM;
>  		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
> -			err = loop_set_capacity(lo, bdev);
> +			err = loop_set_capacity(lo);
>  		break;
>  	case LOOP_SET_DIRECT_IO:
>  		err = -EPERM;

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming

^ permalink raw reply

* Re: [PATCHv6 2/2] loop: support 4k physical blocksize
From: Ming Lei @ 2017-04-07  9:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <1491548338-71005-3-git-send-email-hare@suse.de>

On Fri, Apr 07, 2017 at 08:58:58AM +0200, Hannes Reinecke wrote:
> When generating bootable VM images certain systems (most notably
> s390x) require devices with 4k blocksize. This patch implements
> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> blocksize to that of the underlying device, and allow to change
> the logical blocksize for up to the physical blocksize.

Actually this UAPI flag is only for setting logical block size. 

> 
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/block/loop.c      | 36 +++++++++++++++++++++++++++++++-----
>  drivers/block/loop.h      |  1 +
>  include/uapi/linux/loop.h |  3 +++
>  3 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 81b3900..f098681 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>  }
>  
>  static int
> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> +		 loff_t logical_blocksize)
>  {
>  	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>  	sector_t x = (sector_t)size;
> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>  		lo->lo_offset = offset;
>  	if (lo->lo_sizelimit != sizelimit)
>  		lo->lo_sizelimit = sizelimit;
> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> +		lo->lo_logical_blocksize = logical_blocksize;
> +		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> +		blk_queue_logical_block_size(lo->lo_queue,
> +					     lo->lo_logical_blocksize);
> +	}

We can move setting physical block size into loop_set_fd(), and set
512 bytes as default logical block size in loop_set_fd() too.

>  	set_capacity(lo->lo_disk, x);
>  	bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>  	/* let user-space know about the new size */
> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>  	struct file *file = lo->lo_backing_file;
>  	struct inode *inode = file->f_mapping->host;
>  	struct request_queue *q = lo->lo_queue;
> +	int lo_bits = blksize_bits(lo->lo_logical_blocksize);
>  
>  	/*
>  	 * We use punch hole to reclaim the free space used by the
> @@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
>  
>  	q->limits.discard_granularity = inode->i_sb->s_blocksize;
>  	q->limits.discard_alignment = 0;
> -	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> +	blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>  	q->limits.discard_zeroes_data = 1;
>  	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>  }
> @@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>  
>  	lo->use_dio = false;
>  	lo->lo_blocksize = lo_blocksize;
> +	lo->lo_logical_blocksize = 512;

It isn't enough to just this local vaiable as 512 here, since the last
logical block size is still kept in queue's setting.

>  	lo->lo_device = bdev;
>  	lo->lo_flags = lo_flags;
>  	lo->lo_backing_file = file;
> @@ -1087,6 +1096,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  	int err;
>  	struct loop_func_table *xfer;
>  	kuid_t uid = current_uid();
> +	int lo_flags = lo->lo_flags;
>  
>  	if (lo->lo_encrypt_key_size &&
>  	    !uid_eq(lo->lo_key_owner, uid) &&
> @@ -1119,9 +1129,24 @@ static int loop_clr_fd(struct loop_device *lo)
>  	if (err)
>  		goto exit;
>  
> +	if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
> +		lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
> +		if (LO_INFO_BLOCKSIZE(info) != 512 &&
> +		    LO_INFO_BLOCKSIZE(info) != 1024 &&
> +		    LO_INFO_BLOCKSIZE(info) != 2048 &&
> +		    LO_INFO_BLOCKSIZE(info) != 4096)
> +			return -EINVAL;
> +		if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
> +			return -EINVAL;
> +	}
> +
>  	if (lo->lo_offset != info->lo_offset ||
> -	    lo->lo_sizelimit != info->lo_sizelimit)
> -		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
> +	    lo->lo_sizelimit != info->lo_sizelimit ||
> +	    lo->lo_flags != lo_flags ||
> +	    ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
> +	     lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
> +		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
> +				     info->lo_init[0]))
>  			err = -EFBIG;
>  			goto exit;
>  		}
> @@ -1312,7 +1337,8 @@ static int loop_set_capacity(struct loop_device *lo)
>  	if (unlikely(lo->lo_state != Lo_bound))
>  		return -ENXIO;
>  
> -	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
> +	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
> +				lo->lo_logical_blocksize);
>  }
>  
>  static int loop_set_dio(struct loop_device *lo, unsigned long arg)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index fb2237c..579f2f7 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -49,6 +49,7 @@ struct loop_device {
>  	struct file *	lo_backing_file;
>  	struct block_device *lo_device;
>  	unsigned	lo_blocksize;
> +	unsigned	lo_logical_blocksize;
>  	void		*key_data; 
>  
>  	gfp_t		old_gfp_mask;
> diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
> index c8125ec..a3960f9 100644
> --- a/include/uapi/linux/loop.h
> +++ b/include/uapi/linux/loop.h
> @@ -22,6 +22,7 @@ enum {
>  	LO_FLAGS_AUTOCLEAR	= 4,
>  	LO_FLAGS_PARTSCAN	= 8,
>  	LO_FLAGS_DIRECT_IO	= 16,
> +	LO_FLAGS_BLOCKSIZE	= 32,
>  };
>  
>  #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
> @@ -59,6 +60,8 @@ struct loop_info64 {
>  	__u64		   lo_init[2];
>  };
>  
> +#define LO_INFO_BLOCKSIZE(l) (l)->lo_init[0]
> +
>  /*
>   * Loop filter types
>   */
> -- 
> 1.8.5.6
> 

-- 
Ming

^ permalink raw reply

* Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
From: Vlastimil Babka @ 2017-04-07  9:21 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Mel Gorman, Johannes Weiner,
	linux-block, nbd-general, open-iscsi, linux-scsi, netdev, stable
In-Reply-To: <1f26f654-abaf-3878-6abb-5e27ff3a289e@virtuozzo.com>

On 04/05/2017 01:40 PM, Andrey Ryabinin wrote:
> On 04/05/2017 10:46 AM, Vlastimil Babka wrote:
>> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
>> deadlock during page migration by lock_page() (see the comment in
>> __unmap_and_move()). Then it unconditionally clears the flag, which can clear a
>> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
>> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
>> compaction handling in slowpath"), because direct compation was called only
>> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
>> 
>> Even now it's only a theoretical issue, as the new callsite of
>> __alloc_pages_direct_compact() is reached only for costly orders and when
>> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
>                            is false			
> 
>> gfp_flags or in_interrupt() is true. There is no such known context, but let's
>> play it safe and make __alloc_pages_direct_compact() robust for cases where
>> PF_MEMALLOC is already set.
>> 
>> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
>> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  mm/page_alloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3589f8be53be..b84e6ffbe756 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>  		enum compact_priority prio, enum compact_result *compact_result)
>>  {
>>  	struct page *page;
>> +	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>>  
>>  	if (!order)
>>  		return NULL;
>> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>  	current->flags |= PF_MEMALLOC;
>>  	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>  									prio);
>> -	current->flags &= ~PF_MEMALLOC;
>> +	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
> Perhaps this would look better:
> 
> 	tsk_restore_flags(current, noreclaim_flag, PF_MEMALLOC);
> 
> ?

Well, I didn't care much considering this is for stable only, and patch 2/4
rewrites this to the new api.

>>  	if (*compact_result <= COMPACT_INACTIVE)
>>  		return NULL;
>> 
> 

^ permalink raw reply

* Re: [PATCH V2 11/16] block, bfq: reduce idling only in symmetric scenarios
From: Paolo Valente @ 2017-04-07  7:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj@kernel.org, axboe@kernel.dk, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, samuele.zecchini92@gmail.com,
	fchecconi@gmail.com, linus.walleij@linaro.org, Arianna Avanzini,
	broonie@kernel.org, riccardo.pizzetti@gmail.com,
	ulf.hansson@linaro.org
In-Reply-To: <1490973609.2587.3.camel@sandisk.com>


> Il giorno 31 mar 2017, alle ore 17:20, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>> +       entity->weight_counter =3D kzalloc(sizeof(struct =
bfq_weight_counter),
>> +                                        GFP_ATOMIC);
>> +       entity->weight_counter->weight =3D entity->weight;
>=20
> GFP_ATOMIC allocations are more likely to fail than GFP_KERNEL =
allocations.
> What will happen if kzalloc() returns NULL?
>=20

A plain crash :( I'm adding the simple handling of this forgotten =
exception. If I don't get other reviews in the next days, I'll post a V3 =
addressing this and the other issue you highlighted.

Thanks,
Paolo

> Bart.

^ permalink raw reply

* Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}
From: Hillf Danton @ 2017-04-07  7:38 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Andrew Morton'
  Cc: linux-mm, linux-kernel, 'Michal Hocko',
	'Mel Gorman', 'Johannes Weiner', linux-block,
	nbd-general, open-iscsi, linux-scsi, netdev,
	'Michal Hocko'
In-Reply-To: <20170405074700.29871-3-vbabka@suse.cz>


On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that support
> proper nesting by saving the previous stat of the flag, similar to the existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it more
> robust.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply

* Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
From: Hillf Danton @ 2017-04-07  7:33 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Andrew Morton'
  Cc: linux-mm, linux-kernel, 'Michal Hocko',
	'Mel Gorman', 'Johannes Weiner', linux-block,
	nbd-general, open-iscsi, linux-scsi, netdev, stable,
	'Andrey Ryabinin'
In-Reply-To: <20170405074700.29871-2-vbabka@suse.cz>

On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		enum compact_priority prio, enum compact_result *compact_result)
>  {
>  	struct page *page;
> +	unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> 
>  	if (!order)
>  		return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	current->flags |= PF_MEMALLOC;
>  	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  									prio);
> -	current->flags &= ~PF_MEMALLOC;
> +	current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
>  	if (*compact_result <= COMPACT_INACTIVE)
>  		return NULL;
> --
> 2.12.2

^ permalink raw reply

* Re: kill req->errors
From: Christoph Hellwig @ 2017-04-07  7:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Christoph Hellwig, Jens Axboe, Josef Bacik, James Smart,
	Roger Pau Monné, linux-scsi, linux-nvme, linux-block,
	dm-devel
In-Reply-To: <20170406200024.GC6140@localhost.localdomain>

On Thu, Apr 06, 2017 at 04:00:24PM -0400, Konrad Rzeszutek Wilk wrote:
> You wouldn't have a git tree to easily test it? Thanks.

git://git.infradead.org/users/hch/block.git request-errors

Gitweb:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/request-errors

^ permalink raw reply

* [PATCHv6 2/2] loop: support 4k physical blocksize
From: Hannes Reinecke @ 2017-04-07  6:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke,
	Hannes Reinecke
In-Reply-To: <1491548338-71005-1-git-send-email-hare@suse.de>

When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/block/loop.c      | 36 +++++++++++++++++++++++++++++++-----
 drivers/block/loop.h      |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81b3900..f098681 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 }
 
 static int
-figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
+figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
+		 loff_t logical_blocksize)
 {
 	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
 	sector_t x = (sector_t)size;
@@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 		lo->lo_offset = offset;
 	if (lo->lo_sizelimit != sizelimit)
 		lo->lo_sizelimit = sizelimit;
+	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
+		lo->lo_logical_blocksize = logical_blocksize;
+		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
+		blk_queue_logical_block_size(lo->lo_queue,
+					     lo->lo_logical_blocksize);
+	}
 	set_capacity(lo->lo_disk, x);
 	bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
 	/* let user-space know about the new size */
@@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
 	struct file *file = lo->lo_backing_file;
 	struct inode *inode = file->f_mapping->host;
 	struct request_queue *q = lo->lo_queue;
+	int lo_bits = blksize_bits(lo->lo_logical_blocksize);
 
 	/*
 	 * We use punch hole to reclaim the free space used by the
@@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
 
 	q->limits.discard_granularity = inode->i_sb->s_blocksize;
 	q->limits.discard_alignment = 0;
-	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+	blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
 	q->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	lo->use_dio = false;
 	lo->lo_blocksize = lo_blocksize;
+	lo->lo_logical_blocksize = 512;
 	lo->lo_device = bdev;
 	lo->lo_flags = lo_flags;
 	lo->lo_backing_file = file;
@@ -1087,6 +1096,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	int err;
 	struct loop_func_table *xfer;
 	kuid_t uid = current_uid();
+	int lo_flags = lo->lo_flags;
 
 	if (lo->lo_encrypt_key_size &&
 	    !uid_eq(lo->lo_key_owner, uid) &&
@@ -1119,9 +1129,24 @@ static int loop_clr_fd(struct loop_device *lo)
 	if (err)
 		goto exit;
 
+	if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
+		lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
+		if (LO_INFO_BLOCKSIZE(info) != 512 &&
+		    LO_INFO_BLOCKSIZE(info) != 1024 &&
+		    LO_INFO_BLOCKSIZE(info) != 2048 &&
+		    LO_INFO_BLOCKSIZE(info) != 4096)
+			return -EINVAL;
+		if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
+			return -EINVAL;
+	}
+
 	if (lo->lo_offset != info->lo_offset ||
-	    lo->lo_sizelimit != info->lo_sizelimit)
-		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
+	    lo->lo_sizelimit != info->lo_sizelimit ||
+	    lo->lo_flags != lo_flags ||
+	    ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
+	     lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
+		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
+				     info->lo_init[0]))
 			err = -EFBIG;
 			goto exit;
 		}
@@ -1312,7 +1337,8 @@ static int loop_set_capacity(struct loop_device *lo)
 	if (unlikely(lo->lo_state != Lo_bound))
 		return -ENXIO;
 
-	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
+	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
+				lo->lo_logical_blocksize);
 }
 
 static int loop_set_dio(struct loop_device *lo, unsigned long arg)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c..579f2f7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -49,6 +49,7 @@ struct loop_device {
 	struct file *	lo_backing_file;
 	struct block_device *lo_device;
 	unsigned	lo_blocksize;
+	unsigned	lo_logical_blocksize;
 	void		*key_data; 
 
 	gfp_t		old_gfp_mask;
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index c8125ec..a3960f9 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -22,6 +22,7 @@ enum {
 	LO_FLAGS_AUTOCLEAR	= 4,
 	LO_FLAGS_PARTSCAN	= 8,
 	LO_FLAGS_DIRECT_IO	= 16,
+	LO_FLAGS_BLOCKSIZE	= 32,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
@@ -59,6 +60,8 @@ struct loop_info64 {
 	__u64		   lo_init[2];
 };
 
+#define LO_INFO_BLOCKSIZE(l) (l)->lo_init[0]
+
 /*
  * Loop filter types
  */
-- 
1.8.5.6

^ permalink raw reply related

* [PATCHv6 1/2] loop: Remove unused 'bdev' argument from loop_set_capacity
From: Hannes Reinecke @ 2017-04-07  6:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke
In-Reply-To: <1491548338-71005-1-git-send-email-hare@suse.de>

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/loop.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f3..81b3900 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1307,7 +1307,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	return err;
 }
 
-static int loop_set_capacity(struct loop_device *lo, struct block_device *bdev)
+static int loop_set_capacity(struct loop_device *lo)
 {
 	if (unlikely(lo->lo_state != Lo_bound))
 		return -ENXIO;
@@ -1370,7 +1370,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 	case LOOP_SET_CAPACITY:
 		err = -EPERM;
 		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
-			err = loop_set_capacity(lo, bdev);
+			err = loop_set_capacity(lo);
 		break;
 	case LOOP_SET_DIRECT_IO:
 		err = -EPERM;
-- 
1.8.5.6

^ permalink raw reply related

* [PATCHv6 0/2] loop: enable different logical blocksizes
From: Hannes Reinecke @ 2017-04-07  6:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Ming Lei, linux-block, Hannes Reinecke

Currently the loop driver just simulates 512-byte blocks. When
creating bootable images for virtual machines it might be required
to use a different physical blocksize (eg 4k for S/390 DASD), as
the some bootloaders (like lilo or zipl for S/390) need to know
the physical block addresses of the kernel and initrd.

With this patchset the loop driver will export the logical and
physical blocksize and the current LOOP_SET_STATUS64 ioctl is
extended to set the logical blocksize by re-using the existing
'init' fields, which are currently unused.

As usual, comments and reviews are welcome.

Changes to v1:
- Move LO_FLAGS_BLOCKSIZE definition
- Reshuffle patches
Changes to v2:
- Include reviews from Ming Lei
Changes to v3:
- Include reviews from Christoph
- Merge patches
Changes to v4:
- Add LO_INFO_BLOCKSIZE definition
Changes to v5:
- Rediff to latest upstream

Hannes Reinecke (2):
  loop: Remove unused 'bdev' argument from loop_set_capacity
  loop: support 4k physical blocksize

 drivers/block/loop.c      | 40 +++++++++++++++++++++++++++++++++-------
 drivers/block/loop.h      |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 37 insertions(+), 7 deletions(-)

-- 
1.8.5.6

^ permalink raw reply

* Re: [PATCH 1/2] block: Implement global tagset
From: Arun Easi @ 2017-04-07  6:21 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Jens Axboe, Omar Sandoval, Martin K. Petersen,
	James Bottomley, Christoph Hellwig, Bart van Assche, linux-block,
	linux-scsi
In-Reply-To: <b80f5b80-4b4f-279f-379e-c1d74be7fc7c@suse.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3105 bytes --]

On Thu, 6 Apr 2017, 1:49am, Hannes Reinecke wrote:

> On 04/06/2017 08:27 AM, Arun Easi wrote:
> > Hi Hannes,
> > 
> > Thanks for taking a crack at the issue. My comments below..
> > 
> > On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:
> > 
> >> Most legacy HBAs have a tagset per HBA, not per queue. To map
> >> these devices onto block-mq this patch implements a new tagset
> >> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
> >> to use just one tagset for all hardware queues.
> >>
> >> Signed-off-by: Hannes Reinecke <hare@suse.com>
> >> ---
> >>  block/blk-mq-tag.c     | 12 ++++++++----
> >>  block/blk-mq.c         | 10 ++++++++--
> >>  include/linux/blk-mq.h |  1 +
> >>  3 files changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >> index e48bc2c..a14e76c 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >>  		busy_tag_iter_fn *fn, void *priv)
> >>  {
> >> -	int i;
> >> +	int i, lim = tagset->nr_hw_queues;
> >>  
> >> -	for (i = 0; i < tagset->nr_hw_queues; i++) {
> >> +	if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
> >> +		lim = 1;
> >> +	for (i = 0; i < lim; i++) {
> >>  		if (tagset->tags && tagset->tags[i])
> >>  			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
> >>  	}
> >> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >>  
> >>  int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
> >>  {
> >> -	int i, j, ret = 0;
> >> +	int i, j, ret = 0, lim = set->nr_hw_queues;
> >>  
> >>  	if (!set->ops->reinit_request)
> >>  		goto out;
> >>  
> >> -	for (i = 0; i < set->nr_hw_queues; i++) {
> >> +	if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
> >> +		lim = 1;
> >> +	for (i = 0; i < lim; i++) {
> >>  		struct blk_mq_tags *tags = set->tags[i];
> >>  
> >>  		for (j = 0; j < tags->nr_tags; j++) {
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 159187a..db96ed0 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
> >>  {
> >>  	int ret = 0;
> >>  
> >> +	if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
> >> +		set->tags[hctx_idx] = set->tags[0];
> >> +		return true;
> >> +	}
> > 
:
> 
> > BTW, if you would like me to try out this patch on my setup, please let me 
> > know.
> > 
> Oh, yes. Please do.
> 

Ran the tests on my setup (Dell R730, 2 Node). This change did not drop 
any IOPs (got ~2M 512b). The cache miss percentage was varying based on if 
the tests were running on one node or both (latter yperformed worse). All 
interrupts were directed to only 1 node. Interestingly, the cache miss 
percentage was lowest when MQ was off.

I hit a fdisk hang (open path), btw, not sure if it has anything todo with 
this change, though.

Notes and hang stack attached.

Let me know if you are interested in any specific perf event/command-line.

Regards,
-Arun

[-- Attachment #2: Type: TEXT/plain, Size: 3423 bytes --]

perf stat, ran on a short 10 second load.

---1port-1node-new-mq----
 Performance counter stats for 'CPU(s) 2':

 188,642,696      LLC-loads                                            (66.66%)
   3,615,142      LLC-load-misses  #    1.92% of all LL-cache hits     (66.67%)
  86,488,341      LLC-stores                                           (33.34%)
  10,820,977      LLC-store-misses                                     (33.33%)
 391,370,104      cache-references                                     (49.99%)
  14,498,491      cache-misses     #    3.705 % of all cache refs      (66.66%)

---1port-1node-mq---
 Performance counter stats for 'CPU(s) 2':

 145,025,999      LLC-loads                                            (66.67%)
   3,793,427      LLC-load-misses  #    2.62% of all LL-cache hits     (66.67%)
  60,878,939      LLC-stores                                           (33.33%)
   8,044,714      LLC-store-misses                                     (33.33%)
 294,713,070      cache-references                                     (50.00%)
  11,923,354      cache-misses     #    4.046 % of all cache refs      (66.66%)

---1port-1node-nomq---
 Performance counter stats for 'CPU(s) 2':

 157,375,709      LLC-loads                                            (66.66%)
     476,117      LLC-load-misses  #    0.30% of all LL-cache hits     (66.66%)
  76,046,098      LLC-stores                                           (33.34%)
     840,756      LLC-store-misses                                     (33.34%)
 326,230,969      cache-references                                     (50.00%)
   1,332,398      cache-misses     #    0.408 % of all cache refs      (66.67%)

======================

--2port-allnodes-new-mq--
 Performance counter stats for 'CPU(s) 2':

  55,455,533      LLC-loads                                            (66.67%)
  37,996,545      LLC-load-misses  #   68.52% of all LL-cache hits     (66.67%)
  14,030,291      LLC-stores                                           (33.33%)
   7,096,931      LLC-store-misses                                     (33.33%)
  76,711,197      cache-references                                     (49.99%)
  45,170,719      cache-misses     #   58.884 % of all cache refs      (66.66%)

--2port-allnodes-mq--
 Performance counter stats for 'CPU(s) 2':

  59,303,410      LLC-loads                                            (66.66%)
  31,115,601      LLC-load-misses  #   52.47% of all LL-cache hits     (66.66%)
  17,496,477      LLC-stores                                           (33.34%)
   6,201,373      LLC-store-misses                                     (33.34%)
  89,035,272      cache-references                                     (50.00%)
  37,372,777      cache-misses     #   41.975 % of all cache refs      (66.66%)

--2port-allnodes-nomq--
 Performance counter stats for 'CPU(s) 2':

  86,724,905      LLC-loads                                            (66.67%)
  27,154,245      LLC-load-misses  #   31.31% of all LL-cache hits     (66.67%)
  33,710,265      LLC-stores                                           (33.34%)
   6,521,394      LLC-store-misses                                     (33.33%)
 139,089,528      cache-references                                     (50.00%)
  33,682,000      cache-misses     #   24.216 % of all cache refs      (66.66%)


[-- Attachment #3: Type: TEXT/PLAIN, Size: 2937 bytes --]


Apr  6 17:34:05 avlnxperf kernel: INFO: task fdisk:27745 blocked for more than 120 seconds.
Apr  6 17:34:05 avlnxperf kernel:      Tainted: G    B      OE   4.11.0-rc4-newblk-ae+ #4
Apr  6 17:34:05 avlnxperf kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 17:34:05 avlnxperf kernel: fdisk           D    0 27745  27743 0x00000080
Apr  6 17:34:05 avlnxperf kernel: Call Trace:
Apr  6 17:34:05 avlnxperf kernel: __schedule+0x289/0x8f0
Apr  6 17:34:05 avlnxperf kernel: schedule+0x36/0x80
Apr  6 17:34:05 avlnxperf kernel: schedule_timeout+0x249/0x300
Apr  6 17:34:05 avlnxperf kernel: ? sched_clock_cpu+0x11/0xb0
Apr  6 17:34:05 avlnxperf kernel: ? try_to_wake_up+0x59/0x450
Apr  6 17:34:05 avlnxperf kernel: wait_for_completion+0x121/0x180
Apr  6 17:34:05 avlnxperf kernel: ? wake_up_q+0x80/0x80
Apr  6 17:34:05 avlnxperf kernel: flush_work+0x11d/0x1c0
Apr  6 17:34:05 avlnxperf kernel: ? wake_up_worker+0x30/0x30
Apr  6 17:34:05 avlnxperf kernel: __cancel_work_timer+0x10e/0x1d0
Apr  6 17:34:05 avlnxperf kernel: ? kobj_lookup+0x10d/0x160
Apr  6 17:34:05 avlnxperf kernel: cancel_delayed_work_sync+0x13/0x20
Apr  6 17:34:05 avlnxperf kernel: disk_block_events+0x77/0x80
Apr  6 17:34:05 avlnxperf kernel: __blkdev_get+0x11b/0x4b0
Apr  6 17:34:05 avlnxperf kernel: blkdev_get+0x1c3/0x320
Apr  6 17:34:05 avlnxperf kernel: blkdev_open+0x5b/0x70
Apr  6 17:34:05 avlnxperf kernel: do_dentry_open+0x213/0x330
Apr  6 17:34:05 avlnxperf kernel: ? bd_acquire+0xd0/0xd0
Apr  6 17:34:05 avlnxperf kernel: vfs_open+0x4f/0x70
Apr  6 17:34:05 avlnxperf kernel: ? may_open+0x9b/0x100
Apr  6 17:34:05 avlnxperf kernel: path_openat+0x557/0x13c0
Apr  6 17:34:05 avlnxperf kernel: ? generic_file_read_iter+0x746/0x8c0
Apr  6 17:34:05 avlnxperf kernel: ? scsi_bios_ptable+0x54/0x130
Apr  6 17:34:05 avlnxperf kernel: do_filp_open+0x91/0x100
Apr  6 17:34:05 avlnxperf kernel: ? __alloc_fd+0x46/0x170
Apr  6 17:34:05 avlnxperf kernel: do_sys_open+0x124/0x210
Apr  6 17:34:05 avlnxperf kernel: ? __audit_syscall_exit+0x209/0x290
Apr  6 17:34:05 avlnxperf kernel: SyS_open+0x1e/0x20
Apr  6 17:34:05 avlnxperf kernel: do_syscall_64+0x67/0x180
Apr  6 17:34:05 avlnxperf kernel: entry_SYSCALL64_slow_path+0x25/0x25
Apr  6 17:34:05 avlnxperf kernel: RIP: 0033:0x7faef86b0a10
Apr  6 17:34:05 avlnxperf kernel: RSP: 002b:00007fffa7159438 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
Apr  6 17:34:05 avlnxperf kernel: RAX: ffffffffffffffda RBX: 0000000000b34310 RCX: 00007faef86b0a10
Apr  6 17:34:05 avlnxperf kernel: RDX: 00007fffa7159598 RSI: 0000000000080000 RDI: 00007fffa7159590
Apr  6 17:34:05 avlnxperf kernel: RBP: 00007fffa7159590 R08: 00007faef8610938 R09: 0000000000000008
Apr  6 17:34:05 avlnxperf kernel: R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
Apr  6 17:34:05 avlnxperf kernel: R13: 0000000000b34550 R14: 0000000000000005 R15: 0000000000000000

^ permalink raw reply

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Ming Lei @ 2017-04-07  3:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <000e804d-fa00-d248-a3ce-dcbebc34cda9@fb.com>

Hi Jens,

Thanks for your comment!

On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
> On 04/06/2017 02:23 AM, Ming Lei wrote:
> > On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> >> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> >>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> >>>> From: Omar Sandoval <osandov@fb.com>
> >>>>
> >>>> While dispatching requests, if we fail to get a driver tag, we mark the
> >>>> hardware queue as waiting for a tag and put the requests on a
> >>>> hctx->dispatch list to be run later when a driver tag is freed. However,
> >>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> >>>> queues if using a single-queue scheduler with a multiqueue device. If
> >>>
> >>> It can't perform well by using a SQ scheduler on a MQ device, so just be
> >>> curious why someone wants to do that in this way,:-)
> >>
> >> I don't know why anyone would want to, but it has to work :) The only
> >> reason we noticed this is because when the NBD device is created, it
> >> only has a single queue, so we automatically assign mq-deadline to it.
> >> Later, we update the number of queues, but it's still using mq-deadline.
> >>
> >>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> >>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> >>> of current hctx.
> >>
> >> Yup, that's right. It's weird, and I talked to Jens about just forcing
> >> the MQ device into an SQ mode when using an SQ scheduler, but this way
> >> works fine more or less.
> > 
> > Or just switch the elevator to the MQ default one when the device becomes
> > MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
> > hctx?
> 
> No, that would be a really bad idea imho. First of all, I don't want
> kernel driven scheduler changes. Secondly, the framework should work
> with a non-direct mapping between hardware dispatch queues and
> scheduling queues.
> 
> While we could force a single queue usage to make that a 1:1 mapping
> always, that loses big benefits on eg nbd, which uses multiple hardware
> queues to up the bandwidth. Similarly on nvme, for example, we still
> scale better with N submission queues and 1 scheduling queue compared to
> having just 1 submission queue.

Looks that isn't what I meant. And my 2nd point is to make mq-deadline's
dispatch_request(hctx) just returns requests mapped to the hw queue of
'hctx', then we can avoid to mess up blk-mq.c and blk-mq-sched.c.

> 
> >>> If that is true, it looks like a issue in usage of I/O scheduler, since
> >>> the mq-deadline scheduler just queues requests in one per-request_queue
> >>> linked list, for MQ device, the scheduler queue should have been per-hctx.
> >>
> >> That's an option, but that's a different scheduling policy. Again, I
> >> agree that it's strange, but it's reasonable behavior.
> > 
> > IMO, the current mq-deadline isn't good/ready for MQ device, and it
> > doesn't make sense to use it for MQ.
> 
> I don't think that's true at all. I do agree that it's somewhat quirky
> since it does introduce scheduling dependencies between the hardware
> queues, and we have to work at making that well understood and explicit,
> as not to introduce bugs due to that. But in reality, all multiqueue
> hardware we are deadling with are mapped to a single resource. As such,
> it makes a lot of sense to schedule it as such. Hence I don't think that
> a single queue deadline approach is necessarily a bad idea for even fast
> storage.

When we map all mq into one single deadline queue, it can't scale well, for
example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
commodity NVMe in a dual-socket(four cores) system:
	
	IO scheduler|	CPU utilization
	-------------------------------
	none		|   60%
	-------------------------------
	mq-deadline |	100%
	-------------------------------

And IO throughput is basically same in two cases.

Thanks,
Ming

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox