Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] block: fix leak of q->rq_wb
From: Omar Sandoval @ 2017-03-28  3:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, kernel-team
In-Reply-To: <20170328033850.GB23217@ming.t460p>

On Tue, Mar 28, 2017 at 11:43:04AM +0800, Ming Lei wrote:
> On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> > couple of cases: when a request queue is reregistered and when gendisks
> > share a request_queue. This has been a problem since wbt was introduced,
> > but the WARN_ON(!list_empty(&stats->callbacks)) in the blk-stat rework
> > exposed it. The fix is unfortunately a hack until we fix all of the
> > drivers sharing a request_queue.
> > 
> > Fixes: 87760e5eef35 ("block: hook up writeback throttling")
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  block/blk-sysfs.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fa831cb2fc30..a187e3f70028 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -893,7 +893,21 @@ int blk_register_queue(struct gendisk *disk)
> >  
> >  	kobject_uevent(&q->kobj, KOBJ_ADD);
> >  
> > -	blk_wb_init(q);
> > +	/*
> > +	 * There are two cases where wbt may have already been initialized:
> > +	 * 1. A call sequence of blk_register_queue(); blk_unregister_queue();
> > +	 *    blk_register_queue().
> > +	 * 2. Multiple gendisks sharing a request_queue.
> > +	 *
> > +	 * To fix case 1, we'd like to call wbt_exit() in
> > +	 * blk_unregister_queue(). However, that's unsafe for case 2. So, we're
> > +	 * forced to do this and call wbt_exit() in blk_release_queue() instead.
> > +	 *
> > +	 * Note that in case 2, wbt will account across disks until those legacy
> > +	 * drivers are fixed.
> > +	 */
> > +	if (!q->rq_wb)
> > +		blk_wb_init(q);
> 
> Since 'rq_wb' is per-queue and its life time is same with queue's, I
> am wondering why blk_wb_init() isn't put into blk_alloc_queue_node() or
> queue's initialization api(blk_init_allocated_queue(), or
> blk_mq_init_allocated_queue())?

Doing it at queue init time might be cleaner, I'll try that.

^ permalink raw reply

* [PATCH 0/6] block: convert remaining drivers which share a request queue
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This is clearly the pinnacle of my career: converting all remaining
block drivers which share a request queue across gendisks to use a
separate request queue per gendisk. These are all compile tested (but
the last two platform-specific ones involved hacking the Kconfig and
commenting out a bunch of arch-dependent code to get the rest to
compile), no runtime testing at all.

Let me know if I missed any. Even better, let me know if we can just
delete some of these entirely.

Thanks!

Omar Sandoval (6):
  hd: stop sharing request queue across multiple gendisks
  parport/pd: stop sharing request queue across multiple gendisks
  parport/pcd: stop sharing request queue across multiple gendisks
  parport/pf: stop sharing request queue across multiple gendisks
  swim: stop sharing request queue across multiple gendisks
  jsflash: stop sharing request queue across multiple gendisks

 drivers/block/hd.c          | 62 ++++++++++++++++++++++++++++-----------------
 drivers/block/paride/pcd.c  | 57 +++++++++++++++++++++++++++--------------
 drivers/block/paride/pd.c   | 50 ++++++++++++++++++++++++------------
 drivers/block/paride/pf.c   | 57 +++++++++++++++++++++++++++--------------
 drivers/block/swim.c        | 55 ++++++++++++++++++++++++++--------------
 drivers/sbus/char/jsflash.c | 50 ++++++++++++++++++++++++++----------
 6 files changed, 221 insertions(+), 110 deletions(-)

-- 
2.12.1

^ permalink raw reply

* [PATCH 1/6] hd: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/block/hd.c | 62 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/drivers/block/hd.c b/drivers/block/hd.c
index 6043648da1e8..79d63b20a297 100644
--- a/drivers/block/hd.c
+++ b/drivers/block/hd.c
@@ -95,7 +95,7 @@
 #define ICRC_ERR		0x80	/* new meaning:  CRC error during transfer */
 
 static DEFINE_SPINLOCK(hd_lock);
-static struct request_queue *hd_queue;
+static unsigned int hd_queue;
 static struct request *hd_req;
 
 #define TIMEOUT_VALUE	(6*HZ)
@@ -537,7 +537,7 @@ static void hd_times_out(unsigned long dummy)
 	if (!hd_req)
 		return;
 
-	spin_lock_irq(hd_queue->queue_lock);
+	spin_lock_irq(&hd_lock);
 	reset = 1;
 	name = hd_req->rq_disk->disk_name;
 	printk("%s: timeout\n", name);
@@ -548,7 +548,7 @@ static void hd_times_out(unsigned long dummy)
 		hd_end_request_cur(-EIO);
 	}
 	hd_request();
-	spin_unlock_irq(hd_queue->queue_lock);
+	spin_unlock_irq(&hd_lock);
 }
 
 static int do_special_op(struct hd_i_struct *disk, struct request *req)
@@ -566,6 +566,25 @@ static int do_special_op(struct hd_i_struct *disk, struct request *req)
 	return 1;
 }
 
+static int set_next_request(void)
+{
+	struct request_queue *q;
+	int old_pos = hd_queue;
+
+	do {
+		q = hd_gendisk[hd_queue]->queue;
+		if (++hd_queue == NR_HD)
+			hd_queue = 0;
+		if (q) {
+			hd_req = blk_fetch_request(q);
+			if (hd_req)
+				break;
+		}
+	} while (hd_queue != old_pos);
+
+	return hd_req != NULL;
+}
+
 /*
  * The driver enables interrupts as much as possible.  In order to do this,
  * (a) the device-interrupt is disabled before entering hd_request(),
@@ -587,12 +606,9 @@ static void hd_request(void)
 repeat:
 	del_timer(&device_timer);
 
-	if (!hd_req) {
-		hd_req = blk_fetch_request(hd_queue);
-		if (!hd_req) {
-			do_hd = NULL;
-			return;
-		}
+	if (!hd_req && !set_next_request()) {
+		do_hd = NULL;
+		return;
 	}
 	req = hd_req;
 
@@ -676,7 +692,7 @@ static irqreturn_t hd_interrupt(int irq, void *dev_id)
 {
 	void (*handler)(void) = do_hd;
 
-	spin_lock(hd_queue->queue_lock);
+	spin_lock(&hd_lock);
 
 	do_hd = NULL;
 	del_timer(&device_timer);
@@ -684,7 +700,7 @@ static irqreturn_t hd_interrupt(int irq, void *dev_id)
 		handler = unexpected_hd_interrupt;
 	handler();
 
-	spin_unlock(hd_queue->queue_lock);
+	spin_unlock(&hd_lock);
 
 	return IRQ_HANDLED;
 }
@@ -700,16 +716,8 @@ static int __init hd_init(void)
 	if (register_blkdev(HD_MAJOR, "hd"))
 		return -1;
 
-	hd_queue = blk_init_queue(do_hd_request, &hd_lock);
-	if (!hd_queue) {
-		unregister_blkdev(HD_MAJOR, "hd");
-		return -ENOMEM;
-	}
-
-	blk_queue_max_hw_sectors(hd_queue, 255);
 	init_timer(&device_timer);
 	device_timer.function = hd_times_out;
-	blk_queue_logical_block_size(hd_queue, 512);
 
 	if (!NR_HD) {
 		/*
@@ -742,7 +750,11 @@ static int __init hd_init(void)
 		sprintf(disk->disk_name, "hd%c", 'a'+drive);
 		disk->private_data = p;
 		set_capacity(disk, p->head * p->sect * p->cyl);
-		disk->queue = hd_queue;
+		disk->queue = blk_init_queue(do_hd_request, &hd_lock);
+		if (!disk->queue)
+			goto Enomem;
+		blk_queue_max_hw_sectors(disk->queue, 255);
+		blk_queue_logical_block_size(disk->queue, 512);
 		p->unit = drive;
 		hd_gendisk[drive] = disk;
 		printk("%s: %luMB, CHS=%d/%d/%d\n",
@@ -781,11 +793,15 @@ static int __init hd_init(void)
 out:
 	del_timer(&device_timer);
 	unregister_blkdev(HD_MAJOR, "hd");
-	blk_cleanup_queue(hd_queue);
 	return -1;
 Enomem:
-	while (drive--)
-		put_disk(hd_gendisk[drive]);
+	for (drive = 0; drive < NR_HD; drive++) {
+		if (hd_gendisk[drive]) {
+			if (hd_gendisk[drive]->queue)
+				blk_cleanup_queue(hd_gendisk[drive]->queue);
+			put_disk(hd_gendisk[drive]);
+		}
+	}
 	goto out;
 }
 
-- 
2.12.1

^ permalink raw reply related

* [PATCH 4/6] parport/pf: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team, Tim Waugh
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only.

Cc: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/block/paride/pf.c | 57 +++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c
index 14c5d32f5d8b..f24ca7315ddc 100644
--- a/drivers/block/paride/pf.c
+++ b/drivers/block/paride/pf.c
@@ -287,6 +287,12 @@ static void __init pf_init_units(void)
 		struct gendisk *disk = alloc_disk(1);
 		if (!disk)
 			continue;
+		disk->queue = blk_init_queue(do_pf_request, &pf_spin_lock);
+		if (!disk->queue) {
+			put_disk(disk);
+			return;
+		}
+		blk_queue_max_segments(disk->queue, cluster);
 		pf->disk = disk;
 		pf->pi = &pf->pia;
 		pf->media_status = PF_NM;
@@ -772,7 +778,28 @@ static int pf_ready(void)
 	return (((status_reg(pf_current) & (STAT_BUSY | pf_mask)) == pf_mask));
 }
 
-static struct request_queue *pf_queue;
+static int pf_queue;
+
+static int set_next_request(void)
+{
+	struct pf_unit *pf;
+	struct request_queue *q;
+	int old_pos = pf_queue;
+
+	do {
+		pf = &units[pf_queue];
+		q = pf->present ? pf->disk->queue : NULL;
+		if (++pf_queue == PF_UNITS)
+			pf_queue = 0;
+		if (q) {
+			pf_req = blk_fetch_request(q);
+			if (pf_req)
+				break;
+		}
+	} while (pf_queue != old_pos);
+
+	return pf_req != NULL;
+}
 
 static void pf_end_request(int err)
 {
@@ -780,16 +807,13 @@ static void pf_end_request(int err)
 		pf_req = NULL;
 }
 
-static void do_pf_request(struct request_queue * q)
+static void pf_request(void)
 {
 	if (pf_busy)
 		return;
 repeat:
-	if (!pf_req) {
-		pf_req = blk_fetch_request(q);
-		if (!pf_req)
-			return;
-	}
+	if (!pf_req && !set_next_request())
+		return;
 
 	pf_current = pf_req->rq_disk->private_data;
 	pf_block = blk_rq_pos(pf_req);
@@ -817,6 +841,11 @@ static void do_pf_request(struct request_queue * q)
 	}
 }
 
+static void do_pf_request(struct request_queue *q)
+{
+	pf_request();
+}
+
 static int pf_next_buf(void)
 {
 	unsigned long saved_flags;
@@ -846,7 +875,7 @@ static inline void next_request(int err)
 	spin_lock_irqsave(&pf_spin_lock, saved_flags);
 	pf_end_request(err);
 	pf_busy = 0;
-	do_pf_request(pf_queue);
+	pf_request();
 	spin_unlock_irqrestore(&pf_spin_lock, saved_flags);
 }
 
@@ -972,15 +1001,6 @@ static int __init pf_init(void)
 			put_disk(pf->disk);
 		return -EBUSY;
 	}
-	pf_queue = blk_init_queue(do_pf_request, &pf_spin_lock);
-	if (!pf_queue) {
-		unregister_blkdev(major, name);
-		for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++)
-			put_disk(pf->disk);
-		return -ENOMEM;
-	}
-
-	blk_queue_max_segments(pf_queue, cluster);
 
 	for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++) {
 		struct gendisk *disk = pf->disk;
@@ -988,7 +1008,6 @@ static int __init pf_init(void)
 		if (!pf->present)
 			continue;
 		disk->private_data = pf;
-		disk->queue = pf_queue;
 		add_disk(disk);
 	}
 	return 0;
@@ -1003,10 +1022,10 @@ static void __exit pf_exit(void)
 		if (!pf->present)
 			continue;
 		del_gendisk(pf->disk);
+		blk_cleanup_queue(pf->disk->queue);
 		put_disk(pf->disk);
 		pi_release(pf->pi);
 	}
-	blk_cleanup_queue(pf_queue);
 }
 
 MODULE_LICENSE("GPL");
-- 
2.12.1

^ permalink raw reply related

* [PATCH 6/6] jsflash: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team, David S . Miller
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only (by hacking it to compile on x86).

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/sbus/char/jsflash.c | 50 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/drivers/sbus/char/jsflash.c b/drivers/sbus/char/jsflash.c
index 6ff61dad5e21..62fed9dc893e 100644
--- a/drivers/sbus/char/jsflash.c
+++ b/drivers/sbus/char/jsflash.c
@@ -183,11 +183,33 @@ static void jsfd_read(char *buf, unsigned long p, size_t togo) {
 	}
 }
 
-static void jsfd_do_request(struct request_queue *q)
+static int jsfd_queue;
+
+static struct request *jsfd_next_request(void)
+{
+	struct request_queue *q;
+	struct request *rq;
+	int old_pos = jsfd_queue;
+
+	do {
+		q = jsfd_disk[jsfd_queue]->queue;
+		if (++jsfd_queue == JSF_MAX)
+			jsfd_queue = 0;
+		if (q) {
+			rq = blk_fetch_request(q);
+			if (rq)
+				return rq;
+		}
+	} while (jsfd_queue != old_pos);
+
+	return NULL;
+}
+
+static void jsfd_request(void)
 {
 	struct request *req;
 
-	req = blk_fetch_request(q);
+	req = jsfd_next_request();
 	while (req) {
 		struct jsfd_part *jdp = req->rq_disk->private_data;
 		unsigned long offset = blk_rq_pos(req) << 9;
@@ -211,10 +233,15 @@ static void jsfd_do_request(struct request_queue *q)
 		err = 0;
 	end:
 		if (!__blk_end_request_cur(req, err))
-			req = blk_fetch_request(q);
+			req = jsfd_next_request();
 	}
 }
 
+static void jsfd_do_request(struct request_queue *q)
+{
+	jsfd_request();
+}
+
 /*
  * The memory devices use the full 32/64 bits of the offset, and so we cannot
  * check against negative addresses: they are ok. The return value is weird,
@@ -544,8 +571,6 @@ static int jsflash_init(void)
 	return 0;
 }
 
-static struct request_queue *jsf_queue;
-
 static int jsfd_init(void)
 {
 	static DEFINE_SPINLOCK(lock);
@@ -562,6 +587,11 @@ static int jsfd_init(void)
 		struct gendisk *disk = alloc_disk(1);
 		if (!disk)
 			goto out;
+		disk->queue = blk_init_queue(jsfd_do_request, &lock);
+		if (!disk->queue) {
+			put_disk(disk);
+			goto out;
+		}
 		jsfd_disk[i] = disk;
 	}
 
@@ -570,13 +600,6 @@ static int jsfd_init(void)
 		goto out;
 	}
 
-	jsf_queue = blk_init_queue(jsfd_do_request, &lock);
-	if (!jsf_queue) {
-		err = -ENOMEM;
-		unregister_blkdev(JSFD_MAJOR, "jsfd");
-		goto out;
-	}
-
 	for (i = 0; i < JSF_MAX; i++) {
 		struct gendisk *disk = jsfd_disk[i];
 		if ((i & JSF_PART_MASK) >= JSF_NPART) continue;
@@ -589,7 +612,6 @@ static int jsfd_init(void)
 		disk->fops = &jsfd_fops;
 		set_capacity(disk, jdp->dsize >> 9);
 		disk->private_data = jdp;
-		disk->queue = jsf_queue;
 		add_disk(disk);
 		set_disk_ro(disk, 1);
 	}
@@ -619,6 +641,7 @@ static void __exit jsflash_cleanup_module(void)
 	for (i = 0; i < JSF_MAX; i++) {
 		if ((i & JSF_PART_MASK) >= JSF_NPART) continue;
 		del_gendisk(jsfd_disk[i]);
+		blk_cleanup_queue(jsfd_disk[i]->queue);
 		put_disk(jsfd_disk[i]);
 	}
 	if (jsf0.busy)
@@ -628,7 +651,6 @@ static void __exit jsflash_cleanup_module(void)
 
 	misc_deregister(&jsf_dev);
 	unregister_blkdev(JSFD_MAJOR, "jsfd");
-	blk_cleanup_queue(jsf_queue);
 }
 
 module_init(jsflash_init_module);
-- 
2.12.1

^ permalink raw reply related

* [PATCH 2/6] parport/pd: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team, Tim Waugh
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only.

Cc: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/block/paride/pd.c | 50 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index 9cfd2e06a649..b05e151c9b38 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -381,12 +381,33 @@ static enum action do_pd_write_start(void);
 static enum action do_pd_read_drq(void);
 static enum action do_pd_write_done(void);
 
-static struct request_queue *pd_queue;
+static int pd_queue;
 static int pd_claimed;
 
 static struct pd_unit *pd_current; /* current request's drive */
 static PIA *pi_current; /* current request's PIA */
 
+static int set_next_request(void)
+{
+	struct gendisk *disk;
+	struct request_queue *q;
+	int old_pos = pd_queue;
+
+	do {
+		disk = pd[pd_queue].gd;
+		q = disk ? disk->queue : NULL;
+		if (++pd_queue == PD_UNITS)
+			pd_queue = 0;
+		if (q) {
+			pd_req = blk_fetch_request(q);
+			if (pd_req)
+				break;
+		}
+	} while (pd_queue != old_pos);
+
+	return pd_req != NULL;
+}
+
 static void run_fsm(void)
 {
 	while (1) {
@@ -418,8 +439,7 @@ static void run_fsm(void)
 				spin_lock_irqsave(&pd_lock, saved_flags);
 				if (!__blk_end_request_cur(pd_req,
 						res == Ok ? 0 : -EIO)) {
-					pd_req = blk_fetch_request(pd_queue);
-					if (!pd_req)
+					if (!set_next_request())
 						stop = 1;
 				}
 				spin_unlock_irqrestore(&pd_lock, saved_flags);
@@ -839,7 +859,13 @@ static void pd_probe_drive(struct pd_unit *disk)
 	p->first_minor = (disk - pd) << PD_BITS;
 	disk->gd = p;
 	p->private_data = disk;
-	p->queue = pd_queue;
+	p->queue = blk_init_queue(do_pd_request, &pd_lock);
+	if (!p->queue) {
+		disk->gd = NULL;
+		put_disk(p);
+		return;
+	}
+	blk_queue_max_hw_sectors(p->queue, cluster);
 
 	if (disk->drive == -1) {
 		for (disk->drive = 0; disk->drive <= 1; disk->drive++)
@@ -919,26 +945,18 @@ static int __init pd_init(void)
 	if (disable)
 		goto out1;
 
-	pd_queue = blk_init_queue(do_pd_request, &pd_lock);
-	if (!pd_queue)
-		goto out1;
-
-	blk_queue_max_hw_sectors(pd_queue, cluster);
-
 	if (register_blkdev(major, name))
-		goto out2;
+		goto out1;
 
 	printk("%s: %s version %s, major %d, cluster %d, nice %d\n",
 	       name, name, PD_VERSION, major, cluster, nice);
 	if (!pd_detect())
-		goto out3;
+		goto out2;
 
 	return 0;
 
-out3:
-	unregister_blkdev(major, name);
 out2:
-	blk_cleanup_queue(pd_queue);
+	unregister_blkdev(major, name);
 out1:
 	return -ENODEV;
 }
@@ -953,11 +971,11 @@ static void __exit pd_exit(void)
 		if (p) {
 			disk->gd = NULL;
 			del_gendisk(p);
+			blk_cleanup_queue(p->queue);
 			put_disk(p);
 			pi_release(disk->pi);
 		}
 	}
-	blk_cleanup_queue(pd_queue);
 }
 
 MODULE_LICENSE("GPL");
-- 
2.12.1

^ permalink raw reply related

* [PATCH 5/6] swim: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only (by hacking it to compile on x86).

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/block/swim.c | 55 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index b5afd495d482..3064be6cf375 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -211,7 +211,7 @@ enum head {
 struct swim_priv {
 	struct swim __iomem *base;
 	spinlock_t lock;
-	struct request_queue *queue;
+	int fdc_queue;
 	int floppy_count;
 	struct floppy_state unit[FD_MAX_UNIT];
 };
@@ -525,12 +525,33 @@ static int floppy_read_sectors(struct floppy_state *fs,
 	return 0;
 }
 
-static void redo_fd_request(struct request_queue *q)
+static struct request *swim_next_request(struct swim_priv *swd)
 {
+	struct request_queue *q;
+	struct request *rq;
+	int old_pos = swd->fdc_queue;
+
+	do {
+		q = swd->unit[swd->fdc_queue].disk->queue;
+		if (++swd->fdc_queue == swd->floppy_count)
+			swd->fdc_queue = 0;
+		if (q) {
+			rq = blk_fetch_request(q);
+			if (rq)
+				return rq;
+		}
+	} while (swd->fdc_queue != old_pos);
+
+	return NULL;
+}
+
+static void do_fd_request(struct request_queue *q)
+{
+	struct swim_priv *swd = q->queuedata;
 	struct request *req;
 	struct floppy_state *fs;
 
-	req = blk_fetch_request(q);
+	req = swim_next_request(swd);
 	while (req) {
 		int err = -EIO;
 
@@ -554,15 +575,10 @@ static void redo_fd_request(struct request_queue *q)
 		}
 	done:
 		if (!__blk_end_request_cur(req, err))
-			req = blk_fetch_request(q);
+			req = swim_next_request(swd);
 	}
 }
 
-static void do_fd_request(struct request_queue *q)
-{
-	redo_fd_request(q);
-}
-
 static struct floppy_struct floppy_type[4] = {
 	{    0,  0, 0,  0, 0, 0x00, 0x00, 0x00, 0x00, NULL }, /* no testing   */
 	{  720,  9, 1, 80, 0, 0x2A, 0x02, 0xDF, 0x50, NULL }, /* 360KB SS 3.5"*/
@@ -833,22 +849,25 @@ static int swim_floppy_init(struct swim_priv *swd)
 		return -EBUSY;
 	}
 
+	spin_lock_init(&swd->lock);
+
 	for (drive = 0; drive < swd->floppy_count; drive++) {
 		swd->unit[drive].disk = alloc_disk(1);
 		if (swd->unit[drive].disk == NULL) {
 			err = -ENOMEM;
 			goto exit_put_disks;
 		}
+		swd->unit[drive].disk->queue = blk_init_queue(do_fd_request,
+							      &swd->lock);
+		if (!swd->unit[drive].disk->queue) {
+			err = -ENOMEM;
+			put_disk(swd->unit[drive].disk);
+			goto exit_put_disks;
+		}
+		swd->unit[drive].disk->queue->queuedata = swd;
 		swd->unit[drive].swd = swd;
 	}
 
-	spin_lock_init(&swd->lock);
-	swd->queue = blk_init_queue(do_fd_request, &swd->lock);
-	if (!swd->queue) {
-		err = -ENOMEM;
-		goto exit_put_disks;
-	}
-
 	for (drive = 0; drive < swd->floppy_count; drive++) {
 		swd->unit[drive].disk->flags = GENHD_FL_REMOVABLE;
 		swd->unit[drive].disk->major = FLOPPY_MAJOR;
@@ -856,7 +875,6 @@ static int swim_floppy_init(struct swim_priv *swd)
 		sprintf(swd->unit[drive].disk->disk_name, "fd%d", drive);
 		swd->unit[drive].disk->fops = &floppy_fops;
 		swd->unit[drive].disk->private_data = &swd->unit[drive];
-		swd->unit[drive].disk->queue = swd->queue;
 		set_capacity(swd->unit[drive].disk, 2880);
 		add_disk(swd->unit[drive].disk);
 	}
@@ -943,13 +961,12 @@ static int swim_remove(struct platform_device *dev)
 
 	for (drive = 0; drive < swd->floppy_count; drive++) {
 		del_gendisk(swd->unit[drive].disk);
+		blk_cleanup_queue(swd->unit[drive].disk->queue);
 		put_disk(swd->unit[drive].disk);
 	}
 
 	unregister_blkdev(FLOPPY_MAJOR, "fd");
 
-	blk_cleanup_queue(swd->queue);
-
 	/* eject floppies */
 
 	for (drive = 0; drive < swd->floppy_count; drive++)
-- 
2.12.1

^ permalink raw reply related

* [PATCH 3/6] parport/pcd: stop sharing request queue across multiple gendisks
From: Omar Sandoval @ 2017-03-28  6:28 UTC (permalink / raw)
  To: Jens Axboe, Vivek Goyal, linux-block; +Cc: kernel-team, Tim Waugh
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Compile-tested only.

Cc: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 drivers/block/paride/pcd.c | 57 ++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c
index 939641d6e262..b1267ef34d5a 100644
--- a/drivers/block/paride/pcd.c
+++ b/drivers/block/paride/pcd.c
@@ -300,6 +300,11 @@ static void pcd_init_units(void)
 		struct gendisk *disk = alloc_disk(1);
 		if (!disk)
 			continue;
+		disk->queue = blk_init_queue(do_pcd_request, &pcd_lock);
+		if (!disk->queue) {
+			put_disk(disk);
+			continue;
+		}
 		cd->disk = disk;
 		cd->pi = &cd->pia;
 		cd->present = 0;
@@ -735,18 +740,36 @@ static int pcd_detect(void)
 }
 
 /* I/O request processing */
-static struct request_queue *pcd_queue;
+static int pcd_queue;
+
+static int set_next_request(void)
+{
+	struct pcd_unit *cd;
+	struct request_queue *q;
+	int old_pos = pcd_queue;
+
+	do {
+		cd = &pcd[pcd_queue];
+		q = cd->present ? cd->disk->queue : NULL;
+		if (++pcd_queue == PCD_UNITS)
+			pcd_queue = 0;
+		if (q) {
+			pcd_req = blk_fetch_request(q);
+			if (pcd_req)
+				break;
+		}
+	} while (pcd_queue != old_pos);
+
+	return pcd_req != NULL;
+}
 
-static void do_pcd_request(struct request_queue * q)
+static void pcd_request(void)
 {
 	if (pcd_busy)
 		return;
 	while (1) {
-		if (!pcd_req) {
-			pcd_req = blk_fetch_request(q);
-			if (!pcd_req)
-				return;
-		}
+		if (!pcd_req && !set_next_request())
+			return;
 
 		if (rq_data_dir(pcd_req) == READ) {
 			struct pcd_unit *cd = pcd_req->rq_disk->private_data;
@@ -766,6 +789,11 @@ static void do_pcd_request(struct request_queue * q)
 	}
 }
 
+static void do_pcd_request(struct request_queue *q)
+{
+	pcd_request();
+}
+
 static inline void next_request(int err)
 {
 	unsigned long saved_flags;
@@ -774,7 +802,7 @@ static inline void next_request(int err)
 	if (!__blk_end_request_cur(pcd_req, err))
 		pcd_req = NULL;
 	pcd_busy = 0;
-	do_pcd_request(pcd_queue);
+	pcd_request();
 	spin_unlock_irqrestore(&pcd_lock, saved_flags);
 }
 
@@ -849,7 +877,7 @@ static void do_pcd_read_drq(void)
 
 	do_pcd_read();
 	spin_lock_irqsave(&pcd_lock, saved_flags);
-	do_pcd_request(pcd_queue);
+	pcd_request();
 	spin_unlock_irqrestore(&pcd_lock, saved_flags);
 }
 
@@ -957,19 +985,10 @@ static int __init pcd_init(void)
 		return -EBUSY;
 	}
 
-	pcd_queue = blk_init_queue(do_pcd_request, &pcd_lock);
-	if (!pcd_queue) {
-		unregister_blkdev(major, name);
-		for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++)
-			put_disk(cd->disk);
-		return -ENOMEM;
-	}
-
 	for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) {
 		if (cd->present) {
 			register_cdrom(&cd->info);
 			cd->disk->private_data = cd;
-			cd->disk->queue = pcd_queue;
 			add_disk(cd->disk);
 		}
 	}
@@ -988,9 +1007,9 @@ static void __exit pcd_exit(void)
 			pi_release(cd->pi);
 			unregister_cdrom(&cd->info);
 		}
+		blk_cleanup_queue(cd->disk->queue);
 		put_disk(cd->disk);
 	}
-	blk_cleanup_queue(pcd_queue);
 	unregister_blkdev(major, name);
 	pi_unregister_driver(par_drv);
 }
-- 
2.12.1

^ permalink raw reply related

* Re: [PATCH 1/6] hd: stop sharing request queue across multiple gendisks
From: Christoph Hellwig @ 2017-03-28  6:32 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, Vivek Goyal, linux-block, kernel-team
In-Reply-To: <dc3e8fc2dd692fc27f43c51eaf607ea0c3ddda5f.1490681910.git.osandov@fb.com>

Can we just remove this driver entirely?  I'm pretty sure Linux will
have all kinds of problems with hardware of that age anyway.

^ permalink raw reply

* Re: [PATCH 0/6] block: convert remaining drivers which share a request queue
From: Christoph Hellwig @ 2017-03-28  6:33 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, Vivek Goyal, linux-block, kernel-team
In-Reply-To: <cover.1490681910.git.osandov@fb.com>

On Mon, Mar 27, 2017 at 11:28:41PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This is clearly the pinnacle of my career: converting all remaining
> block drivers which share a request queue across gendisks to use a
> separate request queue per gendisk. These are all compile tested (but
> the last two platform-specific ones involved hacking the Kconfig and
> commenting out a bunch of arch-dependent code to get the rest to
> compile), no runtime testing at all.
> 
> Let me know if I missed any. Even better, let me know if we can just
> delete some of these entirely.

Weren't the floppy drivers the prime example of shared request queue?
I haven't looked at any of them for a while, so I'm not sure if that's
still the case.

^ permalink raw reply

* Re: [PATCH 0/6] block: convert remaining drivers which share a request queue
From: Omar Sandoval @ 2017-03-28  6:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Vivek Goyal, linux-block, kernel-team
In-Reply-To: <20170328063319.GB11751@infradead.org>

On Mon, Mar 27, 2017 at 11:33:19PM -0700, Christoph Hellwig wrote:
> On Mon, Mar 27, 2017 at 11:28:41PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This is clearly the pinnacle of my career: converting all remaining
> > block drivers which share a request queue across gendisks to use a
> > separate request queue per gendisk. These are all compile tested (but
> > the last two platform-specific ones involved hacking the Kconfig and
> > commenting out a bunch of arch-dependent code to get the rest to
> > compile), no runtime testing at all.
> > 
> > Let me know if I missed any. Even better, let me know if we can just
> > delete some of these entirely.
> 
> Weren't the floppy drivers the prime example of shared request queue?
> I haven't looked at any of them for a while, so I'm not sure if that's
> still the case.


Those have all been fixed up, besides the swim one in this series:

786029ff810f ("amiga floppy: Stop sharing request queue across multiple gendisks")
639e2f2aa76e ("atari floppy: Stop sharing request queue across multiple gendisks")
488211844e0c ("floppy: switch to one queue per drive instead of sharing a queue")

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Linus Walleij @ 2017-03-28  7:46 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: linux-mmc@vger.kernel.org, Ulf Hansson, Paolo Valente,
	Chunyan Zhang, Baolin Wang, linux-block, Jens Axboe,
	Christoph Hellwig, Arnd Bergmann
In-Reply-To: <3fc89f9f-fbcf-113d-3644-b6c9dae003f0@intel.com>

On Fri, Mar 10, 2017 at 3:21 PM, Adrian Hunter <adrian.hunter@intel.com> wrote:
> On 10/03/17 00:49, Linus Walleij wrote:
>> On Wed, Feb 22, 2017 at 2:29 PM, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>> On 09/02/17 17:33, Linus Walleij wrote:

>>>> This is a central change that let us do many other changes since
>>>> we have broken the submit and complete code paths in two, and we
>>>> can potentially remove the NULL flushing of the asynchronous
>>>> pipeline and report block requests as finished directly from
>>>> the worker.
>>>
>>> This needs more thought.  The completion should go straight to the mmc block
>>> driver from the ->done() callback.  And from there straight back to the
>>> block layer if recovery is not needed.  We want to stop using
>>> mmc_start_areq() altogether because we never want to wait - we always want
>>> to issue (if possible) and return.
>>
>> I don't quite follow this. Isn't what you request exactly what
>> patch 15/16 "mmc: queue: issue requests in massive parallel"
>> is doing?
>
> There is the latency for the worker that runs mmc_finalize_areq() and then
> another latency to wake up the worker that is running mmc_start_areq().
> That is 2 wake-ups instead of 1.

That is correct (though the measured effect is small).

However when we switch to MQ it must happen like this due to its asynchronous
nature of issuing requests to us.

Then we have MQ's submission thread coming in from one en and our worker
to manage retries and errors on the other side. We obviously cannot do
the retries and resends in MQs context as it blocks subsequent requests.

> As a side note, ideally we would be able to issue the next request from the
> interrupt or soft interrupt context of the completion (i.e. 0 wake-ups
> between requests), but we would probably have to look at the host API to
> support that.

I looked at that and couldn't find a good way to get to that point.

Mainly because of mmc_start_bkops() that may
arbitrarily fire after every command and start new requests to the
card, and that of course require a process context to happen. Then there
is the retune thing that I do not fully understand how it schedules, but
it might be fine since I'm under the impression that it is done at the
start of the next request if need be. Maybe both can be overcome by
quick checks in IRQ context and then this can be done. (I'm not smart enough
to see that yet, sorry.)

However since we activate the blocking context in MQ I don't know
if it can even deal with having requests completed in interrupt context
so that the next thing that happens after completing the request and
returning from the interrupt is that the block layer thread gets scheduled
(unless something more important is going on), I guess it is possible?
It looks like it could be really efficient.

>>> For the blk-mq port, the queue thread should also be retained, partly
>>> because it solves some synchronization problems, but mostly because, at this
>>> stage, we anyway don't have solutions for all the different ways the driver
>>> can block.
>>> (as listed here https://marc.info/?l=linux-mmc&m=148336571720463&w=2 )
>>
>> Essentially I take out that thread and replace it with this one worker
>> introduced in this very patch. I agree the driver can block in many ways
>> and that is why I need to have it running in process context, and this
>> is what the worker introduced here provides.
>
> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
> qdepth requests from the I/O scheduler and left them on a local list while
> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
> number of requests in limbo (not issued but also not in the I/O scheduler)
> for that time.

I think Jens provided a patch for this bug (don't see the patch
upstream though).

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Linus Walleij @ 2017-03-28  7:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Adrian Hunter, linux-mmc@vger.kernel.org, Ulf Hansson,
	Paolo Valente, Chunyan Zhang, Baolin Wang, linux-block,
	Christoph Hellwig, Arnd Bergmann
In-Reply-To: <a3216031-0801-45e2-320c-bfc88fa742b4@kernel.dk>

On Tue, Mar 14, 2017 at 3:36 PM, Jens Axboe <axboe@kernel.dk> wrote:

> There's one case that doesn't look like it was converted properly, but
> that's a mistake. The general insert-and-run cases run inline if we can,
> but the direct-issue needs a fixup, see below.
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..4196d6bee92d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1434,7 +1434,8 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
>         return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>  }
>
> -static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> +static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
> +                                     bool can_block)
>  {
>         struct request_queue *q = rq->q;
>         struct blk_mq_queue_data bd = {
> @@ -1475,7 +1476,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
>         }
>
>  insert:
> -       blk_mq_sched_insert_request(rq, false, true, true, false);
> +       blk_mq_sched_insert_request(rq, false, true, false, can_block);
>  }
>
>  /*
> @@ -1569,11 +1570,11 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>
>                 if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
>                         rcu_read_lock();
> -                       blk_mq_try_issue_directly(old_rq, &cookie);
> +                       blk_mq_try_issue_directly(old_rq, &cookie, false);
>                         rcu_read_unlock();
>                 } else {
>                         srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
> -                       blk_mq_try_issue_directly(old_rq, &cookie);
> +                       blk_mq_try_issue_directly(old_rq, &cookie, true);
>                         srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
>                 }
>                 goto done;

Jens do you have this in your patch queue or is it something you want
us to test and
submit back to you?

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH RFC 22/39] mmc: block: Prepare CQE data
From: Linus Walleij @ 2017-03-28  7:57 UTC (permalink / raw)
  To: Adrian Hunter, Paolo Valente, Jens Axboe
  Cc: linux-block, Ulf Hansson, linux-mmc, Alex Lemberg, Mateusz Nowak,
	Yuliy Izrailov, Jaehoon Chung, Dong Aisheng, Das Asutosh,
	Zhangfei Gao, Dorfman Konstantin, David Griego, Sahitya Tummala,
	Harjani Ritesh, Venu Byravarasu
In-Reply-To: <f823dfb9-2acd-e9be-7132-a0b28659c316@intel.com>

On Fri, Mar 10, 2017 at 9:29 AM, Adrian Hunter <adrian.hunter@intel.com> wrote:
> On 10/03/17 00:39, Linus Walleij wrote:
>> On Fri, Mar 3, 2017 at 1:22 PM, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>> On 15/02/17 15:49, Linus Walleij wrote:
>>>> On Fri, Feb 10, 2017 at 1:55 PM, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>>> Enhance mmc_blk_data_prep() to support CQE requests.
>>>>>
>>>>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>>>>
>>>> Hey:
>>>>
>>>>> +#include <linux/ioprio.h>
>>>> (...)
>>>>> +       if (IOPRIO_PRIO_CLASS(req_get_ioprio(req)) == IOPRIO_CLASS_RT)
>>>>> +               brq->data.flags |= MMC_DATA_PRIO;
>>>>
>>>> What is this?
>>>
>>> It is the command queue priority.
>>>
>>> The command queue supports 2 priorities: "high" (1) and "simple" (0).  The
>>> eMMC will give "high" priority tasks priority over "simple" priority tasks.
>>>
>>> So here we give priority to IOPRIO_CLASS_RT which seems appropriate.
>>
>> So if I understand correctly, you are obtaining the block layer scheduling
>> priorities, that can (only?) be set by ionice has from the command line?
>
> AFAICS it is the ioprio_set() system call .

OK to be clear, correct, it is set by that system call.

And as far as I can tell there is exactly one program in the world that issues
that syscall, and that is ionice(1).

It is described to the world as a way for users to give hints to the
I/O scheduler
about how to manage requests. For example a user can hint that some IO
is a backup job and not important and some IO is a media player and really
important.

And since the block layer only handles this hint in the legacy blk CFQ scheduler
I want to know from the block layer developers if this priority has any future.
BFQ does not use this hint at all, I think Paolo described the ionice command
as problematic, and we need to understand if user-provided IO scheduling hints
is something that is going to be used in the future.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH] blk-mq: Add NULL pointer check for HW dispatch queue
From: Somnath Kotur @ 2017-03-28  8:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jitendra Bhivare, linux-block
In-Reply-To: <20170327091419.GD6879@infradead.org>

On Mon, Mar 27, 2017 at 2:44 PM, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 20, 2017 at 03:10:01PM +0530, Jitendra Bhivare wrote:
> > As part of blk_mq_realloc_hw_ctx(), if the init_hctx() ops is
> > failed by the underyling transport, the hctx pointer is freed and
> > initialized to NULL.
> > However, functions down the line, access this hwctx pointer without
> > a NULL pointer check, which could lead to a kernel crash.
>
> Shouldn't we fail initializing the queue if any of the hctx allocations
> fail?

Well, just to give a better background of the issue, here is the
dump_stack of where/when the failure happens

Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffffa05d42d6>]
ib_alloc_mr+0x26/0x50 [ib_core]
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffffa0a37691>]
__nvme_rdma_init_request+0xc1/0x230 [nvme_rdma]
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffffa0a37831>]
nvme_rdma_init_request+0x11/0x20 [nvme_rdma]
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffff813429bb>]
blk_mq_init_rq_map+0x23b/0x2b0
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffff81342e25>]
blk_mq_alloc_tag_set+0x135/0x2c0
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffffa0a37cc3>]
nvme_rdma_create_ctrl+0x483/0x710 [nvme_rdma]
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffffa0a2c127>]
nvmf_dev_write+0x727/0x93c [nvme_fabrics]
Mar 18 08:27:31 dhcp-10-192-204-6 kernel: [<ffffffff812320e7>]
__vfs_write+0x37/0x160

the ctrl->queue_count in nvme_rdma_create_ctrl() is initialized like so:

ctrl->queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */

where opts->nr_io_queues is typically set to num_online_cpus() which
in my case turned out to be 16, while the failure i encountered was
for the 14th CPU , the failure being alloc_mr() because we reached the
limitation of MRs in our chip.

The point is that post this failure, functions like
blk_mq_init_cpu_queues() and blk_mq_map_swqueue() use code snippet
like below to access the hctxs:

for_each_possible_cpu(i) {
....
 hctx = blk_mq_map_queue(q, i);
 hctx->....                                          // crash if ptr is NULL
..
}

I'm not that familiar with the blk code itself, so perhaps there is a
better way of fixing it, but have pointed out the problem and a
possible fix, this is more of a bug
in the error-handling path?

Thanks
Som

^ permalink raw reply

* Re: [PATCH 2/2] blk-stat: add a poll_size value to the request_queue struct
From: Sagi Grimberg @ 2017-03-28 10:50 UTC (permalink / raw)
  To: sbates, axboe; +Cc: linux-block, Damien.LeMoal, osandov, linux-nvme
In-Reply-To: <1490494692-2416-3-git-send-email-sbates@raithlin.com>



On 26/03/17 05:18, sbates@raithlin.com wrote:
> From: Stephen Bates <sbates@raithlin.com>
>
> In order to bucket IO for the polling algorithm we use a sysfs entry
> to set the filter value. It is signed and we will use that as follows:
>
>  0   : No filtering. All IO are considered in stat generation
>  > 0 : Filtering based on IO of exactly this size only.
>  < 0 : Filtering based on IO less than or equal to -1 time this value.

I'd say that this is a fairly non-trivial semantic meanning to this...

Is there any use for the size exact match filter? If not then
I suggest we always make it (<=) in its semantics...

> Use this member to implement a new bucket function for the io polling
> stats.
>
> Signed-off-by: Stephen Bates <sbates@raithlin.com>
> ---
>  block/blk-mq.c         | 17 +++++++++++++++--
>  block/blk-sysfs.c      | 30 ++++++++++++++++++++++++++++++
>  include/linux/blkdev.h |  1 +
>  3 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5ff66f2..5ea9481 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2327,6 +2327,18 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>  	blk_mq_sysfs_register(q);
>  }
>
> +int blk_mq_poll_stats_bucket(const struct request *rq)
> +{
> +	int val = rq->q->poll_size;
> +
> +	if ((val == 0) ||
> +	    (val > 0 && blk_rq_bytes(rq) == val) ||
> +	    (val < 0 && blk_rq_bytes(rq) <= -val))
> +		return (int)rq_data_dir(rq);
> +
> +	return -1;
> +}
> +
>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  						  struct request_queue *q)
>  {
> @@ -2338,7 +2350,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  		goto err_exit;
>
>  	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
> -					     blk_stat_rq_ddir, 2, q);
> +					     blk_mq_poll_stats_bucket, 2, q);
>  	if (!q->poll_cb)
>  		goto err_exit;
>
> @@ -2387,9 +2399,10 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  	q->nr_requests = set->queue_depth;
>
>  	/*
> -	 * Default to classic polling
> +	 * Default to classic polling. Default to considering all IO.
>  	 */
>  	q->poll_nsec = -1;
> +	q->poll_size = 0;
>
>  	if (set->ops->complete)
>  		blk_queue_softirq_done(q, set->ops->complete);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fa831cb..000d5db 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -394,6 +394,29 @@ static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
>  	return count;
>  }
>
> +static ssize_t queue_poll_size_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%d\n", q->poll_size);
> +}
> +
> +static ssize_t queue_poll_size_store(struct request_queue *q, const char *page,
> +				     size_t count)
> +{
> +	int err, val;
> +
> +	if (!q->mq_ops || !q->mq_ops->poll)
> +		return -EINVAL;
> +
> +	err = kstrtoint(page, 10, &val);
> +	if (err < 0)
> +		return err;
> +
> +	q->poll_size = val;
> +
> +	return count;
> +}
> +
> +
>  static ssize_t queue_poll_show(struct request_queue *q, char *page)
>  {
>  	return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
> @@ -654,6 +677,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
>  	.store = queue_poll_store,
>  };
>
> +static struct queue_sysfs_entry queue_poll_size_entry = {
> +	.attr = {.name = "io_poll_size", .mode = S_IRUGO | S_IWUSR },
> +	.show = queue_poll_size_show,
> +	.store = queue_poll_size_store,
> +};
> +
>  static struct queue_sysfs_entry queue_poll_delay_entry = {
>  	.attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_poll_delay_show,
> @@ -710,6 +739,7 @@ static struct attribute *default_attrs[] = {
>  	&queue_dax_entry.attr,
>  	&queue_wb_lat_entry.attr,
>  	&queue_poll_delay_entry.attr,
> +	&queue_poll_size_entry.attr,
>  	NULL,
>  };
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1a7dc42..7ff5388 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -517,6 +517,7 @@ struct request_queue {
>
>  	unsigned int		rq_timeout;
>  	int			poll_nsec;
> +	int			poll_size;
>
>  	struct blk_stat_callback	*poll_cb;
>  	struct blk_rq_stat	poll_stat[2];
>

^ permalink raw reply

* Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd
From: axboe @ 2017-03-28 14:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <1490653474.7897.13.camel@sandisk.com>

On Mon, Mar 27 2017, Bart Van Assche wrote:
> On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> > +	u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
> > +	u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
> 
> Although I know this is an issue in the existing code and not something
> introduced by you: please consider using logical_to_sectors() instead of
> open-coding this function. Otherwise this patch looks fine to me.

The downside of doing that is that we still call ilog2() twice, which
sucks. Would be faster to cache ilog2(sector_size) and use that in the
shift calculation.

-- 
Jens Axboe

^ permalink raw reply

* Re: v4.11-rc blk-mq lockup?
From: Jens Axboe @ 2017-03-28 14:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org
In-Reply-To: <1490651076.7897.11.camel@sandisk.com>

On Mon, Mar 27 2017, Bart Van Assche wrote:
> Hello Jens,
> 
> If I leave the srp-test software running for a few minutes using the
> following command:
> 
> # while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; done
> 
> then after some time the following complaint appears for multiple
> kworkers:
> 
> INFO: task kworker/9:0:65 blocked for more than 480 seconds.
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½Tainted: Gï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½Iï¿½ï¿½ï¿½ï¿½ï¿½4.11.0-rc4-dbg+ #5
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/9:0ï¿½ï¿½ï¿½ï¿½ï¿½Dï¿½ï¿½ï¿½ï¿½0ï¿½ï¿½ï¿½ï¿½65ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½2 0x00000000
> Workqueue: dio/dm-0 dio_aio_complete_work
> Call Trace:
> ï¿½__schedule+0x3df/0xc10
> ï¿½schedule+0x38/0x90
> ï¿½rwsem_down_write_failed+0x2c4/0x4c0
> ï¿½call_rwsem_down_write_failed+0x17/0x30
> ï¿½down_write+0x5a/0x70
> ï¿½__generic_file_fsync+0x43/0x90
> ï¿½ext4_sync_file+0x2d0/0x550
> ï¿½vfs_fsync_range+0x46/0xa0
> ï¿½dio_complete+0x181/0x1b0
> ï¿½dio_aio_complete_work+0x17/0x20
> ï¿½process_one_work+0x208/0x6a0
> ï¿½worker_thread+0x49/0x4a0
> ï¿½kthread+0x107/0x140
> ï¿½ret_from_fork+0x2e/0x40
> 
> I had not yet observed this behavior with kernel v4.10 or older. If this
> happens and I check the queue state with the following script:

Can you include the 'state' file in your script?

Do you know when this started happening? You say it doesn't happen in
4.10, but did it pass earlier in the 4.11-rc cycle?

Does it reproduce with dm?

I can't tell from your report if this is new in the 4.11 series,

> The kernel tree I used in my tests is the result of merging the
> following commits:
> * commit 3dca2c2f3d3bï¿½from git://git.kernel.dk/linux-block.git
>   ("Merge branch 'for-4.12/block' into for-next")
> * commit f88ab0c4b481 from git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
>   ("scsi: libsas: fix ata xfer length")
> * commit ad0376eb1483 from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>   ("Merge tag 'edac_for_4.11_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp")

Can we try and isolate it a bit - -rc4 alone, for instance?

-- 
Jens Axboe

^ permalink raw reply

* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Roman Penyaev @ 2017-03-28 14:17 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Sagi Grimberg, linux-block, linux-rdma@vger.kernel.org,
	Doug Ledford, Jens Axboe, hch, Fabian Holler, Milind Dumbare,
	Michael Wang, Danil Kipnis
In-Reply-To: <CAMGffEmgsP9GjQQp63matwxpzJT+SZu+VPU+Co-h99b8D5g5Zw@mail.gmail.com>

Hi Bart and Sagi,

Thanks for warm welcome and early feedback.  I will respond both of you
but here, on Jack's email, since I am not in CC in the first cover letter
(what a bummer).  Sorry for mess.

Sagi Grimberg <sagi@grimberg.me> wrote:

> - Is there room for this ibnbd? If we were to take every block driver
>   that was submitted without sufficient justification, it'd be very
>   hard to maintain. What advantage (if any) does this buy anyone over
>   existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*)
>   not sold on this one...

It seems that better to start from the history.  IBNBD project, as it was
presented, is not only the block device (which supposed to be thin), but
mainly it is an rdma transport, client and server logic, which is called
IBTRS in our terms.  IBTRS was the starter, the main idea for us, which is
planned to bind our own replicated storage solution via infiniband.

We wanted clear transport interface, which should not be very much different
from what normal bsd sockets provide, e.g.:

  ibtrs_clt_open()               - Opens a session to a server.
  ibtrs_clt_rdma_write()         - WRITE, i.e transfer data to server.
  ibtrs_clt_request_rdma_write() - READ, i.e. request data transfer from server.

We did not want to rely, depend and embed any existent commands sets and
protocols (e.g. SCSI) inside our transport.  IBTRS should stay apart from
any storage knowledge and should be able to do only two things:

  1. establish connection to a server (accept connections from clients)
  2. read/write any data

Thinking about transport as a layer, IBNBD is just a user of that layer,
thin block device with only 4 commands, which establishes one connection
to the server and maps N block devices thru that connection.

I pretty well realize, that I've described obvious things with this
layering, but again, what we wanted and achieved is an independent
IB transport, which is planned to be used for replication and in that
project IBNBD won't exist.

> - To me, the fundamental design that the client side owns a pool of
>   buffers that it issues writes too, seems inferior than the
>   one taken in iser/nvmf (immediate data).

That of course discussable, since we consider that as a feature :)
But indeed, we never tested against nvmf.  And of course, I am open
to any surprises.

>   IMO, the ibnbd design has
>   scalability issues both in terms of server side resources,

Each client connection eats ~64Mb of memory on server side for the IO
buffer pool.  Since we are talking about hundreds of connections this
is reasonable trade off (memory is considerably cheap) to avoid any
allocations on IO path and make it completely lockless.

>   client side contention

We do not have any noticeable contentions.  What we have is a single
queue of buffers for all devices mapped on a client per connection
(session) and I can consider that as a single point of resource, where
each device have to fight in order to get an empty bit from a bitmap.
In practice all benchmarks we did (I have to say, that this was a pretty
old 4.4 kernel, and fresh data - is the amount of work which obviously
should be done) show that even we share a queue a bottleneck is always
an infiniband.

>   and network congestion (on infiniband the latter is
>   less severe).

We rely on IB flow control and if a server side did not respond
with

  ib_post_recv()

we do not reach the wire from a client.  That should be enough.

> - I honestly do not understand why you need *19057* LOC to implement
>   a rdma based block driver. Thats almost larger than all of our
>   existing block drivers combined...

Oh, those LOC numbers :)  IBNBD client (block device) itself is around
2800 lines with all sysfs handy things.  What is really bloated is a
transport client and server sides, which pretend to be smart and cover:

  o reconnects
  o heartbeats
  o quite huge FSM implementation for each connection (session)

True, this code can be split on common parts and deflated.

>   First glance at the code provides
>   some explanations,
> (1) you have some strange code that has no business
>   in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes)

Indeed, that is crap which is left from those times, when we tried
to cover all generic kernel calls in order to do fault injection.

>   open-coding various existing logging routines,

No excuses, will be vanished.

> (2) you are for some
>   reason adding a second tag allocation scheme (why?)

Several reasons.  a) We share single queue of buffers for mapped devices
per client connection and we still support RQ mode.  b) MQ shared tags
indeed are shared, but not between different hctxs.  I.e. because of
our single queue for N devices per connection we are not able to create
M hctxs and share this single queue between them.  Sometimes we need to
do blk_mq_stop_hw_queue() in order to stop the hw queue and then restart
it from a completion.  Obviously, simplest approach is to equally share
static number of buffers in queue between N devices.  But that won't work
since with big N we finish with very small queue depth per device.

Of course, would be nice if along with BLK_MQ_F_TAG_SHARED another flag
exists, say, BLK_MQ_F_TAG_GLOBALLY_SHARED.  Unfortunately with our design
MQ does not help us a lot with current shared tags implementation.

> (3) you are open
>   coding a lot of stuff that we added to the stack in the past months...

Could you please point precisely in the code what do you mean?

If you are saying that you added a lot to nvmf, I would like to take a
pause and postpone nvmf discussion till we have fresh numbers and more
understanding.

>   (4) you seem to over-layer your code for reasons that I do not
>   really understand.

Frankly, I did not get this.  Here 'over-layer code' sounds very abstract
to me.

>   And I didn't really look deep at all into the
>   code, just to get the feel of it, and it seems like it needs a lot
>   of work before it can even be considered upstream ready.

Definetly.

Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> * Doesn't scale in terms of number of CPUs submitting I/O. The graphs shown
>   during the Vault talk clearly illustrate this. This is probably the result
>   of sharing a data structure across all client CPUs, maybe the bitmap that
>   tracks which parts of the target buffer space are in use.

Probably that was not clear from the presentation, but that exactly what
was fixed by CPU affinity (next slide after those hapless graphs).  What
is left unclear to us is the NUMA effect, which is not related to distances,
reported by numactl.

But still your advice to run everything against perf and check cache misses
is more than valid.  Thanks.

> * Supports IB but none of the other RDMA transports (RoCE / iWARP).

We did a lot of experiments with SoftRoCe and virtualized environment
in order to have quick testing results on one host without hardware.
That works quite well (despite SoftRoCe bugs which were fixed locally).
So should not be a big deal to cover iWARP.

> We also need performance numbers that compare IBNBD against SRP and/or
> NVMeOF with memory registration disabled to see whether and how much faster
> IBNBD is compared to these two protocols.

Indeed, we've never tested against nvmf and that is a must.

Thanks.

--
Roman

^ permalink raw reply

* Re: [PATCH 0/3] blk-throttle: add .low limit fix
From: Jens Axboe @ 2017-03-28 15:58 UTC (permalink / raw)
  To: Shaohua Li, linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490651903.git.shli@fb.com>

On 03/27/2017 04:19 PM, Shaohua Li wrote:
> Jens,
> 
> The 3 patches replace the last two patches (patch 17/18) of the series I sent
> out today. The new patch overloads blk stat as you suggested.

Thanks! With that, I've queued it up for 4.12.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES
From: Bart Van Assche @ 2017-03-28 16:12 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <20170323143341.31549-2-hch@lst.de>

On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> Make life easy for implementations that needs to send a data buffer
> to the device (e.g. SCSI) by numbering it as a data out command.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/blk_types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>=20
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index d703acb55d0f..6393c13a6498 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -160,7 +160,7 @@ enum req_opf {
>  	/* write the same sector many times */
>  	REQ_OP_WRITE_SAME	=3D 7,
>  	/* write the zero filled sector many times */
> -	REQ_OP_WRITE_ZEROES	=3D 8,
> +	REQ_OP_WRITE_ZEROES	=3D 9,
> =20
>  	/* SCSI passthrough using struct scsi_request */
>  	REQ_OP_SCSI_IN		=3D 32,

Hello Christoph,

Since REQ_OP_WRITE_ZEROES was introduced in kernel v4.10, do we need
"Cc: stable" and "Fixes: a6f0788ec2881" tags for this patch?

Thanks,

Bart.=

^ permalink raw reply

* Re: v4.11-rc blk-mq lockup?
From: Bart Van Assche @ 2017-03-28 16:25 UTC (permalink / raw)
  To: axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170328140655.GB27578@kernel.dk>

On Tue, 2017-03-28 at 08:06 -0600, Jens Axboe wrote:
> On Mon, Mar 27 2017, Bart Van Assche wrote:
> > Hello Jens,
> >=20
> > If I leave the srp-test software running for a few minutes using the
> > following command:
> >=20
> > # while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; do=
ne
> >=20
> > then after some time the following complaint appears for multiple
> > kworkers:
> >=20
> > INFO: task kworker/9:0:65 blocked for more than 480 seconds.
> > =A0=A0=A0=A0=A0=A0Tainted: G=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0I=A0=A0=A0=A0=
=A04.11.0-rc4-dbg+ #5
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag=
e.
> > kworker/9:0=A0=A0=A0=A0=A0D=A0=A0=A0=A00=A0=A0=A0=A065=A0=A0=A0=A0=A0=
=A02 0x00000000
> > Workqueue: dio/dm-0 dio_aio_complete_work
> > Call Trace:
> > =A0__schedule+0x3df/0xc10
> > =A0schedule+0x38/0x90
> > =A0rwsem_down_write_failed+0x2c4/0x4c0
> > =A0call_rwsem_down_write_failed+0x17/0x30
> > =A0down_write+0x5a/0x70
> > =A0__generic_file_fsync+0x43/0x90
> > =A0ext4_sync_file+0x2d0/0x550
> > =A0vfs_fsync_range+0x46/0xa0
> > =A0dio_complete+0x181/0x1b0
> > =A0dio_aio_complete_work+0x17/0x20
> > =A0process_one_work+0x208/0x6a0
> > =A0worker_thread+0x49/0x4a0
> > =A0kthread+0x107/0x140
> > =A0ret_from_fork+0x2e/0x40
> >=20
> > I had not yet observed this behavior with kernel v4.10 or older. If thi=
s
> > happens and I check the queue state with the following script:
>=20
> Can you include the 'state' file in your script?
>=20
> Do you know when this started happening? You say it doesn't happen in
> 4.10, but did it pass earlier in the 4.11-rc cycle?
>=20
> Does it reproduce with dm?
>=20
> I can't tell from your report if this is new in the 4.11 series,
>=20
> > The kernel tree I used in my tests is the result of merging the
> > following commits:
> > * commit 3dca2c2f3d3b=A0from git://git.kernel.dk/linux-block.git
> >   ("Merge branch 'for-4.12/block' into for-next")
> > * commit f88ab0c4b481 from git://git.kernel.org/pub/scm/linux/kernel/gi=
t/mkp/scsi.git
> >   ("scsi: libsas: fix ata xfer length")
> > * commit ad0376eb1483 from git://git.kernel.org/pub/scm/linux/kernel/gi=
t/torvalds/linux.git
> >   ("Merge tag 'edac_for_4.11_2' of git://git.kernel.org/pub/scm/linux/k=
ernel/git/bp/bp")
>=20
> Can we try and isolate it a bit - -rc4 alone, for instance?

Hello Jens,

Sorry but performing a bisect would be hard: without recent SCSI and block
layer fixes this test triggers other failures before the lockup reported in
this e-mail is triggered. See e.g.
https://marc.info/?l=3Dlinux-scsi&m=3D148979716822799.

I do not know whether it would be possible to modify the test such that onl=
y
the dm driver is involved but no SCSI code.

When I reran the test this morning the hang was triggered by the 02-sq-on-m=
q
test. This means that dm was used in blk-sq mode and that blk-mq was used f=
or
the ib_srp SCSI device instances.

Please find below the updated script and its output.

---

#!/bin/bash

show_state() {
=A0=A0=A0=A0local a dev=3D$1

=A0=A0=A0=A0for a in device/state queue/scheduler; do
	[ -e "$dev/$a" ] && grep -aH '' "$dev/$a"
=A0=A0=A0=A0done
}

cd /sys/class/block || exit $?
for dev in *; do
=A0=A0=A0=A0if [ -e "$dev/mq" ]; then
	echo "$dev"
	pending=3D0
	for f in "$dev"/mq/*/{pending,*/rq_list}; do
	=A0=A0=A0=A0[ -e "$f" ] || continue
	=A0=A0=A0=A0if { read -r line1 && read -r line2; } <"$f"; then
		echo "$f"
		echo "$line1 $line2" >/dev/null
		head -n 9 "$f"
		((pending++))
	=A0=A0=A0=A0fi
	done
	(
	=A0=A0=A0=A0busy=3D0
	=A0=A0=A0=A0cd /sys/kernel/debug/block >&/dev/null &&
	=A0=A0=A0=A0for d in "$dev"/mq/*; do
		[ ! -d "$d" ] && continue
		grep -q '^busy=3D0$' "$d/tags" && continue
		((busy++))
	=A0=A0=A0=A0=A0=A0=A0=A0for f in "$d"/{dispatch,tags*,cpu*/rq_list}; do
		=A0=A0=A0=A0[ -e "$f" ] && grep -aH '' "$f"
		done
	=A0=A0=A0=A0done
	=A0=A0=A0=A0exit $busy
	)
	pending=3D$((pending+$?))
	[ "$pending" -gt 0 ] && show_state "$dev"
=A0=A0=A0=A0fi
done

---

sda
sdb
sdc
sdd
sde
sde/mq/2/dispatch:ffff8803f5b65d00 {.cmd_flags=3D0xca01, .rq_flags=3D0x2040=
, .tag=3D37, .internal_tag=3D-1}
sde/mq/2/tags:nr_tags=3D62
sde/mq/2/tags:nr_reserved_tags=3D0
sde/mq/2/tags:active_queues=3D0
sde/mq/2/tags:
sde/mq/2/tags:bitmap_tags:
sde/mq/2/tags:depth=3D62
sde/mq/2/tags:busy=3D31
sde/mq/2/tags:bits_per_word=3D8
sde/mq/2/tags:map_nr=3D8
sde/mq/2/tags:alloc_hint=3D{54, 43, 44, 43, 22, 42, 52, 4, 10, 7, 16, 32, 1=
1, 17, 44, 26, 51, 59, 9, 45, 9, 55, 10, 44, 22, 46, 25, 25, 21, 18, 52, 32=
}
sde/mq/2/tags:wake_batch=3D7
sde/mq/2/tags:wake_index=3D0
sde/mq/2/tags:ws=3D{
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/2/tags:}
sde/mq/2/tags:round_robin=3D0
sde/mq/2/tags_bitmap:00000000: 7f00 0000 e0ff ff1f
sde/mq/2/cpu9/rq_list:ffff8803f5b67440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D38, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f5b68b80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D39, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f5b6a2c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D40, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f5b6ba00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D41, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f5b6d140 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D42, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f5b6e880 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D43, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac0000 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D44, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac1740 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D45, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac2e80 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D46, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac45c0 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D47, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac5d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D48, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac7440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D49, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ac8b80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D50, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373aca2c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D51, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373acba00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D52, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373acd140 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D53, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff880373ace880 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D54, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f4950000 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D55, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f4951740 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D56, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f4952e80 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D57, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f49545c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D58, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f4955d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D59, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff8803f4957440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D60, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe0000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D0, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe1740 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D1, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe2e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D2, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe45c0 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D3, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe5d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D4, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe7440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D5, .internal_tag=3D-1}
sde/mq/2/cpu9/rq_list:ffff88036bfe8b80 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D6, .internal_tag=3D-1}
sde/device/state:running
sde/queue/scheduler:[none]=A0
sdf
sdf/mq/2/tags:nr_tags=3D62
sdf/mq/2/tags:nr_reserved_tags=3D0
sdf/mq/2/tags:active_queues=3D0
sdf/mq/2/tags:
sdf/mq/2/tags:bitmap_tags:
sdf/mq/2/tags:depth=3D62
sdf/mq/2/tags:busy=3D31
sdf/mq/2/tags:bits_per_word=3D8
sdf/mq/2/tags:map_nr=3D8
sdf/mq/2/tags:alloc_hint=3D{54, 43, 44, 43, 22, 42, 52, 4, 10, 7, 16, 32, 1=
1, 17, 44, 26, 51, 59, 9, 45, 9, 55, 10, 44, 22, 46, 25, 25, 21, 18, 52, 32=
}
sdf/mq/2/tags:wake_batch=3D7
sdf/mq/2/tags:wake_index=3D0
sdf/mq/2/tags:ws=3D{
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdf/mq/2/tags:}
sdf/mq/2/tags:round_robin=3D0
sdf/mq/2/tags_bitmap:00000000: 7f00 0000 e0ff ff1f
sdf/device/state:running
sdf/queue/scheduler:[none]=A0
sdg
sdh
sdi
sdj
sr0

^ permalink raw reply

* Re: v4.11-rc blk-mq lockup?
From: Jens Axboe @ 2017-03-28 16:30 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org
In-Reply-To: <1490718332.2573.6.camel@sandisk.com>

On 03/28/2017 10:25 AM, Bart Van Assche wrote:
> On Tue, 2017-03-28 at 08:06 -0600, Jens Axboe wrote:
>> On Mon, Mar 27 2017, Bart Van Assche wrote:
>>> Hello Jens,
>>>
>>> If I leave the srp-test software running for a few minutes using the
>>> following command:
>>>
>>> # while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; done
>>>
>>> then after some time the following complaint appears for multiple
>>> kworkers:
>>>
>>> INFO: task kworker/9:0:65 blocked for more than 480 seconds.
>>>       Tainted: G          I     4.11.0-rc4-dbg+ #5
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> kworker/9:0     D    0    65      2 0x00000000
>>> Workqueue: dio/dm-0 dio_aio_complete_work
>>> Call Trace:
>>>  __schedule+0x3df/0xc10
>>>  schedule+0x38/0x90
>>>  rwsem_down_write_failed+0x2c4/0x4c0
>>>  call_rwsem_down_write_failed+0x17/0x30
>>>  down_write+0x5a/0x70
>>>  __generic_file_fsync+0x43/0x90
>>>  ext4_sync_file+0x2d0/0x550
>>>  vfs_fsync_range+0x46/0xa0
>>>  dio_complete+0x181/0x1b0
>>>  dio_aio_complete_work+0x17/0x20
>>>  process_one_work+0x208/0x6a0
>>>  worker_thread+0x49/0x4a0
>>>  kthread+0x107/0x140
>>>  ret_from_fork+0x2e/0x40
>>>
>>> I had not yet observed this behavior with kernel v4.10 or older. If this
>>> happens and I check the queue state with the following script:
>>
>> Can you include the 'state' file in your script?
>>
>> Do you know when this started happening? You say it doesn't happen in
>> 4.10, but did it pass earlier in the 4.11-rc cycle?
>>
>> Does it reproduce with dm?
>>
>> I can't tell from your report if this is new in the 4.11 series,
>>
>>> The kernel tree I used in my tests is the result of merging the
>>> following commits:
>>> * commit 3dca2c2f3d3b from git://git.kernel.dk/linux-block.git
>>>   ("Merge branch 'for-4.12/block' into for-next")
>>> * commit f88ab0c4b481 from git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
>>>   ("scsi: libsas: fix ata xfer length")
>>> * commit ad0376eb1483 from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>   ("Merge tag 'edac_for_4.11_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp")
>>
>> Can we try and isolate it a bit - -rc4 alone, for instance?
> 
> Hello Jens,
> 
> Sorry but performing a bisect would be hard: without recent SCSI and block
> layer fixes this test triggers other failures before the lockup reported in
> this e-mail is triggered. See e.g.
> https://marc.info/?l=linux-scsi&m=148979716822799.

Yeah, I realize that. Not necessarily a huge problem. If I can reproduce
it here, then I can poke enough at it to find out wtf is going on here.

> I do not know whether it would be possible to modify the test such that only
> the dm driver is involved but no SCSI code.

How about the other way around? Just SCSI, but no dm?

> When I reran the test this morning the hang was triggered by the 02-sq-on-mq
> test. This means that dm was used in blk-sq mode and that blk-mq was used for
> the ib_srp SCSI device instances.
> 
> Please find below the updated script and its output.

Thanks for running it again, but it's the wrong state file. I should have
been more clear. The one I'm interested in is in the mq/<num>/ directories,
like the 'tags' etc files.

> 
> ---
> 
> #!/bin/bash
> 
> show_state() {
>     local a dev=$1
> 
>     for a in device/state queue/scheduler; do
> 	[ -e "$dev/$a" ] && grep -aH '' "$dev/$a"
>     done
> }
> 
> cd /sys/class/block || exit $?
> for dev in *; do
>     if [ -e "$dev/mq" ]; then
> 	echo "$dev"
> 	pending=0
> 	for f in "$dev"/mq/*/{pending,*/rq_list}; do
> 	    [ -e "$f" ] || continue
> 	    if { read -r line1 && read -r line2; } <"$f"; then
> 		echo "$f"
> 		echo "$line1 $line2" >/dev/null
> 		head -n 9 "$f"
> 		((pending++))
> 	    fi
> 	done
> 	(
> 	    busy=0
> 	    cd /sys/kernel/debug/block >&/dev/null &&
> 	    for d in "$dev"/mq/*; do
> 		[ ! -d "$d" ] && continue
> 		grep -q '^busy=0$' "$d/tags" && continue
> 		((busy++))
> 	        for f in "$d"/{dispatch,tags*,cpu*/rq_list}; do

Ala:


 	        for f in "$d"/{dispatch,state,tags*,cpu*/rq_list}; do

Also, can you include the involved dm devices as well for this state
dump?

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 12/23] sd: handle REQ_UNMAP
From: Bart Van Assche @ 2017-03-28 16:48 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <20170323143341.31549-13-hch@lst.de>

On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> Try to use a write same with unmap bit variant if the device supports it
> and the caller asks for it.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>=20
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index b6f70a09a301..ca96bb33471b 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -871,6 +871,16 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cm=
nd *cmd)
>  			return BLKPREP_INVALID;
>  		return sd_setup_ata_trim_cmnd(cmd);
>  	}
> +
> +	if (rq->cmd_flags & REQ_UNMAP) {
> +		switch (sdkp->provisioning_mode) {
> +		case SD_LBP_WS16:
> +			return sd_setup_write_same16_cmnd(cmd, true);
> +		case SD_LBP_WS10:
> +			return sd_setup_write_same10_cmnd(cmd, true);
> +		}
> +	}
> +
>  	if (sdp->no_write_same)
>  		return BLKPREP_INVALID;
>  	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)

Users can change the provisioning mode from user space from=A0SD_LBP_WS16 i=
nto
SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
0xffffffff || nr_sectors > 0xffff) check if REQ_UNMAP is set.

Bart.=

^ permalink raw reply

* Re: [PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches
From: Bart Van Assche @ 2017-03-28 16:50 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <20170323143341.31549-12-hch@lst.de>

On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> This gets us support for non-discard efficient write of zeroes (e.g. NVMe=
)
> and preparse for removing the discard_zeroes_data flag.

Hello Christoph,

"preparse" probably should have been "prepare"?

Thanks,

Bart.=

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox