Linux block layer
 help / color / mirror / Atom feed
* [PATCH 03/11] bdi: Mark congested->bdi as internal
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c                 | 10 +++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
 	atomic_t refcnt;		/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct backing_dev_info *bdi;	/* the associated bdi */
+	struct backing_dev_info *__bdi;	/* the associated bdi, set to NULL
+					 * on bdi unregistration. For memcg-wb
+					 * internal use only! */
 	int blkcg_id;			/* ID of the associated blkcg */
 	struct rb_node rb_node;		/* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
 		return NULL;
 
 	atomic_set(&new_congested->refcnt, 0);
-	new_congested->bdi = bdi;
+	new_congested->__bdi = bdi;
 	new_congested->blkcg_id = blkcg_id;
 	goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
 	}
 
 	/* bdi might already have been destroyed leaving @congested unlinked */
-	if (congested->bdi) {
+	if (congested->__bdi) {
 		rb_erase(&congested->rb_node,
-			 &congested->bdi->cgwb_congested_tree);
-		congested->bdi = NULL;
+			 &congested->__bdi->cgwb_congested_tree);
+		congested->__bdi = NULL;
 	}
 
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
 		rb_erase(rbn, &bdi->cgwb_congested_tree);
-		congested->bdi = NULL;	/* mark @congested unlinked */
+		congested->__bdi = NULL;	/* mark @congested unlinked */
 	}
 	spin_unlock_irq(&cgwb_lock);
 }
-- 
2.10.2

^ permalink raw reply related

* [PATCH 04/11] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 	memset(wb, 0, sizeof(*wb));
 
+	if (wb != &bdi->wb)
+		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-	if (!wb->congested)
-		return -ENOMEM;
+	if (!wb->congested) {
+		err = -ENOMEM;
+		goto out_put_bdi;
+	}
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	fprop_local_destroy_percpu(&wb->completions);
 out_put_cong:
 	wb_congested_put(wb->congested);
+out_put_bdi:
+	if (wb != &bdi->wb)
+		bdi_put(bdi);
 	return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
 	fprop_local_destroy_percpu(&wb->completions);
 	wb_congested_put(wb->congested);
+	if (wb != &wb->bdi->wb)
+		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

^ permalink raw reply related

* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
+	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
+	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	/*
+	 * Make sure bit gets cleared after shutdown is finished. Matches with
+	 * the barrier provided by test_and_clear_bit() above.
+	 */
+	smp_wmb();
+	clear_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
+	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+
+	while (!list_empty(&bdi->wb_list)) {
+		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+				      bdi_node);
+		spin_unlock_irq(&cgwb_lock);
+		wb_shutdown(wb);
+		spin_lock_irq(&cgwb_lock);
+	}
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-- 
2.10.2

^ permalink raw reply related

* [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().

So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03d4ba27c133..e3d56dba4da8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	}
 	spin_unlock_bh(&wb->work_lock);
 
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
-	spin_lock_irq(&cgwb_lock);
-	list_del_rcu(&wb->bdi_node);
-	spin_unlock_irq(&cgwb_lock);
-
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
 	percpu_ref_kill(&wb->refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -766,6 +772,13 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	spin_unlock_irq(&cgwb_lock);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -793,6 +806,16 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	wb_congested_put(bdi->wb_congested);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+}
+
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	list_del_rcu(&wb->bdi_node);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 int bdi_init(struct backing_dev_info *bdi)
@@ -811,8 +834,6 @@ int bdi_init(struct backing_dev_info *bdi)
 
 	ret = cgwb_bdi_init(bdi);
 
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
-
 	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
@@ -848,6 +869,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
+	cgwb_bdi_register(bdi);
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
-- 
2.10.2

^ permalink raw reply related

* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c            | 8 ++------
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ec8750f5332..c66f5dd4a02a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
+	/* Detach inode from wb early as bdi_put() may free bdi->wb */
+	inode_detach_wb(inode);
 	if (bdev->bd_bdi != &noop_backing_dev_info) {
 		bdi_put(bdev->bd_bdi);
 		bdev->bd_bdi = &noop_backing_dev_info;
@@ -1880,12 +1882,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
 	if (inode->i_wb) {
+		WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
 		wb_put(inode->i_wb);
 		inode->i_wb = NULL;
 	}
-- 
2.10.2

^ permalink raw reply related

* [PATCH 11/11] block: Fix oops scsi_disk_get()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
 	owner = disk->fops->owner;
 	if (owner && !try_module_get(owner))
 		return NULL;
-	kobj = kobject_get(&disk_to_dev(disk)->kobj);
+	kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
 	if (kobj == NULL) {
 		module_put(owner);
 		return NULL;
-- 
2.10.2

^ permalink raw reply related

* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara, Greg Kroah-Hartman
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c           | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+						struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+	if (!kobj)
+		return NULL;
 	if (!kref_get_unless_zero(&kobj->kref))
 		kobj = NULL;
 	return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

^ permalink raw reply related

* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
 	wb_shutdown(&bdi->wb);
-	cgwb_bdi_destroy(bdi);
+	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
 		bdi_debug_unregister(bdi);
-- 
2.10.2

^ permalink raw reply related

* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c                 | 22 +---------------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
-	atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b67be4fc12c4..8c30b1a7aae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
-	struct backing_dev_info *bdi = wb->bdi;
 
 	wb_shutdown(wb);
 
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
 	kfree_rcu(wb, rcu);
-
-	if (atomic_dec_and_test(&bdi->usage_cnt))
-		wake_up_all(&cgwb_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		/* we might have raced another instance of this function */
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
-			atomic_inc(&bdi->usage_cnt);
 			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
-	atomic_set(&bdi->usage_cnt, 1);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -728,18 +720,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
-
-	/*
-	 * All cgwb's must be shutdown and released before returning.  Drain
-	 * the usage counter to wait for all cgwb's ever created on @bdi.
-	 */
-	atomic_dec(&bdi->usage_cnt);
-	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
-	/*
-	 * Grab back our reference so that we hold it when @bdi gets
-	 * re-registered.
-	 */
-	atomic_inc(&bdi->usage_cnt);
 }
 
 /**
-- 
2.10.2

^ permalink raw reply related

* [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-13 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Hello,

this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
  on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
  accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

								Honza

^ permalink raw reply

* NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 15:24 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, Jens Axboe, linux-nvme

Hey Jens,

After some fixes to nvme-rdma in the area of cpu hot unplug and
rebase to jens for-linus branch I get the following NULL deref [1]

This crash did not happen before I rebased to for-linus (unless I
screwed up something).

I'm on my way out so I just send it out in hope that someone can
figure it out before I do...

After I offlined a cpu, I got the nvmf target to disconnect
from the host, the host then schedules a reconnect. after the
host reconnects it issues a namespace scanning which removes
an old namespace. Then we get to blk_cleanup_queue which
then triggers the NULL deref.

The strange thing is that we pass the
  (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...

[1]
--
[   55.865818] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[   55.867094] IP: __blk_mq_tag_idle+0x19/0x30
[   55.867825] PGD 0

[   55.868477] Oops: 0002 [#1] SMP
[   55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper 
cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid 
ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp 
libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm 
floppy pps_core pata_acpi
[   55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
[   55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   55.879055] Workqueue: events nvme_scan_work [nvme_core]
[   55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
[   55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
[   55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
[   55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX: 
ffffa0b13fc11d00
[   55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI: 
ffffa0b13376f400
[   55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09: 
ffffffffbee42e20
[   55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12: 
ffffa0b1384c6018
[   55.887343] R13: 0000000000000001 R14: 0000000000000080 R15: 
0000000000000000
[   55.888517] FS:  0000000000000000(0000) GS:ffffa0b13fc00000(0000) 
knlGS:0000000000000000
[   55.889816] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4: 
00000000003406f0
[   55.891878] Call Trace:
[   55.892285]  blk_mq_exit_hctx.isra.41+0xc4/0xd0
[   55.893020]  blk_mq_free_queue+0x110/0x130
[   55.893693]  blk_cleanup_queue+0xe0/0x150
[   55.894346]  nvme_ns_remove+0x78/0xd0 [nvme_core]
[   55.895109]  nvme_validate_ns+0x8c/0x290 [nvme_core]
[   55.895911]  ? nvme_scan_work+0x28a/0x370 [nvme_core]
[   55.896726]  nvme_scan_work+0x2ad/0x370 [nvme_core]
[   55.897523]  process_one_work+0x16b/0x480
[   55.898174]  worker_thread+0x4b/0x500
[   55.898771]  kthread+0x101/0x140
[   55.899299]  ? process_one_work+0x480/0x480
[   55.899977]  ? kthread_create_on_node+0x40/0x40
[   55.900711]  ? start_kernel+0x3bc/0x461
[   55.901336]  ? acpi_early_init+0x83/0xf9
[   55.901980]  ? acpi_load_tables+0x31/0x85
[   55.902632]  ret_from_fork+0x2c/0x40
[   55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66 
90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48 
89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
[   55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
[   55.907209] CR2: 0000000000000008
[   55.907750] ---[ end trace f016dee1082237cf ]---
--

^ permalink raw reply

* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Jens Axboe @ 2017-03-13 15:42 UTC (permalink / raw)
  To: Sagi Grimberg, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <a4e0cb7c-c779-b681-e66c-2159c5f2b09f@grimberg.me>

On 03/13/2017 09:24 AM, Sagi Grimberg wrote:
> Hey Jens,
> 
> After some fixes to nvme-rdma in the area of cpu hot unplug and
> rebase to jens for-linus branch I get the following NULL deref [1]
> 
> This crash did not happen before I rebased to for-linus (unless I
> screwed up something).
> 
> I'm on my way out so I just send it out in hope that someone can
> figure it out before I do...
> 
> After I offlined a cpu, I got the nvmf target to disconnect
> from the host, the host then schedules a reconnect. after the
> host reconnects it issues a namespace scanning which removes
> an old namespace. Then we get to blk_cleanup_queue which
> then triggers the NULL deref.
> 
> The strange thing is that we pass the
>   (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...
> 
> [1]
> --
> [   55.865818] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000008
> [   55.867094] IP: __blk_mq_tag_idle+0x19/0x30
> [   55.867825] PGD 0
> 
> [   55.868477] Oops: 0002 [#1] SMP
> [   55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
> mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper 
> cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid 
> ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp 
> libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper 
> syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm 
> floppy pps_core pata_acpi
> [   55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
> [   55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   55.879055] Workqueue: events nvme_scan_work [nvme_core]
> [   55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
> [   55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
> [   55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
> [   55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX: 
> ffffa0b13fc11d00
> [   55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI: 
> ffffa0b13376f400
> [   55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09: 
> ffffffffbee42e20
> [   55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12: 
> ffffa0b1384c6018
> [   55.887343] R13: 0000000000000001 R14: 0000000000000080 R15: 
> 0000000000000000
> [   55.888517] FS:  0000000000000000(0000) GS:ffffa0b13fc00000(0000) 
> knlGS:0000000000000000
> [   55.889816] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4: 
> 00000000003406f0
> [   55.891878] Call Trace:
> [   55.892285]  blk_mq_exit_hctx.isra.41+0xc4/0xd0
> [   55.893020]  blk_mq_free_queue+0x110/0x130
> [   55.893693]  blk_cleanup_queue+0xe0/0x150
> [   55.894346]  nvme_ns_remove+0x78/0xd0 [nvme_core]
> [   55.895109]  nvme_validate_ns+0x8c/0x290 [nvme_core]
> [   55.895911]  ? nvme_scan_work+0x28a/0x370 [nvme_core]
> [   55.896726]  nvme_scan_work+0x2ad/0x370 [nvme_core]
> [   55.897523]  process_one_work+0x16b/0x480
> [   55.898174]  worker_thread+0x4b/0x500
> [   55.898771]  kthread+0x101/0x140
> [   55.899299]  ? process_one_work+0x480/0x480
> [   55.899977]  ? kthread_create_on_node+0x40/0x40
> [   55.900711]  ? start_kernel+0x3bc/0x461
> [   55.901336]  ? acpi_early_init+0x83/0xf9
> [   55.901980]  ? acpi_load_tables+0x31/0x85
> [   55.902632]  ret_from_fork+0x2c/0x40
> [   55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66 
> 90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48 
> 89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
> [   55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
> [   55.907209] CR2: 0000000000000008
> [   55.907750] ---[ end trace f016dee1082237cf ]---

Are you saying your code works on top of 4.11-rc2, but not on top of my
for-linus? That seems odd. Looking at the oops, you are crashing with
!tags in __blk_mq_tag_idle. The below should work around it, but I'm
puzzled why this is new. Is it related to the other path you fixed in
this patch:

commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Mon Mar 13 16:10:11 2017 +0200

    blk-mq: Fix tagset reinit in the presence of cpu hot-unplug

Since that's also handling hctx->tags == NULL.


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..1283f74bfdfb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return;
 
-	atomic_dec(&tags->active_queues);
+	if (tags) {
+		atomic_dec(&tags->active_queues);
 
-	blk_mq_tag_wakeup_all(tags, false);
+		blk_mq_tag_wakeup_all(tags, false);
+	}
 }
 
 /*

-- 
Jens Axboe

^ permalink raw reply related

* unify and streamline the blk-mq make_request implementations
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

A bunch of cleanups to get us a nice I/O submission path.

^ permalink raw reply

* [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c         | 8 +-------
 include/linux/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..acf0ddf4af52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	plug = current->plug;
-	/*
-	 * If the driver supports defer issued based on 'last', then
-	 * queue it up like normal since we can potentially save some
-	 * CPU this way.
-	 */
-	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
-	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
 		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a9006117..5b3e201c8d4f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,7 +152,6 @@ enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
-	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.11.0

^ permalink raw reply related

* [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 164 +++++++++++----------------------------------------------
 1 file changed, 31 insertions(+), 133 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf0ddf4af52..53e49a3f6f0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1478,11 +1478,6 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	blk_mq_sched_insert_request(rq, false, true, true, false);
 }
 
-/*
- * Multiple hardware queue variant. This will not use per-process plugs,
- * but will attempt to bypass the hctx queueing if we can go straight to
- * hardware for SYNC IO.
- */
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
@@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	plug = current->plug;
-	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
+	if (plug && q->nr_hw_queues == 1) {
+		struct request *last = NULL;
+
+		blk_mq_bio_to_request(rq, bio);
+
+		/*
+		 * @request_count may become stale because of schedule
+		 * out, so check the list again.
+		 */
+		if (list_empty(&plug->mq_list))
+			request_count = 0;
+		else if (blk_queue_nomerges(q))
+			request_count = blk_plug_queued_count(q);
+
+		if (!request_count)
+			trace_block_plug(q);
+		else
+			last = list_entry_rq(plug->mq_list.prev);
+
+		blk_mq_put_ctx(data.ctx);
+
+		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+			blk_flush_plug_list(plug, false);
+			trace_block_plug(q);
+		}
+
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		goto done;
+	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
 		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
@@ -1596,119 +1620,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-/*
- * Single hardware queue variant. This will attempt to use any per-process
- * plug for merging and IO deferral.
- */
-static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
-{
-	const int is_sync = op_is_sync(bio->bi_opf);
-	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_plug *plug;
-	unsigned int request_count = 0;
-	struct blk_mq_alloc_data data = { .flags = 0 };
-	struct request *rq;
-	blk_qc_t cookie;
-	unsigned int wb_acct;
-
-	blk_queue_bounce(q, &bio);
-
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio_io_error(bio);
-		return BLK_QC_T_NONE;
-	}
-
-	blk_queue_split(q, &bio, q->bio_split);
-
-	if (!is_flush_fua && !blk_queue_nomerges(q)) {
-		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
-			return BLK_QC_T_NONE;
-	} else
-		request_count = blk_plug_queued_count(q);
-
-	if (blk_mq_sched_bio_merge(q, bio))
-		return BLK_QC_T_NONE;
-
-	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
-
-	trace_block_getrq(q, bio, bio->bi_opf);
-
-	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
-	if (unlikely(!rq)) {
-		__wbt_done(q->rq_wb, wb_acct);
-		return BLK_QC_T_NONE;
-	}
-
-	wbt_track(&rq->issue_stat, wb_acct);
-
-	cookie = request_to_qc_t(data.hctx, rq);
-
-	if (unlikely(is_flush_fua)) {
-		if (q->elevator)
-			goto elv_insert;
-		blk_mq_bio_to_request(rq, bio);
-		blk_insert_flush(rq);
-		goto run_queue;
-	}
-
-	/*
-	 * A task plug currently exists. Since this is completely lockless,
-	 * utilize that to temporarily store requests until the task is
-	 * either done or scheduled away.
-	 */
-	plug = current->plug;
-	if (plug) {
-		struct request *last = NULL;
-
-		blk_mq_bio_to_request(rq, bio);
-
-		/*
-		 * @request_count may become stale because of schedule
-		 * out, so check the list again.
-		 */
-		if (list_empty(&plug->mq_list))
-			request_count = 0;
-		if (!request_count)
-			trace_block_plug(q);
-		else
-			last = list_entry_rq(plug->mq_list.prev);
-
-		blk_mq_put_ctx(data.ctx);
-
-		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
-		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-			blk_flush_plug_list(plug, false);
-			trace_block_plug(q);
-		}
-
-		list_add_tail(&rq->queuelist, &plug->mq_list);
-		return cookie;
-	}
-
-	if (q->elevator) {
-elv_insert:
-		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
-		blk_mq_sched_insert_request(rq, false, true,
-						!is_sync || is_flush_fua, true);
-		goto done;
-	}
-	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
-		/*
-		 * For a SYNC request, send it to the hardware immediately. For
-		 * an ASYNC request, just ensure that we run it later on. The
-		 * latter allows for merging opportunities and more efficient
-		 * dispatching.
-		 */
-run_queue:
-		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
-	}
-
-	blk_mq_put_ctx(data.ctx);
-done:
-	return cookie;
-}
-
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
@@ -2366,10 +2277,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&q->requeue_list);
 	spin_lock_init(&q->requeue_lock);
 
-	if (q->nr_hw_queues > 1)
-		blk_queue_make_request(q, blk_mq_make_request);
-	else
-		blk_queue_make_request(q, blk_sq_make_request);
+	blk_queue_make_request(q, blk_mq_make_request);
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -2717,16 +2625,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 	set->nr_hw_queues = nr_hw_queues;
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
-
-		/*
-		 * Manually set the make_request_fn as blk_queue_make_request
-		 * resets a lot of the queue settings.
-		 */
-		if (q->nr_hw_queues > 1)
-			q->make_request_fn = blk_mq_make_request;
-		else
-			q->make_request_fn = blk_sq_make_request;
-
 		blk_mq_queue_reinit(q, cpu_online_mask);
 	}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53e49a3f6f0a..48748cb799ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1478,13 +1478,27 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	blk_mq_sched_insert_request(rq, false, true, true, false);
 }
 
+static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, blk_qc_t *cookie)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		__blk_mq_try_issue_directly(rq, cookie);
+		rcu_read_unlock();
+	} else {
+		unsigned int srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+		__blk_mq_try_issue_directly(rq, cookie);
+		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
+	}
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
 	struct blk_mq_alloc_data data = { .flags = 0 };
 	struct request *rq;
-	unsigned int request_count = 0, srcu_idx;
+	unsigned int request_count = 0;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
@@ -1582,18 +1596,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		} else /* is_sync */
 			old_rq = rq;
 		blk_mq_put_ctx(data.ctx);
-		if (!old_rq)
-			goto done;
-
-		if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
-			rcu_read_lock();
-			blk_mq_try_issue_directly(old_rq, &cookie);
-			rcu_read_unlock();
-		} else {
-			srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
-			blk_mq_try_issue_directly(old_rq, &cookie);
-			srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
-		}
+		if (old_rq)
+			blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
 		goto done;
 	}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 4/4] blk-mq: streamline blk_mq_make_request
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 67 +++++++++++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 48748cb799ed..18e449cc832f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	cookie = request_to_qc_t(data.hctx, rq);
 
+	plug = current->plug;
 	if (unlikely(is_flush_fua)) {
-		if (q->elevator)
-			goto elv_insert;
 		blk_mq_bio_to_request(rq, bio);
-		blk_insert_flush(rq);
-		goto run_queue;
-	}
-
-	plug = current->plug;
-	if (plug && q->nr_hw_queues == 1) {
+		if (q->elevator) {
+			blk_mq_sched_insert_request(rq, false, true,
+						!is_sync || is_flush_fua, true);
+		} else {
+			blk_insert_flush(rq);
+			blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+		}
+	} else if (plug && q->nr_hw_queues == 1) {
 		struct request *last = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
@@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		else
 			last = list_entry_rq(plug->mq_list.prev);
 
-		blk_mq_put_ctx(data.ctx);
-
 		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
 			blk_flush_plug_list(plug, false);
@@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		}
 
 		list_add_tail(&rq->queuelist, &plug->mq_list);
-		goto done;
-	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
-		struct request *old_rq = NULL;
-
+	} else if (plug && !blk_queue_nomerges(q)) {
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
 		 * We do limited plugging. If the bio can be merged, do that.
 		 * Otherwise the existing request in the plug list will be
 		 * issued. So the plug list will have one request at most
+		 *
+		 * The plug list might get flushed before this. If that happens,
+		 * the plug list is emptry and same_queue_rq is invalid.
 		 */
-		if (plug) {
-			/*
-			 * The plug list might get flushed before this. If that
-			 * happens, same_queue_rq is invalid and plug list is
-			 * empty
-			 */
-			if (same_queue_rq && !list_empty(&plug->mq_list)) {
-				old_rq = same_queue_rq;
-				list_del_init(&old_rq->queuelist);
-			}
-			list_add_tail(&rq->queuelist, &plug->mq_list);
-		} else /* is_sync */
-			old_rq = rq;
-		blk_mq_put_ctx(data.ctx);
-		if (old_rq)
-			blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
-		goto done;
-	}
+		if (!list_empty(&plug->mq_list))
+			list_del_init(&same_queue_rq->queuelist);
+		else
+			same_queue_rq = NULL;
 
-	if (q->elevator) {
-elv_insert:
-		blk_mq_put_ctx(data.ctx);
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		if (same_queue_rq)
+			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
+					&cookie);
+	} else if (is_sync) {
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+	} else if (q->elevator) {
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_sched_insert_request(rq, false, true,
 						!is_sync || is_flush_fua, true);
-		goto done;
-	}
-	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+	} else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
 		 * an ASYNC request, just ensure that we run it later on. The
 		 * latter allows for merging opportunities and more efficient
 		 * dispatching.
 		 */
-run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
+
 	blk_mq_put_ctx(data.ctx);
-done:
 	return cookie;
 }
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-13 16:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, David Rientjes, linux-kernel
In-Reply-To: <c514c076-18ea-8605-d439-e46730aa29e0@kernel.dk>

>> Do you mean, you prefer the approach that was taken in v1 patch or
>> something else?
>
> I can no longer find v1 of the patch, just v2 and on. Can you send a
> link to it?

https://lkml.org/lkml/2017/2/28/8

^ permalink raw reply

* Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0
From: Abdul Haleem @ 2017-03-13 17:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Brian Foster, linux-xfs, linux-block, mpe, linuxppc-dev,
	linux-kernel
In-Reply-To: <158127ce-d71d-56a5-3dd3-f676b106c65d@kernel.dk>

On Sat, 2017-03-11 at 15:46 -0700, Jens Axboe wrote:
> On 03/09/2017 05:59 AM, Brian Foster wrote:
> > cc linux-block
> > 
> > On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
> >> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
> >>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
> >>>>
> >>>> Issue is not reproducible all the time.
> 
> Is that still the case with -git as of yesterday? Check that you
> have this merge:
> 
> 34bbce9e344b47e8871273409632f525973afad4
> 
> in your tree.
> 

Thanks for pointing out, with the below merge commit warnings disappear.

commit 34bbce9e344b47e8871273409632f525973afad4
Merge: bb61ce5 672a2c8
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Mar 9 15:53:25 2017 -0800

    Merge branch 'for-linus' of git://git.kernel.dk/linux-block


Thanks for the fix !

Reported-by & Tested-by : Abdul Haleem <abdhalee@linux.vnet.ibm.com>

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre

^ permalink raw reply

* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Dan Williams @ 2017-03-13 18:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Thiago Jung Bauermann,
	Tejun Heo, Tahsin Erdogan, Omar Sandoval
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.

Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.

^ permalink raw reply

* Re: [PATCH] brd: make rd_size static
From: Bart Van Assche @ 2017-03-13 20:07 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, yanaijie@huawei.com, axboe@kernel.dk
  Cc: zhaohongjiang@huawei.com, miaoxie@huawei.com
In-Reply-To: <4dd2367b-61b9-d390-5bb1-955e90fef0b0@kernel.dk>

On Sat, 2017-03-11 at 15:29 -0700, Jens Axboe wrote:
> On 03/10/2017 12:32 AM, Jason Yan wrote:
> > Fixes the following sparse warning:
> >=20
> > drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
> > Should it be static?
>=20
> If you do a search on this topic, you'll find others that attempted
> to do the same. Arm uses it for tag parsing, for some reason, your
> patch below would break it.
>=20
> It'd be great if this was fixed up for real, though.

How about something like the (untested) patch below?


Subject: [PATCH] arch/arm/kernel/atags_parse.c: Fix rd_size declaration

Ensure that the ARM setup code treats "rd_size" as unsigned long instead of=
 int.

---
 arch/arm/kernel/atags_parse.c | 3 ++-
 drivers/block/brd.c           | 2 ++
 drivers/block/brd.h           | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/brd.h

diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index 68c6ae0b9e4c..f18b6deaf050 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -30,6 +30,7 @@
 #include <asm/mach/arch.h>
=20
 #include "atags.h"
+#include "../../../drivers/block/brd.h"
=20
 static char default_command_line[COMMAND_LINE_SIZE] __initdata =3D CONFIG_=
CMDLINE;
=20
@@ -91,7 +92,7 @@ __tagtable(ATAG_VIDEOTEXT, parse_tag_videotext);
 #ifdef CONFIG_BLK_DEV_RAM
 static int __init parse_tag_ramdisk(const struct tag *tag)
 {
-	extern int rd_size, rd_image_start, rd_prompt, rd_doload;
+	extern int rd_image_start, rd_prompt, rd_doload;
=20
 	rd_image_start =3D tag->u.ramdisk.start;
 	rd_doload =3D (tag->u.ramdisk.flags & 1) =3D=3D 0;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..f1f9f0338fbd 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -25,6 +25,8 @@
=20
 #include <linux/uaccess.h>
=20
+#include "brd.h"
+
 #define SECTOR_SHIFT		9
 #define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
 #define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
diff --git a/drivers/block/brd.h b/drivers/block/brd.h
new file mode 100644
index 000000000000..dbb0f92fefc8
--- /dev/null
+++ b/drivers/block/brd.h
@@ -0,0 +1 @@
+extern unsigned long rd_size;
--=20
2.12.0

^ permalink raw reply related

* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Bart Van Assche @ 2017-03-13 20:52 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-2-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> This flag was never used since it was introduced.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c         | 8 +-------
>  include/linux/blk-mq.h | 1 -
>  2 files changed, 1 insertion(+), 8 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..acf0ddf4af52 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
>  	}
> =20
>  	plug =3D current->plug;
> -	/*
> -	 * If the driver supports defer issued based on 'last', then
> -	 * queue it up like normal since we can potentially save some
> -	 * CPU this way.
> -	 */
> -	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> -	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> +	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {

A minor comment: due to this change the outer pair of parentheses
became superfluous. Please consider removing these.

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Bart Van Assche @ 2017-03-13 21:01 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-3-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> @@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
>  	}
> =20
>  	plug =3D current->plug;
> -	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> +	if (plug && q->nr_hw_queues =3D=3D 1) {
> +		struct request *last =3D NULL;
> +
> +		blk_mq_bio_to_request(rq, bio);
> +
> +		/*
> +		 * @request_count may become stale because of schedule
> +		 * out, so check the list again.
> +		 */

The above comment was relevant as long as there was a request_count assignm=
ent
above blk_mq_sched_get_request(). This patch moves that assignment inside i=
f
(plug && q->nr_hw_queues =3D=3D 1). Does that mean that the above comment s=
hould be
removed entirely?

> +		if (list_empty(&plug->mq_list))
> +			request_count =3D 0;
> +		else if (blk_queue_nomerges(q))
> +			request_count =3D blk_plug_queued_count(q);
> +
> +		if (!request_count)
> +			trace_block_plug(q);
> +		else
> +			last =3D list_entry_rq(plug->mq_list.prev);
> +
> +		blk_mq_put_ctx(data.ctx);
> +
> +		if (request_count >=3D BLK_MAX_REQUEST_COUNT || (last &&
> +		    blk_rq_bytes(last) >=3D BLK_PLUG_FLUSH_SIZE)) {
> +			blk_flush_plug_list(plug, false);
> +			trace_block_plug(q);
> +		}
> +
> +		list_add_tail(&rq->queuelist, &plug->mq_list);
> +		goto done;
> +	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
>  		struct request *old_rq =3D NULL;
> =20
>  		blk_mq_bio_to_request(rq, bio);

Bart.=

^ permalink raw reply

* Re: [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Bart Van Assche @ 2017-03-13 21:02 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-4-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
> new wrapper that takes care of RCU / SRCU locking to avoid having
> boileplate code in the caller which would get duplicated with new callers=
.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=

^ permalink raw reply

* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 21:46 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <54f833f1-ab98-5d22-e7d4-5e6059a4c467@fb.com>


> Are you saying your code works on top of 4.11-rc2, but not on top of my
> for-linus?

I was actually on Linus 4.11-rc1 before I rebased on top of your
for-linus.

> That seems odd. Looking at the oops, you are crashing with
> !tags in __blk_mq_tag_idle. The below should work around it, but I'm
> puzzled why this is new.

I got it just once (out of a single run :)), but maybe it is
possible that its racy and not really new.

But another example where this can happen:
blk_mq_realloc_hw_ctxs explicitly checks on hctx->tags != NULL
but right after calls blk_mq_exit_hctx() which goes in the
same route, won't this happen there too? Or is it assumed that
hctx->state does not have BLK_MQ_S_TAG_ACTIVE on here?

> Is it related to the other path you fixed in this patch:
>
> commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
> Author: Sagi Grimberg <sagi@grimberg.me>
> Date:   Mon Mar 13 16:10:11 2017 +0200
>
>     blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
>
> Since that's also handling hctx->tags == NULL.

The above patch prevented a NULL deref earlier when the
tags were reinitialized, now we are all setup and we
happen to remove an old namespace.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9d97bfc4d465..1283f74bfdfb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>  	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>  		return;
>
> -	atomic_dec(&tags->active_queues);
> +	if (tags) {
> +		atomic_dec(&tags->active_queues);
>
> -	blk_mq_tag_wakeup_all(tags, false);
> +		blk_mq_tag_wakeup_all(tags, false);
> +	}
>  }
>
>  /*
>

I'll see if I can test it out later this week. thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox