* [PATCH 1/6] blk-mq: Fix a use-after-free
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
@ 2014-12-09 15:57 ` Bart Van Assche
2014-12-09 15:58 ` [PATCH 2/6] blk-mq: Avoid that __bt_get_word() wraps multiple times Bart Van Assche
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:57 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
blk-mq users are allowed to free the memory request_queue.tag_set
points at after blk_cleanup_queue() has finished but before
blk_release_queue() has started. This can happen e.g. in the SCSI
core. The SCSI core namely embeds the tag_set structure in a SCSI
host structure. The SCSI host structure is freed by
scsi_host_dev_release(). This function is called after
blk_cleanup_queue() finished but can be called before
blk_release_queue().
This means that it is not safe to access request_queue.tag_set from
inside blk_release_queue(). Hence remove the blk_sync_queue() call
from blk_release_queue(). This call is not necessary - outstanding
requests must have finished before blk_release_queue() is
called. Additionally, move the blk_mq_free_queue() call from
blk_release_queue() to blk_cleanup_queue() to avoid that struct
request_queue.tag_set gets accessed after it has been freed.
This patch avoids that the following kernel oops can be triggered
when deleting a SCSI host for which scsi-mq was enabled:
Call Trace:
[<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
[<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
[<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
[<ffffffff8124d654>] blk_release_queue+0x84/0xd0
[<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8126c140>] kobject_put+0x30/0x70
[<ffffffff81245895>] blk_put_queue+0x15/0x20
[<ffffffff8125c409>] disk_release+0x99/0xd0
[<ffffffff8133d056>] device_release+0x36/0xb0
[<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8126c140>] kobject_put+0x30/0x70
[<ffffffff8125a78a>] put_disk+0x1a/0x20
[<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
[<ffffffff811d56a0>] blkdev_put+0x50/0x160
[<ffffffff81199eb4>] kill_block_super+0x44/0x70
[<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
[<ffffffff8119a87e>] deactivate_super+0x4e/0x70
[<ffffffff811b9833>] cleanup_mnt+0x43/0x90
[<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
[<ffffffff8107252c>] task_work_run+0xac/0xe0
[<ffffffff81002c01>] do_notify_resume+0x61/0xa0
[<ffffffff814d2c58>] int_signal+0x12/0x17
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
---
block/blk-core.c | 3 +++
block/blk-sysfs.c | 12 ++++--------
2 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 2e7424b..a681827 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -525,6 +525,9 @@ void blk_cleanup_queue(struct request_queue *q)
del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
blk_sync_queue(q);
+ if (q->mq_ops)
+ blk_mq_free_queue(q);
+
spin_lock_irq(lock);
if (q->queue_lock != &q->__queue_lock)
q->queue_lock = &q->__queue_lock;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1fac434..935ea2a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -492,17 +492,15 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
* Currently, its primary task it to free all the &struct request
* structures that were allocated to the queue and the queue itself.
*
- * Caveat:
- * Hopefully the low level driver will have finished any
- * outstanding requests first...
+ * Note:
+ * The low level driver must have finished any outstanding requests first
+ * via blk_cleanup_queue().
**/
static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- blk_sync_queue(q);
-
blkcg_exit_queue(q);
if (q->elevator) {
@@ -517,9 +515,7 @@ static void blk_release_queue(struct kobject *kobj)
if (q->queue_tags)
__blk_queue_free_tags(q);
- if (q->mq_ops)
- blk_mq_free_queue(q);
- else
+ if (!q->mq_ops)
blk_free_flush_queue(q->fq);
blk_trace_shutdown(q);
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 2/6] blk-mq: Avoid that __bt_get_word() wraps multiple times
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
2014-12-09 15:57 ` [PATCH 1/6] blk-mq: Fix a use-after-free Bart Van Assche
@ 2014-12-09 15:58 ` Bart Van Assche
2014-12-09 15:58 ` [PATCH 3/6] blk-mq: Fix a race between bt_clear_tag() and bt_get() Bart Van Assche
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:58 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
---
block/blk-mq-tag.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 728b9a4..5a3db28 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -137,6 +137,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
{
int tag, org_last_tag, end;
+ bool wrap = last_tag != 0;
org_last_tag = last_tag;
end = bm->depth;
@@ -148,8 +149,9 @@ restart:
* We started with an offset, start from 0 to
* exhaust the map.
*/
- if (org_last_tag && last_tag) {
- end = last_tag;
+ if (wrap) {
+ wrap = false;
+ end = org_last_tag;
last_tag = 0;
goto restart;
}
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 3/6] blk-mq: Fix a race between bt_clear_tag() and bt_get()
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
2014-12-09 15:57 ` [PATCH 1/6] blk-mq: Fix a use-after-free Bart Van Assche
2014-12-09 15:58 ` [PATCH 2/6] blk-mq: Avoid that __bt_get_word() wraps multiple times Bart Van Assche
@ 2014-12-09 15:58 ` Bart Van Assche
2014-12-09 15:58 ` [PATCH 4/6] blk-mq: Avoid that I/O hangs in bt_get() Bart Van Assche
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:58 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
---
block/blk-mq-tag.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5a3db28..70d9f9e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -158,7 +158,7 @@ restart:
return -1;
}
last_tag = tag + 1;
- } while (test_and_set_bit_lock(tag, &bm->word));
+ } while (test_and_set_bit(tag, &bm->word));
return tag;
}
@@ -342,11 +342,10 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
struct bt_wait_state *bs;
int wait_cnt;
- /*
- * The unlock memory barrier need to order access to req in free
- * path and clearing tag bit
- */
- clear_bit_unlock(TAG_TO_BIT(bt, tag), &bt->map[index].word);
+ clear_bit(TAG_TO_BIT(bt, tag), &bt->map[index].word);
+
+ /* Ensure that the wait list checks occur after clear_bit(). */
+ smp_mb();
bs = bt_wake_ptr(bt);
if (!bs)
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 4/6] blk-mq: Avoid that I/O hangs in bt_get()
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
` (2 preceding siblings ...)
2014-12-09 15:58 ` [PATCH 3/6] blk-mq: Fix a race between bt_clear_tag() and bt_get() Bart Van Assche
@ 2014-12-09 15:58 ` Bart Van Assche
2014-12-09 16:10 ` Jens Axboe
2014-12-09 15:59 ` [PATCH 5/6] blk-mq: Use all available hardware queues Bart Van Assche
2014-12-09 15:59 ` [PATCH 6/6] blk-mq: Micro-optimize bt_get() Bart Van Assche
5 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:58 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:
* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
hardware queue.
The call stack that corresponds to the hang is as follows:
io_schedule+0x9c/0x130
bt_get+0xef/0x180
blk_mq_get_tag+0x9f/0xd0
__blk_mq_alloc_request+0x16/0x1f0
blk_mq_map_request+0x123/0x130
blk_mq_make_request+0x69/0x280
generic_make_request+0xc0/0x110
submit_bio+0x64/0x130
do_blockdev_direct_IO+0x1dc8/0x2da0
__blockdev_direct_IO+0x47/0x50
blkdev_direct_IO+0x49/0x50
generic_file_read_iter+0x546/0x610
blkdev_read_iter+0x32/0x40
aio_run_iocb+0x1f8/0x400
do_io_submit+0x121/0x490
SyS_io_submit+0xb/0x10
system_call_fastpath+0x12/0x17
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
---
block/blk-mq-tag.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 70d9f9e..795996e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -256,6 +256,12 @@ static int bt_get(struct blk_mq_alloc_data *data,
if (tag != -1)
break;
+ blk_mq_run_hw_queue(hctx, false);
+
+ tag = __bt_get(hctx, bt, last_tag);
+ if (tag != -1)
+ break;
+
blk_mq_put_ctx(data->ctx);
io_schedule();
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH 4/6] blk-mq: Avoid that I/O hangs in bt_get()
2014-12-09 15:58 ` [PATCH 4/6] blk-mq: Avoid that I/O hangs in bt_get() Bart Van Assche
@ 2014-12-09 16:10 ` Jens Axboe
0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2014-12-09 16:10 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
On 12/09/2014 08:58 AM, Bart Van Assche wrote:
> Avoid that if there are fewer hardware queues than CPU threads that
> bt_get() can hang. The symptoms of the hang were as follows:
> * All tags allocated for a particular hardware queue.
> * (nr_tags) pending commands for that hardware queue.
> * No pending commands for the software queues associated with that
> hardware queue.
I already queued up the previous one, and added the extra tag get myself.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 5/6] blk-mq: Use all available hardware queues
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
` (3 preceding siblings ...)
2014-12-09 15:58 ` [PATCH 4/6] blk-mq: Avoid that I/O hangs in bt_get() Bart Van Assche
@ 2014-12-09 15:59 ` Bart Van Assche
2014-12-09 16:10 ` Jens Axboe
2014-12-09 15:59 ` [PATCH 6/6] blk-mq: Micro-optimize bt_get() Bart Van Assche
5 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
Suppose that a system has two CPU sockets, three cores per socket,
that it does not support hyperthreading and that four hardware
queues are provided by a block driver. With the current algorithm
this will lead to the following assignment of CPU cores to hardware
queues:
HWQ 0: 0 1
HWQ 1: 2 3
HWQ 2: 4 5
HWQ 3: (none)
This patch changes the queue assignment into:
HWQ 0: 0 1
HWQ 1: 2
HWQ 2: 3 4
HWQ 3: 5
In other words, this patch has the following three effects:
- All four hardware queues are used instead of only three.
- CPU cores are spread more evenly over hardware queues. For the
above example the range of the number of CPU cores associated
with a single HWQ is reduced from [0..2] to [1..2].
- If the number of HWQ's is a multiple of the number of CPU sockets
it is now guaranteed that all CPU cores associated with a single
HWQ reside on the same CPU socket.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
---
block/blk-mq-cpumap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1065d7c..8e56455 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -17,7 +17,7 @@
static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
const int cpu)
{
- return cpu / ((nr_cpus + nr_queues - 1) / nr_queues);
+ return cpu * nr_queues / nr_cpus;
}
static int get_first_sibling(unsigned int cpu)
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH 5/6] blk-mq: Use all available hardware queues
2014-12-09 15:59 ` [PATCH 5/6] blk-mq: Use all available hardware queues Bart Van Assche
@ 2014-12-09 16:10 ` Jens Axboe
0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2014-12-09 16:10 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
On 12/09/2014 08:59 AM, Bart Van Assche wrote:
> Suppose that a system has two CPU sockets, three cores per socket,
> that it does not support hyperthreading and that four hardware
> queues are provided by a block driver. With the current algorithm
> this will lead to the following assignment of CPU cores to hardware
> queues:
>
> HWQ 0: 0 1
> HWQ 1: 2 3
> HWQ 2: 4 5
> HWQ 3: (none)
>
> This patch changes the queue assignment into:
>
> HWQ 0: 0 1
> HWQ 1: 2
> HWQ 2: 3 4
> HWQ 3: 5
>
> In other words, this patch has the following three effects:
> - All four hardware queues are used instead of only three.
> - CPU cores are spread more evenly over hardware queues. For the
> above example the range of the number of CPU cores associated
> with a single HWQ is reduced from [0..2] to [1..2].
> - If the number of HWQ's is a multiple of the number of CPU sockets
> it is now guaranteed that all CPU cores associated with a single
> HWQ reside on the same CPU socket.
I have thought about this since your last posting, and I think it should
be a win for most cases to do this, even if we end up with asymmetric
queue <-> cpu mappings.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 6/6] blk-mq: Micro-optimize bt_get()
2014-12-09 15:57 [PATCH 0/6] Six blk-mq patches Bart Van Assche
` (4 preceding siblings ...)
2014-12-09 15:59 ` [PATCH 5/6] blk-mq: Use all available hardware queues Bart Van Assche
@ 2014-12-09 15:59 ` Bart Van Assche
2014-12-09 16:06 ` Jens Axboe
5 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2014-12-09 15:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Robert Elliott, Ming Lei, Alexander Gordeev,
linux-kernel
Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
---
block/blk-mq-tag.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 795996e..2373724 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -248,8 +248,8 @@ static int bt_get(struct blk_mq_alloc_data *data,
if (!(data->gfp & __GFP_WAIT))
return -1;
- bs = bt_wait_ptr(bt, hctx);
do {
+ bs = bt_wait_ptr(bt, hctx);
prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
tag = __bt_get(hctx, bt, last_tag);
@@ -276,8 +276,6 @@ static int bt_get(struct blk_mq_alloc_data *data,
hctx = data->hctx;
bt = &hctx->tags->bitmap_tags;
}
- finish_wait(&bs->wait, &wait);
- bs = bt_wait_ptr(bt, hctx);
} while (1);
finish_wait(&bs->wait, &wait);
--
2.1.2
^ permalink raw reply related [flat|nested] 10+ messages in thread