public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] blk-mq: two fixes
@ 2015-08-09  7:41 Ming Lei
  2015-08-09  7:41 ` [PATCH 1/2] blk-mq: fix buffer overflow when reading sysfs file of 'pending' Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ming Lei @ 2015-08-09  7:41 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: Christoph Hellwig

Hi Jens,

The 1st patch fixes one buffer overflow issue when reading
sysfs file of hctx's pending.

The 2nd patch fixes one race between timeout and request freeing,
which also simplifies implementation of blk_mq_tag_to_rq() a lot.

 block/blk-flush.c    | 15 ++++++++++++++-
 block/blk-mq-sysfs.c | 25 ++++++++++++++++++-------
 block/blk-mq-tag.c   |  4 ++--
 block/blk-mq-tag.h   | 12 ++++++++++++
 block/blk-mq.c       | 16 +---------------
 block/blk.h          |  6 ++++++
 6 files changed, 53 insertions(+), 25 deletions(-)


thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  2015-08-09  7:41 [PATCH 0/2] blk-mq: two fixes Ming Lei
@ 2015-08-09  7:41 ` Ming Lei
  2015-08-09  7:41 ` [PATCH 2/2] blk-mq: fix race between timeout and freeing request Ming Lei
  2015-08-15  1:10 ` [PATCH 0/2] blk-mq: two fixes Ming Lei
  2 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2015-08-09  7:41 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: Christoph Hellwig, Ming Lei, stable

There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.

One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.

This patch fixes the following warning and the related memory
destruction.

[  359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
[  359.055595] irq event stamp: 15537^M
[  359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  359.055614] Dumping ftrace buffer:^M
[  359.055660]    (ftrace buffer empty)^M
[  359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
[  359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
[  359.055693] RIP: 0010:[<ffffffff811541c5>]  [<ffffffff811541c5>] __kmalloc+0xe8/0x152^M

Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-mq-sysfs.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..279c5d6 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -141,15 +141,26 @@ static ssize_t blk_mq_sysfs_completed_show(struct blk_mq_ctx *ctx, char *page)
 
 static ssize_t sysfs_list_show(char *page, struct list_head *list, char *msg)
 {
-	char *start_page = page;
 	struct request *rq;
+	int len = snprintf(page, PAGE_SIZE - 1, "%s:\n", msg);
+
+	list_for_each_entry(rq, list, queuelist) {
+		const int rq_len = 2 * sizeof(rq) + 2;
+
+		/* if the output will be truncated */
+		if (PAGE_SIZE - 1 < len + rq_len) {
+			/* backspacing if it can't hold '\t...\n' */
+			if (PAGE_SIZE - 1 < len + 5)
+				len -= rq_len;
+			len += snprintf(page + len, PAGE_SIZE - 1 - len,
+					"\t...\n");
+			break;
+		}
+		len += snprintf(page + len, PAGE_SIZE - 1 - len,
+				"\t%p\n", rq);
+	}
 
-	page += sprintf(page, "%s:\n", msg);
-
-	list_for_each_entry(rq, list, queuelist)
-		page += sprintf(page, "\t%p\n", rq);
-
-	return page - start_page;
+	return len;
 }
 
 static ssize_t blk_mq_sysfs_rq_list_show(struct blk_mq_ctx *ctx, char *page)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] blk-mq: fix race between timeout and freeing request
  2015-08-09  7:41 [PATCH 0/2] blk-mq: two fixes Ming Lei
  2015-08-09  7:41 ` [PATCH 1/2] blk-mq: fix buffer overflow when reading sysfs file of 'pending' Ming Lei
@ 2015-08-09  7:41 ` Ming Lei
  2015-08-15  1:10 ` [PATCH 0/2] blk-mq: two fixes Ming Lei
  2 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2015-08-09  7:41 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: Christoph Hellwig, Ming Lei, stable

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  439.700653] Dumping ftrace buffer:^M
[  439.700653]    (ftrace buffer empty)^M
[  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
[  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[  439.730500] Stack:^M
[  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[  439.755663] Call Trace:^M
[  439.755663]  <IRQ> ^M
[  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
[  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
[  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
[  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
[  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
[  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
[  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
[  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
[  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
[  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
[  439.755663]  <EOI> ^M
[  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
[  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
[  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
[  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
[  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
[  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
[  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
[  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
[  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
[  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
[  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
[  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.790911]  RSP <ffff880819203da0>^M
[  439.790911] CR2: 0000000000000158^M
[  439.790911] ---[ end trace d40af58949325661 ]---^M

Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-flush.c  | 15 ++++++++++++++-
 block/blk-mq-tag.c |  4 ++--
 block/blk-mq-tag.h | 12 ++++++++++++
 block/blk-mq.c     | 16 +---------------
 block/blk.h        |  6 ++++++
 5 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20badd7..9c423e5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -73,6 +73,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -226,7 +227,12 @@ static void flush_end_io(struct request *flush_rq, int error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
 
 	if (q->mq_ops) {
+		struct blk_mq_hw_ctx *hctx;
+
+		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
+		hctx = q->mq_ops->map_queue(q, flush_rq->mq_ctx->cpu);
+		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
 	}
 
@@ -308,11 +314,18 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 
 	/*
 	 * Borrow tag from the first request since they can't
-	 * be in flight at the same time.
+	 * be in flight at the same time. And acquire the tag's
+	 * ownership for flush req.
 	 */
 	if (q->mq_ops) {
+		struct blk_mq_hw_ctx *hctx;
+
 		flush_rq->mq_ctx = first_rq->mq_ctx;
 		flush_rq->tag = first_rq->tag;
+		fq->orig_rq = first_rq;
+
+		hctx = q->mq_ops->map_queue(q, first_rq->mq_ctx->cpu);
+		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	}
 
 	flush_rq->cmd_type = REQ_TYPE_FS;
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9b6e288..9115c6d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -429,7 +429,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx,
 		for (bit = find_first_bit(&bm->word, bm->depth);
 		     bit < bm->depth;
 		     bit = find_next_bit(&bm->word, bm->depth, bit + 1)) {
-		     	rq = blk_mq_tag_to_rq(hctx->tags, off + bit);
+			rq = hctx->tags->rqs[off + bit];
 			if (rq->q == hctx->queue)
 				fn(hctx, rq, data, reserved);
 		}
@@ -453,7 +453,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags,
 		for (bit = find_first_bit(&bm->word, bm->depth);
 		     bit < bm->depth;
 		     bit = find_next_bit(&bm->word, bm->depth, bit + 1)) {
-			rq = blk_mq_tag_to_rq(tags, off + bit);
+			rq = tags->rqs[off + bit];
 			fn(rq, data, reserved);
 		}
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 75893a3..9eb2cf4 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -89,4 +89,16 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	__blk_mq_tag_idle(hctx);
 }
 
+/*
+ * This helper should only be used for flush request to share tag
+ * with the request cloned from, and both the two requests can't be
+ * in flight at the same time. The caller has to make sure the tag
+ * can't be freed.
+ */
+static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx,
+		unsigned int tag, struct request *rq)
+{
+	hctx->tags->rqs[tag] = rq;
+}
+
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..176262e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -559,23 +559,9 @@ void blk_mq_abort_requeue_list(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_mq_abort_requeue_list);
 
-static inline bool is_flush_request(struct request *rq,
-		struct blk_flush_queue *fq, unsigned int tag)
-{
-	return ((rq->cmd_flags & REQ_FLUSH_SEQ) &&
-			fq->flush_rq->tag == tag);
-}
-
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
-	struct request *rq = tags->rqs[tag];
-	/* mq_ctx of flush rq is always cloned from the corresponding req */
-	struct blk_flush_queue *fq = blk_get_flush_queue(rq->q, rq->mq_ctx);
-
-	if (!is_flush_request(rq, fq, tag))
-		return rq;
-
-	return fq->flush_rq;
+	return tags->rqs[tag];
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
diff --git a/block/blk.h b/block/blk.h
index 026d959..838188b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -22,6 +22,12 @@ struct blk_flush_queue {
 	struct list_head	flush_queue[2];
 	struct list_head	flush_data_in_flight;
 	struct request		*flush_rq;
+
+	/*
+	 * flush_rq shares tag with this rq, both can't be active
+	 * at the same time
+	 */
+	struct request		*orig_rq;
 	spinlock_t		mq_flush_lock;
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] blk-mq: two fixes
  2015-08-09  7:41 [PATCH 0/2] blk-mq: two fixes Ming Lei
  2015-08-09  7:41 ` [PATCH 1/2] blk-mq: fix buffer overflow when reading sysfs file of 'pending' Ming Lei
  2015-08-09  7:41 ` [PATCH 2/2] blk-mq: fix race between timeout and freeing request Ming Lei
@ 2015-08-15  1:10 ` Ming Lei
  2015-08-15 15:44   ` Jens Axboe
  2 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2015-08-15  1:10 UTC (permalink / raw)
  To: Jens Axboe, Linux Kernel Mailing List; +Cc: Christoph Hellwig

On Sun, Aug 9, 2015 at 3:41 PM, Ming Lei <ming.lei@canonical.com> wrote:
> Hi Jens,
>
> The 1st patch fixes one buffer overflow issue when reading
> sysfs file of hctx's pending.
>
> The 2nd patch fixes one race between timeout and request freeing,
> which also simplifies implementation of blk_mq_tag_to_rq() a lot.

Ping...


>  block/blk-flush.c    | 15 ++++++++++++++-
>  block/blk-mq-sysfs.c | 25 ++++++++++++++++++-------
>  block/blk-mq-tag.c   |  4 ++--
>  block/blk-mq-tag.h   | 12 ++++++++++++
>  block/blk-mq.c       | 16 +---------------
>  block/blk.h          |  6 ++++++
>  6 files changed, 53 insertions(+), 25 deletions(-)
>
>
> thanks,
> Ming
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] blk-mq: two fixes
  2015-08-15  1:10 ` [PATCH 0/2] blk-mq: two fixes Ming Lei
@ 2015-08-15 15:44   ` Jens Axboe
  2015-08-16  9:25     ` Ming Lei
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2015-08-15 15:44 UTC (permalink / raw)
  To: Ming Lei, Linux Kernel Mailing List; +Cc: Christoph Hellwig

On 08/14/2015 07:10 PM, Ming Lei wrote:
> On Sun, Aug 9, 2015 at 3:41 PM, Ming Lei <ming.lei@canonical.com> wrote:
>> Hi Jens,
>>
>> The 1st patch fixes one buffer overflow issue when reading
>> sysfs file of hctx's pending.
>>
>> The 2nd patch fixes one race between timeout and request freeing,
>> which also simplifies implementation of blk_mq_tag_to_rq() a lot.
>
> Ping...

Sorry, 1/2 is a no brainer, needed a bit of time to look at 2/2. I think 
it looks fine, and I like how it simplifies (and gets rid of) the flush 
special case tag lookup. Applied.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] blk-mq: two fixes
  2015-08-15 15:44   ` Jens Axboe
@ 2015-08-16  9:25     ` Ming Lei
  0 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2015-08-16  9:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel Mailing List, Christoph Hellwig

On Sat, Aug 15, 2015 at 11:44 PM, Jens Axboe <axboe@fb.com> wrote:
> On 08/14/2015 07:10 PM, Ming Lei wrote:
>>
>> On Sun, Aug 9, 2015 at 3:41 PM, Ming Lei <ming.lei@canonical.com> wrote:
>>>
>>> Hi Jens,
>>>
>>> The 1st patch fixes one buffer overflow issue when reading
>>> sysfs file of hctx's pending.
>>>
>>> The 2nd patch fixes one race between timeout and request freeing,
>>> which also simplifies implementation of blk_mq_tag_to_rq() a lot.
>>
>>
>> Ping...
>
>
> Sorry, 1/2 is a no brainer, needed a bit of time to look at 2/2. I think it
> looks fine, and I like how it simplifies (and gets rid of) the flush special
> case tag lookup. Applied.

Great thanks, :-)Thanks,


Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-16  9:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-09  7:41 [PATCH 0/2] blk-mq: two fixes Ming Lei
2015-08-09  7:41 ` [PATCH 1/2] blk-mq: fix buffer overflow when reading sysfs file of 'pending' Ming Lei
2015-08-09  7:41 ` [PATCH 2/2] blk-mq: fix race between timeout and freeing request Ming Lei
2015-08-15  1:10 ` [PATCH 0/2] blk-mq: two fixes Ming Lei
2015-08-15 15:44   ` Jens Axboe
2015-08-16  9:25     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox