All of lore.kernel.org
 help / color / mirror / Atom feed
From: longli@linuxonhyperv.com
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@fb.com>,
	Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Cc: Long Li <longli@microsoft.com>
Subject: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
Date: Mon, 19 Aug 2019 23:14:29 -0700	[thread overview]
Message-ID: <1566281669-48212-4-git-send-email-longli@linuxonhyperv.com> (raw)
In-Reply-To: <1566281669-48212-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.

For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware queue interrupts CPU 0 for I/O response
3. processes from CPU 1, 2 and 3 keep sending I/Os

CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
all the time serving NVMe and other system interrupts, but doesn't have a
chance to run in process context.

To fix this, CPU 0 can schedule a work to complete the I/O request when it
detects the scheduler is not making progress. This serves multiple purposes:

1. This CPU has to be scheduled to complete the request. The other CPUs can't
issue more I/Os until some previous I/Os are completed. This helps this CPU
get out of NVMe interrupts.

2. This acts a throttling mechanisum for NVMe devices, in that it can not
starve a CPU while servicing I/Os from other CPUs.

3. This CPU can make progress on RCU and other work items on its queue.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68c0f4f..576bb6fce293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -28,6 +28,7 @@
 #include <linux/t10-pi.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
+#include <linux/sched.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -97,6 +98,15 @@ static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
+/*
+ * The following are for detecting if this CPU is flooded with interrupts.
+ * The timestamp for the last context switch is recorded. If that is at least
+ * MAX_SCHED_TIMEOUT ago, try to recover from interrupt flood
+ */
+static DEFINE_PER_CPU(u64, last_switch);
+static DEFINE_PER_CPU(u64, last_clock);
+#define MAX_SCHED_TIMEOUT 2000000000	// 2 seconds in ns
+
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
 	blk_mq_delay_kick_requeue_list(req->q, delay);
 }
 
+static void nvme_complete_rq_work(struct work_struct *work)
+{
+	struct nvme_request *nvme_rq =
+		container_of(work, struct nvme_request, work);
+	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
+
+	nvme_complete_rq(req);
+}
+
+
 void nvme_complete_rq(struct request *req)
 {
-	blk_status_t status = nvme_error_status(req);
+	blk_status_t status;
+	int cpu;
+	u64 switches;
+	struct nvme_request *nvme_rq;
+
+	if (!in_interrupt())
+		goto skip_check;
+
+	nvme_rq = nvme_req(req);
+	cpu = smp_processor_id();
+	if (idle_cpu(cpu))
+		goto skip_check;
+
+	/* Check if this CPU is flooded with interrupts */
+	switches = get_cpu_rq_switches(cpu);
+	if (this_cpu_read(last_switch) == switches) {
+		/*
+		 * If this CPU hasn't made a context switch in
+		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
+		 * complete this I/O. This forces this CPU run non-interrupt
+		 * code and throttle the other CPU issuing the I/O
+		 */
+		if (sched_clock() - this_cpu_read(last_clock)
+				> MAX_SCHED_TIMEOUT) {
+			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
+			schedule_work_on(cpu, &nvme_rq->work);
+			return;
+		}
+
+	} else {
+		this_cpu_write(last_switch, switches);
+		this_cpu_write(last_clock, sched_clock());
+	}
+
+skip_check:
+	status = nvme_error_status(req);
 
 	trace_nvme_complete_rq(req);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0a4a7f5e0de7..a8876e69e476 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -113,6 +113,7 @@ struct nvme_request {
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;
+	struct work_struct	work;
 };
 
 /*
-- 
2.17.1


  parent reply	other threads:[~2019-08-20  6:14 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
2019-08-20  9:38   ` Peter Zijlstra
2019-08-21  8:20     ` Long Li
2019-08-21 10:34       ` Peter Zijlstra
2019-08-20  9:39   ` Peter Zijlstra
2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
2019-08-20  6:14 ` longli [this message]
2019-08-20  9:52   ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts Peter Zijlstra
2019-08-21  8:37     ` Long Li
2019-08-21 10:35       ` Peter Zijlstra
2019-08-20 17:33   ` Sagi Grimberg
2019-08-21  8:39     ` Long Li
2019-08-21 17:36       ` Long Li
2019-08-21 21:54         ` Sagi Grimberg
2019-08-24  0:13           ` Long Li
2019-08-23  3:21     ` Ming Lei
2019-08-24  0:27       ` Long Li
2019-08-24 12:55         ` Ming Lei
2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
2019-08-20  8:59   ` John Garry
2019-08-20 15:05     ` Keith Busch
2019-08-21  7:47     ` Long Li
2019-08-21  9:44       ` Ming Lei
2019-08-21 10:03         ` John Garry
2019-08-21 16:27         ` Long Li
2019-08-22  1:33           ` Ming Lei
2019-08-22  2:00             ` Keith Busch
2019-08-22  2:23               ` Ming Lei
2019-08-22  9:48               ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1566281669-48212-4-git-send-email-longli@linuxonhyperv.com \
    --to=longli@linuxonhyperv.com \
    --cc=axboe@fb.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=longli@microsoft.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.