All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Juri Lelli <juri.lelli@redhat.com>, Ming Lei <minlei@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Kernel-managed IRQ affinity (cont)
Date: Fri, 20 Dec 2019 00:11:15 +0800	[thread overview]
Message-ID: <20191219161115.GA18672@ming.t460p> (raw)
In-Reply-To: <20191219143214.GA50561@xz-x1>

On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> > Hi Peter,
> 
> Hi, Ming,
> 
> > 
> > On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > > Hi, Thomas,
> > > 
> > > (Sorry I must have lost the discussion during an email migration, so
> > >  I'll start with a new one)
> > > 
> > > This is a continued discussion of previous one on kernel managed IRQ
> > > affinity [1].  I think at that time the conclusion is that we don't
> > > have a usage scenario to change current policy [2].  However recently
> > > I noticed that it is probably a very fundamental requirement for some
> > > real-time scenarios, even when there's no multi-queue involved.
> > > 
> > > In my test case, it was a very common realtime guest with 10 vcpus,
> > > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> > > virtio-blk device as boot disk.  With a distribution very close to
> > > latest upstream, we can observe high spikes, probably due to the IRQs.
> > > 
> > > To guarantee realtime responsiveness, we need to make sure the IRQs
> > > will be managable, say, when I run a real-time workload on vcpu9, we
> > > should be able to move all the IRQs from vcpu9 to the other vcpus
> > > (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> > > we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> > > from the virtio-blk device:
> > > 
> > >   # cat /proc/interrupts | grep -w 38
> > >   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> > >   # cat /proc/irq/38/smp_affinity
> > >   3ff
> > >   # cat /proc/irq/38/effective_affinity
> > >   200
> > > 
> > > Meanwhile, I don't think there's anything special for VMs, so this
> > > issue should exist even for hosts as long as the IRQ is managed in the
> > > same way here as the virtio-blk device.
> > > 
> > > As Ming has mentioned in previous discussions [3], I think it would be
> > > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > > assigning IRQs to the cores.  Currently it's not.  What would you
> > > suggest in this case?  Do you think this is a valid user scenario?
> > > 
> > > Thanks,
> > > 
> > > [1] https://lkml.org/lkml/2019/3/18/15
> > > [2] https://lkml.org/lkml/2019/3/25/562
> > > [3] https://lkml.org/lkml/2019/3/25/308
> > 
> > The following patch supposes to implementation the requirement for you,
> > can you test it by passing 'isolcpus=managed_irq,X-Y'?
> 
> I really appreciate your patch!  I'll keep this version, while before
> I start to test it...
> 
> > 
> > With this kind of change, you can't run any IO from any isolated
> > CPU core, otherwise, unpredictable error may be triggered, either oops or
> > IO hang.
> 
> ... I'm not sure whether this can be acceptable for a production
> environment.
> 
> In our case, the IRQ should come from virtio-blk which is the root
> disk, so I assume even the RT core could use it at least when loading
> the executable into RAM.  So...
> 
> > 
> > Another conservative approach is to only select effective CPU from
> > non-isolated cpus, however, the assigned CPUs may not be balanced among
> > interrupt vectors. But it is safer, since the system still works even if
> > someone submits IO from any isolated cpu core.
> 
> ... this one seems to be more appealing at least to me.

OK, please try the following patch:


diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 6c8512d3be88..0fbcbacd1b29 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -13,6 +13,7 @@ enum hk_flags {
 	HK_FLAG_TICK		= (1 << 4),
 	HK_FLAG_DOMAIN		= (1 << 5),
 	HK_FLAG_WQ		= (1 << 6),
+	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1753486b440c..0a75a09cc4e8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -20,6 +20,7 @@
 #include <linux/sched/task.h>
 #include <uapi/linux/sched/types.h>
 #include <linux/task_work.h>
+#include <linux/sched/isolation.h>
 
 #include "internals.h"
 
@@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 {
 	struct irq_desc *desc = irq_data_to_desc(data);
 	struct irq_chip *chip = irq_data_get_irq_chip(data);
+	const struct cpumask *housekeeping_mask =
+		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
 	int ret;
+	cpumask_var_t tmp_mask;
 
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
-	ret = chip->irq_set_affinity(data, mask, force);
+	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
+		return -EINVAL;
+
+	/*
+	 * Userspace can't change managed irq's affinity, make sure
+	 * that isolated CPU won't be selected as the effective CPU
+	 * if this irq's affinity includes both isolated CPU and
+	 * housekeeping CPU.
+	 *
+	 * This way guarantees that isolated CPU won't be interrupted
+	 * by IO submitted from housekeeping CPU.
+	 */
+	if (irqd_affinity_is_managed(data) &&
+			cpumask_intersects(mask, housekeeping_mask))
+		cpumask_and(tmp_mask, mask, housekeeping_mask);
+	else
+		cpumask_copy(tmp_mask, mask);
+
+	ret = chip->irq_set_affinity(data, tmp_mask, force);
 	switch (ret) {
 	case IRQ_SET_MASK_OK:
 	case IRQ_SET_MASK_OK_DONE:
@@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		ret = 0;
 	}
 
+	free_cpumask_var(tmp_mask);
+
 	return ret;
 }
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9fcb2a695a41..008d6ac2342b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "managed_irq,", 12)) {
+			str += 12;
+			flags |= HK_FLAG_MANAGED_IRQ;
+			continue;
+		}
+
 		pr_warn("isolcpus: Error, unknown flag\n");
 		return 0;
 	}

-- 
Ming


  reply	other threads:[~2019-12-19 16:11 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-16 19:57 Kernel-managed IRQ affinity (cont) Peter Xu
2019-12-19  8:28 ` Ming Lei
2019-12-19 14:32   ` Peter Xu
2019-12-19 16:11     ` Ming Lei [this message]
2019-12-19 18:09       ` Peter Xu
2019-12-23 19:18         ` Peter Xu
2020-01-09 20:02       ` Thomas Gleixner
2020-01-10  1:28         ` Ming Lei
2020-01-10 19:43           ` Thomas Gleixner
2020-01-11  2:48             ` Ming Lei
2020-01-14 13:45               ` Thomas Gleixner
2020-01-14 23:38                 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191219161115.GA18672@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=minlei@redhat.com \
    --cc=peterx@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.