From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 620F81C68C for ; Wed, 26 Jun 2024 02:22:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719368579; cv=none; b=Jd4WyixqEbJL5P6fW6PJfL5T5ZT2R0UvgHxCba/dcgCMvqb9eef3O//Lvmsm4Z+3+63q/v0DbTNP7v0TjZo7CQUL48tCn6zllLhbAWWpmUWXMkmkmliIqzaQjloCZpuO1ImqXQ8QEZlWJ9k/HHHPZFVc6h5DuVfZLvp6IalH8e4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719368579; c=relaxed/simple; bh=tf8cAd4CGDxMkSIb4/CzIqW7Z1l7xYr6/PgLpBFWqmQ=; h=Subject:To:CC:References:From:Message-ID:Date:MIME-Version: In-Reply-To:Content-Type; b=tVhjoJmEuTwWCbGWYFZFXTknj/Zx9+b8pH/leJ7UsnHaXp2aZapY9WT0kPO02xGpn7b5tDzMYIqYQs3F80PNm/5HecFfCxSZTAPtGj+PKp3nHwf7cD7C3C/iFS1pTOtmTr979R4o6B0Wz7MJ/xJkTI802t+2OBQ0Klpz8gvamSY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4W853B0G6NzZcG9; Wed, 26 Jun 2024 10:18:30 +0800 (CST) Received: from kwepemd500014.china.huawei.com (unknown [7.221.188.63]) by mail.maildlp.com (Postfix) with ESMTPS id CEEFB180086; Wed, 26 Jun 2024 10:22:53 +0800 (CST) Received: from [10.67.146.137] (10.67.146.137) by kwepemd500014.china.huawei.com (7.221.188.63) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Wed, 26 Jun 2024 10:22:53 +0800 Subject: Re: [RESEND PATCH] irqchip/gic-v4.1: Use the ITS of the NUMA node where current cpu is located To: Marc Zyngier CC: , , , , References: <20240625014019.3914240-1-tangnianyao@huawei.com> <86wmmdihkf.wl-maz@kernel.org> From: Tangnianyao Message-ID: <60de5bd6-51db-e327-5808-280407a6285d@huawei.com> Date: Wed, 26 Jun 2024 10:22:52 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: <86wmmdihkf.wl-maz@kernel.org> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Language: en-US X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemd500014.china.huawei.com (7.221.188.63) On 6/25/2024 15:53, Marc Zyngier wrote: > On Tue, 25 Jun 2024 02:40:19 +0100, > Nianyao Tang wrote: >> When GICv4.1 enabled, guest sending IPI use the last ITS reported. >> On multi-NUMA environment with more than one ITS, it makes IPI performance >> various from VM to VM, depending on which NUMA the VM is deployed on. >> We can use closer ITS instead of the last ITS reported. > Closer to *what*? the SGI sender? or the receiver? Something else? VSGI sender. VSGI sender use original find_4_1_its to inject vsgi, it always find the last reported 4_1 ITS, regardless of which NUMA the VSGI sender cpu is located on. > >> Modify find_4_1_its to find the ITS of the NUMA node where current >> cpu is located and save it with per cpu variable. > But find_4_1_its() isn't only used for SGIs. Is it valid to do this > trick for all use cases? To consider this case, I've implemented original find_4_1_its function, finding a 4_1 ITS in system and return, even NUMA is not match. Would it be enough to be compatitable with other code ? A new find_4_1_its can firstly select 4_1 ITS on the same NUMA as the current cpu(VSGI sender), and if fail to find, then return 4_1 ITS on other NUMA. > >> (There's format issues with the previous patch, resend it) > In the future, please move this sort of comment to a note after the > --- delimiter. ok, get it. > >> Signed-off-by: Nianyao Tang >> --- >> drivers/irqchip/irq-gic-v3-its.c | 27 ++++++++++++++++++--------- >> 1 file changed, 18 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c >> index 3c755d5dad6e..d35b42f3b2af 100644 >> --- a/drivers/irqchip/irq-gic-v3-its.c >> +++ b/drivers/irqchip/irq-gic-v3-its.c >> @@ -193,6 +193,8 @@ static DEFINE_RAW_SPINLOCK(vmovp_lock); >> >> static DEFINE_IDA(its_vpeid_ida); >> >> +static DEFINE_PER_CPU(struct its_node *, its_on_cpu); > I don't really get the "its_on_cpu" name. "local_its" would at least > indicate a notion being "close". I want to mean ITS on the current cpu NUMA node. Yes, "local_its" is better. > >> + >> #define gic_data_rdist() (raw_cpu_ptr(gic_rdists->rdist)) >> #define gic_data_rdist_cpu(cpu) (per_cpu_ptr(gic_rdists->rdist, cpu)) >> #define gic_data_rdist_rd_base() (gic_data_rdist()->rd_base) >> @@ -4058,19 +4060,25 @@ static struct irq_chip its_vpe_irq_chip = { >> >> static struct its_node *find_4_1_its(void) >> { >> - static struct its_node *its = NULL; >> + struct its_node *its = NULL; >> + struct its_node *its_non_cpu_node = NULL; >> + int cpu = smp_processor_id(); >> >> - if (!its) { >> - list_for_each_entry(its, &its_nodes, entry) { >> - if (is_v4_1(its)) >> - return its; >> - } >> + if (per_cpu(its_on_cpu, cpu)) >> + return per_cpu(its_on_cpu, cpu); >> >> - /* Oops? */ >> - its = NULL; >> - } >> + list_for_each_entry(its, &its_nodes, entry) { >> + if (is_v4_1(its) && its->numa_node == cpu_to_node(cpu)) { >> + per_cpu(its_on_cpu, cpu) = its; >> + return its; >> + } else if (is_v4_1(its)) >> + its_non_cpu_node = its; >> + } > Why do you consider the NUMA node instead of the ITS' own affinity? > SVPET gives you some notion of distance with the RDs, and that'd > probably be useful. I assumed BIOS should report NUMA node following real topology, use NUMA node for simplicity. > >> >> - return its; >> + if (!per_cpu(its_on_cpu, cpu) && its_non_cpu_node) >> + per_cpu(its_on_cpu, cpu) = its_non_cpu_node; >> + >> + return its_non_cpu_node; >> } > Urgh. Mixing init and runtime is awful. Why isn't this initialised > when a CPU comes up? We already have all the infrastructure. The original find_4_1_its use "static struct its_node *its" to save 4_1 ITS, and it's init inside this function. So, to follow this, I tried to not modify this usage. > > But the biggest question is "what sort of performance improvement does > this bring"? You give no numbers, no way to evaluate anything. > > I've asked for that times and times again: if your changes are > claiming a performance improvement, please back it up. It's not that > hard. On a 2-socket environment, reported as 2-NUMA, each socket with one ITS and 32 cpu, GICv4.1 enabled. For performance, I deploy a 4U8G guest, 4 vcpu on same socket. When I deploy guest on socket0, kvm-unit-tests ipi_hw result is 850ns. It test the delay from one vcpu sending ipi to another vcpu receiving ipi in guest. When I deploy guest on socket1, the result is 750ns. The reason is VSGI sender always use lasted reported ITS to inject VSGI. The access from cpu to other-socket ITS will cost 100ns more compared to cpu to local ITS. > > Thanks, > > M. >