From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A78DC7EE29 for ; Thu, 25 May 2023 18:32:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240719AbjEYScA (ORCPT ); Thu, 25 May 2023 14:32:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50470 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240218AbjEYSby (ORCPT ); Thu, 25 May 2023 14:31:54 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8BDE912C; Thu, 25 May 2023 11:31:52 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 180446488C; Thu, 25 May 2023 18:31:52 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 52AA1C4339B; Thu, 25 May 2023 18:31:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1685039511; bh=qtzyEzDQ4ttmkO3L/rqrL/ykfQSxPIOGbQRfRH8HjBM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KFv9kKLxs/FEkk1WiEiW2xN+3rcPVw7bta/m7RifJb2BkZhFimbYuEw9cA4QBOzR2 VulORgXUvIAmIXgdQbKkfcGpFNSy7aIvzGvxHvNah+5fA74eozL4QRjhCuMb5Kgame Puz8bghC/R7kKfKueJh/BwXG8YPqlFBd3VWrI4fX/SurI0fHl69/BVJE9vZxNObaBD hYQUEuyqkoLvhyAxPdJiS+3bqjJPvCtXeur1VHbRSTIzNe18RmO/mbLjjAkykuzrOy 1KU/As+6qWLNOrWs8j8yxvVV9lzXwqbptT0EzuJ+puWt+0MLxwTw2rN7YOpzWDJdr5 MKZ6d2Jf1lbAA== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Adrian Huang , Keith Busch , Jiwei Sun , Christoph Hellwig , Sasha Levin , sagi@grimberg.me, linux-nvme@lists.infradead.org Subject: [PATCH AUTOSEL 6.3 03/67] nvme-pci: clamp max_hw_sectors based on DMA optimized limitation Date: Thu, 25 May 2023 14:30:40 -0400 Message-Id: <20230525183144.1717540-3-sashal@kernel.org> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230525183144.1717540-1-sashal@kernel.org> References: <20230525183144.1717540-1-sashal@kernel.org> MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Adrian Huang [ Upstream commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd ] When running the fio test on a 448-core AMD server + a NVME disk, a soft lockup or a hard lockup call trace is shown: [soft lockup] watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50 ... Call Trace: fq_flush_timeout+0x7d/0xd0 ? __pfx_fq_flush_timeout+0x10/0x10 call_timer_fn+0x2e/0x150 run_timer_softirq+0x48a/0x560 ? __pfx_fq_flush_timeout+0x10/0x10 ? clockevents_program_event+0xaf/0x130 __do_softirq+0xf1/0x335 irq_exit_rcu+0x9f/0xd0 sysvec_apic_timer_interrupt+0xb4/0xd0 asm_sysvec_apic_timer_interrupt+0x1f/0x30 ... Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log: | fq_flush_timeout() { | fq_ring_free() { | put_pages_list() { 0.170 us | free_unref_page_list(); 0.810 us | } | free_iova_fast() { | free_iova() { * 85622.66 us | _raw_spin_lock_irqsave(); 2.860 us | remove_iova(); 0.600 us | _raw_spin_unlock_irqrestore(); 0.470 us | lock_info_report(); 2.420 us | free_iova_mem.part.0(); * 85638.27 us | } * 85638.84 us | } | put_pages_list() { 0.230 us | free_unref_page_list(); 0.470 us | } ... ... $ 31017069 us | } Most of cores are under lock contention for acquiring iova_rbtree_lock due to the iova flush queue mechanism. [hard lockup] NMI watchdog: Watchdog detected hard LOCKUP on cpu 351 RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330 Call Trace: _raw_spin_lock_irqsave+0x4f/0x60 free_iova+0x27/0xd0 free_iova_fast+0x4d/0x1d0 fq_ring_free+0x9b/0x150 iommu_dma_free_iova+0xb4/0x2e0 __iommu_dma_unmap+0x10b/0x140 iommu_dma_unmap_sg+0x90/0x110 dma_unmap_sg_attrs+0x4a/0x50 nvme_unmap_data+0x5d/0x120 [nvme] nvme_pci_complete_batch+0x77/0xc0 [nvme] nvme_irq+0x2ee/0x350 [nvme] ? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme] __handle_irq_event_percpu+0x53/0x1a0 handle_irq_event_percpu+0x19/0x60 handle_irq_event+0x3d/0x60 handle_edge_irq+0xb3/0x210 __common_interrupt+0x7f/0x150 common_interrupt+0xc5/0xf0 asm_common_interrupt+0x2b/0x40 ... ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of cores are under lock contention for acquiring iova_rbtree_lock due to the iova flush queue mechanism. [Root Cause] The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10) is 4096kb, which streaming DMA mappings cannot benefit from the scalable IOVA mechanism introduced by the commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova allocation") if the length is greater than 128kb. To fix the lock contention issue, clamp max_hw_sectors based on DMA optimized limitation in order to leverage scalable IOVA mechanism. Note: The issue does not happen with another NVME disk (mdts = 5 and max_hw_sectors_kb = 128) [1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4 Suggested-by: Keith Busch Reported-and-tested-by: Jiwei Sun Signed-off-by: Adrian Huang Reviewed-by: Keith Busch Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin --- drivers/nvme/host/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a7772c0194d5a..a389f1ea0b151 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2960,7 +2960,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, * over a single page. */ dev->ctrl.max_hw_sectors = min_t(u32, - NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9); + NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); dev->ctrl.max_segments = NVME_MAX_SEGS; /* -- 2.39.2