From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18DB2C433E6 for ; Fri, 22 Jan 2021 17:44:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D448823A79 for ; Fri, 22 Jan 2021 17:44:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728815AbhAVRnW (ORCPT ); Fri, 22 Jan 2021 12:43:22 -0500 Received: from foss.arm.com ([217.140.110.172]:58844 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729885AbhAVRjr (ORCPT ); Fri, 22 Jan 2021 12:39:47 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0749B139F; Fri, 22 Jan 2021 09:39:01 -0800 (PST) Received: from [10.57.39.58] (unknown [10.57.39.58]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C9FC13F719; Fri, 22 Jan 2021 09:38:58 -0800 (PST) Subject: Re: performance regression noted in v5.11-rc after c062db039f40 To: Chuck Lever , Lu Baolu Cc: "iommu@lists.linux-foundation.org" , Will Deacon , linux-rdma , "logang@deltatee.com" , Christoph Hellwig , "murphyt7@tcd.ie" , "isaacm@codeaurora.org" References: <20210112143819.GA9689@willie-the-truck> <607648D8-BF0C-40D6-9B43-2359F45EE74C@oracle.com> <3568C74A-A587-4464-8840-24F7A93ABA06@oracle.com> <990a7c1e-e8c0-a6a8-f057-03b104cebca3@linux.intel.com> <3A4451BB-41BD-429B-BE0C-12AE7D03A99B@oracle.com> From: Robin Murphy Message-ID: Date: Fri, 22 Jan 2021 17:38:58 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: <3A4451BB-41BD-429B-BE0C-12AE7D03A99B@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org On 2021-01-22 16:18, Chuck Lever wrote: > > >> On Jan 21, 2021, at 10:00 PM, Lu Baolu wrote: >> >> +Isaac >> >> On 1/22/21 3:09 AM, Chuck Lever wrote: >>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy wrote: >>>> >>>> On 2021-01-18 16:18, Chuck Lever wrote: >>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon wrote: >>>>>> >>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks] >>>>>> >>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote: >>>>>>> Hi- >>>>>>> >>>>>>> [ Please cc: me on replies, I'm not currently subscribed to >>>>>>> iommu@lists ]. >>>>>>> >>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards >>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount: >>>>>>> >>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I >>>>>>> >>>>>>> For those not familiar with the way storage protocols use RDMA, The >>>>>>> initiator/client sets up memory regions and the target/server uses >>>>>>> RDMA Read and Write to move data out of and into those regions. The >>>>>>> initiator/client uses only RDMA memory registration and invalidation >>>>>>> operations, and the target/server uses RDMA Read and Write. >>>>>>> >>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU >>>>>>> enabled using the kernel command line options "intel_iommu=on >>>>>>> iommu=strict". >>>>>>> >>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput. >>>>>>> I was able to bisect on my client to the following commits. >>>>>>> >>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in >>>>>>> map_sg"). This is about normal for this test. >>>>>>> >>>>>>> Children see throughput for 12 initial writers = 4732581.09 kB/sec >>>>>>> Parent sees throughput for 12 initial writers = 4646810.21 kB/sec >>>>>>> Min throughput per process = 387764.34 kB/sec >>>>>>> Max throughput per process = 399655.47 kB/sec >>>>>>> Avg throughput per process = 394381.76 kB/sec >>>>>>> Min xfer = 1017344.00 kB >>>>>>> CPU Utilization: Wall time 2.671 CPU time 1.974 CPU utilization 73.89 % >>>>>>> Children see throughput for 12 rewriters = 4837741.94 kB/sec >>>>>>> Parent sees throughput for 12 rewriters = 4833509.35 kB/sec >>>>>>> Min throughput per process = 398983.72 kB/sec >>>>>>> Max throughput per process = 406199.66 kB/sec >>>>>>> Avg throughput per process = 403145.16 kB/sec >>>>>>> Min xfer = 1030656.00 kB >>>>>>> CPU utilization: Wall time 2.584 CPU time 1.959 CPU utilization 75.82 % >>>>>>> Children see throughput for 12 readers = 5921370.94 kB/sec >>>>>>> Parent sees throughput for 12 readers = 5914106.69 kB/sec >>>>>>> Min throughput per process = 491812.38 kB/sec >>>>>>> Max throughput per process = 494777.28 kB/sec >>>>>>> Avg throughput per process = 493447.58 kB/sec >>>>>>> Min xfer = 1042688.00 kB >>>>>>> CPU utilization: Wall time 2.122 CPU time 1.968 CPU utilization 92.75 % >>>>>>> Children see throughput for 12 re-readers = 5947985.69 kB/sec >>>>>>> Parent sees throughput for 12 re-readers = 5941348.51 kB/sec >>>>>>> Min throughput per process = 492805.81 kB/sec >>>>>>> Max throughput per process = 497280.19 kB/sec >>>>>>> Avg throughput per process = 495665.47 kB/sec >>>>>>> Min xfer = 1039360.00 kB >>>>>>> CPU utilization: Wall time 2.111 CPU time 1.968 CPU utilization 93.22 % >>>>>>> >>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in >>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here. >>>>>>> >>>>>>> Children see throughput for 12 initial writers = 4342419.12 kB/sec >>>>>>> Parent sees throughput for 12 initial writers = 4310612.79 kB/sec >>>>>>> Min throughput per process = 359299.06 kB/sec >>>>>>> Max throughput per process = 363866.16 kB/sec >>>>>>> Avg throughput per process = 361868.26 kB/sec >>>>>>> Min xfer = 1035520.00 kB >>>>>>> CPU Utilization: Wall time 2.902 CPU time 1.951 CPU utilization 67.22 % >>>>>>> Children see throughput for 12 rewriters = 4408576.66 kB/sec >>>>>>> Parent sees throughput for 12 rewriters = 4404280.87 kB/sec >>>>>>> Min throughput per process = 364553.88 kB/sec >>>>>>> Max throughput per process = 370029.28 kB/sec >>>>>>> Avg throughput per process = 367381.39 kB/sec >>>>>>> Min xfer = 1033216.00 kB >>>>>>> CPU utilization: Wall time 2.836 CPU time 1.956 CPU utilization 68.97 % >>>>>>> Children see throughput for 12 readers = 5406879.47 kB/sec >>>>>>> Parent sees throughput for 12 readers = 5401862.78 kB/sec >>>>>>> Min throughput per process = 449583.03 kB/sec >>>>>>> Max throughput per process = 451761.69 kB/sec >>>>>>> Avg throughput per process = 450573.29 kB/sec >>>>>>> Min xfer = 1044224.00 kB >>>>>>> CPU utilization: Wall time 2.323 CPU time 1.977 CPU utilization 85.12 % >>>>>>> Children see throughput for 12 re-readers = 5410601.12 kB/sec >>>>>>> Parent sees throughput for 12 re-readers = 5403504.40 kB/sec >>>>>>> Min throughput per process = 449918.12 kB/sec >>>>>>> Max throughput per process = 452489.28 kB/sec >>>>>>> Avg throughput per process = 450883.43 kB/sec >>>>>>> Min xfer = 1043456.00 kB >>>>>>> CPU utilization: Wall time 2.321 CPU time 1.978 CPU utilization 85.21 % >>>>>>> >>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to >>>>>>> the iommu ops"). Significant throughput loss. >>>>>>> >>>>>>> Children see throughput for 12 initial writers = 3812036.91 kB/sec >>>>>>> Parent sees throughput for 12 initial writers = 3753683.40 kB/sec >>>>>>> Min throughput per process = 313672.25 kB/sec >>>>>>> Max throughput per process = 321719.44 kB/sec >>>>>>> Avg throughput per process = 317669.74 kB/sec >>>>>>> Min xfer = 1022464.00 kB >>>>>>> CPU Utilization: Wall time 3.309 CPU time 1.986 CPU utilization 60.02 % >>>>>>> Children see throughput for 12 rewriters = 3786831.94 kB/sec >>>>>>> Parent sees throughput for 12 rewriters = 3783205.58 kB/sec >>>>>>> Min throughput per process = 313654.44 kB/sec >>>>>>> Max throughput per process = 317844.50 kB/sec >>>>>>> Avg throughput per process = 315569.33 kB/sec >>>>>>> Min xfer = 1035520.00 kB >>>>>>> CPU utilization: Wall time 3.302 CPU time 1.945 CPU utilization 58.90 % >>>>>>> Children see throughput for 12 readers = 4265828.28 kB/sec >>>>>>> Parent sees throughput for 12 readers = 4261844.88 kB/sec >>>>>>> Min throughput per process = 352305.00 kB/sec >>>>>>> Max throughput per process = 357726.22 kB/sec >>>>>>> Avg throughput per process = 355485.69 kB/sec >>>>>>> Min xfer = 1032960.00 kB >>>>>>> CPU utilization: Wall time 2.934 CPU time 1.942 CPU utilization 66.20 % >>>>>>> Children see throughput for 12 re-readers = 4220651.19 kB/sec >>>>>>> Parent sees throughput for 12 re-readers = 4216096.04 kB/sec >>>>>>> Min throughput per process = 348677.16 kB/sec >>>>>>> Max throughput per process = 353467.44 kB/sec >>>>>>> Avg throughput per process = 351720.93 kB/sec >>>>>>> Min xfer = 1035264.00 kB >>>>>>> CPU utilization: Wall time 2.969 CPU time 1.952 CPU utilization 65.74 % >>>>>>> >>>>>>> The regression appears to be 100% reproducible. >>>>> Any thoughts? >>>>> How about some tools to try or debugging advice? I don't know where to start. >>>> >>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign). >>>> >>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}. >>> I did a function_graph trace of the above iozone test on a v5.10 NFS >>> client and again on v5.11-rc. There is a substantial timing difference >>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set >>> of pages that are part of an NFS/RDMA WRITE operation. >>> v5.10: >>> 1072.028308: funcgraph_entry: | dma_map_sg_attrs() { >>> 1072.028308: funcgraph_entry: | intel_map_sg() { >>> 1072.028309: funcgraph_entry: | find_domain() { >>> 1072.028309: funcgraph_entry: 0.280 us | get_domain_info(); >>> 1072.028310: funcgraph_exit: 0.930 us | } >>> 1072.028310: funcgraph_entry: 0.360 us | domain_get_iommu(); >>> 1072.028311: funcgraph_entry: | intel_alloc_iova() { >>> 1072.028311: funcgraph_entry: | alloc_iova_fast() { >>> 1072.028311: funcgraph_entry: 0.375 us | _raw_spin_lock_irqsave(); >>> 1072.028312: funcgraph_entry: 0.285 us | __lock_text_start(); >>> 1072.028313: funcgraph_exit: 1.500 us | } >>> 1072.028313: funcgraph_exit: 2.052 us | } >>> 1072.028313: funcgraph_entry: | domain_mapping() { >>> 1072.028313: funcgraph_entry: | __domain_mapping() { >>> 1072.028314: funcgraph_entry: 0.350 us | pfn_to_dma_pte(); >>> 1072.028315: funcgraph_entry: 0.942 us | domain_flush_cache(); >>> 1072.028316: funcgraph_exit: 2.852 us | } >>> 1072.028316: funcgraph_entry: 0.275 us | iommu_flush_write_buffer(); >>> 1072.028317: funcgraph_exit: 4.213 us | } >>> 1072.028318: funcgraph_exit: 9.392 us | } >>> 1072.028318: funcgraph_exit: + 10.073 us | } >>> 1072.028323: xprtrdma_mr_map: mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE) >>> 1072.028323: xprtrdma_chunk_read: task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more) >>> v5.11-rc: >>> 57.602990: funcgraph_entry: | dma_map_sg_attrs() { >>> 57.602990: funcgraph_entry: | iommu_dma_map_sg() { >>> 57.602990: funcgraph_entry: 0.285 us | iommu_get_dma_domain(); >>> 57.602991: funcgraph_entry: 0.270 us | iommu_dma_deferred_attach(); >>> 57.602991: funcgraph_entry: | iommu_dma_sync_sg_for_device() { >>> 57.602992: funcgraph_entry: 0.268 us | dev_is_untrusted(); >>> 57.602992: funcgraph_exit: 0.815 us | } >>> 57.602993: funcgraph_entry: 0.267 us | dev_is_untrusted(); >>> 57.602993: funcgraph_entry: | iommu_dma_alloc_iova() { >>> 57.602994: funcgraph_entry: | alloc_iova_fast() { >>> 57.602994: funcgraph_entry: 0.260 us | _raw_spin_lock_irqsave(); >>> 57.602995: funcgraph_entry: 0.293 us | _raw_spin_lock(); >>> 57.602995: funcgraph_entry: 0.273 us | _raw_spin_unlock_irqrestore(); >>> 57.602996: funcgraph_entry: 1.147 us | alloc_iova(); >>> 57.602997: funcgraph_exit: 3.370 us | } >>> 57.602997: funcgraph_exit: 3.945 us | } >>> 57.602998: funcgraph_entry: 0.272 us | dma_info_to_prot(); >>> 57.602998: funcgraph_entry: | iommu_map_sg_atomic() { >>> 57.602998: funcgraph_entry: | __iommu_map_sg() { >>> 57.602999: funcgraph_entry: 1.733 us | __iommu_map(); >>> 57.603001: funcgraph_entry: 1.642 us | __iommu_map(); >>> 57.603003: funcgraph_entry: 1.638 us | __iommu_map(); >>> 57.603005: funcgraph_entry: 1.645 us | __iommu_map(); >>> 57.603007: funcgraph_entry: 1.630 us | __iommu_map(); >>> 57.603009: funcgraph_entry: 1.770 us | __iommu_map(); >>> 57.603011: funcgraph_entry: 1.730 us | __iommu_map(); >>> 57.603013: funcgraph_entry: 1.633 us | __iommu_map(); >>> 57.603015: funcgraph_entry: 1.605 us | __iommu_map(); >>> 57.603017: funcgraph_entry: 2.847 us | __iommu_map(); >>> 57.603020: funcgraph_entry: 2.847 us | __iommu_map(); >>> 57.603024: funcgraph_entry: 2.955 us | __iommu_map(); >>> 57.603027: funcgraph_entry: 2.928 us | __iommu_map(); >>> 57.603030: funcgraph_entry: 2.933 us | __iommu_map(); >>> 57.603034: funcgraph_entry: 2.943 us | __iommu_map(); >>> 57.603037: funcgraph_entry: 2.928 us | __iommu_map(); >>> 57.603040: funcgraph_entry: 2.857 us | __iommu_map(); >>> 57.603044: funcgraph_entry: 2.953 us | __iommu_map(); >>> 57.603047: funcgraph_entry: 3.023 us | __iommu_map(); >>> 57.603050: funcgraph_entry: 1.645 us | __iommu_map(); >>> 57.603052: funcgraph_exit: + 53.648 us | } >>> 57.603052: funcgraph_exit: + 54.178 us | } >>> 57.603053: funcgraph_exit: + 62.953 us | } >>> 57.603053: funcgraph_exit: + 63.567 us | } >>> 57.603059: xprtrdma_mr_map: task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE) >>> 57.603060: xprtrdma_chunk_read: task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more) >> >> I kind of believe it's due to the indirect calls. This is also reported >> on ARM. >> >> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/ >> >> Maybe we can try changing indirect calls to static ones to verify this >> problem. > > I liked the idea of map_sg() enough to try my hand at building a PoC for > Intel, based on Isaac's patch series. It's just a cut-and-paste of the > generic iommu.c code with the indirect calls to ops->map() replaced. > > The indirect calls do not seem to be the problem. Calling intel_iommu_map > directly appears to be as costly as calling it indirectly. > > However, perhaps there are other ways map_sg() can be beneficial. In > v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be > invoked just once for each large map operation, for example. Oh, if the driver needs to do maintenance beyond just installing PTEs, that should probably be devolved to iotlb_sync_map anyway. There's a patch series here generalising that to be more useful, which is hopefully just waiting to be merged now: https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/ Robin. > Here's a trace of my prototype in operation: > > 380.620150: funcgraph_entry: | iommu_dma_map_sg() { > 380.620150: funcgraph_entry: 0.285 us | iommu_get_dma_domain(); > 380.620150: funcgraph_entry: 0.265 us | iommu_dma_deferred_attach(); > 380.620151: funcgraph_entry: | iommu_dma_sync_sg_for_device() { > 380.620151: funcgraph_entry: 0.285 us | dev_is_untrusted(); > 380.620152: funcgraph_exit: 0.860 us | } > 380.620152: funcgraph_entry: 0.263 us | dev_is_untrusted(); > 380.620153: funcgraph_entry: | iommu_dma_alloc_iova() { > 380.620153: funcgraph_entry: | alloc_iova_fast() { > 380.620153: funcgraph_entry: 0.268 us | _raw_spin_lock_irqsave(); > 380.620154: funcgraph_entry: 0.275 us | _raw_spin_unlock_irqrestore(); > 380.620155: funcgraph_exit: 1.402 us | } > 380.620155: funcgraph_exit: 1.955 us | } > 380.620155: funcgraph_entry: 0.265 us | dma_info_to_prot(); > 380.620156: funcgraph_entry: | iommu_map_sg_atomic() { > 380.620156: funcgraph_entry: | __iommu_map_sg() { > 380.620156: funcgraph_entry: | intel_iommu_map_sg() { > 380.620157: funcgraph_entry: 0.270 us | iommu_pgsize(); > 380.620157: funcgraph_entry: | intel_iommu_map() { > 380.620157: funcgraph_entry: 0.970 us | __domain_mapping(); > 380.620159: funcgraph_entry: 0.265 us | iommu_flush_write_buffer(); > 380.620159: funcgraph_exit: 2.322 us | } > 380.620160: funcgraph_entry: 0.270 us | iommu_pgsize(); > 380.620160: funcgraph_entry: | intel_iommu_map() { > 380.620161: funcgraph_entry: 0.957 us | __domain_mapping(); > 380.620162: funcgraph_entry: 0.275 us | iommu_flush_write_buffer(); > 380.620163: funcgraph_exit: 2.315 us | } > 380.620163: funcgraph_entry: 0.265 us | iommu_pgsize(); > 380.620163: funcgraph_entry: | intel_iommu_map() { > 380.620164: funcgraph_entry: 0.940 us | __domain_mapping(); > 380.620165: funcgraph_entry: 0.270 us | iommu_flush_write_buffer(); > 380.620166: funcgraph_exit: 2.295 us | } > > .... > > 380.620247: funcgraph_entry: 0.262 us | iommu_pgsize(); > 380.620248: funcgraph_entry: | intel_iommu_map() { > 380.620248: funcgraph_entry: 0.935 us | __domain_mapping(); > 380.620249: funcgraph_entry: 0.305 us | iommu_flush_write_buffer(); > 380.620250: funcgraph_exit: 2.315 us | } > 380.620250: funcgraph_entry: 0.273 us | iommu_pgsize(); > 380.620251: funcgraph_entry: | intel_iommu_map() { > 380.620251: funcgraph_entry: 0.967 us | __domain_mapping(); > 380.620253: funcgraph_entry: 0.265 us | iommu_flush_write_buffer(); > 380.620253: funcgraph_exit: 2.310 us | } > 380.620254: funcgraph_exit: + 97.388 us | } > 380.620254: funcgraph_exit: + 97.960 us | } > 380.620254: funcgraph_exit: + 98.482 us | } > 380.620255: funcgraph_exit: ! 105.175 us | } > 380.620260: xprtrdma_mr_map: task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE) > 380.620261: xprtrdma_chunk_read: task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more) > > > -- > Chuck Lever > > >