From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32846C0015E for ; Tue, 1 Aug 2023 06:38:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 894BC2800E3; Tue, 1 Aug 2023 02:38:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 845862800C8; Tue, 1 Aug 2023 02:38:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70CC12800E3; Tue, 1 Aug 2023 02:38:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5ECEF2800C8 for ; Tue, 1 Aug 2023 02:38:11 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2D45F1A04A4 for ; Tue, 1 Aug 2023 06:38:11 +0000 (UTC) X-FDA: 81074581182.04.1901DC0 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf19.hostedemail.com (Postfix) with ESMTP id 6674D1A0003 for ; Tue, 1 Aug 2023 06:38:09 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=ihVqg2sC; spf=pass (imf19.hostedemail.com: domain of gregkh@linuxfoundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690871889; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OV7n9Q+fIGDxsCJ+Rt3QGIPKUFBi0S+jjN024OwXhlU=; b=E+3aX+xEbmS35iCtzhAWK3E3IphAP/A9u71MEhQkIDpuVGZZ2IH9KNzNSwk5zWxSFAC6PG LPVOZRnLSu/oMQo5+o+pDPzn9JLcRh7WRzf1WjItbQVDGKMtTKvK4VDekJN9jJ4AQTzVpK ybh8v0FFxl01cj/rFj8+pfTqFl4ESLc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690871889; a=rsa-sha256; cv=none; b=ETcjMCUIzs6jMv7arLS3mXbiFstGCjDIxY1ZB0OD/MgByUQxFpgNkvMnO+Ywl4+iaXKejk umlk3c6C2SsMsN1a5jc9alM8uZvgTDnnhPN/E2aDpChwjUbm4YpgZhypU9HpaS/7+kXMVD 3CLxMG3uxymD5Lh6vqlohvKB+UddFFg= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=ihVqg2sC; spf=pass (imf19.hostedemail.com: domain of gregkh@linuxfoundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 49E1B6147A; Tue, 1 Aug 2023 06:38:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 31641C433C8; Tue, 1 Aug 2023 06:38:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1690871887; bh=hDT+iR6AIZkPiUojYsTG3f1l1RIz7J/cYfpHFlOtmcY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=ihVqg2sCDGTatgUfOTqQau2d5hVIQ7FhkJ4dQpOT6tNN864bCCltGkeZsSp33J3G6 67N3c/0YoQ7uttpJF1XBIQwpsMlhyfqYfn0RI23AQwNs82WxOMomf6WcFs2wg6yuwo qESSyNtICw/59SiCMx7JA4bv1FkEQ3vhy5wSX/Nc= Date: Tue, 1 Aug 2023 08:38:05 +0200 From: Greg Kroah-Hartman To: Petr Tesarik Cc: Stefano Stabellini , Russell King , Thomas Bogendoerfer , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , "H. Peter Anvin" , "Rafael J. Wysocki" , Juergen Gross , Oleksandr Tyshchenko , Christoph Hellwig , Marek Szyprowski , Robin Murphy , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Petr Tesarik , Jonathan Corbet , Andy Shevchenko , Hans de Goede , James Seo , James Clark , Kees Cook , "moderated list:XEN HYPERVISOR ARM" , "moderated list:ARM PORT" , open list , "open list:MIPS" , "open list:XEN SWIOTLB SUBSYSTEM" , "open list:SLAB ALLOCATOR" , Roberto Sassu , petr@tesarici.cz Subject: Re: [PATCH v7 0/9] Allow dynamic allocation of software IO TLB bounce buffers Message-ID: <2023080144-cardigan-nerd-2bed@gregkh> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 6674D1A0003 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: bkp9fzcfr6s1u5hqtbrfr8r4x57pa5op X-HE-Tag: 1690871889-284145 X-HE-Meta: U2FsdGVkX180jab4Y9fUZWI23Zba08NTwqKMHbDR3t/5ZBZoqUr1H+kAFsQucphOQrs4swvyAgJuEusDUN1N3nlk45ogQOkom1uzUPfrPfq4pCjxRzP4qqM7XnDN8mFqgYHmsGZiziHAdGzq+eTmLxsUhmM3IU8jcyPmH8AGtuzsqF1VOPUiRij/HLIPfYZFnHsNAt4pAj7t3eMeMCt1ugmauNctwVBfL1pxUD5lAgOlPYvGuIN+0IlgLWgLiqkR9Bc4WhGoHSIt2ZJ77ZpYjA6Fgbyx6V/eK9PidOidp5z0rE2gm+ChwxXKHpS65fJBMCS0VJ1V+jwexpclGj/Bjx5HPcfNSITYc73mEp+61b0KvOytyQmQ0nmvtIyYieTblzUd3Y2lfpgPPnHojs92Xt40tYluO+Dyf2tVa4PN6M7/jFe7xK+U1nfFTMf/j9KjU0o6xnyv6xpYc608+VjamO5wkVi/SfV0TyGCdvuWGenIMZ2ysEm1mfAlk72Ajgur6fp+1ehHF862wYpb8W6+07qPEuC9hjQXdUKsDbHEPtMXiP5Gakg/dMiVx8UL/en93oNZuSDbZNUqLc341FcaPCBUUtc/D92DBk9dNmut+7XVTZioS3bTsj13N/8dH9EwKtuFmcAMtbjRMOTe8e6z+9hutjCuELVGLAg75HOhrQUTH60/QvAvL505Yp3HCXEHgGcy+l90TQBiDZcqaDVkVa96b5Mzgxb2gsRASVAF0PU+zsewNjB9vvz8wQktgta6oV3Vm3AGMpayLP7W7pSl3zeNcPm+X3WGeHb+X2IMxkEo0hJwKMlSSXArX84zuu6i4wT7CqEPgMbwSaAhozQxJowyJfEgVFbfef45DSSP6qFWc+zA0k2KG2Zo2/ti4l/NsucLZHfmciO+3njMOcDcqH/+Ymq3PzXPFCYA5hryE4yTtyb6Vv6QciZycvnx9ImTMV1/4wDvmmfiINP788h ZexBglgX ZfNyoQxTEcp3shqDCXzmZG4wI6hNltwTJXsZgIv6AjvbrbQTwzGg6mBCWiutGVehh4ttS/qGb6CbpwpuCMPWoHYO6ICtO8mnY23woWC6vNtTONI4bUTMPzeqGTjpUndHTpUGaaaNqyjBYuCk+7IQYCvez9PsooSywNyrLzlkt2sUjmwhIgJkx/dZVfThEBAIkxzDKV2sf1ggcBGD39eMGU286b7Rwr8kKhhjjqmL/QGeJqljmX2A25LW8U6SuBhnnctYwcaONfKH/1kP08QI6tEloC+j5DIA3X89x7wj56ZxCrNXFD6ldT4T5P+pyURda7P27msF6N92X/84BDarEm71GkASLtrl0nezjTW7MTabM1/3zENx6YXEd9mFgI0kgOboGi8LDFgxqy7ZGxOOelxrljzOdTcmsu1LDwImBaRmB2wFQimmCnAg7diIvGSqNC8c/5RDQZa5LdPPj2tC2JAzACulo0v+YC95HxCYQebB8m2hqU/B27/XFQA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 01, 2023 at 08:23:55AM +0200, Petr Tesarik wrote: > From: Petr Tesarik > > Motivation > ========== > > The software IO TLB was designed with these assumptions: > > 1) It would not be used much. Small systems (little RAM) don't need it, and > big systems (lots of RAM) would have modern DMA controllers and an IOMMU > chip to handle legacy devices. > 2) A small fixed memory area (64 MiB by default) is sufficient to > handle the few cases which require a bounce buffer. > 3) 64 MiB is little enough that it has no impact on the rest of the > system. > 4) Bounce buffers require large contiguous chunks of low memory. Such > memory is precious and can be allocated only early at boot. > > It turns out they are not always true: > > 1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy > 32-bit peripheral busses and/or DMA controllers. > 2) CoCo VMs use bounce buffers for all I/O but may need substantially more > than 64 MiB. > 3) Embedded developers put as many features as possible into the available > memory. A few dozen "missing" megabytes may limit what features can be > implemented. > 4) If CMA is available, it can allocate large continuous chunks even after > the system has run for some time. > > Goals > ===== > > The goal of this work is to start with a small software IO TLB at boot and > expand it later when/if needed. > > Design > ====== > > This version of the patch series retains the current slot allocation > algorithm with multiple areas to reduce lock contention, but additional > slots can be added when necessary. > > These alternatives have been considered: > > - Allocate and free buffers as needed using direct DMA API. This works > quite well, except in CoCo VMs where each allocation/free requires > decrypting/encrypting memory, which is a very expensive operation. > > - Allocate a very large software IO TLB at boot, but allow to migrate pages > to/from it (like CMA does). For systems with CMA, this would mean two big > allocations at boot. Finding the balance between CMA, SWIOTLB and rest of > available RAM can be challenging. More importantly, there is no clear > benefit compared to allocating SWIOTLB memory pools from the CMA. > > Implementation Constraints > ========================== > > These constraints have been taken into account: > > 1) Minimize impact on devices which do not benefit from the change. > 2) Minimize the number of memory decryption/encryption operations. > 3) Avoid contention on a lock or atomic variable to preserve parallel > scalability. > > Additionally, the software IO TLB code is also used to implement restricted > DMA pools. These pools are restricted to a pre-defined physical memory > region and must not use any other memory. In other words, dynamic > allocation of memory pools must be disabled for restricted DMA pools. > > Data Structures > =============== > > The existing struct io_tlb_mem is the central type for a SWIOTLB allocator, > but it now contains multiple memory pools:: > > io_tlb_mem > +---------+ io_tlb_pool > | SWIOTLB | +-------+ +-------+ +-------+ > |allocator|-->|default|-->|dynamic|-->|dynamic|-->... > | | |memory | |memory | |memory | > +---------+ | pool | | pool | | pool | > +-------+ +-------+ +-------+ > > The allocator structure contains global state (such as flags and counters) > and structures needed to schedule new allocations. Each memory pool > contains the actual buffer slots and metadata. The first memory pool in the > list is the default memory pool allocated statically at early boot. > > New memory pools are allocated from a kernel worker thread. That's because > bounce buffers are allocated when mapping a DMA buffer, which may happen in > interrupt context where large atomic allocations would probably fail. > Allocation from process context is much more likely to succeed, especially > if it can use CMA. > > Nonetheless, the onset of a load spike may fill up the SWIOTLB before the > worker has a chance to run. In that case, try to allocate a small transient > memory pool to accommodate the request. If memory is encrypted and the > device cannot do DMA to encrypted memory, this buffer is allocated from the > coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore > require increasing the size of the coherent pool with the "coherent_pool" > command-line parameter. > > Performance > =========== > > All testing compared a vanilla v6.4-rc6 kernel with a fully patched > kernel. The kernel was booted with "swiotlb=force" to allow stress-testing > the software IO TLB on a high-performance device that would otherwise not > need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of > popular distribution kernels; it is understood that parallel workloads > suffer from contention on the recently added debugfs atomic counters. > > These benchmarks were run: > > - small: single-threaded I/O of 4 KiB blocks, > - big: single-threaded I/O of 64 KiB blocks, > - 4way: 4-way parallel I/O of 4 KiB blocks. > > In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but > wasteful). The "default" pair of columns shows performance impact when > booted with 64 MiB SWIOTLB (i.e. current state). The "growing" pair of > columns shows the impact when booted with a 1 MiB initial SWIOTLB, which > grew to 5 MiB at run time. The "var" column in the tables below is the > coefficient of variance over 5 runs of the test, the "diff" column is the > difference in read-write I/O bandwidth (MiB/s). The very first column is > the coefficient of variance in the results of the base unpatched kernel. > > First, on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based > block device on the host: > > base default growing > var var diff var diff > small 1.96% 0.47% -1.5% 0.52% -2.2% > big 2.03% 1.35% +0.9% 2.22% +2.9% > 4way 0.80% 0.45% -0.7% 1.22% <0.1% > > Second, on a Raspberry Pi4 with 8G RAM and a class 10 A1 microSD card: > > base default growing > var var diff var diff > small 1.09% 1.69% +0.5% 2.14% -0.2% > big 0.03% 0.28% -0.5% 0.03% -0.1% > 4way 5.15% 2.39% +0.2% 0.66% <0.1% > > Third, on a CoCo VM. This was a bigger system, so I also added a 24-thread > parallel I/O test: > > base default growing > var var diff var diff > small 2.41% 6.02% +1.1% 10.33% +6.7% > big 9.20% 2.81% -0.6% 16.84% -0.2% > 4way 0.86% 2.66% -0.1% 2.22% -4.9% > 24way 3.19% 6.19% +4.4% 4.08% -5.9% > > Note the increased variance of the CoCo VM, although the host was not > otherwise loaded. These are caused by the first run, which includes the > overhead of allocating additional bounce buffers and sharing them with the > hypervisor. The system was not rebooted between successive runs. > > Parallel tests suffer from a reduced number of areas in the dynamically > allocated memory pools. This can be improved by allocating a larger pool > from CMA (not implemented in this series yet). > > I have no good explanation for the increase in performance of the > 24-thread I/O test with the default (non-growing) memory pool. Although the > difference is within variance, it seems to be real. The average bandwidth > is consistently above that of the unpatched kernel. > > To sum it up: > > - All workloads benefit from reduced memory footprint. > - No performance regressions have been observed with the default size of > the software IO TLB. > - Most workloads retain their former performance even if the software IO > TLB grows at run time. > For the driver-core touched portions: Reviewed-by: Greg Kroah-Hartman