From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3776BC48BF8 for ; Thu, 22 Feb 2024 08:35:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ABF186B0078; Thu, 22 Feb 2024 03:35:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A47746B007B; Thu, 22 Feb 2024 03:35:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C0EA6B007E; Thu, 22 Feb 2024 03:35:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 77BC36B0078 for ; Thu, 22 Feb 2024 03:35:47 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id ECD37A0CF7 for ; Thu, 22 Feb 2024 08:35:46 +0000 (UTC) X-FDA: 81818781492.11.3B72F6C Received: from mail-lj1-f178.google.com (mail-lj1-f178.google.com [209.85.208.178]) by imf16.hostedemail.com (Postfix) with ESMTP id DCFA018000E for ; Thu, 22 Feb 2024 08:35:44 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cr6qahbi; spf=pass (imf16.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708590945; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sLcnfNka3p1XZTyVd3sS5LqFAV2FLUbRoJLphNmVCtU=; b=oYMfh+rw693BN87N3Scj7lMkdbKtoQPYruQHjHGVcqC2QxkbZXBKe4szAzPuRx1Ow2wOaW F4/mLRFOYf+xc7l9aLqrRPrseRpOWVU+qUhk4cSBe2VqI5Cc5hPPS3bSdX2WyzaN65vvXY n6dLzyMHcLT8UswG8EsRW9GIbdO5xKY= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cr6qahbi; spf=pass (imf16.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708590945; a=rsa-sha256; cv=none; b=VtGJ2rJoYLrbFo873ZGWjZkIj08muDWt5YujRKDrnbHwwcMnbY1Iwx6V6WBZ+1G+Vg6TH1 MLegjZxHpmXsBHe4YFUE+i8PR9zgStGy/tmLJbUjPNF8VuxpCdiuai0oXlx/c9iqYrx0YQ QQdpUh6TkyasEY8lyqBokLwaVCkURhE= Received: by mail-lj1-f178.google.com with SMTP id 38308e7fff4ca-2d2533089f6so20838481fa.1 for ; Thu, 22 Feb 2024 00:35:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708590943; x=1709195743; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=sLcnfNka3p1XZTyVd3sS5LqFAV2FLUbRoJLphNmVCtU=; b=cr6qahbiMUEsg5KqVlzOQ4gBCjqvgnWLhB0mlVTEqV+9j42xAJlmOnaUovk6uosu7v osNYwYzR5WH6cs6X6LbTNaaTSNUJcXYySoKp4KM/G155geHGy9FwbbmK1JEYzwI6WR/P D+2SFgfPxNBc9Z40PZNAB0s3aIySoGd9CIQFlYirtpvFf7RjwmrMlmYFba5TrnuN8D/1 EhcUtCMtyUcLdDkncrfKaaqvHFMdgg2SPI0sM+iA6moVA/HWsaDFha5ka6RKoDN6bURN jLe56/IlTFeEsSCoQgjKvwBUj7hGAEXGU6EgvxBikEQJH5lORCFtvfW6jWGCzvnej/vh mGeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708590943; x=1709195743; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=sLcnfNka3p1XZTyVd3sS5LqFAV2FLUbRoJLphNmVCtU=; b=B/vDoDrdQjfVF013Pbws8zOQyISm/y2VeN2aCL/Apx8hnZIFxCgXRBd18wMYUh6MHj H8dbpoeojb700pT/rp55rcm4pJ2IvcMpmKanlmjHeeMNod/BelCW3krEDnS09mfexaiV zPw21O7hN0AgGhvDL2ARhwocUB8iKr3scHD+qTssbyRawPqyTcmCMhdFRB0pT2LVEIFL s87nDy+XXe6jEk6Mg1bTnsNwvCvA7yVZHyHRZFvucrLk2Jd7Mr2AbnRbXQBiNoY3JIme DzbUgWnc9UepBUs+KBsL3q2EdzJC8q6T6uJalNDn3B2V0cEAOZyQLR61rH2mluRq56ZO G40w== X-Forwarded-Encrypted: i=1; AJvYcCWw+SbZkDYt/q5iDio4qtBOxjPEYZ3cdtdgheNty9iWJM+MKlfg5tDcLe1UpmweVZVFF7n56tUta4TU5IFtj7pR/S4= X-Gm-Message-State: AOJu0YwSlSinsZLMxofmfDYMzSgzNnyrBcjQCBA75ehnhsrNuX3iW1uB 1o4aDKyJp5qTsdCD9M2Nf8fCKg6m8gzGrAoSdMgXgXyycXY++YsF X-Google-Smtp-Source: AGHT+IH0mO9PggosH8AtipIvuO9sVXAdfyo819+Pk1RZTJ9mnG/3ywQyHRmKCUNv4UYe5wuAj0x3Tg== X-Received: by 2002:a2e:9955:0:b0:2d2:2bd2:78e8 with SMTP id r21-20020a2e9955000000b002d22bd278e8mr8186518ljj.47.1708590942545; Thu, 22 Feb 2024 00:35:42 -0800 (PST) Received: from pc636 (host-185-121-47-193.sydskane.nu. [185.121.47.193]) by smtp.gmail.com with ESMTPSA id k18-20020a2e8892000000b002d11f45f408sm2201657lji.25.2024.02.22.00.35.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Feb 2024 00:35:42 -0800 (PST) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Thu, 22 Feb 2024 09:35:40 +0100 To: Matthew Wilcox , Mel Gorman , kirill.shutemov@linux.intel.com, Vishal Moola Cc: Andrew Morton , LKML , Baoquan He , Lorenzo Stoakes , Christoph Hellwig , "Liam R . Howlett" , Dave Chinner , "Paul E . McKenney" , Joel Fernandes , Oleksiy Avramchenko , linux-mm@kvack.org Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3 Message-ID: References: <20240102184633.748113-1-urezki@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240102184633.748113-1-urezki@gmail.com> X-Rspamd-Queue-Id: DCFA018000E X-Rspam-User: X-Stat-Signature: cimwmzx4jz379f93pfzud88atd9wxw3u X-Rspamd-Server: rspam01 X-HE-Tag: 1708590944-665977 X-HE-Meta: U2FsdGVkX18AnZHxRGdco9ezKN/LgOK1BuVIsQcqygQnZrdqU4qhelIN85zOD95AF1QKtppl/Jz3GPKZlyvZiH0tAcqbIeMoBQfEype3WwILdPlT9n7URXWM4eVzK32FezOnx+hzUK6VFv94XN+fSzp28zQ3PEr5PDcXzz19Ff9TnytFgt4gf4I/vwjQkJf9PW3twnC9OEHNOLMJGmKnfPjayMN0Y0vZ7TuEEQ1usbjdPUptTNBjYcihZeancOiaBPIHImLH5bLM9JtDfRLh/QsljDWXsY3g37GtDqayQbXT9dwKiPLBh55EvSVFMuQ5rC6BezyiuEJdn1NM6FZamzgtfThWKCP2TupW29wQBRYxGMl7xV+kRR2a+LPa2LE/UDS00kZrEqwkmx3ai1ygZ2fwsRxmj1SgF8ZAFuLHIMFGpzJJQ7RL4bR52JUkoqPS++DVZ+IAhfszLKfdAvccig03pIgNeSjpkYUgExvjfBpcnSQAyoyhd+JyeOJU5rQJUAnOhVj3ru3fll1jckrWFjVrorZehEebDHyVd7FYCUPdZm3xbvKfK+N1899OGSetpX2Q0Azh3PNU57rx3NID4cRngsOZkW1sGEbqlaxBaDgsAdwYPDY77kHU8LSalP6vIwjlRGaJfWvsjT3L/AWIJyDsssaD3ChpjA2ClGRE4EbyvmVRHL+6AHg/0LoEcd3q/IQciGwuiuunlcK+ieOFrwzmaMLpOCyhoPLc+uh8iZKf+MJ6Rn8/Kk3vN/S3QjPRqa7mbKhxO0nBvF5pWSF6wooEdnXo0Ec5FIKpwm7MtHL3QJMjYH5cK6dGGDVR4jAk1A1yqlZlC+R04epSLXzovzuz9mH39IZtJv7Oa6vG7zGEEJltX1R6sr2wZWj1H2w3z4KAPS8ZA2gHq69UfVvFipib9m2fOmfISagUK2G8gI98geuDdd1E8tTdBZyoHwYyik4t4xO49TRFfhbgrTD SLqfNdDC zphmonLDixXo+erSHrRWHOf22RaakeymcE9PCt5qV8bgmt3ye3FyctjP5VrJjhTxLiUq7Wdhq8ixnxMKfzqoUGB6mDYPWlCTzaQ/WyO0AssEGV4WqPJa5xQkgz9sBdi2jwXBdy2XcOt6MiuwqqMuHtyN/X/JwpopRfPdLjQC2WXQXkrMt5O+xVeH1bKpQUwKdYwNGAI8lixnb7hhVIKlixen6bvrCg4DkrDlMYORtm1rWM8XfkmPrQCRTrDNeDLr8uZEwm2Z+dOtxoocu3osHVy4tUrGG+uA1Il9iVo08armov7xVef3dpFHD8rs+LBI5wGLitGQZWBUqfaDqmXo086UyyTpA0cYwsRef/p8J35qtf0fU3GdOWg99mLx1DepR0B6PUDs+b8RuQS+em2i9uebDlcwbfBhvzv/A7NFNTDco/686uxYopEf746SWRnNyU6sGGqFPQ9AV18fokqNf+zZFhx3AHEyX6W85MtyJezRmUbluNEtGr60jehzXcUoQ6Pwg78kGepkAZahJ+4IAND4SAURQpoW3GumW8OZKTJICX6s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello, Folk! > This is v3. It is based on the 6.7.0-rc8. > > 1. Motivation > > - Offload global vmap locks making it scaled to number of CPUS; > - If possible and there is an agreement, we can remove the "Per cpu kva allocator" > to make the vmap code to be more simple; > - There were complains from XFS folk that a vmalloc might be contented > on the their workloads. > > 2. Design(high level overview) > > We introduce an effective vmap node logic. A node behaves as independent > entity to serve an allocation request directly(if possible) from its pool. > That way it bypasses a global vmap space that is protected by its own lock. > > An access to pools are serialized by CPUs. Number of nodes are equal to > number of CPUs in a system. Please note the high threshold is bound to > 128 nodes. > > Pools are size segregated and populated based on system demand. The maximum > alloc request that can be stored into a segregated storage is 256 pages. The > lazily drain path decays a pool by 25% as a first step and as second populates > it by fresh freed VAs for reuse instead of returning them into a global space. > > When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start > address is converted into a correct node where it should be placed and resided. > Doing so we balance VAs across the nodes as a result an access becomes scalable. > The addr_to_node() function does a proper address conversion to a correct node. > > A vmap space is divided on segments with fixed size, it is 16 pages. That way > any address can be associated with a segment number. Number of segments are > equal to num_possible_cpus() but not grater then 128. The numeration starts > from 0. See below how it is converted: > > static inline unsigned int > addr_to_node_id(unsigned long addr) > { > return (addr / zone_size) % nr_nodes; > } > > On a free path, a VA can be easily found by converting its "va_start" address > to a certain node it resides. It is moved from "busy" data to "lazy" data structure. > Later on, as noted earlier, the lazy kworker decays each node pool and populates it > by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc > request. > > 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor > > sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > > > 94.41% 0.89% [kernel] [k] _raw_spin_lock > 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath > 76.13% 0.28% [kernel] [k] __vmalloc_node_range > 72.96% 0.81% [kernel] [k] alloc_vmap_area > 56.94% 0.00% [kernel] [k] __get_vm_area_node > 41.95% 0.00% [kernel] [k] vmalloc > 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test > 35.17% 0.00% [kernel] [k] ret_from_fork_asm > 35.17% 0.00% [kernel] [k] ret_from_fork > 35.17% 0.00% [kernel] [k] kthread > 35.08% 0.00% [test_vmalloc] [k] test_func > 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test > 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test > 23.53% 0.25% [kernel] [k] vfree.part.0 > 21.72% 0.00% [kernel] [k] remove_vm_area > 20.08% 0.21% [kernel] [k] find_unlink_vmap_area > 2.34% 0.61% [kernel] [k] free_vmap_area_noflush > > vs > > 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test > 63.36% 0.02% [kernel] [k] vmalloc > 63.34% 2.64% [kernel] [k] __vmalloc_node_range > 30.42% 4.46% [kernel] [k] vfree.part.0 > 28.98% 2.51% [kernel] [k] __alloc_pages_bulk > 27.28% 0.19% [kernel] [k] __get_vm_area_node > 26.13% 1.50% [kernel] [k] alloc_vmap_area > 21.72% 21.67% [kernel] [k] clear_page_rep > 19.51% 2.43% [kernel] [k] _raw_spin_lock > 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath > 13.40% 2.07% [kernel] [k] free_unref_page > 10.62% 0.01% [kernel] [k] remove_vm_area > 9.02% 8.73% [kernel] [k] insert_vmap_area > 8.94% 0.00% [kernel] [k] ret_from_fork_asm > 8.94% 0.00% [kernel] [k] ret_from_fork > 8.94% 0.00% [kernel] [k] kthread > 8.29% 0.00% [test_vmalloc] [k] test_func > 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test > 5.30% 4.73% [kernel] [k] purge_vmap_node > 4.47% 2.65% [kernel] [k] free_vmap_area_noflush > > > confirms that a native_queued_spin_lock_slowpath goes down to > 16.51% percent from 93.07%. > > The throughput is ~12x higher: > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 10m51.271s > user 0m0.013s > sys 0m0.187s > urezki@pc638:~$ > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 0m51.301s > user 0m0.015s > sys 0m0.040s > urezki@pc638:~$ > > 4. Changelog > > v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/ > v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/ > > Delta v2 -> v3: > - fix comments from v2 feedback; > - switch from pre-fetch chunk logic to a less complex size based pools. > > Baoquan He (1): > mm/vmalloc: remove vmap_area_list > > Uladzislau Rezki (Sony) (10): > mm: vmalloc: Add va_alloc() helper > mm: vmalloc: Rename adjust_va_to_fit_type() function > mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c > mm: vmalloc: Remove global vmap_area_root rb-tree > mm: vmalloc: Remove global purge_vmap_area_root rb-tree > mm: vmalloc: Offload free_vmap_area_lock lock > mm: vmalloc: Support multiple nodes in vread_iter > mm: vmalloc: Support multiple nodes in vmallocinfo > mm: vmalloc: Set nr_nodes based on CPUs in a system > mm: vmalloc: Add a shrinker to drain vmap pools > > .../admin-guide/kdump/vmcoreinfo.rst | 8 +- > arch/arm64/kernel/crash_core.c | 1 - > arch/riscv/kernel/crash_core.c | 1 - > include/linux/vmalloc.h | 1 - > kernel/crash_core.c | 4 +- > kernel/kallsyms_selftest.c | 1 - > mm/nommu.c | 2 - > mm/vmalloc.c | 1049 ++++++++++++----- > 8 files changed, 786 insertions(+), 281 deletions(-) > > -- > 2.39.2 > There is one thing that i have to clarify and which is open for me yet. Test machine: quemu x86_64 system 64 CPUs 64G of memory test suite: test_vmalloc.sh environment: mm-unstable, branch: next-20240220 where this series is located. On top of it i added locally Suren's Baghdasaryan Memory allocation profiling v3 for better understanding of memory usage. Before running test, the condition is as below: urezki@pc638:~$ sort -h /proc/allocinfo 27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc 79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 936 63618 0 134 63236 Swap: 0 0 0 urezki@pc638:~$ The test-suite stresses vmap/vmalloc layer by creating workers which in a tight loop do alloc/free, i.e. it is considered as extreme. Below three identical tests were done with only one difference, which is 64, 128 and 256 kworkers: 1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 urezki@pc638:~$ sort -h /proc/allocinfo 80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page 350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 1417 63054 0 298 62755 Swap: 0 0 0 urezki@pc638:~$ 2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 urezki@pc638:~$ sort -h /proc/allocinfo 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page 1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 2556 61914 0 302 61616 Swap: 0 0 0 urezki@pc638:~$ 3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 urezki@pc638:~$ sort -h /proc/allocinfo 127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page 5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 6741 57652 0 394 57431 Swap: 0 0 0 urezki@pc638:~$ pagetable_alloc - gets increased as soon as a higher pressure is applied by increasing number of workers. Running same number of jobs on a next run does not increase it and stays on same level as on previous. /** * pagetable_alloc - Allocate pagetables * @gfp: GFP flags * @order: desired pagetable order * * pagetable_alloc allocates memory for page tables as well as a page table * descriptor to describe that memory. * * Return: The ptdesc describing the allocated page tables. */ static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) { struct page *page = alloc_pages(gfp | __GFP_COMP, order); return page_ptdesc(page); } Could you please comment on it? Or do you have any thought? Is it expected? Is a page-table ever shrink? /proc/slabinfo does not show any high "active" or "number" of objects to be used by any cache. /proc/meminfo - "VmallocUsed" stays low after those 3 tests. I have checked it with KASAN, KMEMLEAK and i do not see any issues. Thank you for the help! -- Uladzislau Rezki