All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	kirill.shutemov@linux.intel.com,
	Vishal Moola <vishal.moola@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Joel Fernandes <joel@joelfernandes.org>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3
Date: Wed, 28 Feb 2024 17:27:06 +0800	[thread overview]
Message-ID: <Zd78aiZ8uiM6ZP16@MiWiFi-R3L-srv> (raw)
In-Reply-To: <ZdjqDRLbpnExRhSZ@pc638.lan>

[-- Attachment #1: Type: text/plain, Size: 5978 bytes --]

On 02/23/24 at 07:55pm, Uladzislau Rezki wrote:
> On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote:
> > On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello, Folk!
> > > > > > >
> > > > > > >[...]
> > > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > > > does not increase it and stays on same level as on previous.
> > > > > > >
> > > > > > > /**
> > > > > > >  * pagetable_alloc - Allocate pagetables
> > > > > > >  * @gfp:    GFP flags
> > > > > > >  * @order:  desired pagetable order
> > > > > > >  *
> > > > > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > > >  * descriptor to describe that memory.
> > > > > > >  *
> > > > > > >  * Return: The ptdesc describing the allocated page tables.
> > > > > > >  */
> > > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > > > {
> > > > > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > > > >
> > > > > > >         return page_ptdesc(page);
> > > > > > > }
> > > > > > >
> > > > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > > > Is a page-table ever shrink?
> > > > > > 
> > > > > > It's my understanding that the vunmap_range helpers don't actively
> > > > > > free page tables, they just clear PTEs. munmap does free them in
> > > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > > > too.
> > > > > >
> > > > > Right. I see that for a user space, pgtables are removed. There was a
> > > > > work on it.
> > > > > 
> > > > > >
> > > > > > I would not be surprised if the memory increase you're seeing is more
> > > > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > > > whole test.
> > > > > > 
> > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > > > map lot of memory.
> > > > 
> > > > The 'nr_threads=256' testing may be too radical. I took the test on
> > > > a bare metal machine as below, it's still running and hang there after
> > > > 30 minutes. I did this after system boot. I am looking for other
> > > > machines with more processors.
> > > > 
> > > > [root@dell-r640-068 ~]# nproc 
> > > > 64
> > > > [root@dell-r640-068 ~]# free -h
> > > >                total        used        free      shared  buff/cache   available
> > > > Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> > > > Swap:          4.0Gi          0B       4.0Gi
> > > > [root@dell-r640-068 ~]# 
> > > > 
> > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > > > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > > > 
> > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> > > complete. So wait more :)
> > 
> > Right, mine could take the similar time to finish that. I got a machine
> > with 288 cpus, see if I can get some clues. When I go through the code
> > flow, suddenly realized it could be drain_vmap_area_work which is the 
> > bottle neck and cause the tremendous page table pages costing.
> > 
> > On your system, there's 64 cpus. then 
> > 
> > nr_lazy_max = lazy_max_pages() = 7*32M = 224M;
> > 
> > So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
> > and triggering drain_vmap_work(). When cpu resouce is very limited, the
> > lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c 
> > are going far faster and more easily then vmap reclaiming. If old va is not
> > reused, new va is allocated and keep extending, the new page table surely
> > need be created to cover them.
> > 
> > I will take testing on the system with 288 cpus, will update if testing
> > is done.
> > 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 12caa794abd4..a90c5393d85f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size)
>  	return NULL;
>  }
>  
> +static unsigned long lazy_max_pages(void);
> +
>  static bool
>  node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
>  {
> @@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
>  	if (!vp)
>  		return false;
>  
> +	if (READ_ONCE(vp->len) > lazy_max_pages())
> +		return false;
> +
>  	spin_lock(&n->pool_lock);
>  	list_add(&va->list, &vp->head);
>  	WRITE_ONCE(vp->len, vp->len + 1);
> @@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>  				INIT_WORK(&vn->purge_work, purge_vmap_node);
>  
>  				if (cpumask_test_cpu(i, cpu_online_mask))
> -					schedule_work_on(i, &vn->purge_work);
> +					queue_work_on(i, system_highpri_wq, &vn->purge_work);
>  				else
> -					schedule_work(&vn->purge_work);
> +					queue_work(system_highpri_wq, &vn->purge_work);
>  
>  				nr_purge_helpers--;
>  			} else {
> <snip>
> 
> We need this. This settles it back to a normal PTE-usage. Tomorrow i
> will check if cache-len should be limited. I tested on my 64 CPUs
> system with radical 256 kworkers. It looks good.

I finally finished the testing w/o and with your above improvement
patch. Testing is done on a system with 128 cpus. The system with 288
cpus is not available because of some console connection. Attach the log
here. In some testing after rebooting, I found it could take more than 30
minutes, I am not sure if it's caused by my messy code change. I finally
cleaned up all of them and take a clean linux-next to test, then apply
your above draft code.

[-- Attachment #2: vmalloc_node.log --]
[-- Type: text/plain, Size: 7736 bytes --]

[root@dell-per6515-03 linux]# nproc 
128
[root@dell-per6515-03 linux]# free -h
               total        used        free      shared  buff/cache   available
Mem:           124Gi       2.6Gi       122Gi        21Mi       402Mi       122Gi
Swap:          4.0Gi          0B       4.0Gi

1)linux-next kernel w/o improving code from Uladzislau
-------------------------------------------------------
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real	4m28.018s
user	0m0.015s
sys	0m4.712s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    21405696     5226 mm/memory.c:1122 func:folio_prealloc 
    26199936     7980 kernel/fork.c:309 func:alloc_thread_stack_node 
    29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   266797056    65136 include/linux/mm.h:2848 func:pagetable_alloc 
   507617280    32796 mm/slub.c:2305 func:alloc_slab_page 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
Run the test with following parameters: run_test_mask=127 nr_threads=128
Done.
Check the kernel ring buffer to see the summary.

real	6m19.328s
user	0m0.005s
sys	0m9.476s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    21405696     5226 mm/memory.c:1122 func:folio_prealloc 
    26889408     8190 kernel/fork.c:309 func:alloc_thread_stack_node 
    29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   550068224    34086 mm/slub.c:2305 func:alloc_slab_page 
   664535040   162240 include/linux/mm.h:2848 func:pagetable_alloc 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	19m10.657s
user	0m0.015s
sys	0m20.959s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    22441984     5479 mm/shmem.c:1634 func:shmem_alloc_folio 
    26758080     8150 kernel/fork.c:309 func:alloc_thread_stack_node 
    35880960     8760 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   122355712     7852 mm/readahead.c:468 func:ra_alloc_folio 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   708231168    50309 mm/slub.c:2305 func:alloc_slab_page 
  1107296256   270336 include/linux/mm.h:2848 func:pagetable_alloc 
[root@dell-per6515-03 ~]# 

2)linux-next kernel with improving code from Uladzislau
-----------------------------------------------------
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real	4m27.226s
user	0m0.006s
sys	0m4.709s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38023168     9283 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72228864    17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184176640    10684 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   284700672    69507 include/linux/mm.h:2848 func:pagetable_alloc 
   601427968    36377 mm/slub.c:2305 func:alloc_slab_page 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
Run the test with following parameters: run_test_mask=127 nr_threads=128
Done.
Check the kernel ring buffer to see the summary.

real	6m16.960s
user	0m0.007s
sys	0m9.465s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72220672    17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   427884544   104464 include/linux/mm.h:2848 func:pagetable_alloc 
   697311232    45159 mm/slub.c:2305 func:alloc_slab_page
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	21m15.673s
user	0m0.008s
sys	0m20.259s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72224768    17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   506974208   123773 include/linux/mm.h:2848 func:pagetable_alloc 
   809504768    53621 mm/slub.c:2305 func:alloc_slab_page
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	21m36.580s
user	0m0.012s
sys	0m19.912s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38977536     9516 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72273920    17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99895296    97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   141033472    34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   186064896    10841 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   541237248   132138 include/linux/mm.h:2848 func:pagetable_alloc 
   694718464    41216 mm/slub.c:2305 func:alloc_slab_page




  reply	other threads:[~2024-02-28  9:27 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-02 18:46 [PATCH v3 00/11] Mitigate a vmap lock contention v3 Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 01/11] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 02/11] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 03/11] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
2024-01-05  8:10   ` Wen Gu
2024-01-05 10:50     ` Uladzislau Rezki
2024-01-06  9:17       ` Wen Gu
2024-01-06 16:36         ` Uladzislau Rezki
2024-01-07  6:59           ` Hillf Danton
2024-01-08  7:45             ` Wen Gu
2024-01-08 18:37               ` Uladzislau Rezki
2024-01-16 23:25   ` Lorenzo Stoakes
2024-01-18 13:15     ` Uladzislau Rezki
2024-01-20 12:55       ` Lorenzo Stoakes
2024-01-22 17:44         ` Uladzislau Rezki
2024-01-02 18:46 ` [PATCH v3 05/11] mm/vmalloc: remove vmap_area_list Uladzislau Rezki (Sony)
2024-01-16 23:36   ` Lorenzo Stoakes
2024-01-02 18:46 ` [PATCH v3 06/11] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
2024-01-03 11:08   ` Hillf Danton
2024-01-03 15:47     ` Uladzislau Rezki
2024-01-11  9:02   ` Dave Chinner
2024-01-11 15:54     ` Uladzislau Rezki
2024-01-11 20:37       ` Dave Chinner
2024-01-12 12:18         ` Uladzislau Rezki
2024-01-16 22:12           ` Dave Chinner
2024-01-18 18:15             ` Uladzislau Rezki
2024-02-08  0:25   ` Baoquan He
2024-02-08 13:57     ` Uladzislau Rezki
2024-02-28  9:48   ` Baoquan He
2024-02-28 10:39     ` Uladzislau Rezki
2024-02-28 12:26       ` Baoquan He
2024-03-22 18:21   ` Guenter Roeck
2024-03-22 19:03     ` Uladzislau Rezki
2024-03-22 20:53       ` Guenter Roeck
2024-01-02 18:46 ` [PATCH v3 08/11] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 09/11] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system Uladzislau Rezki (Sony)
2024-01-11  9:25   ` Dave Chinner
2024-01-15 19:09     ` Uladzislau Rezki
2024-01-16 22:06       ` Dave Chinner
2024-01-18 18:23         ` Uladzislau Rezki
2024-01-18 21:28           ` Dave Chinner
2024-01-19 10:32             ` Uladzislau Rezki
2024-01-02 18:46 ` [PATCH v3 11/11] mm: vmalloc: Add a shrinker to drain vmap pools Uladzislau Rezki (Sony)
2024-02-22  8:35 ` [PATCH v3 00/11] Mitigate a vmap lock contention v3 Uladzislau Rezki
2024-02-22 23:15   ` Pedro Falcato
2024-02-23  9:34     ` Uladzislau Rezki
2024-02-23 10:26       ` Baoquan He
2024-02-23 11:06         ` Uladzislau Rezki
2024-02-23 15:57           ` Baoquan He
2024-02-23 18:55             ` Uladzislau Rezki
2024-02-28  9:27               ` Baoquan He [this message]
2024-02-29 10:38                 ` Uladzislau Rezki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd78aiZ8uiM6ZP16@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=joel@joelfernandes.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@suse.de \
    --cc=oleksiy.avramchenko@sony.com \
    --cc=paulmck@kernel.org \
    --cc=pedro.falcato@gmail.com \
    --cc=urezki@gmail.com \
    --cc=vishal.moola@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.