Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Uladzislau Rezki <urezki@gmail.com>
To: Adrian Huang <adrianhuang0701@gmail.com>
Cc: Adrian Huang <adrianhuang0701@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Adrian Huang <ahuang12@lenovo.com>
Subject: Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
Date: Fri, 30 Aug 2024 18:26:46 +0200	[thread overview]
Message-ID: <ZtHyxvscMuxHQkaO@pc636> (raw)
In-Reply-To: <ZtDFQHGHMq6TfbKA@pc636>

On Thu, Aug 29, 2024 at 09:00:16PM +0200, Uladzislau Rezki wrote:
> On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> > From: Adrian Huang <ahuang12@lenovo.com>
> > 
> > When running the vmalloc stress on a 448-core system, observe the average
> > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> > 'funclatency.py' tool [1].
> > 
> >   # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> > 
> >      usecs             : count    distribution
> >         0 -> 1         : 0       |                                        |
> >         2 -> 3         : 29      |                                        |
> >         4 -> 7         : 19      |                                        |
> >         8 -> 15        : 56      |                                        |
> >        16 -> 31        : 483     |****                                    |
> >        32 -> 63        : 1548    |************                            |
> >        64 -> 127       : 2634    |*********************                   |
> >       128 -> 255       : 2535    |*********************                   |
> >       256 -> 511       : 1776    |**************                          |
> >       512 -> 1023      : 1015    |********                                |
> >      1024 -> 2047      : 573     |****                                    |
> >      2048 -> 4095      : 488     |****                                    |
> >      4096 -> 8191      : 1091    |*********                               |
> >      8192 -> 16383     : 3078    |*************************               |
> >     16384 -> 32767     : 4821    |****************************************|
> >     32768 -> 65535     : 3318    |***************************             |
> >     65536 -> 131071    : 1718    |**************                          |
> >    131072 -> 262143    : 2220    |******************                      |
> >    262144 -> 524287    : 1147    |*********                               |
> >    524288 -> 1048575   : 1179    |*********                               |
> >   1048576 -> 2097151   : 822     |******                                  |
> >   2097152 -> 4194303   : 906     |*******                                 |
> >   4194304 -> 8388607   : 2148    |*****************                       |
> >   8388608 -> 16777215  : 4497    |*************************************   |
> >  16777216 -> 33554431  : 289     |**                                      |
> > 
> >   avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
> > 
> >   The worst case is over 16-33 seconds, so soft lockup is triggered [2].
> > 
> > [Root Cause]
> > 1) Each purge_list has the long list. The following shows the number of
> >    vmap_area is purged.
> > 
> >    crash> p vmap_nodes
> >    vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
> >    crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
> >      nr_purged = 663070
> >      ...
> >      nr_purged = 821670
> >      nr_purged = 692214
> >      nr_purged = 726808
> >      ...
> > 
> > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
> >    operation when purging each vmap_area. However, the iteration is over
> >    600000 vmap_area (See 'nr_purged' above).
> > 
> >    Here is objdump output:
> > 
> >      $ objdump -D vmlinux
> >      ffffffff813e8c80 <purge_vmap_node>:
> >      ...
> >      ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
> >      ...
> > 
> >    Quote from "Instruction tables" pdf file [3]:
> >      Instructions with a LOCK prefix have a long latency that depends on
> >      cache organization and possibly RAM speed. If there are multiple
> >      processors or cores or direct memory access (DMA) devices, then all
> >      locked instructions will lock a cache line for exclusive access,
> >      which may involve RAM access. A LOCK prefix typically costs more
> >      than a hundred clock cycles, even on single-processor systems.
> > 
> >    That's why the latency of purge_vmap_node() dramatically increases
> >    on a many-core system: One core is busy on purging each vmap_area of
> >    the *long* purge_list and executing atomic_long_sub() for each
> >    vmap_area, while other cores free vmalloc allocations and execute
> >    atomic_long_add_return() in free_vmap_area_noflush().
> > 
> > [Solution]
> > Employ a local variable to record the total purged pages, and execute
> > atomic_long_sub() after the traversal of the purge_list is done. The
> > experiment result shows the latency improvement is 99%.
> > 
> > [Experiment Result]
> > 1) System Configuration: Three servers (with HT-enabled) are tested.
> >      * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
> >      * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
> >      * 448-core server: AMD Zen 4 Processor*2
> > 
> > 2) Kernel Config
> >      * CONFIG_KASAN is disabled
> > 
> > 3) The data in column "w/o patch" and "w/ patch"
> >      * Unit: micro seconds (us)
> >      * Each data is the average of 3-time measurements
> > 
> >          System        w/o patch (us)   w/ patch (us)    Improvement (%)
> >      ---------------   --------------   -------------    -------------
> >      72-core server          2194              14            99.36%
> >      192-core server       143799            1139            99.21%
> >      448-core server      1992122            6883            99.65%
> > 
> > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 
> > [3] https://www.agner.org/optimize/instruction_tables.pdf
> > 
> > Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> > ---
> >  mm/vmalloc.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 3f9b6bd707d2..607697c81e60 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
> >  {
> >  	struct vmap_node *vn = container_of(work,
> >  		struct vmap_node, purge_work);
> > +	unsigned long nr_purged_pages = 0;
> >  	struct vmap_area *va, *n_va;
> >  	LIST_HEAD(local_list);
> >  
> > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
> >  
> >  		list_del_init(&va->list);
> >  
> > -		atomic_long_sub(nr, &vmap_lazy_nr);
> > +		nr_purged_pages += nr;
> >  		vn->nr_purged++;
> >  
> >  		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
> >  		list_add(&va->list, &local_list);
> >  	}
> >  
> > +	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> > +
> >  	reclaim_list_global(&local_list);
> >  }
> >  
> > -- 
> > 2.34.1
> > 
> I see the point and it looks good to me.
> 
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Thank you for improving this. There is one more spot which i detected
> earlier, it is:
> 
> <snip>
> static void free_vmap_area_noflush(struct vmap_area *va)
> {
> 	unsigned long nr_lazy_max = lazy_max_pages();
> 	unsigned long va_start = va->va_start;
> 	unsigned int vn_id = decode_vn_id(va->flags);
> 	struct vmap_node *vn;
> 	unsigned long nr_lazy;
> 
> 	if (WARN_ON_ONCE(!list_empty(&va->list)))
> 		return;
> 
> 	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
> 				PAGE_SHIFT, &vmap_lazy_nr);
> 
> ...
> <snip>
> 
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?
> 
I have checked the free_vmap_area_noflush() on my hardware. It is 64
cores system:

<perf cycles>
...
+    7.84%     5.18%  [kernel]          [k] free_vmap_area_noflush
+    6.16%     1.61%  [kernel]          [k] free_unref_page
+    5.57%     1.51%  [kernel]          [k] find_unlink_vmap_area
...
<perf cycles>

<perf cycles annotate>
..
            │     arch_atomic64_add_return():
   23352402 │       mov  %r12,%rdx
            │       lock xadd %rdx,vmap_lazy_nr
            │     is_vn_id_valid():
52364447314 │       mov  nr_vmap_nodes,%ecx <----- the hotest spot which consumes the CPU cycles the most(99%)
            │     arch_atomic64_add_return():
   45547180 │       add  %rdx,%r12        
            │     is_vn_id_valid():
...
<perf cycles annotate>

At least in my case, HW, i do not see that atomic_long_add_return() is a
top when it comes to CPU cycles. Below one is the hottest instead:

static bool
is_vn_id_valid(unsigned int node_id)
{
	if (node_id < nr_vmap_nodes)
		return true;

	return false;
}

access to "nr_vmap_nodes" which is read-only and globally defined:

static __read_mostly unsigned int nr_vmap_nodes = 1;

Any thoughts?

--
Uladzislau Rezki

next prev parent reply	other threads:[~2024-08-30 16:27 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-29 13:06 [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Adrian Huang
2024-08-29 19:00 ` Uladzislau Rezki
2024-08-30 16:26   ` Uladzislau Rezki [this message]
2024-08-31  7:03     ` Christophe JAILLET
2024-09-02 17:03       ` Uladzislau Rezki
2024-09-02 12:00   ` Adrian Huang
2024-09-02 17:00     ` Uladzislau Rezki
2024-08-31  0:33 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtHyxvscMuxHQkaO@pc636 \
    --to=urezki@gmail.com \
    --cc=adrianhuang0701@gmail.com \
    --cc=ahuang12@lenovo.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.