[PATCH v3 0/3] mm, lru_gen: batch update pages when aging

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm, lru_gen: batch update pages when aging
@ 2024-01-23 18:45 Kairui Song
  2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-23 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Wei Xu, Chris Li, Matthew Wilcox,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Link V1:
https://lore.kernel.org/linux-mm/20231222102255.56993-1-ryncsn@gmail.com/

Link V2:
https://lore.kernel.org/linux-mm/20240111183321.19984-1-ryncsn@gmail.com/

Currently when MGLRU ages, it moves the pages one by one and updates mm
counter page by page, which is correct but the overhead can be optimized
by batching these operations.

I did a rebase and applied more tests to see if there are any regressions or
improvements, it seems everything looks OK except the memtier test where I
tuned down the repeat time (-x) compared to V1 and V2 and simply test more
times instead. It now seems to have a minor regression. If it's true,
it's caused by the prefetch patch. But the noise (Standard Deviation) is
a bit high so not sure if that test is credible. The test result of each
individual patch is in the commit message.

Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=6m --group_reporting

Before this series:
bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488

After this series (+7.1%):
bw (  MiB/s): min= 8359, max= 9796, per=100.00%, avg=9367.29, stdev=15.75, samples=11488
iops        : min=2140113, max=2507928, avg=2398024.65, stdev=4033.07, samples=11488

Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
  fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
    --time_based --ramp_time=1m --runtime=30m \
    --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --norandommap \
    --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
    --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7

Before this series:
 READ: 6622.0 MiB/s, Stdev: 22.090722
WRITE: 1256.3 MiB/s, Stdev: 5.249339

After this series (+5.4%, +3.9%):
 READ: 6981.0 MiB/s, Stdev: 15.556349
WRITE: 1305.7 MiB/s, Stdev: 2.357023

Test 3: 30m of MySQL test in 6G memcg with swap (12 times):
  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS
  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
    --tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this series
Avg: 134743.714545 qps. Stdev: 582.242189

After this series (+0.3%):
Avg: 135099.210000 qps. Stdev: 351.488863

Test 4: Build linux kernel in 2G memcg with make -j48 with swap
        (for memory stress, 18 times):

Before this series:
Avg: 1456.768899 s. Stdev: 20.106973

After this series (-0.5%):
Avg: 1464.178154 s. Stdev: 17.992974

Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &
  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3

Before this series:
Avg: 50317.984000 Ops/sec. Stdev: 2568.965458

After this series (-2.7%):
Avg: 48959.374118 Ops/sec. Stdev: 3488.559744

Updates from V2:
- Add more tests and simplify patch 2/3 to contain only one gen info for
  batch, as Wei Xu suggests that the batch struct may use too much stack.
- Add more tests, and test individual patch as requested by Wei Xu.
- Fix typo as pointed out by Andrew Morton.

Update from V1:
- Fix function argument type as suggested by Chris Li.

Kairui Song (3):
  mm, lru_gen: try to prefetch next page when scanning LRU
  mm, lru_gen: batch update counters on aging
  mm, lru_gen: move pages in bulk when aging

 mm/vmscan.c | 145 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 125 insertions(+), 20 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-23 18:45 [PATCH v3 0/3] mm, lru_gen: batch update pages when aging Kairui Song
@ 2024-01-23 18:45 ` Kairui Song
  2024-01-25  7:32   ` Chris Li
  2024-01-23 18:45 ` [PATCH v3 2/3] mm, lru_gen: batch update counters on aging Kairui Song
  2024-01-23 18:45 ` [PATCH v3 3/3] mm, lru_gen: move pages in bulk when aging Kairui Song
  2 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2024-01-23 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Wei Xu, Chris Li, Matthew Wilcox,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Prefetch for inactive/active LRU have been long exiting, apply the same
optimization for MGLRU.

Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=6m --group_reporting

Before this patch:
bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488

After this patch (+7.2%):
bw (  MiB/s): min= 8360, max= 9771, per=100.00%, avg=9381.31, stdev=15.67, samples=11488
iops        : min=2140296, max=2501385, avg=2401613.91, stdev=4010.41, samples=11488

Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
  fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
    --time_based --ramp_time=1m --runtime=30m \
    --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --norandommap \
    --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
    --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7

Before this patch:
 READ: 6622.0 MiB/s. Stdev: 22.090722
WRITE: 1256.3 MiB/s. Stdev: 5.249339

After this patch (+4.6%, +3.3%):
 READ: 6926.6 MiB/s, Stdev: 37.950260
WRITE: 1297.3 MiB/s, Stdev: 7.408704

Test 3: 30m of MySQL test in 6G memcg (12 times):
  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS

  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
    --tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this patch
Avg: 134743.714545 qps. Stdev: 582.242189

After this patch (+0.2%):
Avg: 135005.779091 qps. Stdev: 295.299027

Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
        (for memory stress, 18 times):

Before this patch:
Avg: 1456.768899 s. Stdev: 20.106973

After this patch (+0.0%):
Avg: 1455.659254 s. Stdev: 15.274481

Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &
  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3

Before this patch:
Avg: 50317.984000 Ops/sec. Stdev: 2568.965458

After this patch (-5.7%):
Avg: 47691.343500 Ops/sec. Stdev: 3925.772473

It seems prefetch is helpful in most cases, but the memtier test is
either hitting a case where prefetch causes higher cache miss or it's
just too noisy (high stdev).

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f9c854ce6cc..03631cedb3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 	/* prevent cold/hot inversion if force_scan is true */
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		struct list_head *head = &lrugen->folios[old_gen][type][zone];
+		struct folio *prev = NULL;
 
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
+		if (!list_empty(head))
+			prev = lru_to_folio(head);
+
+		while (prev) {
+			struct folio *folio = prev;
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
+			if (unlikely(list_is_first(&folio->lru, head))) {
+				prev = NULL;
+			} else {
+				prev = lru_to_folio(&folio->lru);
+				prefetchw(&prev->flags);
+			}
+
 			new_gen = folio_inc_gen(lruvec, folio, false);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
@@ -4341,11 +4352,15 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	for (i = MAX_NR_ZONES; i > 0; i--) {
 		LIST_HEAD(moved);
 		int skipped_zone = 0;
+		struct folio *prev = NULL;
 		int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
 		struct list_head *head = &lrugen->folios[gen][type][zone];
 
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
+		if (!list_empty(head))
+			prev = lru_to_folio(head);
+
+		while (prev) {
+			struct folio *folio = prev;
 			int delta = folio_nr_pages(folio);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
@@ -4355,6 +4370,13 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 
 			scanned += delta;
 
+			if (unlikely(list_is_first(&folio->lru, head))) {
+				prev = NULL;
+			} else {
+				prev = lru_to_folio(&folio->lru);
+				prefetchw(&prev->flags);
+			}
+
 			if (sort_folio(lruvec, folio, sc, tier))
 				sorted += delta;
 			else if (isolate_folio(lruvec, folio, sc)) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
@ 2024-01-25  7:32   ` Chris Li
  2024-01-25 17:51     ` Kairui Song
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Li @ 2024-01-25  7:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Yu Zhao, Wei Xu, Matthew Wilcox,
	linux-kernel

On Tue, Jan 23, 2024 at 10:46 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Prefetch for inactive/active LRU have been long exiting, apply the same
> optimization for MGLRU.
>
> Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
>   fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
>     --buffered=1 --ioengine=io_uring --iodepth=128 \
>     --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>     --rw=randread --random_distribution=zipf:0.5 --norandommap \
>     --time_based --ramp_time=1m --runtime=6m --group_reporting
>
> Before this patch:
> bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
> iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488
>
> After this patch (+7.2%):
> bw (  MiB/s): min= 8360, max= 9771, per=100.00%, avg=9381.31, stdev=15.67, samples=11488
> iops        : min=2140296, max=2501385, avg=2401613.91, stdev=4010.41, samples=11488
>
> Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
>   fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
>     --time_based --ramp_time=1m --runtime=30m \
>     --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
>     --iodepth_batch_complete=32 --norandommap \
>     --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
>     --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7
>
> Before this patch:
>  READ: 6622.0 MiB/s. Stdev: 22.090722
> WRITE: 1256.3 MiB/s. Stdev: 5.249339
>
> After this patch (+4.6%, +3.3%):
>  READ: 6926.6 MiB/s, Stdev: 37.950260
> WRITE: 1297.3 MiB/s, Stdev: 7.408704
>
> Test 3: 30m of MySQL test in 6G memcg (12 times):
>   echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
>     mysql -u USER -h localhost --password=PASS
>
>   sysbench /usr/share/sysbench/oltp_read_only.lua \
>     --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
>     --tables=48 --table-size=2000000 --threads=16 --time=1800 run
>
> Before this patch
> Avg: 134743.714545 qps. Stdev: 582.242189
>
> After this patch (+0.2%):
> Avg: 135005.779091 qps. Stdev: 295.299027
>
> Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
>         (for memory stress, 18 times):
>
> Before this patch:
> Avg: 1456.768899 s. Stdev: 20.106973
>
> After this patch (+0.0%):
> Avg: 1455.659254 s. Stdev: 15.274481
>
> Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
>     -a 0766 -t 16 -B binary &
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys \
>     --key-minimum=1 --key-maximum=16000000 -d 1024 \
>     --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3
>
> Before this patch:
> Avg: 50317.984000 Ops/sec. Stdev: 2568.965458
>
> After this patch (-5.7%):
> Avg: 47691.343500 Ops/sec. Stdev: 3925.772473
>
> It seems prefetch is helpful in most cases, but the memtier test is
> either hitting a case where prefetch causes higher cache miss or it's
> just too noisy (high stdev).
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..03631cedb3ab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>         /* prevent cold/hot inversion if force_scan is true */
>         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> +               struct folio *prev = NULL;
>
> -               while (!list_empty(head)) {
> -                       struct folio *folio = lru_to_folio(head);
> +               if (!list_empty(head))
> +                       prev = lru_to_folio(head);
> +
> +               while (prev) {
> +                       struct folio *folio = prev;
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
>
> +                       if (unlikely(list_is_first(&folio->lru, head))) {
> +                               prev = NULL;
> +                       } else {
> +                               prev = lru_to_folio(&folio->lru);
> +                               prefetchw(&prev->flags);
> +                       }

This makes the code flow much harder to follow. Also for architecture
that does not support prefetch, this will be a net loss.

Can you use refetchw_prev_lru_folio() instead? It will make the code
much easier to follow. It also turns into no-op when prefetch is not
supported.

Chris

> +
>                         new_gen = folio_inc_gen(lruvec, folio, false);
>                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> @@ -4341,11 +4352,15 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         for (i = MAX_NR_ZONES; i > 0; i--) {
>                 LIST_HEAD(moved);
>                 int skipped_zone = 0;
> +               struct folio *prev = NULL;
>                 int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
>                 struct list_head *head = &lrugen->folios[gen][type][zone];
>
> -               while (!list_empty(head)) {
> -                       struct folio *folio = lru_to_folio(head);
> +               if (!list_empty(head))
> +                       prev = lru_to_folio(head);
> +
> +               while (prev) {
> +                       struct folio *folio = prev;
>                         int delta = folio_nr_pages(folio);
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> @@ -4355,6 +4370,13 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>
>                         scanned += delta;
>
> +                       if (unlikely(list_is_first(&folio->lru, head))) {
> +                               prev = NULL;
> +                       } else {
> +                               prev = lru_to_folio(&folio->lru);
> +                               prefetchw(&prev->flags);
> +                       }
> +
>                         if (sort_folio(lruvec, folio, sc, tier))
>                                 sorted += delta;
>                         else if (isolate_folio(lruvec, folio, sc)) {
> --
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-25  7:32   ` Chris Li
@ 2024-01-25 17:51     ` Kairui Song
  2024-01-26  0:56       ` Chris Li
  0 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2024-01-25 17:51 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Yu Zhao, Wei Xu, Matthew Wilcox,
	linux-kernel

On Thu, Jan 25, 2024 at 3:33 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Jan 23, 2024 at 10:46 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Prefetch for inactive/active LRU have been long exiting, apply the same
> > optimization for MGLRU.
> >
> > Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
> >   fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
> >     --buffered=1 --ioengine=io_uring --iodepth=128 \
> >     --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> >     --rw=randread --random_distribution=zipf:0.5 --norandommap \
> >     --time_based --ramp_time=1m --runtime=6m --group_reporting
> >
> > Before this patch:
> > bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
> > iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488
> >
> > After this patch (+7.2%):
> > bw (  MiB/s): min= 8360, max= 9771, per=100.00%, avg=9381.31, stdev=15.67, samples=11488
> > iops        : min=2140296, max=2501385, avg=2401613.91, stdev=4010.41, samples=11488
> >
> > Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
> >   fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
> >     --time_based --ramp_time=1m --runtime=30m \
> >     --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
> >     --iodepth_batch_complete=32 --norandommap \
> >     --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
> >     --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7
> >
> > Before this patch:
> >  READ: 6622.0 MiB/s. Stdev: 22.090722
> > WRITE: 1256.3 MiB/s. Stdev: 5.249339
> >
> > After this patch (+4.6%, +3.3%):
> >  READ: 6926.6 MiB/s, Stdev: 37.950260
> > WRITE: 1297.3 MiB/s, Stdev: 7.408704
> >
> > Test 3: 30m of MySQL test in 6G memcg (12 times):
> >   echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
> >     mysql -u USER -h localhost --password=PASS
> >
> >   sysbench /usr/share/sysbench/oltp_read_only.lua \
> >     --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
> >     --tables=48 --table-size=2000000 --threads=16 --time=1800 run
> >
> > Before this patch
> > Avg: 134743.714545 qps. Stdev: 582.242189
> >
> > After this patch (+0.2%):
> > Avg: 135005.779091 qps. Stdev: 295.299027
> >
> > Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
> >         (for memory stress, 18 times):
> >
> > Before this patch:
> > Avg: 1456.768899 s. Stdev: 20.106973
> >
> > After this patch (+0.0%):
> > Avg: 1455.659254 s. Stdev: 15.274481
> >
> > Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
> >   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
> >     -a 0766 -t 16 -B binary &
> >   memtier_benchmark -S /tmp/memcached.socket \
> >     -P memcache_binary -n allkeys \
> >     --key-minimum=1 --key-maximum=16000000 -d 1024 \
> >     --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3
> >
> > Before this patch:
> > Avg: 50317.984000 Ops/sec. Stdev: 2568.965458
> >
> > After this patch (-5.7%):
> > Avg: 47691.343500 Ops/sec. Stdev: 3925.772473
> >
> > It seems prefetch is helpful in most cases, but the memtier test is
> > either hitting a case where prefetch causes higher cache miss or it's
> > just too noisy (high stdev).
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 30 ++++++++++++++++++++++++++----
> >  1 file changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4f9c854ce6cc..03631cedb3ab 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> >         /* prevent cold/hot inversion if force_scan is true */
> >         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> >                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> > +               struct folio *prev = NULL;
> >
> > -               while (!list_empty(head)) {
> > -                       struct folio *folio = lru_to_folio(head);
> > +               if (!list_empty(head))
> > +                       prev = lru_to_folio(head);
> > +
> > +               while (prev) {
> > +                       struct folio *folio = prev;
> >
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
> >
> > +                       if (unlikely(list_is_first(&folio->lru, head))) {
> > +                               prev = NULL;
> > +                       } else {
> > +                               prev = lru_to_folio(&folio->lru);
> > +                               prefetchw(&prev->flags);
> > +                       }
>
> This makes the code flow much harder to follow. Also for architecture
> that does not support prefetch, this will be a net loss.
>
> Can you use refetchw_prev_lru_folio() instead? It will make the code
> much easier to follow. It also turns into no-op when prefetch is not
> supported.
>
> Chris
>

Hi Chris,

Thanks for the suggestion.

Yes, that's doable, I made it this way because in previous series (V1
& V2) I applied the bulk move patch first which needed and introduced
the `prev` variable here, so the prefetch logic just used it.
For V3 I did a rebase and moved the prefetch commit to be the first
one, since it seems to be the most effective one, and just kept the
code style to avoid redundant change between patches.

I can update in V4 to make this individual patch better with your suggestion.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-25 17:51     ` Kairui Song
@ 2024-01-26  0:56       ` Chris Li
  2024-01-26 10:31         ` Kairui Song
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Li @ 2024-01-26  0:56 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Yu Zhao, Wei Xu, Matthew Wilcox,
	linux-kernel

On Fri, Jan 26, 2024 at 01:51:44AM +0800, Kairui Song wrote:
> > >  mm/vmscan.c | 30 ++++++++++++++++++++++++++----
> > >  1 file changed, 26 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4f9c854ce6cc..03631cedb3ab 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > >         /* prevent cold/hot inversion if force_scan is true */
> > >         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > >                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> > > +               struct folio *prev = NULL;
> > >
> > > -               while (!list_empty(head)) {
> > > -                       struct folio *folio = lru_to_folio(head);
> > > +               if (!list_empty(head))
> > > +                       prev = lru_to_folio(head);
> > > +
> > > +               while (prev) {
> > > +                       struct folio *folio = prev;
> > >
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
> > >
> > > +                       if (unlikely(list_is_first(&folio->lru, head))) {
> > > +                               prev = NULL;
> > > +                       } else {
> > > +                               prev = lru_to_folio(&folio->lru);
> > > +                               prefetchw(&prev->flags);
> > > +                       }
> >
> > This makes the code flow much harder to follow. Also for architecture
> > that does not support prefetch, this will be a net loss.
> >
> > Can you use refetchw_prev_lru_folio() instead? It will make the code
> > much easier to follow. It also turns into no-op when prefetch is not
> > supported.
> >
> > Chris
> >
> 
> Hi Chris,
> 
> Thanks for the suggestion.
> 
> Yes, that's doable, I made it this way because in previous series (V1
> & V2) I applied the bulk move patch first which needed and introduced
> the `prev` variable here, so the prefetch logic just used it.
> For V3 I did a rebase and moved the prefetch commit to be the first
> one, since it seems to be the most effective one, and just kept the

Maybe something like this? Totally not tested. Feel free to use it any way you want.

Chris

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f9c854ce6cc..2100e786ccc6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3684,6 +3684,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 
 		while (!list_empty(head)) {
 			struct folio *folio = lru_to_folio(head);
+			prefetchw_prev_lru_folio(folio, head, flags);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
@@ -4346,7 +4347,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 
 		while (!list_empty(head)) {
 			struct folio *folio = lru_to_folio(head);
-			int delta = folio_nr_pages(folio);
+			int delta;
+
+			prefetchw_prev_lru_folio(folio, head, flags);
+			delta = folio_nr_pages(folio);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-26  0:56       ` Chris Li
@ 2024-01-26 10:31         ` Kairui Song
  2024-01-26 21:19           ` Chris Li
  0 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2024-01-26 10:31 UTC (permalink / raw)
  To: Chris Li; +Cc: linux-mm, Andrew Morton, Yu Zhao, Wei Xu, Matthew Wilcox, LKML

[-- Attachment #1: Type: text/plain, Size: 4102 bytes --]

On Fri, Jan 26, 2024 at 8:56 AM Chris Li <chrisl@kernel.org> wrote:
> On Fri, Jan 26, 2024 at 01:51:44AM +0800, Kairui Song wrote:
> > > >  mm/vmscan.c | 30 ++++++++++++++++++++++++++----
> > > >  1 file changed, 26 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 4f9c854ce6cc..03631cedb3ab 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec
*lruvec, int type, bool can_swap)
> > > >         /* prevent cold/hot inversion if force_scan is true */
> > > >         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > > >                 struct list_head *head =
&lrugen->folios[old_gen][type][zone];
> > > > +               struct folio *prev = NULL;
> > > >
> > > > -               while (!list_empty(head)) {
> > > > -                       struct folio *folio = lru_to_folio(head);
> > > > +               if (!list_empty(head))
> > > > +                       prev = lru_to_folio(head);
> > > > +
> > > > +               while (prev) {
> > > > +                       struct folio *folio = prev;
> > > >
> > > >
 VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> > > >
 VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > > >
 VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
> > > >                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio)
!= zone, folio);
> > > >
> > > > +                       if (unlikely(list_is_first(&folio->lru,
head))) {
> > > > +                               prev = NULL;
> > > > +                       } else {
> > > > +                               prev = lru_to_folio(&folio->lru);
> > > > +                               prefetchw(&prev->flags);
> > > > +                       }
> > >
> > > This makes the code flow much harder to follow. Also for architecture
> > > that does not support prefetch, this will be a net loss.
> > >
> > > Can you use refetchw_prev_lru_folio() instead? It will make the code
> > > much easier to follow. It also turns into no-op when prefetch is not
> > > supported.
> > >
> > > Chris
> > >
> >
> > Hi Chris,
> >
> > Thanks for the suggestion.
> >
> > Yes, that's doable, I made it this way because in previous series (V1
> > & V2) I applied the bulk move patch first which needed and introduced
> > the `prev` variable here, so the prefetch logic just used it.
> > For V3 I did a rebase and moved the prefetch commit to be the first
> > one, since it seems to be the most effective one, and just kept the
>
> Maybe something like this? Totally not tested. Feel free to use it any
way you want.
>
> Chris
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..2100e786ccc6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3684,6 +3684,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int
type, bool can_swap)
>
>                 while (!list_empty(head)) {
>                         struct folio *folio = lru_to_folio(head);
> +                       prefetchw_prev_lru_folio(folio, head, flags);
>
>
 VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio),
folio);
> @@ -4346,7 +4347,10 @@ static int scan_folios(struct lruvec *lruvec,
struct scan_control *sc,
>
>                 while (!list_empty(head)) {
>                         struct folio *folio = lru_to_folio(head);
> -                       int delta = folio_nr_pages(folio);
> +                       int delta;
> +
> +                       prefetchw_prev_lru_folio(folio, head, flags);
> +                       delta = folio_nr_pages(folio);
>
>
 VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio),
folio);
>

Thanks!

Actually if benefits from 2/3 and 3/3 is trivial compared to the complexity
and not appealing, then let's only keep the prefetch one, which will be
just a one liner change with good result.

[-- Attachment #2: Type: text/html, Size: 5787 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU
  2024-01-26 10:31         ` Kairui Song
@ 2024-01-26 21:19           ` Chris Li
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Li @ 2024-01-26 21:19 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Yu Zhao, Wei Xu, Matthew Wilcox, LKML

On Fri, Jan 26, 2024 at 2:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>

> > > >
> > > > This makes the code flow much harder to follow. Also for architecture
> > > > that does not support prefetch, this will be a net loss.
> > > >
> > > > Can you use refetchw_prev_lru_folio() instead? It will make the code
> > > > much easier to follow. It also turns into no-op when prefetch is not
> > > > supported.
> > > >
> > > > Chris
> > > >
> > >
> > > Hi Chris,
> > >
> > > Thanks for the suggestion.
> > >
> > > Yes, that's doable, I made it this way because in previous series (V1
> > > & V2) I applied the bulk move patch first which needed and introduced
> > > the `prev` variable here, so the prefetch logic just used it.
> > > For V3 I did a rebase and moved the prefetch commit to be the first
> > > one, since it seems to be the most effective one, and just kept the
> >
> > Maybe something like this? Totally not tested. Feel free to use it any way you want.
> >
> > Chris
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4f9c854ce6cc..2100e786ccc6 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3684,6 +3684,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> >
> >                 while (!list_empty(head)) {
> >                         struct folio *folio = lru_to_folio(head);
> > +                       prefetchw_prev_lru_folio(folio, head, flags);
> >
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > @@ -4346,7 +4347,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >
> >                 while (!list_empty(head)) {
> >                         struct folio *folio = lru_to_folio(head);
> > -                       int delta = folio_nr_pages(folio);
> > +                       int delta;
> > +
> > +                       prefetchw_prev_lru_folio(folio, head, flags);
> > +                       delta = folio_nr_pages(folio);
> >
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> >
>
> Thanks!
>
> Actually if benefits from 2/3 and 3/3 is trivial compared to the complexity and not appealing, then let's only keep the prefetch one, which will be just a one liner change with good result.

That is great. I did take a look at 2/3 and 3/3 and come to the same
conclusion regarding the complexity part.

If you resend the one liner for 1/3, you can consider it having my Ack.

Chris


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 2/3] mm, lru_gen: batch update counters on aging
  2024-01-23 18:45 [PATCH v3 0/3] mm, lru_gen: batch update pages when aging Kairui Song
  2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
@ 2024-01-23 18:45 ` Kairui Song
  2024-01-23 18:45 ` [PATCH v3 3/3] mm, lru_gen: move pages in bulk when aging Kairui Song
  2 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-23 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Wei Xu, Chris Li, Matthew Wilcox,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

When lru_gen is aging, it will update mm counters page by page,
which causes a higher overhead if age happens frequently or there
are a lot of pages in one generation getting moved.
Optimize this by doing the counter update in batch.

Although most __mod_*_state has its own caches the overhead
is still observable.

Test 1: Ramdisk fio test in a 4G memcg on a EPYC 7K62 with:
  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=6m --group_reporting

Before this patch:
bw (  MiB/s): min= 8360, max= 9771, per=100.00%, avg=9381.31, stdev=15.67, samples=11488
iops        : min=2140296, max=2501385, avg=2401613.91, stdev=4010.41, samples=11488

After this patch (+0.0%):
bw (  MiB/s): min= 8299, max= 9847, per=100.00%, avg=9388.23, stdev=16.25, samples=11488
iops        : min=2124544, max=2521056, avg=2403385.82, stdev=4159.07, samples=11488

Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
  fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
    --time_based --ramp_time=1m --runtime=30m \
    --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --norandommap \
    --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
    --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7

Before this patch:
 READ: 6926.6 MiB/s, Stdev: 37.950260
WRITE: 1297.3 MiB/s, Stdev: 7.408704

After this patch (+0.7%, +0.4%):
 READ: 6973.3 MiB/s, Stdev: 19.601587
WRITE: 1302.3 MiB/s, Stdev: 4.988877

Test 3: 30m of MySQL test in 6G memcg (12 times):
  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS

  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
    --tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this patch
Avg: 135005.779091 qps. Stdev: 295.299027

After this patch (+0.2%):
Avg: 135310.868182 qps. Stdev: 379.200942

Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
        (for memory stress, 18 times):

Before this patch:
Average: 1455.659254 s. Stdev: 15.274481

After this patch (-0.8%):
Average: 1467.813023 s. Stdev: 24.232886

Test 5: Memtier test in a 4G cgroup using brd as swap (20 times):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &
  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3

Before this patch:
Avg: 47691.343500 Ops/sec. Stdev: 3925.772473

After this patch (+1.7%):
Avg: 48389.282500 Ops/sec. Stdev: 3534.470933

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 68 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 55 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03631cedb3ab..8c701b34d757 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3113,12 +3113,45 @@ static int folio_update_gen(struct folio *folio, int gen)
 	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 }
 
-/* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+/*
+ * When oldest gen ie being reclaimed, protected/unreclaimable pages can be
+ * moved in batch. They usually all land on same gen (old_gen + 1) by
+ * folio_inc_gen so the batch struct is limited to one / type / zone
+ * level LRU.
+ * Batch is applied after finished or aborted scanning one LRU list.
+ */
+struct lru_gen_inc_batch {
+	int delta;
+};
+
+static void lru_gen_inc_batch_done(struct lruvec *lruvec, int gen, int type, int zone,
+				   struct lru_gen_inc_batch *batch)
 {
-	int type = folio_is_file_lru(folio);
+	int delta = batch->delta;
+	int new_gen = (gen + 1) % MAX_NR_GENS;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+	enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
+
+	if (!delta)
+		return;
+
+	WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+		   lrugen->nr_pages[gen][type][zone] - delta);
+	WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+		   lrugen->nr_pages[new_gen][type][zone] + delta);
+
+	if (!lru_gen_is_active(lruvec, gen) && lru_gen_is_active(lruvec, new_gen)) {
+		__update_lru_size(lruvec, lru, zone, -delta);
+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+}
+
+/* protect pages accessed multiple times through file descriptors */
+static int folio_inc_gen(struct folio *folio, int old_gen, bool reclaiming,
+			 struct lru_gen_inc_batch *batch)
+{
+	int new_gen;
+	int delta = folio_nr_pages(folio);
 	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
 
 	VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
@@ -3138,7 +3171,8 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 
-	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+	/* new_gen is ensured to be old_gen + 1 here, do a batch update */
+	batch->delta += delta;
 
 	return new_gen;
 }
@@ -3672,6 +3706,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 {
 	int zone;
 	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_inc_batch batch = { };
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
@@ -3701,12 +3736,15 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 				prefetchw(&prev->flags);
 			}
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(folio, old_gen, false, &batch);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
-			if (!--remaining)
+			if (!--remaining) {
+				lru_gen_inc_batch_done(lruvec, old_gen, type, zone, &batch);
 				return false;
+			}
 		}
+		lru_gen_inc_batch_done(lruvec, old_gen, type, zone, &batch);
 	}
 done:
 	reset_ctrl_pos(lruvec, type, true);
@@ -4226,7 +4264,7 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
  ******************************************************************************/
 
 static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc,
-		       int tier_idx)
+		       int tier_idx, struct lru_gen_inc_batch *batch)
 {
 	bool success;
 	int gen = folio_lru_gen(folio);
@@ -4236,6 +4274,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int refs = folio_lru_refs(folio);
 	int tier = lru_tier_from_refs(refs);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	int old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
 
@@ -4259,7 +4298,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* promoted */
-	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+	if (gen != old_gen) {
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
@@ -4268,7 +4307,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(folio, old_gen, false, batch);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
@@ -4278,7 +4317,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(folio, old_gen, false, batch);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
@@ -4286,7 +4325,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* waiting for writeback */
 	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
 	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
-		gen = folio_inc_gen(lruvec, folio, true);
+		gen = folio_inc_gen(folio, old_gen, true, batch);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
@@ -4353,6 +4392,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 		LIST_HEAD(moved);
 		int skipped_zone = 0;
 		struct folio *prev = NULL;
+		struct lru_gen_inc_batch batch = { };
 		int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
 		struct list_head *head = &lrugen->folios[gen][type][zone];
 
@@ -4377,7 +4417,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 				prefetchw(&prev->flags);
 			}
 
-			if (sort_folio(lruvec, folio, sc, tier))
+			if (sort_folio(lruvec, folio, sc, tier, &batch))
 				sorted += delta;
 			else if (isolate_folio(lruvec, folio, sc)) {
 				list_add(&folio->lru, list);
@@ -4391,6 +4431,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 				break;
 		}
 
+		lru_gen_inc_batch_done(lruvec, gen, type, zone, &batch);
+
 		if (skipped_zone) {
 			list_splice(&moved, head);
 			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped_zone);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/3] mm, lru_gen: move pages in bulk when aging
  2024-01-23 18:45 [PATCH v3 0/3] mm, lru_gen: batch update pages when aging Kairui Song
  2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
  2024-01-23 18:45 ` [PATCH v3 2/3] mm, lru_gen: batch update counters on aging Kairui Song
@ 2024-01-23 18:45 ` Kairui Song
  2 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-23 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Wei Xu, Chris Li, Matthew Wilcox,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Another overhead of aging is page moving. Actually, in most cases,
pages are being moved to the same gen after folio_inc_gen is called,
especially the protected pages.  So it's better to move them in bulk.

This also has a good effect on LRU order. Currently when MGLRU
ages, it walks the LRU backwards, and the protected pages are moved to
the tail of newer gen one by one, which actually reverses the order of
pages in LRU. Moving them in batches can help keep their order, only
in a small scope though, due to the scan limit of MAX_LRU_BATCH pages.

After this commit, we can see a slight performance gain (with
CONFIG_DEBUG_LIST=n):

Test 1: Ramdisk fio test in a 4G memcg on a EPYC 7K62:
  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=6m --group_reporting

Before:
bw (  MiB/s): min= 8299, max= 9847, per=100.00%, avg=9388.23, stdev=16.25, samples=11488
iops        : min=2124544, max=2521056, avg=2403385.82, stdev=4159.07, samples=11488

After (-0.2%):
bw (  MiB/s): min= 8359, max= 9796, per=100.00%, avg=9367.29, stdev=15.75, samples=11488
iops        : min=2140113, max=2507928, avg=2398024.65, stdev=4033.07, samples=11488

Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
  fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
    --time_based --ramp_time=1m --runtime=30m \
    --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --norandommap \
    --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
    --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7

Before this patch:
 READ: 6973.3 MiB/s, Stdev: 19.601587
WRITE: 1302.3 MiB/s, Stdev: 4.988877

After this patch (+0.1%, +0.3%):
 READ: 6981.0 MiB/s, Stdev: 15.556349
WRITE: 1305.7 MiB/s, Stdev: 2.357023

Test 3: 30m of MySQL test in 6G memcg for 12 times:
  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS

  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
    --tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this patch
Avg: 135310.868182 qps. Stdev: 379.200942

After this patch (-0.3%):
Avg: 135099.210000 qps. Stdev: 351.488863

Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
        (for memory stress, 18 times):

Before this patch:
Average: 1467.813023. Stdev: 24.232886

After this patch (+0.0%):
Average: 1464.178154. Stdev: 17.992974

Test 5: Memtier test in a 4G cgroup using brd as swap (20 times):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &
  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3

Before this patch:
Avg: 48389.282500 Ops/sec. Stdev: 3534.470933

After this patch (+1.2%):
Avg: 48959.374118 Ops/sec. Stdev: 3488.559744

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8c701b34d757..373a70801db9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3122,8 +3122,45 @@ static int folio_update_gen(struct folio *folio, int gen)
  */
 struct lru_gen_inc_batch {
 	int delta;
+	struct folio *head, *tail;
 };
 
+static inline void lru_gen_inc_bulk_done(struct lru_gen_folio *lrugen,
+					 int bulk_gen, bool type, int zone,
+					 struct lru_gen_inc_batch *batch)
+{
+	if (!batch->head)
+		return;
+
+	list_bulk_move_tail(&lrugen->folios[bulk_gen][type][zone],
+			    &batch->head->lru,
+			    &batch->tail->lru);
+
+	batch->head = NULL;
+}
+
+/*
+ * When aging, protected pages will go to the tail of the same higher
+ * gen, so the can be moved in batches. Besides reduced overhead, this
+ * also avoids changing their LRU order in a small scope.
+ */
+static inline void lru_gen_try_bulk_move(struct lru_gen_folio *lrugen, struct folio *folio,
+					 int bulk_gen, int new_gen, bool type, int zone,
+					 struct lru_gen_inc_batch *batch)
+{
+	/*
+	 * If folio not moving to the bulk_gen, it's raced with promotion
+	 * so it need to go to the head of another LRU.
+	 */
+	if (bulk_gen != new_gen)
+		list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
+
+	if (!batch->head)
+		batch->tail = folio;
+
+	batch->head = folio;
+}
+
 static void lru_gen_inc_batch_done(struct lruvec *lruvec, int gen, int type, int zone,
 				   struct lru_gen_inc_batch *batch)
 {
@@ -3132,6 +3169,8 @@ static void lru_gen_inc_batch_done(struct lruvec *lruvec, int gen, int type, int
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
 
+	lru_gen_inc_bulk_done(lrugen, new_gen, type, zone, batch);
+
 	if (!delta)
 		return;
 
@@ -3709,6 +3748,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 	struct lru_gen_inc_batch batch = { };
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+	int bulk_gen = (old_gen + 1) % MAX_NR_GENS;
 
 	if (type == LRU_GEN_ANON && !can_swap)
 		goto done;
@@ -3737,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 			}
 
 			new_gen = folio_inc_gen(folio, old_gen, false, &batch);
-			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
+			lru_gen_try_bulk_move(lrugen, folio, bulk_gen, new_gen, type, zone, &batch);
 
 			if (!--remaining) {
 				lru_gen_inc_batch_done(lruvec, old_gen, type, zone, &batch);
@@ -4275,6 +4315,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int tier = lru_tier_from_refs(refs);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+	int bulk_gen = (old_gen + 1) % MAX_NR_GENS;
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
 
@@ -4308,7 +4349,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
 		gen = folio_inc_gen(folio, old_gen, false, batch);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		lru_gen_try_bulk_move(lrugen, folio, bulk_gen, gen, type, zone, batch);
 
 		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
 			   lrugen->protected[hist][type][tier - 1] + delta);
@@ -4318,7 +4359,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* ineligible */
 	if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
 		gen = folio_inc_gen(folio, old_gen, false, batch);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		lru_gen_try_bulk_move(lrugen, folio, bulk_gen, gen, type, zone, batch);
 		return true;
 	}
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-26 21:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23 18:45 [PATCH v3 0/3] mm, lru_gen: batch update pages when aging Kairui Song
2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
2024-01-25  7:32   ` Chris Li
2024-01-25 17:51     ` Kairui Song
2024-01-26  0:56       ` Chris Li
2024-01-26 10:31         ` Kairui Song
2024-01-26 21:19           ` Chris Li
2024-01-23 18:45 ` [PATCH v3 2/3] mm, lru_gen: batch update counters on aging Kairui Song
2024-01-23 18:45 ` [PATCH v3 3/3] mm, lru_gen: move pages in bulk when aging Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).