[PATCH v2 0/3] mm, lru_gen: batch update pages when aging

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm, lru_gen: batch update pages when aging
@ 2024-01-11 18:33 Kairui Song
  2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Hi, this is updated version of previous series:
https://lore.kernel.org/linux-mm/20231222102255.56993-1-ryncsn@gmail.com/

Currently when MGLRU ages, it moves the pages one by one and updates mm
counter page by page, which is correct but the overhead can be optimized
by batching these operations.

In pervious series I only test with memtier which didn't show a good
enough improment. Acutally in-mem fio benifits the most from patch 3:

Ramdisk fio test in a 4G memcg on a EPYC 7K62 with:

  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=5m --group_reporting

Before this series:
bw (  MiB/s): min= 7644, max= 9293, per=100.00%, avg=8777.77, stdev=16.59, samples=9568
iops        : min=1956954, max=2379053, avg=2247108.51, stdev=4247.22, samples=9568

After this series (+7.5%):
bw (  MiB/s): min= 8462, max= 9902, per=100.00%, avg=9444.77, stdev=16.43, samples=9568
iops        : min=2166433, max=2535135, avg=2417858.23, stdev=4205.15, samples=9568

However it's highly related to the actual timing and use case.

Besides, batch moving also has a good effect on LRU ordering. Currently when
MGLRU ages, it walks the LRU backward, and the protected pages are moved to
the tail of newer gen one by one, which reverses the order of pages in
LRU. Moving them in batches can help keep their order, only in a small
scope though due to the scan limit of MAX_LRU_BATCH pages.

I noticed a higher performance gain if there are a lot of pages getting
protected, but hard to reproduce, so instead I tested using a simpler
benchmark, memtier, also for a more generic result. The main overhead
here is not aging but the result is also looking good:

Average result of 18 test runs:

Before:           44017.78 Ops/sec
After patch 1-3:  44890.50 Ops/sec (+1.8%)

Some more test result in commit messages.

Kairui Song (3):
  mm, lru_gen: batch update counters on againg
  mm, lru_gen: move pages in bulk when aging
  mm, lru_gen: try to prefetch next page when canning LRU

 mm/vmscan.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 124 insertions(+), 16 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] mm, lru_gen: batch update counters on againg
  2024-01-11 18:33 [PATCH v2 0/3] mm, lru_gen: batch update pages when aging Kairui Song
@ 2024-01-11 18:33 ` Kairui Song
  2024-01-11 18:37   ` Kairui Song
  2024-01-12 21:01   ` Wei Xu
  2024-01-11 18:33 ` [PATCH v2 2/3] mm, lru_gen: move pages in bulk when aging Kairui Song
  2024-01-11 18:33 ` [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU Kairui Song
  2 siblings, 2 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When lru_gen is aging, it will update mm counters page by page,
which causes a higher overhead if age happens frequently or there
are a lot of pages in one generation getting moved.
Optimize this by doing the counter update in batch.

Although most __mod_*_state has its own caches the overhead
is still observable.

Tested in a 4G memcg on a EPYC 7K62 with:

  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6

Average result of 18 test runs:

Before: 44017.78 Ops/sec
After:  44687.08 Ops/sec (+1.5%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f9c854ce6cc..185d53607c7e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3113,9 +3113,47 @@ static int folio_update_gen(struct folio *folio, int gen)
 	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 }
 
+/*
+ * Update LRU gen in batch for each lru_gen LRU list. The batch is limited to
+ * each gen / type / zone level LRU. Batch is applied after finished or aborted
+ * scanning one LRU list.
+ */
+struct gen_update_batch {
+	int delta[MAX_NR_GENS];
+};
+
+static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
+				 struct gen_update_batch *batch)
+{
+	int gen;
+	int promoted = 0;
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
+
+	for (gen = 0; gen < MAX_NR_GENS; gen++) {
+		int delta = batch->delta[gen];
+
+		if (!delta)
+			continue;
+
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			promoted += delta;
+	}
+
+	if (promoted) {
+		__update_lru_size(lruvec, lru, zone, -promoted);
+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, promoted);
+	}
+}
+
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio,
+			 bool reclaiming, struct gen_update_batch *batch)
 {
+	int delta = folio_nr_pages(folio);
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
@@ -3138,7 +3176,8 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 
-	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+	batch->delta[old_gen] -= delta;
+	batch->delta[new_gen] += delta;
 
 	return new_gen;
 }
@@ -3672,6 +3711,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 {
 	int zone;
 	int remaining = MAX_LRU_BATCH;
+	struct gen_update_batch batch = { };
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
@@ -3690,12 +3730,15 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio, false, &batch);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
-			if (!--remaining)
+			if (!--remaining) {
+				lru_gen_update_batch(lruvec, type, zone, &batch);
 				return false;
+			}
 		}
+		lru_gen_update_batch(lruvec, type, zone, &batch);
 	}
 done:
 	reset_ctrl_pos(lruvec, type, true);
@@ -4215,7 +4258,7 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
  ******************************************************************************/
 
 static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc,
-		       int tier_idx)
+		       int tier_idx, struct gen_update_batch *batch)
 {
 	bool success;
 	int gen = folio_lru_gen(folio);
@@ -4257,7 +4300,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio, false, batch);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
@@ -4267,7 +4310,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio, false, batch);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
@@ -4275,7 +4318,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* waiting for writeback */
 	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
 	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
-		gen = folio_inc_gen(lruvec, folio, true);
+		gen = folio_inc_gen(lruvec, folio, true, batch);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
@@ -4341,6 +4384,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	for (i = MAX_NR_ZONES; i > 0; i--) {
 		LIST_HEAD(moved);
 		int skipped_zone = 0;
+		struct gen_update_batch batch = { };
 		int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
 		struct list_head *head = &lrugen->folios[gen][type][zone];
 
@@ -4355,7 +4399,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 
 			scanned += delta;
 
-			if (sort_folio(lruvec, folio, sc, tier))
+			if (sort_folio(lruvec, folio, sc, tier, &batch))
 				sorted += delta;
 			else if (isolate_folio(lruvec, folio, sc)) {
 				list_add(&folio->lru, list);
@@ -4375,6 +4419,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 			skipped += skipped_zone;
 		}
 
+		lru_gen_update_batch(lruvec, type, zone, &batch);
+
 		if (!remaining || isolated >= MIN_LRU_BATCH)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] mm, lru_gen: move pages in bulk when aging
  2024-01-11 18:33 [PATCH v2 0/3] mm, lru_gen: batch update pages when aging Kairui Song
  2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
@ 2024-01-11 18:33 ` Kairui Song
  2024-01-11 18:33 ` [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU Kairui Song
  2 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Another overhead of aging is page moving. Actually, in most cases,
pages are being moved to the same gen after folio_inc_gen is called,
especially the protected pages.  So it's better to move them in bulk.

This also has a good effect on LRU order. Currently when MGLRU
ages, it walks the LRU backwards, and the protected pages are moved to
the tail of newer gen one by one, which actually reverses the order of
pages in LRU. Moving them in batches can help keep their order, only
in a small scope though, due to the scan limit of MAX_LRU_BATCH pages.

After this commit, we can see a slight performance gain (with
CONFIG_DEBUG_LIST=n):

Test 1: of memcached in a 4G memcg on a EPYC 7K62 with:

  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6

Average result of 18 test runs:

Before:           44017.78 Ops/sec
After patch 1-2:  44810.01 Ops/sec (+1.8%)

Test 2: MySQL in 6G memcg with:

  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS

  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=sb\
    --tables=48 --table-size=2000000 --threads=16 --time=1800\
    --report-interval=5 run

QPS of 6 test runs:

Before:
134126.83
134352.13
134045.19
133985.12
134787.47
134554.43

After patch 1-2 (+0.4%):
134913.38
134695.35
134891.31
134662.66
135090.32
134901.14

Only about 10% CPU time is spent in kernel space for MySQL test so the
improvement is very trivial.

There could be a higher performance gain when pages are getting
protected aggressively.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 185d53607c7e..57b6549946c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3120,9 +3120,46 @@ static int folio_update_gen(struct folio *folio, int gen)
  */
 struct gen_update_batch {
 	int delta[MAX_NR_GENS];
+	struct folio *head, *tail;
 };
 
-static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
+static void inline lru_gen_inc_bulk_finish(struct lru_gen_folio *lrugen,
+					   int bulk_gen, bool type, int zone,
+					   struct gen_update_batch *batch)
+{
+	if (!batch->head)
+		return;
+
+	list_bulk_move_tail(&lrugen->folios[bulk_gen][type][zone],
+			    &batch->head->lru,
+			    &batch->tail->lru);
+
+	batch->head = NULL;
+}
+
+/*
+ * When aging, protected pages will go to the tail of the same higher
+ * gen, so the can be moved in batches. Besides reduced overhead, this
+ * also avoids changing their LRU order in a small scope.
+ */
+static inline void lru_gen_try_inc_bulk(struct lru_gen_folio *lrugen, struct folio *folio,
+					int bulk_gen, int gen, bool type, int zone,
+					struct gen_update_batch *batch)
+{
+	/*
+	 * If folio not moving to the bulk_gen, it's raced with promotion
+	 * so it need to go to the head of another LRU.
+	 */
+	if (bulk_gen != gen)
+		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
+
+	if (!batch->head)
+		batch->tail = folio;
+
+	batch->head = folio;
+}
+
+static void lru_gen_update_batch(struct lruvec *lruvec, int bulk_gen, int type, int zone,
 				 struct gen_update_batch *batch)
 {
 	int gen;
@@ -3130,6 +3167,8 @@ static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
 
+	lru_gen_inc_bulk_finish(lrugen, bulk_gen, type, zone, batch);
+
 	for (gen = 0; gen < MAX_NR_GENS; gen++) {
 		int delta = batch->delta[gen];
 
@@ -3714,6 +3753,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 	struct gen_update_batch batch = { };
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+	int bulk_gen = (old_gen + 1) % MAX_NR_GENS;
 
 	if (type == LRU_GEN_ANON && !can_swap)
 		goto done;
@@ -3721,24 +3761,33 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 	/* prevent cold/hot inversion if force_scan is true */
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		struct list_head *head = &lrugen->folios[old_gen][type][zone];
+		struct folio *prev = NULL;
 
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
+		if (!list_empty(head))
+			prev = lru_to_folio(head);
 
+		while (prev) {
+			struct folio *folio = prev;
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
+			if (unlikely(list_is_first(&folio->lru, head)))
+				prev = NULL;
+			else
+				prev = lru_to_folio(&folio->lru);
+
 			new_gen = folio_inc_gen(lruvec, folio, false, &batch);
-			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
+			lru_gen_try_inc_bulk(lrugen, folio, bulk_gen, new_gen, type, zone, &batch);
 
 			if (!--remaining) {
-				lru_gen_update_batch(lruvec, type, zone, &batch);
+				lru_gen_update_batch(lruvec, bulk_gen, type, zone, &batch);
 				return false;
 			}
 		}
-		lru_gen_update_batch(lruvec, type, zone, &batch);
+
+		lru_gen_update_batch(lruvec, bulk_gen, type, zone, &batch);
 	}
 done:
 	reset_ctrl_pos(lruvec, type, true);
@@ -4258,7 +4307,7 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
  ******************************************************************************/
 
 static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc,
-		       int tier_idx, struct gen_update_batch *batch)
+		       int tier_idx, int bulk_gen, struct gen_update_batch *batch)
 {
 	bool success;
 	int gen = folio_lru_gen(folio);
@@ -4301,7 +4350,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
 		gen = folio_inc_gen(lruvec, folio, false, batch);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		lru_gen_try_inc_bulk(lrugen, folio, bulk_gen, gen, type, zone, batch);
 
 		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
 			   lrugen->protected[hist][type][tier - 1] + delta);
@@ -4311,7 +4360,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* ineligible */
 	if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
 		gen = folio_inc_gen(lruvec, folio, false, batch);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		lru_gen_try_inc_bulk(lrugen, folio, bulk_gen, gen, type, zone, batch);
 		return true;
 	}
 
@@ -4385,11 +4434,16 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 		LIST_HEAD(moved);
 		int skipped_zone = 0;
 		struct gen_update_batch batch = { };
+		int bulk_gen = (gen + 1) % MAX_NR_GENS;
 		int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
 		struct list_head *head = &lrugen->folios[gen][type][zone];
+		struct folio *prev = NULL;
 
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
+		if (!list_empty(head))
+			prev = lru_to_folio(head);
+
+		while (prev) {
+			struct folio *folio = prev;
 			int delta = folio_nr_pages(folio);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
@@ -4398,8 +4452,12 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
 			scanned += delta;
+			if (unlikely(list_is_first(&folio->lru, head)))
+				prev = NULL;
+			else
+				prev = lru_to_folio(&folio->lru);
 
-			if (sort_folio(lruvec, folio, sc, tier, &batch))
+			if (sort_folio(lruvec, folio, sc, tier, bulk_gen, &batch))
 				sorted += delta;
 			else if (isolate_folio(lruvec, folio, sc)) {
 				list_add(&folio->lru, list);
@@ -4419,7 +4477,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 			skipped += skipped_zone;
 		}
 
-		lru_gen_update_batch(lruvec, type, zone, &batch);
+		lru_gen_update_batch(lruvec, bulk_gen, type, zone, &batch);
 
 		if (!remaining || isolated >= MIN_LRU_BATCH)
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU
  2024-01-11 18:33 [PATCH v2 0/3] mm, lru_gen: batch update pages when aging Kairui Song
  2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
  2024-01-11 18:33 ` [PATCH v2 2/3] mm, lru_gen: move pages in bulk when aging Kairui Song
@ 2024-01-11 18:33 ` Kairui Song
  2024-01-11 18:35   ` Kairui Song
  2 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Prefetch for inactive/active LRU have been long exiting, apply the same
optimization for MGLRU.

Ramdisk based swap test in a 4G memcg on a EPYC 7K62 with:

  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6

Average result of 18 test runs:

Before:           44017.78 Ops/sec
After patch 1-3:  44890.50 Ops/sec (+1.8%)

Ramdisk fio test in a 4G memcg on a EPYC 7K62 with:

  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=5m --group_reporting

Before this patch:
bw (  MiB/s): min= 7644, max= 9293, per=100.00%, avg=8777.77, stdev=16.59, samples=9568
iops        : min=1956954, max=2379053, avg=2247108.51, stdev=4247.22, samples=9568

After this patch (+7.5%):
bw (  MiB/s): min= 8462, max= 9902, per=100.00%, avg=9444.77, stdev=16.43, samples=9568
iops        : min=2166433, max=2535135, avg=2417858.23, stdev=4205.15, samples=9568

Prefetch is highly related to timing and architecture so it may only help in
certain cases, some extra test showed at least no regression here for
the series:

Ramdisk memtier test above in a 8G memcg on an Intel i7-9700:

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys --key-minimum=1 \
    --key-maximum=36000000 --key-pattern=P:P -c 1 -t 12 \
    --ratio 1:0 --pipeline 8 -d 1024 -x 4

Average result of 12 test runs:

Before:           61241.96 Ops/sec
After patch 1-3:  61268.53 Ops/sec (+0.0%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57b6549946c3..4ef83db40adb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3773,10 +3773,12 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			if (unlikely(list_is_first(&folio->lru, head)))
+			if (unlikely(list_is_first(&folio->lru, head))) {
 				prev = NULL;
-			else
+			} else {
 				prev = lru_to_folio(&folio->lru);
+				prefetchw(&prev->flags);
+			}
 
 			new_gen = folio_inc_gen(lruvec, folio, false, &batch);
 			lru_gen_try_inc_bulk(lrugen, folio, bulk_gen, new_gen, type, zone, &batch);
@@ -4452,10 +4454,12 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
 			scanned += delta;
-			if (unlikely(list_is_first(&folio->lru, head)))
+			if (unlikely(list_is_first(&folio->lru, head))) {
 				prev = NULL;
-			else
+			} else {
 				prev = lru_to_folio(&folio->lru);
+				prefetchw(&prev->flags);
+			}
 
 			if (sort_folio(lruvec, folio, sc, tier, bulk_gen, &batch))
 				sorted += delta;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU
  2024-01-11 18:33 ` [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU Kairui Song
@ 2024-01-11 18:35   ` Kairui Song
  0 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:35 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel

Kairui Song <ryncsn@gmail.com> 于2024年1月12日周五 02:33写道：
>
> From: Kairui Song <kasong@tencent.com>
>
> Prefetch for inactive/active LRU have been long exiting, apply the same
> optimization for MGLRU.
>
> Ramdisk based swap test in a 4G memcg on a EPYC 7K62 with:
>

Hi, my applogize, I just realize I forgot to fix the typo in title
right after I send the patch... it should be:
mm, lru_gen: try to prefetch next page when scanning LRU

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, lru_gen: batch update counters on againg
  2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
@ 2024-01-11 18:37   ` Kairui Song
  2024-01-12 21:01   ` Wei Xu
  1 sibling, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-11 18:37 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox, linux-kernel

Kairui Song <ryncsn@gmail.com> 于2024年1月12日周五 02:33写道：
>
> From: Kairui Song <kasong@tencent.com>
>
> When lru_gen is aging, it will update mm counters page by page,
> which causes a higher overhead if age happens frequently or there
> are a lot of pages in one generation getting moved.
> Optimize this by doing the counter update in batch.
>
> Although most __mod_*_state has its own caches the overhead
> is still observable.
>
> Tested in a 4G memcg on a EPYC 7K62 with:
>
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
>     -a 0766 -t 16 -B binary &
>
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys \
>     --key-minimum=1 --key-maximum=16000000 -d 1024 \
>     --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6
>
> Average result of 18 test runs:
>
> Before: 44017.78 Ops/sec
> After:  44687.08 Ops/sec (+1.5%)
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 55 insertions(+), 9 deletions(-)
>

My apology for being careless again here... the title here should be:
mm, lru_gen: batch update counters on aging

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, lru_gen: batch update counters on againg
  2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
  2024-01-11 18:37   ` Kairui Song
@ 2024-01-12 21:01   ` Wei Xu
  2024-01-14 17:42     ` Kairui Song
  1 sibling, 1 reply; 9+ messages in thread
From: Wei Xu @ 2024-01-12 21:01 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox,
	linux-kernel, Greg Thelen

On Thu, Jan 11, 2024 at 10:33 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> When lru_gen is aging, it will update mm counters page by page,
> which causes a higher overhead if age happens frequently or there
> are a lot of pages in one generation getting moved.
> Optimize this by doing the counter update in batch.
>
> Although most __mod_*_state has its own caches the overhead
> is still observable.
>
> Tested in a 4G memcg on a EPYC 7K62 with:
>
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
>     -a 0766 -t 16 -B binary &
>
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys \
>     --key-minimum=1 --key-maximum=16000000 -d 1024 \
>     --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6
>
> Average result of 18 test runs:
>
> Before: 44017.78 Ops/sec
> After:  44687.08 Ops/sec (+1.5%)

I see the same performance numbers get quoted in all the 3 patches.
How much performance improvements does this particular patch provide
(the same for the other 2 patches)? If as the cover letter says, the
most performance benefits come from patch 3 (prefetching), can we just
have that patch alone to avoid the extra complexities.

> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 55 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..185d53607c7e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3113,9 +3113,47 @@ static int folio_update_gen(struct folio *folio, int gen)
>         return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
>  }
>
> +/*
> + * Update LRU gen in batch for each lru_gen LRU list. The batch is limited to
> + * each gen / type / zone level LRU. Batch is applied after finished or aborted
> + * scanning one LRU list.
> + */
> +struct gen_update_batch {
> +       int delta[MAX_NR_GENS];
> +};
> +
> +static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
> +                                struct gen_update_batch *batch)
> +{
> +       int gen;
> +       int promoted = 0;
> +       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +       enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
> +
> +       for (gen = 0; gen < MAX_NR_GENS; gen++) {
> +               int delta = batch->delta[gen];
> +
> +               if (!delta)
> +                       continue;
> +
> +               WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
> +                          lrugen->nr_pages[gen][type][zone] + delta);
> +
> +               if (lru_gen_is_active(lruvec, gen))
> +                       promoted += delta;
> +       }
> +
> +       if (promoted) {
> +               __update_lru_size(lruvec, lru, zone, -promoted);
> +               __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, promoted);
> +       }
> +}
> +
>  /* protect pages accessed multiple times through file descriptors */
> -static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio,
> +                        bool reclaiming, struct gen_update_batch *batch)
>  {
> +       int delta = folio_nr_pages(folio);
>         int type = folio_is_file_lru(folio);
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> @@ -3138,7 +3176,8 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
>                         new_flags |= BIT(PG_reclaim);
>         } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
>
> -       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +       batch->delta[old_gen] -= delta;
> +       batch->delta[new_gen] += delta;
>
>         return new_gen;
>  }
> @@ -3672,6 +3711,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>  {
>         int zone;
>         int remaining = MAX_LRU_BATCH;
> +       struct gen_update_batch batch = { };

Can this batch variable be moved away from the stack?  We (Google) use
a much larger value for MAX_NR_GENS internally. This large stack
allocation from "struct gen_update_batch batch" can significantly
increase the risk of stack overflow for our use cases.

>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> @@ -3690,12 +3730,15 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>                         VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
>
> -                       new_gen = folio_inc_gen(lruvec, folio, false);
> +                       new_gen = folio_inc_gen(lruvec, folio, false, &batch);
>                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> -                       if (!--remaining)
> +                       if (!--remaining) {
> +                               lru_gen_update_batch(lruvec, type, zone, &batch);
>                                 return false;
> +                       }
>                 }
> +               lru_gen_update_batch(lruvec, type, zone, &batch);
>         }
>  done:
>         reset_ctrl_pos(lruvec, type, true);
> @@ -4215,7 +4258,7 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
>   ******************************************************************************/
>
>  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc,
> -                      int tier_idx)
> +                      int tier_idx, struct gen_update_batch *batch)
>  {
>         bool success;
>         int gen = folio_lru_gen(folio);
> @@ -4257,7 +4300,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
>                 int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>
> -               gen = folio_inc_gen(lruvec, folio, false);
> +               gen = folio_inc_gen(lruvec, folio, false, batch);
>                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
>
>                 WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> @@ -4267,7 +4310,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>
>         /* ineligible */
>         if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
> -               gen = folio_inc_gen(lruvec, folio, false);
> +               gen = folio_inc_gen(lruvec, folio, false, batch);
>                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
>         }
> @@ -4275,7 +4318,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         /* waiting for writeback */
>         if (folio_test_locked(folio) || folio_test_writeback(folio) ||
>             (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> -               gen = folio_inc_gen(lruvec, folio, true);
> +               gen = folio_inc_gen(lruvec, folio, true, batch);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
>         }
> @@ -4341,6 +4384,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         for (i = MAX_NR_ZONES; i > 0; i--) {
>                 LIST_HEAD(moved);
>                 int skipped_zone = 0;
> +               struct gen_update_batch batch = { };
>                 int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
>                 struct list_head *head = &lrugen->folios[gen][type][zone];
>
> @@ -4355,7 +4399,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>
>                         scanned += delta;
>
> -                       if (sort_folio(lruvec, folio, sc, tier))
> +                       if (sort_folio(lruvec, folio, sc, tier, &batch))
>                                 sorted += delta;
>                         else if (isolate_folio(lruvec, folio, sc)) {
>                                 list_add(&folio->lru, list);
> @@ -4375,6 +4419,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>                         skipped += skipped_zone;
>                 }
>
> +               lru_gen_update_batch(lruvec, type, zone, &batch);
> +
>                 if (!remaining || isolated >= MIN_LRU_BATCH)
>                         break;
>         }
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, lru_gen: batch update counters on againg
  2024-01-12 21:01   ` Wei Xu
@ 2024-01-14 17:42     ` Kairui Song
  2024-01-15 17:09       ` Kairui Song
  0 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2024-01-14 17:42 UTC (permalink / raw)
  To: Wei Xu
  Cc: linux-mm, Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox,
	linux-kernel, Greg Thelen

Wei Xu <weixugc@google.com> 于2024年1月13日周六 05:01写道：
>
> On Thu, Jan 11, 2024 at 10:33 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > When lru_gen is aging, it will update mm counters page by page,
> > which causes a higher overhead if age happens frequently or there
> > are a lot of pages in one generation getting moved.
> > Optimize this by doing the counter update in batch.
> >
> > Although most __mod_*_state has its own caches the overhead
> > is still observable.
> >
> > Tested in a 4G memcg on a EPYC 7K62 with:
> >
> >   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
> >     -a 0766 -t 16 -B binary &
> >
> >   memtier_benchmark -S /tmp/memcached.socket \
> >     -P memcache_binary -n allkeys \
> >     --key-minimum=1 --key-maximum=16000000 -d 1024 \
> >     --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6
> >
> > Average result of 18 test runs:
> >
> > Before: 44017.78 Ops/sec
> > After:  44687.08 Ops/sec (+1.5%)
>
> I see the same performance numbers get quoted in all the 3 patches.
> How much performance improvements does this particular patch provide
> (the same for the other 2 patches)? If as the cover letter says, the
> most performance benefits come from patch 3 (prefetching), can we just
> have that patch alone to avoid the extra complexities.

Hi Wei,

Indeed these are two different optimization technique, I can reorder
the series, prefetch is the first one and should be more acceptable,
other optimizations can come later. And add standalone info about
improvement of batch operations.

>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 55 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4f9c854ce6cc..185d53607c7e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3113,9 +3113,47 @@ static int folio_update_gen(struct folio *folio, int gen)
> >         return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> >  }
> >
> > +/*
> > + * Update LRU gen in batch for each lru_gen LRU list. The batch is limited to
> > + * each gen / type / zone level LRU. Batch is applied after finished or aborted
> > + * scanning one LRU list.
> > + */
> > +struct gen_update_batch {
> > +       int delta[MAX_NR_GENS];
> > +};
> > +
> > +static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
> > +                                struct gen_update_batch *batch)
> > +{
> > +       int gen;
> > +       int promoted = 0;
> > +       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > +       enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
> > +
> > +       for (gen = 0; gen < MAX_NR_GENS; gen++) {
> > +               int delta = batch->delta[gen];
> > +
> > +               if (!delta)
> > +                       continue;
> > +
> > +               WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
> > +                          lrugen->nr_pages[gen][type][zone] + delta);
> > +
> > +               if (lru_gen_is_active(lruvec, gen))
> > +                       promoted += delta;
> > +       }
> > +
> > +       if (promoted) {
> > +               __update_lru_size(lruvec, lru, zone, -promoted);
> > +               __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, promoted);
> > +       }
> > +}
> > +
> >  /* protect pages accessed multiple times through file descriptors */
> > -static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio,
> > +                        bool reclaiming, struct gen_update_batch *batch)
> >  {
> > +       int delta = folio_nr_pages(folio);
> >         int type = folio_is_file_lru(folio);
> >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > @@ -3138,7 +3176,8 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
> >                         new_flags |= BIT(PG_reclaim);
> >         } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
> >
> > -       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > +       batch->delta[old_gen] -= delta;
> > +       batch->delta[new_gen] += delta;
> >
> >         return new_gen;
> >  }
> > @@ -3672,6 +3711,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> >  {
> >         int zone;
> >         int remaining = MAX_LRU_BATCH;
> > +       struct gen_update_batch batch = { };
>
> Can this batch variable be moved away from the stack?  We (Google) use
> a much larger value for MAX_NR_GENS internally. This large stack
> allocation from "struct gen_update_batch batch" can significantly
> increase the risk of stack overflow for our use cases.
>

Thanks for the info.
Do you have any suggestion about where we should put the batch info? I
though about merging it with lru_gen_mm_walk but that depend on
kzalloc and not useable for slow allocation path, the overhead could
be larger than benefit in many cases.

Not sure if we can use some thing like a preallocated per-cpu cache
here to avoid all the issues.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, lru_gen: batch update counters on againg
  2024-01-14 17:42     ` Kairui Song
@ 2024-01-15 17:09       ` Kairui Song
  0 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-15 17:09 UTC (permalink / raw)
  To: Wei Xu
  Cc: linux-mm, Andrew Morton, Yu Zhao, Chris Li, Matthew Wilcox,
	linux-kernel, Greg Thelen

Kairui Song <ryncsn@gmail.com> 于2024年1月15日周一 01:42写道：
>
> Wei Xu <weixugc@google.com> 于2024年1月13日周六 05:01写道：
> >
> > On Thu, Jan 11, 2024 at 10:33 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > When lru_gen is aging, it will update mm counters page by page,
> > > which causes a higher overhead if age happens frequently or there
> > > are a lot of pages in one generation getting moved.
> > > Optimize this by doing the counter update in batch.
> > >
> > > Although most __mod_*_state has its own caches the overhead
> > > is still observable.
> > >
> > > Tested in a 4G memcg on a EPYC 7K62 with:
> > >
> > >   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
> > >     -a 0766 -t 16 -B binary &
> > >
> > >   memtier_benchmark -S /tmp/memcached.socket \
> > >     -P memcache_binary -n allkeys \
> > >     --key-minimum=1 --key-maximum=16000000 -d 1024 \
> > >     --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6
> > >
> > > Average result of 18 test runs:
> > >
> > > Before: 44017.78 Ops/sec
> > > After:  44687.08 Ops/sec (+1.5%)
> >
> > I see the same performance numbers get quoted in all the 3 patches.
> > How much performance improvements does this particular patch provide
> > (the same for the other 2 patches)? If as the cover letter says, the
> > most performance benefits come from patch 3 (prefetching), can we just
> > have that patch alone to avoid the extra complexities.
>
> Hi Wei,
>
> Indeed these are two different optimization technique, I can reorder
> the series, prefetch is the first one and should be more acceptable,
> other optimizations can come later. And add standalone info about
> improvement of batch operations.
>
> >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 55 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4f9c854ce6cc..185d53607c7e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -3113,9 +3113,47 @@ static int folio_update_gen(struct folio *folio, int gen)
> > >         return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > >  }
> > >
> > > +/*
> > > + * Update LRU gen in batch for each lru_gen LRU list. The batch is limited to
> > > + * each gen / type / zone level LRU. Batch is applied after finished or aborted
> > > + * scanning one LRU list.
> > > + */
> > > +struct gen_update_batch {
> > > +       int delta[MAX_NR_GENS];
> > > +};
> > > +
> > > +static void lru_gen_update_batch(struct lruvec *lruvec, int type, int zone,
> > > +                                struct gen_update_batch *batch)
> > > +{
> > > +       int gen;
> > > +       int promoted = 0;
> > > +       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > > +       enum lru_list lru = type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
> > > +
> > > +       for (gen = 0; gen < MAX_NR_GENS; gen++) {
> > > +               int delta = batch->delta[gen];
> > > +
> > > +               if (!delta)
> > > +                       continue;
> > > +
> > > +               WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
> > > +                          lrugen->nr_pages[gen][type][zone] + delta);
> > > +
> > > +               if (lru_gen_is_active(lruvec, gen))
> > > +                       promoted += delta;
> > > +       }
> > > +
> > > +       if (promoted) {
> > > +               __update_lru_size(lruvec, lru, zone, -promoted);
> > > +               __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, promoted);
> > > +       }
> > > +}
> > > +
> > >  /* protect pages accessed multiple times through file descriptors */
> > > -static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio,
> > > +                        bool reclaiming, struct gen_update_batch *batch)
> > >  {
> > > +       int delta = folio_nr_pages(folio);
> > >         int type = folio_is_file_lru(folio);
> > >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > >         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > @@ -3138,7 +3176,8 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
> > >                         new_flags |= BIT(PG_reclaim);
> > >         } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
> > >
> > > -       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > +       batch->delta[old_gen] -= delta;
> > > +       batch->delta[new_gen] += delta;
> > >
> > >         return new_gen;
> > >  }
> > > @@ -3672,6 +3711,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > >  {
> > >         int zone;
> > >         int remaining = MAX_LRU_BATCH;
> > > +       struct gen_update_batch batch = { };
> >
> > Can this batch variable be moved away from the stack?  We (Google) use
> > a much larger value for MAX_NR_GENS internally. This large stack
> > allocation from "struct gen_update_batch batch" can significantly
> > increase the risk of stack overflow for our use cases.
> >
>
> Thanks for the info.
> Do you have any suggestion about where we should put the batch info? I
> though about merging it with lru_gen_mm_walk but that depend on
> kzalloc and not useable for slow allocation path, the overhead could
> be larger than benefit in many cases.
>
> Not sure if we can use some thing like a preallocated per-cpu cache
> here to avoid all the issues.

Hi Wei,

After second thought, the batch is mostly used together with
folio_inc_gen which means most pages are only being moved between two
gens (being protected/unreclaimable), so I think only one counter int
is needed in the batch, I'll update this patch and do some test based
on this.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-15 17:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-11 18:33 [PATCH v2 0/3] mm, lru_gen: batch update pages when aging Kairui Song
2024-01-11 18:33 ` [PATCH v2 1/3] mm, lru_gen: batch update counters on againg Kairui Song
2024-01-11 18:37   ` Kairui Song
2024-01-12 21:01   ` Wei Xu
2024-01-14 17:42     ` Kairui Song
2024-01-15 17:09       ` Kairui Song
2024-01-11 18:33 ` [PATCH v2 2/3] mm, lru_gen: move pages in bulk when aging Kairui Song
2024-01-11 18:33 ` [PATCH v2 3/3] mm, lru_gen: try to prefetch next page when canning LRU Kairui Song
2024-01-11 18:35   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).